Re: [PATCH v3 0/4] kernfs: proposed locking and concurrency improvement

2021-04-19 Thread Ian Kent
On Mon, 2021-04-19 at 15:56 +0800, Fox Chen wrote:
> On Fri, Apr 9, 2021 at 9:14 AM Ian Kent  wrote:
> > There have been a few instances of contention on the kernfs_mutex
> > during
> > path walks, a case on very large IBM systems seen by myself, a
> > report by
> > Brice Goglin and followed up by Fox Chen, and I've since seen a
> > couple
> > of other reports by CoreOS users.
> > 
> > The common thread is a large number of kernfs path walks leading to
> > slowness of path walks due to kernfs_mutex contention.
> > 
> > The problem being that changes to the VFS over some time have
> > increased
> > it's concurrency capabilities to an extent that kernfs's use of a
> > mutex
> > is no longer appropriate. There's also an issue of walks for non-
> > existent
> > paths causing contention if there are quite a few of them which is
> > a less
> > common problem.
> > 
> > This patch series is relatively straight forward.
> > 
> > All it does is add the ability to take advantage of VFS negative
> > dentry
> > caching to avoid needless dentry alloc/free cycles for lookups of
> > paths
> > that don't exit and change the kernfs_mutex to a read/write
> > semaphore.
> > 
> > The patch that tried to stay in VFS rcu-walk mode during path walks
> > has
> > been dropped for two reasons. First, it doesn't actually give very
> > much
> > improvement and, second, if there's a place where mistakes could go
> > unnoticed it would be in that path. This makes the patch series
> > simpler
> > to review and reduces the likelihood of problems going unnoticed
> > and
> > popping up later.
> > 
> > The patch to use a revision to identify if a directory has changed
> > has
> > also been dropped. If the directory has changed the dentry revision
> > needs to be updated to avoid subsequent rb tree searches and after
> > changing to use a read/write semaphore the update also requires a
> > lock.
> > But the d_lock is the only lock available at this point which might
> > itself be contended.
> > 
> > Changes since v2:
> > - actually fix the inode attribute update locking.
> > - drop the patch that tried to stay in rcu-walk mode.
> > - drop the use a revision to identify if a directory has changed
> > patch.
> > 
> > Changes since v1:
> > - fix locking in .permission() and .getattr() by re-factoring the
> > attribute
> >   handling code.
> > 
> > ---
> > 
> > Ian Kent (4):
> >   kernfs: move revalidate to be near lookup
> >   kernfs: use VFS negative dentry caching
> >   kernfs: switch kernfs to use an rwsem
> >   kernfs: use i_lock to protect concurrent inode updates
> > 
> > 
> >  fs/kernfs/dir.c |  240 +++--
> > --
> >  fs/kernfs/file.c|4 -
> >  fs/kernfs/inode.c   |   18 ++-
> >  fs/kernfs/kernfs-internal.h |5 +
> >  fs/kernfs/mount.c   |   12 +-
> >  fs/kernfs/symlink.c |4 -
> >  include/linux/kernfs.h  |2
> >  7 files changed, 155 insertions(+), 130 deletions(-)
> > 
> > --
> > 
> 
> Hi Ian,
> 
> I tested this patchset with my
> benchmark(https://github.com/foxhlchen/sysfs_benchmark) on a 96 CPUs
> (aws c5) machine.
> 
> The result was promising:
> Before, one open+read+close cycle took 500us without much variation.
> With this patch, the fastest one only takes 30us, though the slowest
> is still around 100us(due to the spinlock). perf report shows no more
> significant mutex contention.

Thanks for this Fox.
I'll have a look through the data a bit later.

For now, I'd like to keep the series as simple as possible.

But there shouldn't be a problem reading and comparing those
attributes between the kernfs node and the inode without taking
the additional lock. So a check could be done and the lock only
taken if an update is needed.

That may well improve that worst case quite a bit, but as I say,
it would need to be a follow up change.

Ian



Re: [PATCH v3 2/4] kernfs: use VFS negative dentry caching

2021-04-09 Thread Ian Kent
On Fri, 2021-04-09 at 16:26 +0800, Ian Kent wrote:
> On Fri, 2021-04-09 at 01:35 +, Al Viro wrote:
> > On Fri, Apr 09, 2021 at 09:15:06AM +0800, Ian Kent wrote:
> > > + parent = kernfs_dentry_node(dentry->d_parent);
> > > + if (parent) {
> > > + const void *ns = NULL;
> > > +
> > > + if (kernfs_ns_enabled(parent))
> > > + ns = kernfs_info(dentry->d_parent-
> > > > d_sb)->ns;
> > 
> > For any dentry d, we have d->d_parent->d_sb == d->d_sb.  All
> > the time.
> > If you ever run into the case where that would not be true, you've
> > found
> > a critical bug.
> 
> Right, yes.
> 
> > > + kn = kernfs_find_ns(parent, dentry-
> > > > d_name.name, ns);
> > > + if (kn)
> > > + goto out_bad;
> > > + }
> > 
> > Umm...  What's to prevent a race with successful rename(2)?  IOW,
> > what's
> > there to stabilize ->d_parent and ->d_name while we are in that
> > function?
> 
> Indeed, glad you looked at this.
> 
> Now I'm wondering how kerfs_iop_rename() protects itself from
> concurrent kernfs_rename_ns() ... 

As I thought ... I haven't done an exhaustive search but I can't find
any file system that doesn't call back into kernfs from
kernfs_syscall_ops (if provided at kernfs root creation).

I don't see anything that uses kernfs that defines a .rename() op
but if there was one it would be expected to call back into kernfs
at which point it would block on kernfs_mutex (kernfs_rwsem) until
it's released.

So I don't think there can be changes in this case due to the lock
taken just above the code your questioning.

I need to think a bit about whether the dentry being negative (ie.
not having kernfs node) could allow bad things to happen ...

Or am I misunderstanding the race your pointing out here?

Ian



Re: [PATCH v3 2/4] kernfs: use VFS negative dentry caching

2021-04-09 Thread Ian Kent
On Fri, 2021-04-09 at 01:35 +, Al Viro wrote:
> On Fri, Apr 09, 2021 at 09:15:06AM +0800, Ian Kent wrote:
> > +   parent = kernfs_dentry_node(dentry->d_parent);
> > +   if (parent) {
> > +   const void *ns = NULL;
> > +
> > +   if (kernfs_ns_enabled(parent))
> > +   ns = kernfs_info(dentry->d_parent-
> > >d_sb)->ns;
> 
>   For any dentry d, we have d->d_parent->d_sb == d->d_sb.  All
> the time.
> If you ever run into the case where that would not be true, you've
> found
> a critical bug.

Right, yes.

> 
> > +   kn = kernfs_find_ns(parent, dentry-
> > >d_name.name, ns);
> > +   if (kn)
> > +   goto out_bad;
> > +   }
> 
> Umm...  What's to prevent a race with successful rename(2)?  IOW,
> what's
> there to stabilize ->d_parent and ->d_name while we are in that
> function?

Indeed, glad you looked at this.

Now I'm wondering how kerfs_iop_rename() protects itself from
concurrent kernfs_rename_ns() ... 



[PATCH v3 4/4] kernfs: use i_lock to protect concurrent inode updates

2021-04-08 Thread Ian Kent
The inode operations .permission() and .getattr() use the kernfs node
write lock but all that's needed is to keep the rb tree stable while
updating the inode attributes as well as protecting the update itself
against concurrent changes.

And .permission() is called frequently during path walks and can cause
quite a bit of contention between kernfs node operations and path
walks when the number of concurrent walks is high.

To change kernfs_iop_getattr() and kernfs_iop_permission() to take
the rw sem read lock instead of the write lock an additional lock is
needed to protect against multiple processes concurrently updating
the inode attributes and link count in kernfs_refresh_inode().

The inode i_lock seems like the sensible thing to use to protect these
inode attribute updates so use it in kernfs_refresh_inode().

Signed-off-by: Ian Kent 
---
 fs/kernfs/inode.c |   10 ++
 fs/kernfs/mount.c |4 ++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3b01e9e61f14e..6728ecd81eb37 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, 
struct inode *inode)
 {
struct kernfs_iattrs *attrs = kn->iattr;
 
+   spin_lock(>i_lock);
inode->i_mode = kn->mode;
if (attrs)
/*
@@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, 
struct inode *inode)
 
if (kernfs_type(kn) == KERNFS_DIR)
set_nlink(inode, kn->dir.subdirs + 2);
+   spin_unlock(>i_lock);
 }
 
 int kernfs_iop_getattr(struct user_namespace *mnt_userns,
@@ -191,9 +193,9 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;
 
-   down_write(_rwsem);
+   down_read(_rwsem);
kernfs_refresh_inode(kn, inode);
-   up_write(_rwsem);
+   up_read(_rwsem);
 
generic_fillattr(_user_ns, inode, stat);
return 0;
@@ -284,9 +286,9 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
 
kn = inode->i_private;
 
-   down_write(_rwsem);
+   down_read(_rwsem);
kernfs_refresh_inode(kn, inode);
-   up_write(_rwsem);
+   up_read(_rwsem);
 
return generic_permission(_user_ns, inode, mask);
 }
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index baa4155ba2edf..f2f909d09f522 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct 
kernfs_fs_context *k
sb->s_shrink.seeks = 0;
 
/* get root inode, initialize and unlock it */
-   down_write(_rwsem);
+   down_read(_rwsem);
inode = kernfs_get_inode(sb, info->root->kn);
-   up_write(_rwsem);
+   up_read(_rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;




[PATCH v3 3/4] kernfs: switch kernfs to use an rwsem

2021-04-08 Thread Ian Kent
The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel during path walks.

Change the kernfs mutex to an rwsem so that, when opportunity arises,
node searches can be done in parallel with path walk lookups.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |  117 ---
 fs/kernfs/file.c|4 +
 fs/kernfs/inode.c   |   16 +++---
 fs/kernfs/kernfs-internal.h |5 +-
 fs/kernfs/mount.c   |   12 ++--
 fs/kernfs/symlink.c |4 +
 include/linux/kernfs.h  |2 -
 7 files changed, 86 insertions(+), 74 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index edfeee1bf38ec..9bea235f2ec66 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@
 
 #include "kernfs-internal.h"
 
-DEFINE_MUTEX(kernfs_mutex);
+DECLARE_RWSEM(kernfs_rwsem);
 static DEFINE_SPINLOCK(kernfs_rename_lock);/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];  /* protected by rename_lock */
 static DEFINE_SPINLOCK(kernfs_idr_lock);   /* root->ino_idr */
@@ -26,10 +26,21 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);/* 
root->ino_idr */
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-   lockdep_assert_held(_mutex);
return atomic_read(>active) >= 0;
 }
 
+static bool kernfs_active_write(struct kernfs_node *kn)
+{
+   lockdep_assert_held_write(_rwsem);
+   return kernfs_active(kn);
+}
+
+static bool kernfs_active_read(struct kernfs_node *kn)
+{
+   lockdep_assert_held_read(_rwsem);
+   return kernfs_active(kn);
+}
+
 static bool kernfs_lockdep(struct kernfs_node *kn)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -340,7 +351,7 @@ static int kernfs_sd_compare(const struct kernfs_node *left,
  * @kn->parent->dir.children.
  *
  * Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem held exclusive
  *
  * RETURNS:
  * 0 on susccess -EEXIST on failure.
@@ -385,7 +396,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
  * removed, %false if @kn wasn't on the rbtree.
  *
  * Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem held exclusive
  */
 static bool kernfs_unlink_sibling(struct kernfs_node *kn)
 {
@@ -455,14 +466,14 @@ void kernfs_put_active(struct kernfs_node *kn)
  * return after draining is complete.
  */
 static void kernfs_drain(struct kernfs_node *kn)
-   __releases(_mutex) __acquires(_mutex)
+   __releases(_rwsem) __acquires(_rwsem)
 {
struct kernfs_root *root = kernfs_root(kn);
 
-   lockdep_assert_held(_mutex);
+   lockdep_assert_held_write(_rwsem);
WARN_ON_ONCE(kernfs_active(kn));
 
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
 
if (kernfs_lockdep(kn)) {
rwsem_acquire(>dep_map, 0, 0, _RET_IP_);
@@ -481,7 +492,7 @@ static void kernfs_drain(struct kernfs_node *kn)
 
kernfs_drain_open_files(kn);
 
-   mutex_lock(_mutex);
+   down_write(_rwsem);
 }
 
 /**
@@ -720,7 +731,7 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;
 
-   mutex_lock(_mutex);
+   down_write(_rwsem);
 
ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -735,7 +746,7 @@ int kernfs_add_one(struct kernfs_node *kn)
if (parent->flags & KERNFS_EMPTY_DIR)
goto out_unlock;
 
-   if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent))
+   if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active_write(parent))
goto out_unlock;
 
kn->hash = kernfs_name_hash(kn->name, kn->ns);
@@ -751,7 +762,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
 
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
 
/*
 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -765,7 +776,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;
 
 out_unlock:
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
return ret;
 }
 
@@ -786,7 +797,7 @@ static struct kernfs_node *kernfs_find_ns(struct 
kernfs_node *parent,
bool has_ns = kernfs_ns_enabled(parent);
unsigned int hash;
 
-   lockdep_assert_held(_mutex);
+   lockdep_assert_held(_rwsem);
 
if (has_ns != (bool)ns) {
WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -818,7 +829,7 @@ static struct kernfs_node *kernfs_walk_ns(struct 
kernfs_node *parent,
size_t len;
char *p, *name;
 
-   lockdep_assert_held(_mutex);
+   lockdep_assert_held_read(_rwsem);
 
/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
spin_lock_irq(_rename_lock);
@@ -858,10 +869,10 @@ struct kernfs_node *kernfs_find_and_get_ns(struct 
kernfs_node *parent,
 {
struct kernfs_no

[PATCH v3 2/4] kernfs: use VFS negative dentry caching

2021-04-08 Thread Ian Kent
If there are many lookups for non-existent paths these negative lookups
can lead to a lot of overhead during path walks.

The VFS allows dentries to be created as negative and hashed, and caches
them so they can be used to reduce the fairly high overhead alloc/free
cycle that occurs during these lookups.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |   55 +--
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 4c69e2af82dac..edfeee1bf38ec 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1037,12 +1037,33 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
if (flags & LOOKUP_RCU)
return -ECHILD;
 
-   /* Always perform fresh lookup for negatives */
-   if (d_really_is_negative(dentry))
-   goto out_bad_unlocked;
+   mutex_lock(_mutex);
 
kn = kernfs_dentry_node(dentry);
-   mutex_lock(_mutex);
+
+   /* Negative hashed dentry? */
+   if (!kn) {
+   struct kernfs_node *parent;
+
+   /* If the kernfs node can be found this is a stale negative
+* hashed dentry so it must be discarded and the lookup redone.
+*/
+   parent = kernfs_dentry_node(dentry->d_parent);
+   if (parent) {
+   const void *ns = NULL;
+
+   if (kernfs_ns_enabled(parent))
+   ns = kernfs_info(dentry->d_parent->d_sb)->ns;
+   kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+   if (kn)
+   goto out_bad;
+   }
+
+   /* The kernfs node doesn't exist, leave the dentry negative
+* and return success.
+*/
+   goto out;
+   }
 
/* The kernfs node has been deactivated */
if (!kernfs_active_read(kn))
@@ -1060,12 +1081,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
if (kn->parent && kernfs_ns_enabled(kn->parent) &&
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
-
+out:
mutex_unlock(_mutex);
return 1;
 out_bad:
mutex_unlock(_mutex);
-out_bad_unlocked:
return 0;
 }
 
@@ -1080,33 +1100,24 @@ static struct dentry *kernfs_iop_lookup(struct inode 
*dir,
struct dentry *ret;
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
-   struct inode *inode;
+   struct inode *inode = NULL;
const void *ns = NULL;
 
mutex_lock(_mutex);
-
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;
 
kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
-
-   /* no such entry */
-   if (!kn || !kernfs_active(kn)) {
-   ret = NULL;
-   goto out_unlock;
-   }
-
/* attach dentry and inode */
-   inode = kernfs_get_inode(dir->i_sb, kn);
-   if (!inode) {
-   ret = ERR_PTR(-ENOMEM);
-   goto out_unlock;
+   if (kn && kernfs_active(kn)) {
+   inode = kernfs_get_inode(dir->i_sb, kn);
+   if (!inode)
+   inode = ERR_PTR(-ENOMEM);
}
-
-   /* instantiate and hash dentry */
+   /* instantiate and hash (possibly negative) dentry */
ret = d_splice_alias(inode, dentry);
- out_unlock:
mutex_unlock(_mutex);
+
return ret;
 }
 




[PATCH v3 1/4] kernfs: move revalidate to be near lookup

2021-04-08 Thread Ian Kent
While the dentry operation kernfs_dop_revalidate() is grouped with
dentry type functions it also has a strong affinity to the inode
operation ->lookup().

In order to take advantage of the VFS negative dentry caching that
can be used to reduce path lookup overhead on non-existent paths it
will need to call kernfs_find_ns(). So, to avoid a forward declaration,
move it to be near kernfs_iop_lookup().

There's no functional change from this patch.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |   86 ---
 1 file changed, 43 insertions(+), 43 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7e0e62deab53c..4c69e2af82dac 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -548,49 +548,6 @@ void kernfs_put(struct kernfs_node *kn)
 }
 EXPORT_SYMBOL_GPL(kernfs_put);
 
-static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
-{
-   struct kernfs_node *kn;
-
-   if (flags & LOOKUP_RCU)
-   return -ECHILD;
-
-   /* Always perform fresh lookup for negatives */
-   if (d_really_is_negative(dentry))
-   goto out_bad_unlocked;
-
-   kn = kernfs_dentry_node(dentry);
-   mutex_lock(_mutex);
-
-   /* The kernfs node has been deactivated */
-   if (!kernfs_active(kn))
-   goto out_bad;
-
-   /* The kernfs node has been moved? */
-   if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
-   goto out_bad;
-
-   /* The kernfs node has been renamed */
-   if (strcmp(dentry->d_name.name, kn->name) != 0)
-   goto out_bad;
-
-   /* The kernfs node has been moved to a different namespace */
-   if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-   kernfs_info(dentry->d_sb)->ns != kn->ns)
-   goto out_bad;
-
-   mutex_unlock(_mutex);
-   return 1;
-out_bad:
-   mutex_unlock(_mutex);
-out_bad_unlocked:
-   return 0;
-}
-
-const struct dentry_operations kernfs_dops = {
-   .d_revalidate   = kernfs_dop_revalidate,
-};
-
 /**
  * kernfs_node_from_dentry - determine kernfs_node associated with a dentry
  * @dentry: the dentry in question
@@ -1073,6 +1030,49 @@ struct kernfs_node *kernfs_create_empty_dir(struct 
kernfs_node *parent,
return ERR_PTR(rc);
 }
 
+static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct kernfs_node *kn;
+
+   if (flags & LOOKUP_RCU)
+   return -ECHILD;
+
+   /* Always perform fresh lookup for negatives */
+   if (d_really_is_negative(dentry))
+   goto out_bad_unlocked;
+
+   kn = kernfs_dentry_node(dentry);
+   mutex_lock(_mutex);
+
+   /* The kernfs node has been deactivated */
+   if (!kernfs_active_read(kn))
+   goto out_bad;
+
+   /* The kernfs node has been moved? */
+   if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
+   goto out_bad;
+
+   /* The kernfs node has been renamed */
+   if (strcmp(dentry->d_name.name, kn->name) != 0)
+   goto out_bad;
+
+   /* The kernfs node has been moved to a different namespace */
+   if (kn->parent && kernfs_ns_enabled(kn->parent) &&
+   kernfs_info(dentry->d_sb)->ns != kn->ns)
+   goto out_bad;
+
+   mutex_unlock(_mutex);
+   return 1;
+out_bad:
+   mutex_unlock(_mutex);
+out_bad_unlocked:
+   return 0;
+}
+
+const struct dentry_operations kernfs_dops = {
+   .d_revalidate   = kernfs_dop_revalidate,
+};
+
 static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *dentry,
unsigned int flags)




[PATCH v3 0/4] kernfs: proposed locking and concurrency improvement

2021-04-08 Thread Ian Kent
There have been a few instances of contention on the kernfs_mutex during
path walks, a case on very large IBM systems seen by myself, a report by
Brice Goglin and followed up by Fox Chen, and I've since seen a couple
of other reports by CoreOS users.

The common thread is a large number of kernfs path walks leading to
slowness of path walks due to kernfs_mutex contention.

The problem being that changes to the VFS over some time have increased
it's concurrency capabilities to an extent that kernfs's use of a mutex
is no longer appropriate. There's also an issue of walks for non-existent
paths causing contention if there are quite a few of them which is a less
common problem.

This patch series is relatively straight forward.

All it does is add the ability to take advantage of VFS negative dentry
caching to avoid needless dentry alloc/free cycles for lookups of paths
that don't exit and change the kernfs_mutex to a read/write semaphore.

The patch that tried to stay in VFS rcu-walk mode during path walks has
been dropped for two reasons. First, it doesn't actually give very much
improvement and, second, if there's a place where mistakes could go
unnoticed it would be in that path. This makes the patch series simpler
to review and reduces the likelihood of problems going unnoticed and
popping up later.

The patch to use a revision to identify if a directory has changed has
also been dropped. If the directory has changed the dentry revision
needs to be updated to avoid subsequent rb tree searches and after
changing to use a read/write semaphore the update also requires a lock.
But the d_lock is the only lock available at this point which might
itself be contended.

Changes since v2:
- actually fix the inode attribute update locking.
- drop the patch that tried to stay in rcu-walk mode.
- drop the use a revision to identify if a directory has changed patch.

Changes since v1:
- fix locking in .permission() and .getattr() by re-factoring the attribute
  handling code.

---

Ian Kent (4):
  kernfs: move revalidate to be near lookup
  kernfs: use VFS negative dentry caching
  kernfs: switch kernfs to use an rwsem
  kernfs: use i_lock to protect concurrent inode updates


 fs/kernfs/dir.c |  240 +++
 fs/kernfs/file.c|4 -
 fs/kernfs/inode.c   |   18 ++-
 fs/kernfs/kernfs-internal.h |5 +
 fs/kernfs/mount.c   |   12 +-
 fs/kernfs/symlink.c |4 -
 include/linux/kernfs.h  |2 
 7 files changed, 155 insertions(+), 130 deletions(-)

--



Re: [RFC PATCH] autofs: find_autofs_mount overmounted parent support

2021-03-23 Thread Ian Kent
On Tue, 2021-03-09 at 13:43 +0300, Alexander Mikhalitsyn wrote:
> On Sat, 06 Mar 2021 17:13:32 +0800
> Ian Kent  wrote:
> 
> > On Fri, 2021-03-05 at 14:55 +0300, Alexander Mikhalitsyn wrote:
> > > On Fri, 05 Mar 2021 18:10:02 +0800
> > > Ian Kent  wrote:
> > > 
> > > > On Thu, 2021-03-04 at 13:11 +0300, Alexander Mikhalitsyn wrote:
> > > > > On Thu, 04 Mar 2021 14:54:11 +0800
> > > > > Ian Kent  wrote:
> > > > > 
> > > > > > On Wed, 2021-03-03 at 18:28 +0300, Alexander Mikhalitsyn
> > > > > > wrote:
> > > > > > > It was discovered that find_autofs_mount() function
> > > > > > > in autofs not support cases when autofs mount
> > > > > > > parent is overmounted. In this case this function will
> > > > > > > always return -ENOENT.
> > > > > > 
> > > > > > Ok, I get this shouldn't happen.
> > > > > > 
> > > > > > > Real-life reproducer is fairly simple.
> > > > > > > Consider the following mounts on root mntns:
> > > > > > > --
> > > > > > > 35 24 0:36 / /proc/sys/fs/binfmt_misc ... shared:16 -
> > > > > > > autofs
> > > > > > > systemd-
> > > > > > > 1 ...
> > > > > > > 654 35 0:57 / /proc/sys/fs/binfmt_misc ... shared:322 -
> > > > > > > binfmt_misc
> > > > > > > ...
> > > > > > > --
> > > > > > > and some process which calls
> > > > > > > ioctl(AUTOFS_DEV_IOCTL_OPENMOUNT)
> > > > > > > $ unshare -m -p --fork --mount-proc ./process-bin
> > > > > > > 
> > > > > > > Due to "mount-proc" /proc will be overmounted and
> > > > > > > ioctl() will fail with -ENOENT
> > > > > > 
> > > > > > I think I need a better explanation ...
> > > > > 
> > > > > Thank you for the quick reply, Ian.
> > > > > I'm sorry If my patch description was not sufficiently clear
> > > > > and
> > > > > detailed.
> > > > > 
> > > > > That problem connected with CRIU (Checkpoint-Restore in
> > > > > Userspace)
> > > > > project.
> > > > > In CRIU we have support of autofs mounts C/R. To acheive that
> > > > > we
> > > > > need
> > > > > to use
> > > > > ioctl's from /dev/autofs to get data about mounts, restore
> > > > > mount
> > > > > as
> > > > > catatonic
> > > > > (if needed), change pipe fd and so on. But the problem is
> > > > > that
> > > > > during
> > > > > CRIU
> > > > > dump we may meet situation when VFS subtree where autofs
> > > > > mount
> > > > > present was
> > > > > overmounted as whole.
> > > > > 
> > > > > Simpliest example is /proc/sys/fs/binfmt_misc. This mount
> > > > > present
> > > > > on
> > > > > most
> > > > > GNU/Linux distributions by default. For instance on my Fedora
> > > > > 33:
> > > > 
> > > > Yes, I don't know why systemd uses this direct mount, there
> > > > must
> > > > have been a reason for it.
> > > > 
> > > > > trigger automount of binfmt_misc
> > > > > $ ls /proc/sys/fs/binfmt_misc
> > > > > 
> > > > > $ cat /proc/1/mountinfo | grep binfmt
> > > > > 35 24 0:36 / /proc/sys/fs/binfmt_misc rw,relatime shared:16 -
> > > > > autofs
> > > > > systemd-1 rw,...,direct,pipe_ino=223
> > > > > 632 35 0:56 / /proc/sys/fs/binfmt_misc rw,...,relatime
> > > > > shared:315
> > > > > -
> > > > > binfmt_misc binfmt_misc rw
> > > > 
> > > > Yes, I think this looks normal.
> > > > 
> > > > > $ sudo unshare -m -p --fork --mount-proc sh
> > > > > # cat /proc/self/mountinfo | grep "/proc"
> > > > > 828 809 0:23 / /proc rw,nosuid,nodev,noexec,relatime - proc
> > > > > proc
> > > > > rw
> > > > > 829 828 0:36 / /proc/sys/fs/binfmt_misc rw,relatime - autofs
> > > > > systemd-
> > > > > 1 r

Re: [RFC PATCH] autofs: find_autofs_mount overmounted parent support

2021-03-06 Thread Ian Kent
On Fri, 2021-03-05 at 14:55 +0300, Alexander Mikhalitsyn wrote:
> On Fri, 05 Mar 2021 18:10:02 +0800
> Ian Kent  wrote:
> 
> > On Thu, 2021-03-04 at 13:11 +0300, Alexander Mikhalitsyn wrote:
> > > On Thu, 04 Mar 2021 14:54:11 +0800
> > > Ian Kent  wrote:
> > > 
> > > > On Wed, 2021-03-03 at 18:28 +0300, Alexander Mikhalitsyn wrote:
> > > > > It was discovered that find_autofs_mount() function
> > > > > in autofs not support cases when autofs mount
> > > > > parent is overmounted. In this case this function will
> > > > > always return -ENOENT.
> > > > 
> > > > Ok, I get this shouldn't happen.
> > > > 
> > > > > Real-life reproducer is fairly simple.
> > > > > Consider the following mounts on root mntns:
> > > > > --
> > > > > 35 24 0:36 / /proc/sys/fs/binfmt_misc ... shared:16 - autofs
> > > > > systemd-
> > > > > 1 ...
> > > > > 654 35 0:57 / /proc/sys/fs/binfmt_misc ... shared:322 -
> > > > > binfmt_misc
> > > > > ...
> > > > > --
> > > > > and some process which calls
> > > > > ioctl(AUTOFS_DEV_IOCTL_OPENMOUNT)
> > > > > $ unshare -m -p --fork --mount-proc ./process-bin
> > > > > 
> > > > > Due to "mount-proc" /proc will be overmounted and
> > > > > ioctl() will fail with -ENOENT
> > > > 
> > > > I think I need a better explanation ...
> > > 
> > > Thank you for the quick reply, Ian.
> > > I'm sorry If my patch description was not sufficiently clear and
> > > detailed.
> > > 
> > > That problem connected with CRIU (Checkpoint-Restore in
> > > Userspace)
> > > project.
> > > In CRIU we have support of autofs mounts C/R. To acheive that we
> > > need
> > > to use
> > > ioctl's from /dev/autofs to get data about mounts, restore mount
> > > as
> > > catatonic
> > > (if needed), change pipe fd and so on. But the problem is that
> > > during
> > > CRIU
> > > dump we may meet situation when VFS subtree where autofs mount
> > > present was
> > > overmounted as whole.
> > > 
> > > Simpliest example is /proc/sys/fs/binfmt_misc. This mount present
> > > on
> > > most
> > > GNU/Linux distributions by default. For instance on my Fedora 33:
> > 
> > Yes, I don't know why systemd uses this direct mount, there must
> > have been a reason for it.
> > 
> > > trigger automount of binfmt_misc
> > > $ ls /proc/sys/fs/binfmt_misc
> > > 
> > > $ cat /proc/1/mountinfo | grep binfmt
> > > 35 24 0:36 / /proc/sys/fs/binfmt_misc rw,relatime shared:16 -
> > > autofs
> > > systemd-1 rw,...,direct,pipe_ino=223
> > > 632 35 0:56 / /proc/sys/fs/binfmt_misc rw,...,relatime shared:315
> > > -
> > > binfmt_misc binfmt_misc rw
> > 
> > Yes, I think this looks normal.
> > 
> > > $ sudo unshare -m -p --fork --mount-proc sh
> > > # cat /proc/self/mountinfo | grep "/proc"
> > > 828 809 0:23 / /proc rw,nosuid,nodev,noexec,relatime - proc proc
> > > rw
> > > 829 828 0:36 / /proc/sys/fs/binfmt_misc rw,relatime - autofs
> > > systemd-
> > > 1 rw,...,direct,pipe_ino=223
> > > 943 829 0:56 / /proc/sys/fs/binfmt_misc rw,...,relatime -
> > > binfmt_misc
> > > binfmt_misc rw
> > > 949 828 0:57 / /proc rw...,relatime - proc proc rw
> > 
> > Isn't this screwed up, /proc is on top of the binfmt_misc mount ...
> > 
> > Is this what's seen from the root namespace?
> 
> No-no, after issuing
> $ sudo unshare -m -p --fork --mount-proc sh
> 
> we enter to the pid+mount namespace and:
> 
> # cat /proc/self/mountinfo | grep "/proc"
> 
> So, it's picture from inside namespaces.

Ok, so potentially some of those have been propagated from the
original mount namespace.

It seems to me the sensible thing would be those mounts would
not propagate when a new proc has been requested. It doesn't
make sense to me to carry around mounts that are not accessible
because of something requested by the mount namespace creator.

But that's nothing new and isn't likely to change any time soon.

> 
> > > As we can see now autofs mount /proc/sys/fs/binfmt_misc is
> > > inaccessible.
> > > If we do something like:
> > > 
> > > struct autofs_dev_ioctl *param;
> > > param = malloc(...);
> >

Re: [RFC PATCH] autofs: find_autofs_mount overmounted parent support

2021-03-05 Thread Ian Kent
On Thu, 2021-03-04 at 13:11 +0300, Alexander Mikhalitsyn wrote:
> On Thu, 04 Mar 2021 14:54:11 +0800
> Ian Kent  wrote:
> 
> > On Wed, 2021-03-03 at 18:28 +0300, Alexander Mikhalitsyn wrote:
> > > It was discovered that find_autofs_mount() function
> > > in autofs not support cases when autofs mount
> > > parent is overmounted. In this case this function will
> > > always return -ENOENT.
> > 
> > Ok, I get this shouldn't happen.
> > 
> > > Real-life reproducer is fairly simple.
> > > Consider the following mounts on root mntns:
> > > --
> > > 35 24 0:36 / /proc/sys/fs/binfmt_misc ... shared:16 - autofs
> > > systemd-
> > > 1 ...
> > > 654 35 0:57 / /proc/sys/fs/binfmt_misc ... shared:322 -
> > > binfmt_misc
> > > ...
> > > --
> > > and some process which calls ioctl(AUTOFS_DEV_IOCTL_OPENMOUNT)
> > > $ unshare -m -p --fork --mount-proc ./process-bin
> > > 
> > > Due to "mount-proc" /proc will be overmounted and
> > > ioctl() will fail with -ENOENT
> > 
> > I think I need a better explanation ...
> 
> Thank you for the quick reply, Ian.
> I'm sorry If my patch description was not sufficiently clear and
> detailed.
> 
> That problem connected with CRIU (Checkpoint-Restore in Userspace)
> project.
> In CRIU we have support of autofs mounts C/R. To acheive that we need
> to use
> ioctl's from /dev/autofs to get data about mounts, restore mount as
> catatonic
> (if needed), change pipe fd and so on. But the problem is that during
> CRIU
> dump we may meet situation when VFS subtree where autofs mount
> present was
> overmounted as whole.
> 
> Simpliest example is /proc/sys/fs/binfmt_misc. This mount present on
> most
> GNU/Linux distributions by default. For instance on my Fedora 33:

Yes, I don't know why systemd uses this direct mount, there must
have been a reason for it.

> 
> trigger automount of binfmt_misc
> $ ls /proc/sys/fs/binfmt_misc
> 
> $ cat /proc/1/mountinfo | grep binfmt
> 35 24 0:36 / /proc/sys/fs/binfmt_misc rw,relatime shared:16 - autofs
> systemd-1 rw,...,direct,pipe_ino=223
> 632 35 0:56 / /proc/sys/fs/binfmt_misc rw,...,relatime shared:315 -
> binfmt_misc binfmt_misc rw

Yes, I think this looks normal.

> 
> $ sudo unshare -m -p --fork --mount-proc sh
> # cat /proc/self/mountinfo | grep "/proc"
> 828 809 0:23 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
> 829 828 0:36 / /proc/sys/fs/binfmt_misc rw,relatime - autofs systemd-
> 1 rw,...,direct,pipe_ino=223
> 943 829 0:56 / /proc/sys/fs/binfmt_misc rw,...,relatime - binfmt_misc
> binfmt_misc rw
> 949 828 0:57 / /proc rw...,relatime - proc proc rw

Isn't this screwed up, /proc is on top of the binfmt_misc mount ...

Is this what's seen from the root namespace?

> 
> As we can see now autofs mount /proc/sys/fs/binfmt_misc is
> inaccessible.
> If we do something like:
> 
> struct autofs_dev_ioctl *param;
> param = malloc(...);
> devfd = open("/dev/autofs", O_RDONLY);
> init_autofs_dev_ioctl(param);
> param->size = size;
> strcpy(param->path, "/proc/sys/fs/binfmt_misc");
> param->openmount.devid = 36;
> err = ioctl(devfd, AUTOFS_DEV_IOCTL_OPENMOUNT, param)
> 
> now we get err = -ENOENT.

Maybe that should be EINVAL, not sure about cases though.

> 
> > What's being said here?
> > 
> > For a start your talking about direct mounts, I'm pretty sure this
> > use case can't occur with indirect mounts in the sense that the
> > indirect mount base should/must never be over mounted and IIRC that
> > base can't be /proc (but maybe that's just mounts inside proc ...),
> > can't remember now but from a common sense POV an indirect mount
> > won't/can't be on /proc.
> > 
> > And why is this ioctl be called?
> 
> We call this ioctl during criu dump stage to open fd from autofs
> mount dentry. This fd is used later to call
> ioctl(AUTOFS_IOC_CATATONIC)
> (we do that on criu dump if we see that control process of autofs
> mount
> is dead or pipe is dead).

Right so your usage "is" the way it's intended, ;)

> 
> > If the mount is over mounted should that prevent expiration of the
> > over mounted /proc anyway, so maybe the return is correct ... or
> > not ...
> 
> I agree that case with overmounted subtree with autofs mount is weird
> case.
> But it may be easily created by user and we in CRIU try to handle
> that.

I'm not yet ready to make a call on how I think this this should
be done.

Since you seem to be clear on what this should be used for I'll
need to look more closely at the patch.

But, at fir

Re: [RFC PATCH] autofs: find_autofs_mount overmounted parent support

2021-03-03 Thread Ian Kent
On Wed, 2021-03-03 at 18:28 +0300, Alexander Mikhalitsyn wrote:
> It was discovered that find_autofs_mount() function
> in autofs not support cases when autofs mount
> parent is overmounted. In this case this function will
> always return -ENOENT.

Ok, I get this shouldn't happen.

> 
> Real-life reproducer is fairly simple.
> Consider the following mounts on root mntns:
> --
> 35 24 0:36 / /proc/sys/fs/binfmt_misc ... shared:16 - autofs systemd-
> 1 ...
> 654 35 0:57 / /proc/sys/fs/binfmt_misc ... shared:322 - binfmt_misc
> ...
> --
> and some process which calls ioctl(AUTOFS_DEV_IOCTL_OPENMOUNT)
> $ unshare -m -p --fork --mount-proc ./process-bin
> 
> Due to "mount-proc" /proc will be overmounted and
> ioctl() will fail with -ENOENT

I think I need a better explanation ...

What's being said here?

For a start your talking about direct mounts, I'm pretty sure this
use case can't occur with indirect mounts in the sense that the
indirect mount base should/must never be over mounted and IIRC that
base can't be /proc (but maybe that's just mounts inside proc ...),
can't remember now but from a common sense POV an indirect mount
won't/can't be on /proc.

And why is this ioctl be called?

If the mount is over mounted should that prevent expiration of the
over mounted /proc anyway, so maybe the return is correct ... or
not ...

I get that the mount namespaces should be independent and intuitively
this is a bug but what is the actual use and expected result.

But anyway, aren't you saying that the VFS path walk isn't handling
mount namespaces properly or are you saying that a process outside
this new mount namespace becomes broken because of it?

Either way the solution looks more complicated than I'd expect so
some explanation along these lines would be good.

Ian
> 
> Cc: Matthew Wilcox 
> Cc: Al Viro 
> Cc: Pavel Tikhomirov 
> Cc: Kirill Tkhai 
> Cc: aut...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Alexander Mikhalitsyn <
> alexander.mikhalit...@virtuozzo.com>
> ---
>  fs/autofs/dev-ioctl.c | 127 +---
> --
>  fs/namespace.c|  44 +++
>  include/linux/mount.h |   5 ++
>  3 files changed, 162 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/autofs/dev-ioctl.c b/fs/autofs/dev-ioctl.c
> index 5bf781ea6d67..55edd3eba8ce 100644
> --- a/fs/autofs/dev-ioctl.c
> +++ b/fs/autofs/dev-ioctl.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "autofs_i.h"
>  
> @@ -179,32 +180,130 @@ static int autofs_dev_ioctl_protosubver(struct
> file *fp,
>   return 0;
>  }
>  
> +struct filter_autofs_data {
> + char *pathbuf;
> + const char *fpathname;
> + int (*test)(const struct path *path, void *data);
> + void *data;
> +};
> +
> +static int filter_autofs(const struct path *path, void *p)
> +{
> + struct filter_autofs_data *data = p;
> + char *name;
> + int err;
> +
> + if (path->mnt->mnt_sb->s_magic != AUTOFS_SUPER_MAGIC)
> + return 0;
> +
> + name = d_path(path, data->pathbuf, PATH_MAX);
> + if (IS_ERR(name)) {
> + err = PTR_ERR(name);
> + pr_err("d_path failed, errno %d\n", err);
> + return 0;
> + }
> +
> + if (strncmp(data->fpathname, name, PATH_MAX))
> + return 0;
> +
> + if (!data->test(path, data->data))
> + return 0;
> +
> + return 1;
> +}
> +
>  /* Find the topmost mount satisfying test() */
>  static int find_autofs_mount(const char *pathname,
>struct path *res,
>int test(const struct path *path, void
> *data),
>void *data)
>  {
> - struct path path;
> + struct filter_autofs_data mdata = {
> + .pathbuf = NULL,
> + .test = test,
> + .data = data,
> + };
> + struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns;
> + struct path path = {};
> + char *fpathbuf = NULL;
>   int err;
>  
> + /*
> +  * In most cases user will provide full path to autofs mount
> point
> +  * as it is in /proc/X/mountinfo. But if not, then we need to
> +  * open provided relative path and calculate full path.
> +  * It will not work in case when parent mount of autofs mount
> +  * is overmounted:
> +  * cd /root
> +  * ./autofs_mount /root/autofs_yard/mnt
> +  * mount -t tmpfs tmpfs /root/autofs_yard/mnt
> +  * mount -t tmpfs tmpfs /root/autofs_yard
> +  * ./call_ioctl /root/autofs_yard/mnt <- all fine here because
> we
> +  *   have full path and
> don't
> +  *   need to call
> kern_path()
> +  *   and d_path()
> +  * ./call_ioctl autofs_yard/mnt <- will fail because
> kern_path()
> +  * can't lookup
> /root/autofs_yard/mnt
> +  *  

[ANNOUNCE] autofs 5.1.7 release

2021-01-26 Thread Ian Kent
Hi all,

It's time for a release, autofs-5.1.7.

As with autofs-5.1.6 work to resolve difficulties using very large
large direct mount maps has continued but there have been some
difficulties.

Trying to get back to the situation that existed before symlinking
of the mount table can't be done in libmount because of the way in
which some packages (in particular systemd) use libmount for reading
the mount table.

The approach of using an autofs pseudo mount option "ignore" has been
done in glibc and an autofs configuration option to enable the use of
the option has been added, so it can be used by enabling the setting
in the autofs configuration.

But this approach can't be used for libmount mount table accesses
because systemd needs to see the entire mount table at shutdown so a
user controlled setting isn't the right way to do it.

However, autofs expire is now completely independent of the system
mount table and expire operations for very large direct mount maps are
independent of the size of the direct map itself. This is now dependent
only on the number of active mounts at expire.

The problem of mount activity affecting other system applications still
exists, basically because kernel mount table access is by whole file
and applications that monitor changes to the file need to re-read the
entire file on every change notification to process changes.

Consequently starting autofs with a very large direct mount map will
still cause significant resource usage for a number or process such as
systemd, udisksd and others. Also busy sites with a lot of automount
activity can cause similar problems but there has to be quite a lot of
activity for it to be a problem.

There is a fair bit of improvement to the autofs sss interface module.

This was done because of improvements that were made in sss to improve
the error communication between autofs and sss and it quickly became
clear that there was potential to significantly improve the autofs
module.

For a start autofs wasn't fully utilizing the existing returns which
was fixed.

Also the ability to identify a backend host is not available was added.
And this meant that the autofs module needed quite a bit of change to
take advantage of this new sss functionality.

One consequence of this is that there can be somewhat longer delays if
a backend host is down, including for the interactive key lookup case.

But waiting on these accesses was considered acceptable because the
sss caching is very effective so that this case should be encountered
only very rarely.

Note that if sss does not include these improvements autofs should
continue to behave as it previously did because the ability for autofs
to detect the presencce of the enhacements is part of the sss change.

There are also quite a number of bug fixes and other minor
improvements.

autofs
==

The package can be found at:
https://www.kernel.org/pub/linux/daemons/autofs/v5/

It is autofs-5.1.7.tar.[gz|xz]

No source rpm is there as it can be produced by using:

rpmbuild -ts autofs-5.1.7.tar.gz

and the binary rpm by using:

rpmbuild -tb autofs-5.1.7.tar.gz

Here are the entries from the CHANGELOG which outline the updates:

25/01/2021 autofs-5.1.7
- make bind mounts propagation slave by default.
- update ldap READMEs and schema definitions.
- fix program map multi-mount lookup after mount fail.
- fix browse dir not re-created on symlink expire.
- fix a regression with map instance lookup.
- correct fsf address.
- samples: fix Makefile targets' directory dependencies
- remove intr hosts map mount option.
- fix trailing dollar sun entry expansion.
- initialize struct addrinfo for getaddrinfo() calls.
- fix quoted string length calc in expandsunent().
- fix autofs mount options construction.
- mount_nfs.c fix local rdma share not mounting.
- configure.in: Remove unneeded second call to PKG_PROG_PKG_CONFIG.
- configure.in: Do not append parentheses to PKG_PROG_PKG_CONFIG.
- Use PKG_CHECK_MODULES to detect the libxml2 library.
- fix ldap sasl reconnect problem.
- samples/ldap.schema fix.
- fix configure force shutdown check.
- fix crash in sun_mount().
- fix lookup_nss_read_master() nsswicth check return.
- fix typo in open_sss_lib().
- fix sss_master_map_wait timing.
- add sss ECONREFUSED return handling.
- use mapname in sss context for setautomntent().
- add support for new sss autofs proto version call.
- fix retries check in setautomntent_wait().
- refactor sss setautomntent().
- improve sss setautomntent() error handling.
- refactor sss getautomntent().
- improve sss getautomntent() error handling.
- sss introduce calculate_retry_count() function.
- move readall into struct master.
- sss introduce a flag to indicate map being read.
- update sss timeout documentation.
- refactor sss getautomntbyname().
- improve sss getautomntbyname() error handling.
- use a valid timeout in lookup_prune_one_cache().
- dont prune offset map entries.
- simplify sss source stale check.
- include linux/nfs.h directly in 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-21 Thread Ian Kent
On Sat, 2020-12-19 at 15:47 +0800, Fox Chen wrote:
> On Sat, Dec 19, 2020 at 8:53 AM Ian Kent  wrote:
> > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent 
> > > wrote:
> > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent 
> > > > > wrote:
> > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > > > What could be done is to make the kernfs node
> > > > > > > > > attr_mutex
> > > > > > > > > a pointer and dynamically allocate it but even that
> > > > > > > > > is
> > > > > > > > > too
> > > > > > > > > costly a size addition to the kernfs node structure
> > > > > > > > > as
> > > > > > > > > Tejun has said.
> > > > > > > > 
> > > > > > > > I guess the question to ask is, is there really a need
> > > > > > > > to
> > > > > > > > call kernfs_refresh_inode() from functions that are
> > > > > > > > usually
> > > > > > > > reading/checking functions.
> > > > > > > > 
> > > > > > > > Would it be sufficient to refresh the inode in the
> > > > > > > > write/set
> > > > > > > > operations in (if there's any) places where things like
> > > > > > > > setattr_copy() is not already called?
> > > > > > > > 
> > > > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > > > 
> > > > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > > > sysfs
> > > > > > > namespace is
> > > > > > > implemented, so I don't think there's an easy around
> > > > > > > that.
> > > > > > > The
> > > > > > > only
> > > > > > > thing I
> > > > > > > can think of is embedding the lock into attrs and doing
> > > > > > > xchg
> > > > > > > dance
> > > > > > > when
> > > > > > > attaching it.
> > > > > > 
> > > > > > Sounds like your saying it would be ok to add a lock to the
> > > > > > attrs structure, am I correct?
> > > > > > 
> > > > > > Assuming it is then, to keep things simple, use two locks.
> > > > > > 
> > > > > > One global lock for the allocation and an attrs lock for
> > > > > > all
> > > > > > the
> > > > > > attrs field updates including the kernfs_refresh_inode()
> > > > > > update.
> > > > > > 
> > > > > > The critical section for the global lock could be reduced
> > > > > > and
> > > > > > it
> > > > > > changed to a spin lock.
> > > > > > 
> > > > > > In __kernfs_iattrs() we would have something like:
> > > > > > 
> > > > > > take the allocation lock
> > > > > > do the allocated checks
> > > > > >   assign if existing attrs
> > > > > >   release the allocation lock
> > > > > >   return existing if found
> > > > > > othewise
> > > > > >   release the allocation lock
> > > > > > 
> > > > > > allocate and initialize attrs
> > > > > > 
> > > > > > take the allocation lock
> > > > > > check if someone beat us to it
> > > > > >   free and grab exiting attrs
> > > > > > otherwise
> > > > > >   assign the new attrs
> > > > > > release the allocation lock
> > > > > > return attrs
> > > > > > 
> > > > > > Add a spinlock to the attrs struct and use it everywhere
> > > > > > for
> > > > > > field updates.
> > > > > > 
> > > > > > Am I on the right track or can you see problems with this

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-19 Thread Ian Kent
On Sun, 2020-12-20 at 07:52 +0800, Ian Kent wrote:
> On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote:
> > Hello,
> > 
> > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> > > And looking further I see there's a race that kernfs can't do
> > > anything
> > > about between kernfs_refresh_inode() and
> > > fs/inode.c:update_times().
> > 
> > Do kernfs files end up calling into that path tho? Doesn't look
> > like
> > it to
> > me but if so yeah we'd need to override the update_time for kernfs.

You are correct, update_time() will only be called during symlink
following and only to update atime.

So this isn't sufficient to update the inode attributes to reflect
changes make by things like kernfs_setattr() or when the directory
link count changes ...

Sigh!



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-19 Thread Ian Kent
On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote:
> Hello,
> 
> On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> > And looking further I see there's a race that kernfs can't do
> > anything
> > about between kernfs_refresh_inode() and fs/inode.c:update_times().
> 
> Do kernfs files end up calling into that path tho? Doesn't look like
> it to
> me but if so yeah we'd need to override the update_time for kernfs.

Sorry, the below was very hastily done and not what I would actually
propose.

The main point of it was the question

+   /* Which kernfs node attributes should be updated from
+* time?
+*/

but looking at it again this morning I think the node iattr fields
that might need to be updated would be atime, ctime and mtime only,
maybe not ctime ... not sure.

What do you think?

Also, if kn->attr == NULL it should fall back to what the VFS
currently does.

The update_times() function is one of the few places where the
VFS updates the inode times.

The idea is that the reason kernfs needs to overwrite the inode
attributes is to reset what the VFS might have done but if kernfs
has this inode operation they won't need to be overwritten since
they won't have changed.

There may be other places where the attributes (or an attribute)
are set by the VFS, I haven't finished checking that yet so my
suggestion might not be entirely valid.

What I need to do is work out what kernfs node attributes, if any,
should be updated by .update_times(). If I go by what
kernfs_refresh_inode() does now then that would be none but shouldn't
atime at least be updated in the node iattr.

> > +static int kernfs_iop_update_time(struct inode *inode, struct
> > timespec64 *time, int flags)
> >  {
> > -   struct inode *inode = d_inode(path->dentry);
> > struct kernfs_node *kn = inode->i_private;
> > +   struct kernfs_iattrs *attrs;
> >  
> > mutex_lock(_mutex);
> > +   attrs = kernfs_iattrs(kn);
> > +   if (!attrs) {
> > +   mutex_unlock(_mutex);
> > +   return -ENOMEM;
> > +   }
> > +
> > +   /* Which kernfs node attributes should be updated from
> > +* time?
> > +*/
> > +
> > kernfs_refresh_inode(kn, inode);
> > mutex_unlock(_mutex);
> 
> I don't see how this would reflect the changes from kernfs_setattr()
> into
> the attached inode. This would actually make the attr updates
> obviously racy
> - the userland visible attrs would be stale until the inode gets
> reclaimed
> and then when it gets reinstantiated it'd show the latest
> information.

Right, I will have to think about that, but as I say above this
isn't really what I would propose.

If .update_times() sticks strictly to what kernfs_refresh_inode()
does now then it would set the inode attributes from the node iattr
only.

> 
> That said, if you wanna take the direction where attr updates are
> reflected
> to the associated inode when the change occurs, which makes sense,
> the right
> thing to do would be making kernfs_setattr() update the associated
> inode if
> existent.

Mmm, that's a good point but it looks like the inode isn't available
there.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Ian Kent
On Fri, 2020-12-18 at 09:59 -0500, Tejun Heo wrote:
> Hello,
> 
> On Fri, Dec 18, 2020 at 03:36:21PM +0800, Ian Kent wrote:
> > Sounds like your saying it would be ok to add a lock to the
> > attrs structure, am I correct?
> 
> Yeah, adding a lock to attrs is a lot less of a problem and it looks
> like
> it's gonna have to be either that or hashed locks, which might
> actually make
> sense if we're worried about the size of attrs (I don't think we need
> to).

Maybe that isn't needed.

And looking further I see there's a race that kernfs can't do anything
about between kernfs_refresh_inode() and fs/inode.c:update_times().

kernfs could avoid fighting with the VFS to keep the attributes set to
those of the kernfs node by using the inode operation .update_times()
and, if it makes sense, the kernfs node attributes that it wants to be
updated on file system activity could also be updated here.

I can't find any reason why this shouldn't be done but kernfs is
fairly widely used in other kernel subsystems so what does everyone
think of this patch, updated to set kernfs node attributes that
should be updated of course, see comment in the patch?

kernfs: fix attributes update race

From: Ian Kent 

kernfs uses kernfs_refresh_inode() (called from kernfs_iop_getattr()
and kernfs_iop_permission()) to keep the inode attributes set to the
attibutes of the kernfs node.

But there is no way for kernfs to prevent racing with the function
fs/inode.c:update_times().

The better choice is to use the inode operation .update_times() and
just let the VFS use the generic functions for .getattr() and
.permission().

Signed-off-by: Ian Kent 
---
 fs/kernfs/inode.c   |   37 ++---
 fs/kernfs/kernfs-internal.h |4 +---
 2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index fc2469a20fed..51780329590c 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -24,9 +24,8 @@ static const struct address_space_operations kernfs_aops = {
 };
 
 static const struct inode_operations kernfs_iops = {
-   .permission = kernfs_iop_permission,
+   .update_time= kernfs_update_time,
.setattr= kernfs_iop_setattr,
-   .getattr= kernfs_iop_getattr,
.listxattr  = kernfs_iop_listxattr,
 };
 
@@ -183,18 +182,26 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, 
struct inode *inode)
set_nlink(inode, kn->dir.subdirs + 2);
 }
 
-int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
-  u32 request_mask, unsigned int query_flags)
+static int kernfs_iop_update_time(struct inode *inode, struct timespec64 
*time, int flags)
 {
-   struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;
+   struct kernfs_iattrs *attrs;
 
mutex_lock(_mutex);
+   attrs = kernfs_iattrs(kn);
+   if (!attrs) {
+   mutex_unlock(_mutex);
+   return -ENOMEM;
+   }
+
+   /* Which kernfs node attributes should be updated from
+* time?
+*/
+
kernfs_refresh_inode(kn, inode);
mutex_unlock(_mutex);
 
-   generic_fillattr(inode, stat);
-   return 0;
+   return 0
 }
 
 static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
@@ -272,22 +279,6 @@ void kernfs_evict_inode(struct inode *inode)
kernfs_put(kn);
 }
 
-int kernfs_iop_permission(struct inode *inode, int mask)
-{
-   struct kernfs_node *kn;
-
-   if (mask & MAY_NOT_BLOCK)
-   return -ECHILD;
-
-   kn = inode->i_private;
-
-   mutex_lock(_mutex);
-   kernfs_refresh_inode(kn, inode);
-   mutex_unlock(_mutex);
-
-   return generic_permission(inode, mask);
-}
-
 int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
 void *value, size_t size)
 {
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 7ee97ef59184..98d08b928f93 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -89,10 +89,8 @@ extern struct kmem_cache *kernfs_node_cache, 
*kernfs_iattrs_cache;
  */
 extern const struct xattr_handler *kernfs_xattr_handlers[];
 void kernfs_evict_inode(struct inode *inode);
-int kernfs_iop_permission(struct inode *inode, int mask);
+int kernfs_update_time(struct inode *inode, struct timespec64 *time, int 
flags);
 int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr);
-int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
-  u32 request_mask, unsigned int query_flags);
 ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size);
 int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
 



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Ian Kent
On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> On Fri, Dec 18, 2020 at 7:21 PM Ian Kent  wrote:
> > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent 
> > > wrote:
> > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > What could be done is to make the kernfs node attr_mutex
> > > > > > > a pointer and dynamically allocate it but even that is
> > > > > > > too
> > > > > > > costly a size addition to the kernfs node structure as
> > > > > > > Tejun has said.
> > > > > > 
> > > > > > I guess the question to ask is, is there really a need to
> > > > > > call kernfs_refresh_inode() from functions that are usually
> > > > > > reading/checking functions.
> > > > > > 
> > > > > > Would it be sufficient to refresh the inode in the
> > > > > > write/set
> > > > > > operations in (if there's any) places where things like
> > > > > > setattr_copy() is not already called?
> > > > > > 
> > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > 
> > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > sysfs
> > > > > namespace is
> > > > > implemented, so I don't think there's an easy around that.
> > > > > The
> > > > > only
> > > > > thing I
> > > > > can think of is embedding the lock into attrs and doing xchg
> > > > > dance
> > > > > when
> > > > > attaching it.
> > > > 
> > > > Sounds like your saying it would be ok to add a lock to the
> > > > attrs structure, am I correct?
> > > > 
> > > > Assuming it is then, to keep things simple, use two locks.
> > > > 
> > > > One global lock for the allocation and an attrs lock for all
> > > > the
> > > > attrs field updates including the kernfs_refresh_inode()
> > > > update.
> > > > 
> > > > The critical section for the global lock could be reduced and
> > > > it
> > > > changed to a spin lock.
> > > > 
> > > > In __kernfs_iattrs() we would have something like:
> > > > 
> > > > take the allocation lock
> > > > do the allocated checks
> > > >   assign if existing attrs
> > > >   release the allocation lock
> > > >   return existing if found
> > > > othewise
> > > >   release the allocation lock
> > > > 
> > > > allocate and initialize attrs
> > > > 
> > > > take the allocation lock
> > > > check if someone beat us to it
> > > >   free and grab exiting attrs
> > > > otherwise
> > > >   assign the new attrs
> > > > release the allocation lock
> > > > return attrs
> > > > 
> > > > Add a spinlock to the attrs struct and use it everywhere for
> > > > field updates.
> > > > 
> > > > Am I on the right track or can you see problems with this?
> > > > 
> > > > Ian
> > > > 
> > > 
> > > umm, we update the inode in kernfs_refresh_inode, right??  So I
> > > guess
> > > the problem is how can we protect the inode when
> > > kernfs_refresh_inode
> > > is called, not the attrs??
> > 
> > But the attrs (which is what's copied from) were protected by the
> > mutex lock (IIUC) so dealing with the inode attributes implies
> > dealing with the kernfs node attrs too.
> > 
> > For example in kernfs_iop_setattr() the call to setattr_copy()
> > copies
> > the node attrs to the inode under the same mutex lock. So, if a
> > read
> > lock is used the copy in kernfs_refresh_inode() is no longer
> > protected,
> > it needs to be protected in a different way.
> > 
> 
> Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for
> .setattr but
>  no lock for .getattr (misdocumented?? sometimes they have as you've
> found out)?
> What does it protect against?? Because .permission does a similar
> thing
> here -- updating inode attributes, the goal is to provide the same
> protection lev

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Ian Kent
On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> On Fri, Dec 18, 2020 at 3:36 PM Ian Kent  wrote:
> > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > What could be done is to make the kernfs node attr_mutex
> > > > > a pointer and dynamically allocate it but even that is too
> > > > > costly a size addition to the kernfs node structure as
> > > > > Tejun has said.
> > > > 
> > > > I guess the question to ask is, is there really a need to
> > > > call kernfs_refresh_inode() from functions that are usually
> > > > reading/checking functions.
> > > > 
> > > > Would it be sufficient to refresh the inode in the write/set
> > > > operations in (if there's any) places where things like
> > > > setattr_copy() is not already called?
> > > > 
> > > > Perhaps GKH or Tejun could comment on this?
> > > 
> > > My memory is a bit hazy but invalidations on reads is how sysfs
> > > namespace is
> > > implemented, so I don't think there's an easy around that. The
> > > only
> > > thing I
> > > can think of is embedding the lock into attrs and doing xchg
> > > dance
> > > when
> > > attaching it.
> > 
> > Sounds like your saying it would be ok to add a lock to the
> > attrs structure, am I correct?
> > 
> > Assuming it is then, to keep things simple, use two locks.
> > 
> > One global lock for the allocation and an attrs lock for all the
> > attrs field updates including the kernfs_refresh_inode() update.
> > 
> > The critical section for the global lock could be reduced and it
> > changed to a spin lock.
> > 
> > In __kernfs_iattrs() we would have something like:
> > 
> > take the allocation lock
> > do the allocated checks
> >   assign if existing attrs
> >   release the allocation lock
> >   return existing if found
> > othewise
> >   release the allocation lock
> > 
> > allocate and initialize attrs
> > 
> > take the allocation lock
> > check if someone beat us to it
> >   free and grab exiting attrs
> > otherwise
> >   assign the new attrs
> > release the allocation lock
> > return attrs
> > 
> > Add a spinlock to the attrs struct and use it everywhere for
> > field updates.
> > 
> > Am I on the right track or can you see problems with this?
> > 
> > Ian
> > 
> 
> umm, we update the inode in kernfs_refresh_inode, right??  So I guess
> the problem is how can we protect the inode when kernfs_refresh_inode
> is called, not the attrs??

But the attrs (which is what's copied from) were protected by the
mutex lock (IIUC) so dealing with the inode attributes implies
dealing with the kernfs node attrs too.

For example in kernfs_iop_setattr() the call to setattr_copy() copies
the node attrs to the inode under the same mutex lock. So, if a read
lock is used the copy in kernfs_refresh_inode() is no longer protected,
it needs to be protected in a different way.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Ian Kent
On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> Hello,
> 
> On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > What could be done is to make the kernfs node attr_mutex
> > > a pointer and dynamically allocate it but even that is too
> > > costly a size addition to the kernfs node structure as
> > > Tejun has said.
> > 
> > I guess the question to ask is, is there really a need to
> > call kernfs_refresh_inode() from functions that are usually
> > reading/checking functions.
> > 
> > Would it be sufficient to refresh the inode in the write/set
> > operations in (if there's any) places where things like
> > setattr_copy() is not already called?
> > 
> > Perhaps GKH or Tejun could comment on this?
> 
> My memory is a bit hazy but invalidations on reads is how sysfs
> namespace is
> implemented, so I don't think there's an easy around that. The only
> thing I
> can think of is embedding the lock into attrs and doing xchg dance
> when
> attaching it.

Sounds like your saying it would be ok to add a lock to the
attrs structure, am I correct?

Assuming it is then, to keep things simple, use two locks.

One global lock for the allocation and an attrs lock for all the
attrs field updates including the kernfs_refresh_inode() update.

The critical section for the global lock could be reduced and it
changed to a spin lock.

In __kernfs_iattrs() we would have something like:

take the allocation lock
do the allocated checks
  assign if existing attrs
  release the allocation lock
  return existing if found
othewise
  release the allocation lock

allocate and initialize attrs

take the allocation lock
check if someone beat us to it
  free and grab exiting attrs
otherwise
  assign the new attrs
release the allocation lock
return attrs

Add a spinlock to the attrs struct and use it everywhere for
field updates.

Am I on the right track or can you see problems with this?

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Ian Kent
On Thu, 2020-12-17 at 19:09 +0800, Ian Kent wrote:
> On Thu, 2020-12-17 at 18:09 +0800, Ian Kent wrote:
> > On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote:
> > > On Thu, Dec 17, 2020 at 12:46 PM Ian Kent 
> > > wrote:
> > > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > > > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent 
> > > > > > wrote:
> > > > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <
> > > > > > > > ra...@themaw.net
> > > > > > > > wrote:
> > > > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > > > > > attr_mutex,
> > > > > > > > > > > > as
> > > > > > > > > > > > Tejun
> > > > > > > > > > > > mentioned here
> > > > > > > > > > > > (
> > > > > > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > > > > > > ),
> > > > > > > > > > > > maybe a global
> > > > > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > > > > > 
> > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could
> > > > > > > > > > > be
> > > > > > > > > > > used
> > > > > > > > > > > around
> > > > > > > > > > > the
> > > > > > > > > > > initial check and checked again at the end which
> > > > > > > > > > > would
> > > > > > > > > > > probably
> > > > > > > > > > > have
> > > > > > > > > > > been much faster but much less conservative and a
> > > > > > > > > > > bit
> > > > > > > > > > > more
> > > > > > > > > > > ugly
> > > > > > > > > > > so
> > > > > > > > > > > I just went the conservative path since there was
> > > > > > > > > > > so
> > > > > > > > > > > much
> > > > > > > > > > > change
> > > > > > > > > > > already.
> > > > > > > > > > 
> > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH
> > > > > > > > > > didn't
> > > > > > > > > > remember
> > > > > > > > > > it.
> > > > > > > > > > 
> > > > > > > > > > Based on what Tejun said it sounds like that needs
> > > > > > > > > > work.
> > > > > > > > > 
> > > > > > > > > Those attribute handling patches were meant to allow
> > > > > > > > > taking
> > > > > > > > > the
> > > > > > > > > rw
> > > > > > > > > sem read lock instead of the write lock for
> > > > > > > > > kernfs_refresh_inode()
> > > > > > > > > updates, with the added locking to protect the inode
> > > > > > > > > attributes
> > > > > > > > > update since it's called from the VFS both with and
> > > > > > > > > without
> > > > > > > > > the
> > > > > > > > > inode lock.
> > > > > > > > 
> > > > > > > > Oh, understood. I was asking also because lock on kn-
> > > > > > > > > attr_mutex
> > > > > > > > drags
> > > > > > > > concurrent performance.
> > > > > > > > 
> > > > > > > > > Looking around it looks lik

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Ian Kent
On Thu, 2020-12-17 at 18:09 +0800, Ian Kent wrote:
> On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote:
> > On Thu, Dec 17, 2020 at 12:46 PM Ian Kent  wrote:
> > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent 
> > > > > wrote:
> > > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <
> > > > > > > ra...@themaw.net
> > > > > > > wrote:
> > > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > > > > attr_mutex,
> > > > > > > > > > > as
> > > > > > > > > > > Tejun
> > > > > > > > > > > mentioned here
> > > > > > > > > > > (
> > > > > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > > > > > ),
> > > > > > > > > > > maybe a global
> > > > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > > > > 
> > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be
> > > > > > > > > > used
> > > > > > > > > > around
> > > > > > > > > > the
> > > > > > > > > > initial check and checked again at the end which
> > > > > > > > > > would
> > > > > > > > > > probably
> > > > > > > > > > have
> > > > > > > > > > been much faster but much less conservative and a
> > > > > > > > > > bit
> > > > > > > > > > more
> > > > > > > > > > ugly
> > > > > > > > > > so
> > > > > > > > > > I just went the conservative path since there was
> > > > > > > > > > so
> > > > > > > > > > much
> > > > > > > > > > change
> > > > > > > > > > already.
> > > > > > > > > 
> > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH
> > > > > > > > > didn't
> > > > > > > > > remember
> > > > > > > > > it.
> > > > > > > > > 
> > > > > > > > > Based on what Tejun said it sounds like that needs
> > > > > > > > > work.
> > > > > > > > 
> > > > > > > > Those attribute handling patches were meant to allow
> > > > > > > > taking
> > > > > > > > the
> > > > > > > > rw
> > > > > > > > sem read lock instead of the write lock for
> > > > > > > > kernfs_refresh_inode()
> > > > > > > > updates, with the added locking to protect the inode
> > > > > > > > attributes
> > > > > > > > update since it's called from the VFS both with and
> > > > > > > > without
> > > > > > > > the
> > > > > > > > inode lock.
> > > > > > > 
> > > > > > > Oh, understood. I was asking also because lock on kn-
> > > > > > > > attr_mutex
> > > > > > > drags
> > > > > > > concurrent performance.
> > > > > > > 
> > > > > > > > Looking around it looks like kernfs_iattrs() is called
> > > > > > > > from
> > > > > > > > multiple
> > > > > > > > places without a node database lock at all.
> > > > > > > > 
> > > > > > > > I'm thinking that, to keep my proposed change straight
> > > > > > > > forward
> > > > > > > > and on topic, I should just leave
> > > > > > &

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Ian Kent
On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote:
> On Thu, Dec 17, 2020 at 12:46 PM Ian Kent  wrote:
> > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent 
> > > > wrote:
> > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent  > > > > > >
> > > > > > wrote:
> > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > > > attr_mutex,
> > > > > > > > > > as
> > > > > > > > > > Tejun
> > > > > > > > > > mentioned here
> > > > > > > > > > (
> > > > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > > > > ),
> > > > > > > > > > maybe a global
> > > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > > > 
> > > > > > > > > I wasn't sure about that, IIRC a spin lock could be
> > > > > > > > > used
> > > > > > > > > around
> > > > > > > > > the
> > > > > > > > > initial check and checked again at the end which
> > > > > > > > > would
> > > > > > > > > probably
> > > > > > > > > have
> > > > > > > > > been much faster but much less conservative and a bit
> > > > > > > > > more
> > > > > > > > > ugly
> > > > > > > > > so
> > > > > > > > > I just went the conservative path since there was so
> > > > > > > > > much
> > > > > > > > > change
> > > > > > > > > already.
> > > > > > > > 
> > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH
> > > > > > > > didn't
> > > > > > > > remember
> > > > > > > > it.
> > > > > > > > 
> > > > > > > > Based on what Tejun said it sounds like that needs
> > > > > > > > work.
> > > > > > > 
> > > > > > > Those attribute handling patches were meant to allow
> > > > > > > taking
> > > > > > > the
> > > > > > > rw
> > > > > > > sem read lock instead of the write lock for
> > > > > > > kernfs_refresh_inode()
> > > > > > > updates, with the added locking to protect the inode
> > > > > > > attributes
> > > > > > > update since it's called from the VFS both with and
> > > > > > > without
> > > > > > > the
> > > > > > > inode lock.
> > > > > > 
> > > > > > Oh, understood. I was asking also because lock on kn-
> > > > > > > attr_mutex
> > > > > > drags
> > > > > > concurrent performance.
> > > > > > 
> > > > > > > Looking around it looks like kernfs_iattrs() is called
> > > > > > > from
> > > > > > > multiple
> > > > > > > places without a node database lock at all.
> > > > > > > 
> > > > > > > I'm thinking that, to keep my proposed change straight
> > > > > > > forward
> > > > > > > and on topic, I should just leave kernfs_refresh_inode()
> > > > > > > taking
> > > > > > > the node db write lock for now and consider the
> > > > > > > attributes
> > > > > > > handling
> > > > > > > as a separate change. Once that's done we could
> > > > > > > reconsider
> > > > > > > what's
> > > > > > > needed to use the node db read lock in
> > > > > > > kernfs_refresh_inode().
> > > >

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-16 Thread Ian Kent
On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent  wrote:
> > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent 
> > > > wrote:
> > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > >attr_mutex, 
> > > > > > > > as
> > > > > > > > Tejun
> > > > > > > > mentioned here
> > > > > > > > (
> > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > > ),
> > > > > > > > maybe a global
> > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > 
> > > > > > > I wasn't sure about that, IIRC a spin lock could be used
> > > > > > > around
> > > > > > > the
> > > > > > > initial check and checked again at the end which would
> > > > > > > probably
> > > > > > > have
> > > > > > > been much faster but much less conservative and a bit
> > > > > > > more
> > > > > > > ugly
> > > > > > > so
> > > > > > > I just went the conservative path since there was so much
> > > > > > > change
> > > > > > > already.
> > > > > > 
> > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > > > remember
> > > > > > it.
> > > > > > 
> > > > > > Based on what Tejun said it sounds like that needs work.
> > > > > 
> > > > > Those attribute handling patches were meant to allow taking
> > > > > the
> > > > > rw
> > > > > sem read lock instead of the write lock for
> > > > > kernfs_refresh_inode()
> > > > > updates, with the added locking to protect the inode
> > > > > attributes
> > > > > update since it's called from the VFS both with and without
> > > > > the
> > > > > inode lock.
> > > > 
> > > > Oh, understood. I was asking also because lock on kn-
> > > > >attr_mutex
> > > > drags
> > > > concurrent performance.
> > > > 
> > > > > Looking around it looks like kernfs_iattrs() is called from
> > > > > multiple
> > > > > places without a node database lock at all.
> > > > > 
> > > > > I'm thinking that, to keep my proposed change straight
> > > > > forward
> > > > > and on topic, I should just leave kernfs_refresh_inode()
> > > > > taking
> > > > > the node db write lock for now and consider the attributes
> > > > > handling
> > > > > as a separate change. Once that's done we could reconsider
> > > > > what's
> > > > > needed to use the node db read lock in
> > > > > kernfs_refresh_inode().
> > > > 
> > > > You meant taking write lock of kernfs_rwsem for
> > > > kernfs_refresh_inode()??
> > > > It may be a lot slower in my benchmark, let me test it.
> > > 
> > > Yes, but make sure the write lock of kernfs_rwsem is being taken
> > > not the read lock.
> > > 
> > > That's a mistake I had initially?
> > > 
> > > Still, that attributes handling is, I think, sufficient to
> > > warrant
> > > a separate change since it looks like it might need work, the
> > > kernfs
> > > node db probably should be kept stable for those attribute
> > > updates
> > > but equally the existence of an instantiated dentry might
> > > mitigate
> > > the it.
> > > 
> > > Some people might just know whether it's ok or not but I would
> > > like
> > > to check the callers to work out what's going on.
> > > 
> > > In any case it's academic if GCH isn't willing to consider the
> > > series
> > > for review and possible merge.
> > > 
> > Hi Ian
> > 
> > I removed kn->attr_mutex and chang

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-15 Thread Ian Kent
On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> On Mon, Dec 14, 2020 at 9:30 PM Ian Kent  wrote:
> > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent 
> > > wrote:
> > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > For the patches, there is a mutex_lock in kn->attr_mutex, 
> > > > > > > as
> > > > > > > Tejun
> > > > > > > mentioned here
> > > > > > > (
> > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > ),
> > > > > > > maybe a global
> > > > > > > rwsem for kn->iattr will be better??
> > > > > > 
> > > > > > I wasn't sure about that, IIRC a spin lock could be used
> > > > > > around
> > > > > > the
> > > > > > initial check and checked again at the end which would
> > > > > > probably
> > > > > > have
> > > > > > been much faster but much less conservative and a bit more
> > > > > > ugly
> > > > > > so
> > > > > > I just went the conservative path since there was so much
> > > > > > change
> > > > > > already.
> > > > > 
> > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > > remember
> > > > > it.
> > > > > 
> > > > > Based on what Tejun said it sounds like that needs work.
> > > > 
> > > > Those attribute handling patches were meant to allow taking the
> > > > rw
> > > > sem read lock instead of the write lock for
> > > > kernfs_refresh_inode()
> > > > updates, with the added locking to protect the inode attributes
> > > > update since it's called from the VFS both with and without the
> > > > inode lock.
> > > 
> > > Oh, understood. I was asking also because lock on kn->attr_mutex
> > > drags
> > > concurrent performance.
> > > 
> > > > Looking around it looks like kernfs_iattrs() is called from
> > > > multiple
> > > > places without a node database lock at all.
> > > > 
> > > > I'm thinking that, to keep my proposed change straight forward
> > > > and on topic, I should just leave kernfs_refresh_inode() taking
> > > > the node db write lock for now and consider the attributes
> > > > handling
> > > > as a separate change. Once that's done we could reconsider
> > > > what's
> > > > needed to use the node db read lock in kernfs_refresh_inode().
> > > 
> > > You meant taking write lock of kernfs_rwsem for
> > > kernfs_refresh_inode()??
> > > It may be a lot slower in my benchmark, let me test it.
> > 
> > Yes, but make sure the write lock of kernfs_rwsem is being taken
> > not the read lock.
> > 
> > That's a mistake I had initially?
> > 
> > Still, that attributes handling is, I think, sufficient to warrant
> > a separate change since it looks like it might need work, the
> > kernfs
> > node db probably should be kept stable for those attribute updates
> > but equally the existence of an instantiated dentry might mitigate
> > the it.
> > 
> > Some people might just know whether it's ok or not but I would like
> > to check the callers to work out what's going on.
> > 
> > In any case it's academic if GCH isn't willing to consider the
> > series
> > for review and possible merge.
> > 
> Hi Ian
> 
> I removed kn->attr_mutex and changed read lock to write lock for
> kernfs_refresh_inode
> 
> down_write(_rwsem);
> kernfs_refresh_inode(kn, inode);
> up_write(_rwsem);
> 
> 
> Unfortunate, changes in this way make things worse,  my benchmark
> runs
> 100% slower than upstream sysfs.  :(
> open+read+close a sysfs file concurrently took 1000us. (Currently,
> sysfs with a big mutex kernfs_mutex only takes ~500us
> for one open+read+close operation concurrently)

Right, so it does need attention nowish.

I'll have a look at it in a while, I really need to get a new autofs
release out, and there are quite a few changes, and testing is seeing
a number of errors, some old, some newly introduced. It's proving
difficult.

> 
> > --45.93%--kernfs_iop_permission
>   ||
>   |  |  |  |
>   ||
>   |  |  |
> > --22.55%--down_write
>   ||
>   |  |  |  |  |
>   ||
>   |  |  |  |
> --20.69%--rwsem_down_write_slowpath
>   ||
>   |  |  |  |
>   |
>   ||
>   |  |  |  |
>   |--8.89%--schedule
> 
> perf showed most of the time had been spent on kernfs_iop_permission
> 
> 
> thanks,
> fox



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-14 Thread Ian Kent
On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> On Sun, Dec 13, 2020 at 11:46 AM Ian Kent  wrote:
> > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > > > Tejun
> > > > > mentioned here
> > > > > (
> > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > ),
> > > > > maybe a global
> > > > > rwsem for kn->iattr will be better??
> > > > 
> > > > I wasn't sure about that, IIRC a spin lock could be used around
> > > > the
> > > > initial check and checked again at the end which would probably
> > > > have
> > > > been much faster but much less conservative and a bit more ugly
> > > > so
> > > > I just went the conservative path since there was so much
> > > > change
> > > > already.
> > > 
> > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > remember
> > > it.
> > > 
> > > Based on what Tejun said it sounds like that needs work.
> > 
> > Those attribute handling patches were meant to allow taking the rw
> > sem read lock instead of the write lock for kernfs_refresh_inode()
> > updates, with the added locking to protect the inode attributes
> > update since it's called from the VFS both with and without the
> > inode lock.
> 
> Oh, understood. I was asking also because lock on kn->attr_mutex
> drags
> concurrent performance.
> 
> > Looking around it looks like kernfs_iattrs() is called from
> > multiple
> > places without a node database lock at all.
> > 
> > I'm thinking that, to keep my proposed change straight forward
> > and on topic, I should just leave kernfs_refresh_inode() taking
> > the node db write lock for now and consider the attributes handling
> > as a separate change. Once that's done we could reconsider what's
> > needed to use the node db read lock in kernfs_refresh_inode().
> 
> You meant taking write lock of kernfs_rwsem for
> kernfs_refresh_inode()??
> It may be a lot slower in my benchmark, let me test it.

Yes, but make sure the write lock of kernfs_rwsem is being taken
not the read lock.

That's a mistake I had initially?

Still, that attributes handling is, I think, sufficient to warrant
a separate change since it looks like it might need work, the kernfs
node db probably should be kept stable for those attribute updates
but equally the existence of an instantiated dentry might mitigate
the it.

Some people might just know whether it's ok or not but I would like
to check the callers to work out what's going on.

In any case it's academic if GCH isn't willing to consider the series
for review and possible merge.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-12 Thread Ian Kent
On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > Tejun
> > > mentioned here 
> > > (https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/),
> > > maybe a global 
> > > rwsem for kn->iattr will be better??
> > 
> > I wasn't sure about that, IIRC a spin lock could be used around the
> > initial check and checked again at the end which would probably
> > have
> > been much faster but much less conservative and a bit more ugly so
> > I just went the conservative path since there was so much change
> > already.
> 
> Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember
> it.
> 
> Based on what Tejun said it sounds like that needs work.

Those attribute handling patches were meant to allow taking the rw
sem read lock instead of the write lock for kernfs_refresh_inode()
updates, with the added locking to protect the inode attributes
update since it's called from the VFS both with and without the
inode lock.

Looking around it looks like kernfs_iattrs() is called from multiple
places without a node database lock at all.

I'm thinking that, to keep my proposed change straight forward
and on topic, I should just leave kernfs_refresh_inode() taking
the node db write lock for now and consider the attributes handling
as a separate change. Once that's done we could reconsider what's
needed to use the node db read lock in kernfs_refresh_inode().

It will reduce the effectiveness of the series but it would make
this change much more complicated, and is somewhat off-topic, and
could hamper the chances of reviewers spotting problem with it.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-10 Thread Ian Kent
On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> 
> > For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun
> > mentioned here 
> > (https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/),
> > maybe a global 
> > rwsem for kn->iattr will be better??
> 
> I wasn't sure about that, IIRC a spin lock could be used around the
> initial check and checked again at the end which would probably have
> been much faster but much less conservative and a bit more ugly so
> I just went the conservative path since there was so much change
> already.

Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember
it.

Based on what Tejun said it sounds like that needs work.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-10 Thread Ian Kent
On Thu, 2020-12-10 at 16:44 +, Fox Chen wrote:
> Hi,
> 
> I found this series of patches solves exact the problem I am trying
> to solve.
> https://lore.kernel.org/lkml/20201202145837.48040-1-foxhlc...@gmail.com/

Right.

> 
> The problem is reported by Brice Goglin on thread:
> Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
> https://lore.kernel.org/lkml/x60dvjot4furc...@kroah.com/
> 
> I independently comfirmed that on a 96-core AWS c5.metal server.
> Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id
> 1000 times.
> With a single thread it takes ~2.5 us for each open+read+close.
> With one thread per core, 96 threads running simultaneously takes 540
> us 
> for each of the same operation (without much variation) -- 200x
> slower than the 
> single thread one. 

Right, interesting that the it's actually a problem on such
small system configurations.

I didn't think it would be evident on hardware that doesn't
have a much larger configuration.

> 
> My Benchmark code is here:
> https://github.com/foxhlchen/sysfs_benchmark
> 
> The problem can only be observed in large machines (>=16 cores).
> The more cores you have the slower it can be.
> 
> Perf shows that CPUs spend most of the time (>80%) waiting on mutex
> locks in 
> kernfs_iop_permission and kernfs_dop_revalidate.
> 
> After applying this, performance gets huge boost -- with the fastest
> one at ~30 us 
> to the worst at ~180 us (most of on spin_locks, the delay just
> stacking up, very
> similar to the performance on ext4). 

That's the problem isn't it.

Unfortunately we don't get large improvements for nothing so I
was constantly thinking, what have I done here that isn't ok ...
and I don't have an answer for that.

The series needs review from others for that but we didn't get
that far.

> 
> I hope this problem can justifies this series of patches. A big mutex
> in kernfs
> is really not nice. Due to this BIG LOCK, concurrency in kernfs is
> almost NONE,
> even though you do operations on different files, they are
> contentious.

Well, as much as I don't like to admit it, Greg (and Tejun) do have
a point about the number of sysfs files used when there is a very
large amount of RAM. But IIUC the suggestion of altering the sysfs
representation for this devices memory would introduce all sorts
of problems, it then being different form all device memory
representations (systemd udev coldplug for example).

But I think your saying there are also visible improvements elsewhere
too, which is to be expected of course.

Let's not forget that, as the maintainer, Greg has every right to
be reluctant to take changes because he is likely to end up owning
and maintaining the changes. That can lead to considerable overhead
and frustration if the change isn't quite right and it's very hard
to be sure there aren't hidden problems with it.

Fact is that, due to Greg's rejection, there was much more focus
by the reporter to fix it at the source but I have no control over
that, I only know that it helped to get things moving.

Given the above, I was considering posting the series again and
asking for the series to be re-considered but since I annoyed
Greg so much the first time around I've been reluctant to do so.

But now is a good time I guess, Greg, please, would you re-consider
possibly accepting these patches?

I would really like some actual review of what I have done from
people like yourself and Al. Ha, after that they might well not
be ok anyway!

> 
> As we get more and more cores on normal machines and because sysfs
> provides such
> important information, this problem should be fix. So please
> reconsider accepting
> the patches.
> 
> For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun
> mentioned here 
> (https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/),
> maybe a global 
> rwsem for kn->iattr will be better??

I wasn't sure about that, IIRC a spin lock could be used around the
initial check and checked again at the end which would probably have
been much faster but much less conservative and a bit more ugly so
I just went the conservative path since there was so much change
already.

Ian



Re: 5.9.0-next-20201015: autofs oops in update-binfmts

2020-10-16 Thread Ian Kent
On Fri, 2020-10-16 at 14:35 +0200, Pavel Machek wrote:
> Hi!
> 
> I'm getting this during boot: 32-bit thinkpad x60.

This is very odd.

The change in next is essentially a revert of a change, maybe I'm
missing something and the revert isn't quite a revert. Although
there was one difference.

I'll check for other revert differences too.

Are you in a position to check a kernel without the 5.9 change
if I send you a patch?

And we should check if that difference to what was originally
there is the source of the problem, so probably two things to
follow up on, reverting that small difference first would be
the way to go.

Are you able to reliably reproduce it?

> 
> [   10.718377] BUG: kernel NULL pointer dereference, address:
> 
> [   10.721848] #PF: supervisor read access in kernel mode
> [   10.722763] #PF: error_code(0x) - not-present page
> [   10.726759] *pdpt = 0339e001 *pde =  
> [   10.730793] Oops:  [#1] PREEMPT SMP PTI
> [   10.736201] CPU: 1 PID: 2762 Comm: update-binfmts Not tainted
> 5.9.0-next-20201015+ #152
> [   10.738769] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW
> (2.19 ) 03/31/2011
> [   10.742769] EIP: __kernel_write+0xd4/0x230
> [   10.746769] Code: 89 d6 64 8b 15 b4 77 4c c5 8b 8a 38 0b 00 00 31
> d2 85 c9 74 04 0f b7 51 30 66 89 75 e8 8b 75 ac 8d 4d b0 89 45 e4 66
> 89 55 ea <8b> 06 8b 56 04 57 6a 01 89 45 d4 8d 45 b8 89 55 d8 ba 01
> 00 00 00
> [   10.758762] EAX: 0002 EBX: c1922a40 ECX: c33cdad0 EDX:
> 
> [   10.762791] ESI:  EDI: 012c EBP: c33cdb20 ESP:
> c33cdacc
> [   10.766766] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS:
> 00010286
> [   10.770762] CR0: 80050033 CR2:  CR3: 033d CR4:
> 06b0
> [   10.770762] Call Trace:
> [   10.770762]  ? __mutex_unlock_slowpath+0x2b/0x2c0
> [   10.770762]  ? dma_direct_map_sg+0x13a/0x320
> [   10.770762]  autofs_notify_daemon+0x14d/0x2b0
> [   10.770762]  autofs_wait+0x4cd/0x770
> [   10.793051]  ? autofs_d_automount+0xd6/0x1e0
> [   10.793051]  autofs_mount_wait+0x43/0xe0
> [   10.797808]  autofs_d_automount+0xdf/0x1e0
> [   10.797808]  __traverse_mounts+0x85/0x200
> [   10.797808]  step_into+0x368/0x620
> [   10.797808]  ? proc_setup_thread_self+0x110/0x110
> [   10.797808]  walk_component+0x58/0x190
> [   10.811838]  link_path_walk.part.0+0x245/0x360
> [   10.811838]  path_lookupat.isra.0+0x31/0x130
> [   10.811838]  filename_lookup+0x8d/0x130
> [   10.818749]  ? cache_alloc_debugcheck_after+0x151/0x180
> [   10.818749]  ? getname_flags+0x1f/0x160
> [   10.818749]  ? kmem_cache_alloc+0x75/0x100
> [   10.818749]  user_path_at_empty+0x25/0x30
> [   10.818749]  vfs_statx+0x63/0x100
> [   10.831022]  ? _raw_spin_unlock+0x18/0x30
> [   10.831022]  ? replace_page_cache_page+0x160/0x160
> [   10.831022]  __do_sys_stat64+0x36/0x60
> [   10.831022]  ? exit_to_user_mode_prepare+0x35/0xe0
> [   10.831022]  ? irqentry_exit_to_user_mode+0x8/0x20
> [   10.838773]  ? irqentry_exit+0x55/0x70
> [   10.838773]  ? exc_page_fault+0x228/0x3c0
> [   10.838773]  __ia32_sys_stat64+0xd/0x10
> [   10.838773]  do_int80_syscall_32+0x2c/0x40
> [   10.848561]  entry_INT80_32+0x111/0x111
> [   10.848561] EIP: 0xb7ee2092
> [   10.848561] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30
> 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00
> 00 cd 80  8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d
> b4 26 00
> [   10.848561] EAX: ffda EBX: 00468490 ECX: bfbce6ec EDX:
> 00467348
> [   10.848561] ESI:  EDI: 00468490 EBP: bfbce6ec ESP:
> bfbce6c4
> [   10.848561] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS:
> 0292
> [   10.848561] Modules linked in:
> [   10.848561] CR2: 
> [   10.851552] ---[ end trace d01bd7323c2317a5 ]---
> [   10.851558] EIP: __kernel_write+0xd4/0x230
> [   10.851561] Code: 89 d6 64 8b 15 b4 77 4c c5 8b 8a 38 0b 00 00 31
> d2 85 c9 74 04 0f b7 51 30 66 89 75 e8 8b 75 ac 8d 4d b0 89 45 e4 66
> 89 55 ea <8b> 06 8b 56 04 57 6a 01 89 45 d4 8d 45 b8 89 55 d8 ba 01
> 00 00 00
> [   10.851563] EAX: 0002 EBX: c1922a40 ECX: c33cdad0 EDX:
> 
> [   10.851565] ESI:  EDI: 012c EBP: c33cdb20 ESP:
> c33cdacc
> [   10.851568] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS:
> 00010286
> [   10.851570] CR0: 80050033 CR2: 004a700e CR3: 033d CR4:
> 06b0
> [   11.803128] systemd-journald[2514]: Received request to flush
> runtime journal from PID 1
> [   26.113941] iwl3945 :03:00.0: loaded firmware version
> 15.32.2.9
> [   59.809322] traps: clock-applet[3636] trap int3 ip:b724ffc0
> sp:bf879b90 error:0 in libglib-2.0.so.0.5000.3[b7203000+12a000]
> [   59.812036] traps: mateweather-app[3638] trap int3 ip:b7283fc0
> sp:bfb65760 error:0 in libglib-2.0.so.0.5000.3[b7237000+12a000]
> [   64.628401] wlan0: authenticate with 5c:f4:ab:10:d2:bb
> 
-- 
Ian Kent 



Re: [PATCH] Harden autofs ioctl table

2020-08-19 Thread Ian Kent
On Tue, 2020-08-18 at 13:22 +0100, Matthew Wilcox wrote:
> The table of ioctl functions should be marked const in order to put
> them
> in read-only memory, and we should use array_index_nospec() to avoid
> speculation disclosing the contents of kernel memory to userspace.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Acked-by: Ian Kent 

Thanks Matthew, I certainly want to know about changes to autofs
made by others so thanks for sending this to me.

At the same time I need to send my patches via someone else as
Linus asked me to do ages ago now.

So, once again Andrew, if you would be so kind as to include this
in your tree please.

Ian

> 
> diff --git a/fs/autofs/dev-ioctl.c b/fs/autofs/dev-ioctl.c
> index 75105f45c51a..322b7dfb4ea0 100644
> --- a/fs/autofs/dev-ioctl.c
> +++ b/fs/autofs/dev-ioctl.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "autofs_i.h"
>  
> @@ -563,7 +564,7 @@ static int autofs_dev_ioctl_ismountpoint(struct
> file *fp,
>  
>  static ioctl_fn lookup_dev_ioctl(unsigned int cmd)
>  {
> - static ioctl_fn _ioctls[] = {
> + static const ioctl_fn _ioctls[] = {
>   autofs_dev_ioctl_version,
>   autofs_dev_ioctl_protover,
>   autofs_dev_ioctl_protosubver,
> @@ -581,7 +582,10 @@ static ioctl_fn lookup_dev_ioctl(unsigned int
> cmd)
>   };
>   unsigned int idx = cmd_idx(cmd);
>  
> - return (idx >= ARRAY_SIZE(_ioctls)) ? NULL : _ioctls[idx];
> + if (idx >= ARRAY_SIZE(_ioctls))
> + return NULL;
> + idx = array_index_nospec(idx, ARRAY_SIZE(_ioctls));
> + return _ioctls[idx];
>  }
>  
>  /* ioctl dispatcher */



Re: file metadata via fs API

2020-08-12 Thread Ian Kent
On Wed, 2020-08-12 at 12:50 -0700, Linus Torvalds wrote:
> On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse <
> swhit...@redhat.com> wrote:
> > The point of this is to give us the ability to monitor mounts from
> > userspace.
> 
> We haven't had that before, I don't see why it's suddenly such a big
> deal.

Because there's a trend occurring in user space where there are
frequent and persistent mount changes that cause high overhead.

I've seen the number of problems building up over the last few months
that are essentially the same problem that I wanted to resolve. And
that's related to side effects of autofs using a large number of
mounts.

The problems are real.

> 
> The notification side I understand. Polling /proc files is not the
> answer.

Yep, that's one aspect, getting the information about a mount without
reading the entire mount table seems like the sensible thing to do to
allow for a more efficient notification mechanism.

> 
> But the whole "let's design this crazy subsystem for it" seems way
> overkill. I don't see anybody caring that deeply.
> 
> It really smells like "do it because we can, not because we must".
> 
> Who the hell cares about monitoring mounts at a kHz frequencies? If
> this is for MIS use, you want a nice GUI and not wasting CPU time
> polling.

That part of the problem still remains.

The kernel sending a continuous stream of wake ups under load does
also introduce a resource problem but that's probably something to
handle in user space.

> 
> I'm starting to ignore the pull requests from David Howells, because
> by now they have had the same pattern for a couple of years now:
> esoteric new interfaces that seem overdesigned for corner-cases that
> I'm not seeing people clamoring for.
> 
> I need (a) proof this is actualyl something real users care about and
> (b) way more open discussion and implementation from multiple
> parties.
> 
> Because right now it looks like a small in-cabal of a couple of
> people
> who have wild ideas but I'm not seeing the wider use of it.
> 
> Convince me otherwise. AGAIN. This is the exact same issue I had with
> the notification queues that I really wanted actual use-cases for,
> and
> feedback from actual outside users.
> 
> I really think this is engineering for its own sake, rather than
> responding to actual user concerns.
> 
>Linus



Re: file metadata via fs API

2020-08-12 Thread Ian Kent
On Wed, 2020-08-12 at 14:06 +0100, David Howells wrote:
> Miklos Szeredi  wrote:
> 
> > That presumably means the mount ID <-> mount path mapping already
> > exists, which means it's just possible to use the open(mount_path,
> > O_PATH) to obtain the base fd.
> 
> No, you can't.  A path more correspond to multiple mounts stacked on
> top of
> each other, e.g.:
> 
>   mount -t tmpfs none /mnt
>   mount -t tmpfs none /mnt
>   mount -t tmpfs none /mnt
> 
> Now you have three co-located mounts and you can't use the path to
> differentiate them.  I think this might be an issue in autofs, but
> Ian would
> need to comment on that.

It is a problem for autofs, direct mounts in particular, but also
for mount ordering at times when umounting a tree of mounts where
mounts are covered or at shutdown.

Ian



Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

2020-08-11 Thread Ian Kent
On Tue, 2020-08-11 at 21:39 +0200, Christian Brauner wrote:
> On Tue, Aug 11, 2020 at 09:05:22AM -0700, Linus Torvalds wrote:
> > On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi 
> > wrote:
> > > What's the disadvantage of doing it with a single lookup WITH an
> > > enabling flag?
> > > 
> > > It's definitely not going to break anything, so no backward
> > > compatibility issues whatsoever.
> > 
> > No backwards compatibility issues for existing programs, no.
> > 
> > But your suggestion is fundamentally ambiguous, and you most
> > definitely *can* hit that if people start using this in new
> > programs.
> > 
> > Where does that "unified" pathname come from? It will be generated
> > from "base filename + metadata name" in user space, and
> > 
> >  (a) the base filename might have double or triple slashes in it
> > for
> > whatever reasons.
> > 
> > This is not some "made-up gotcha" thing - I see double slashes
> > *all*
> > the time when we have things like Makefiles doing
> > 
> > srctree=../../src/
> > 
> > and then people do "$(srctree)/". If you haven't seen that kind of
> > pattern where the pathname has two (or sometimes more!) slashes in
> > the
> > middle, you've led a very sheltered life.
> > 
> >  (b) even if the new user space were to think about that, and
> > remove
> > those (hah! when have you ever seen user space do that?), as Al
> > mentioned, the user *filesystem* might have pathnames with double
> > slashes as part of symlinks.
> > 
> > So now we'd have to make sure that when we traverse symlinks, that
> > O_ALT gets cleared. Which means that it's not a unified namespace
> > after all, because you can't make symlinks point to metadata.
> > 
> > Or we'd retroactively change the semantics of a symlink, and that
> > _is_
> > a backwards compatibility issue. Not with old software, no, but it
> > changes the meaning of old symlinks!
> > 
> > So no, I don't think a unified namespace ends up working.
> > 
> > And I say that as somebody who actually loves the concept. Ask Al:
> > I
> > have a few times pushed for "let's allow directory behavior on
> > regular
> > files", so that you could do things like a tar-filesystem, and
> > access
> > the contents of a tar-file by just doing
> > 
> > cat my-file.tar/inside/the/archive.c
> > 
> > or similar.
> > 
> > Al has convinced me it's a horrible idea (and there you have a
> > non-ambiguous marker: the slash at the end of a pathname that
> > otherwise looks and acts as a non-directory)
> > 
> 
> Putting my kernel hat down, putting my userspace hat on.
> 
> I'm looking at this from a potential user of this interface.
> I'm not a huge fan of the metadata fd approach I'd much rather have a
> dedicated system call rather than opening a side-channel metadata fd
> that I can read binary data from. Maybe I'm alone in this but I was
> under the impression that other users including Ian, Lennart, and
> Karel
> have said on-list in some form that they would prefer this approach.
> There are even patches for systemd and libmount, I thought?

Not quite sure what you mean here.

Karel (with some contributions by me) has implemented the interfaces
for David's mount notifications and fsinfo() call in libmount. We
still have a little more to do on that.

I also have a systemd implementation that uses these libmount features
for mount table handling that works quite well, with a couple more
things to do to complete it, that Lennart has done an initial review
for.

It's no secret that I don't like the proc file system in general
but it is really useful for many things, that's just the way it
is.

Ian



Re: [PATCH] fs: autofs: delete repeated words in comments

2020-08-11 Thread Ian Kent
On Tue, 2020-08-11 at 07:42 -0700, Randy Dunlap wrote:
> On 8/11/20 1:36 AM, Ian Kent wrote:
> > On Mon, 2020-08-10 at 19:18 -0700, Randy Dunlap wrote:
> > > Drop duplicated words {the, at} in comments.
> > > 
> > > Signed-off-by: Randy Dunlap 
> > > Cc: Ian Kent 
> > > Cc: aut...@vger.kernel.org
> > 
> > Acked-by: Ian Kent 
> 
> Hi Ian,
> 
> Since you are the listed maintainer of this file, does this mean
> that you will be merging it?

I could but that would mean double handling since I would send it
to Andrew anyway.

Andrew, could you take this one please?

> 
> thanks.
> 
> > > ---
> > >  fs/autofs/dev-ioctl.c |4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- linux-next-20200807.orig/fs/autofs/dev-ioctl.c
> > > +++ linux-next-20200807/fs/autofs/dev-ioctl.c
> > > @@ -20,7 +20,7 @@
> > >   * another mount. This situation arises when starting
> > > automount(8)
> > >   * or other user space daemon which uses direct mounts or offset
> > >   * mounts (used for autofs lazy mount/umount of nested mount
> > > trees),
> > > - * which have been left busy at at service shutdown.
> > > + * which have been left busy at service shutdown.
> > >   */
> > >  
> > >  typedef int (*ioctl_fn)(struct file *, struct autofs_sb_info *,
> > > @@ -496,7 +496,7 @@ static int autofs_dev_ioctl_askumount(st
> > >   * located path is the root of a mount we return 1 along with
> > >   * the super magic of the mount or 0 otherwise.
> > >   *
> > > - * In both cases the the device number (as returned by
> > > + * In both cases the device number (as returned by
> > >   * new_encode_dev()) is also returned.
> > >   */
> > >  static int autofs_dev_ioctl_ismountpoint(struct file *fp,



Re: [PATCH] fs: autofs: delete repeated words in comments

2020-08-11 Thread Ian Kent
On Mon, 2020-08-10 at 19:18 -0700, Randy Dunlap wrote:
> Drop duplicated words {the, at} in comments.
> 
> Signed-off-by: Randy Dunlap 
> Cc: Ian Kent 
> Cc: aut...@vger.kernel.org

Acked-by: Ian Kent 

> ---
>  fs/autofs/dev-ioctl.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux-next-20200807.orig/fs/autofs/dev-ioctl.c
> +++ linux-next-20200807/fs/autofs/dev-ioctl.c
> @@ -20,7 +20,7 @@
>   * another mount. This situation arises when starting automount(8)
>   * or other user space daemon which uses direct mounts or offset
>   * mounts (used for autofs lazy mount/umount of nested mount trees),
> - * which have been left busy at at service shutdown.
> + * which have been left busy at service shutdown.
>   */
>  
>  typedef int (*ioctl_fn)(struct file *, struct autofs_sb_info *,
> @@ -496,7 +496,7 @@ static int autofs_dev_ioctl_askumount(st
>   * located path is the root of a mount we return 1 along with
>   * the super magic of the mount or 0 otherwise.
>   *
> - * In both cases the the device number (as returned by
> + * In both cases the device number (as returned by
>   * new_encode_dev()) is also returned.
>   */
>  static int autofs_dev_ioctl_ismountpoint(struct file *fp,



Re: [PATCH 06/18] fsinfo: Add a uniquifier ID to struct mount [ver #21]

2020-08-05 Thread Ian Kent
On Wed, 2020-08-05 at 20:33 +0100, Matthew Wilcox wrote:
> On Wed, Aug 05, 2020 at 04:30:10PM +0100, David Howells wrote:
> > Miklos Szeredi  wrote:
> > 
> > > idr_alloc_cyclic() seems to be a good template for doing the
> > > lower
> > > 32bit allocation, and we can add code to increment the high 32bit
> > > on
> > > wraparound.
> > > 
> > > Lots of code uses idr_alloc_cyclic() so I guess it shouldn't be
> > > too
> > > bad in terms of memory use or performance.
> > 
> > It's optimised for shortness of path and trades memory for
> > performance.  It's
> > currently implemented using an xarray, so memory usage is dependent
> > on the
> > sparseness of the tree.  Each node in the tree is 576 bytes and in
> > the worst
> > case, each one node will contain one mount - and then you have to
> > backfill the
> > ancestry, though for lower memory costs.
> > 
> > Systemd makes life more interesting since it sets up a whole load
> > of
> > propagations.  Each mount you make may cause several others to be
> > created, but
> > that would likely make the tree more efficient.
> 
> I would recommend using xa_alloc and ignoring the ID assigned from
> xa_alloc.  Looking up by unique ID is then a matter of iterating
> every
> mount (xa_for_each()) looking for a matching unique ID in the mount
> struct.  That's O(n) search, but it's faster than a linked list, and
> we
> don't have that many mounts in a system.

How many is not many, 5000, 1, I agree that 3 plus is fairly
rare, even for the autofs direct mount case I hope the implementation
here will help to fix.

Ian



Re: [PATCH 10/18] fsinfo: Provide notification overrun handling support [ver #21]

2020-08-05 Thread Ian Kent
On Wed, 2020-08-05 at 13:27 +0200, Miklos Szeredi wrote:
> On Wed, Aug 5, 2020 at 1:23 PM Ian Kent  wrote:
> > On Wed, 2020-08-05 at 09:45 +0200, Miklos Szeredi wrote:
> > > Hmm, what's the other possibility for lost notifications?
> > 
> > In user space that is:
> > 
> > Multi-threaded application races, single threaded applications and
> > signal processing races, other bugs ...
> 
> Okay, let's fix the bugs then.

It's the the bugs you don't know about that get you, in this case
the world "is" actually out to get you, ;)

Ian



Re: [PATCH 10/18] fsinfo: Provide notification overrun handling support [ver #21]

2020-08-05 Thread Ian Kent
On Wed, 2020-08-05 at 09:45 +0200, Miklos Szeredi wrote:
> On Wed, Aug 5, 2020 at 4:46 AM Ian Kent  wrote:
> > Coming back to an actual use case.
> > 
> > What I said above is one aspect but, since I'm looking at this
> > right
> > now with systemd, and I do have the legacy code to fall back to,
> > the
> > "just reset everything" suggestion does make sense.
> > 
> > But I'm struggling to see how I can identify notification buffer
> > overrun in libmount, and overrun is just one possibility for lost
> > notifications, so I like the idea that, as a library user, I can
> > work out that I need to take action based on what I have in the
> > notifications themselves.
> 
> Hmm, what's the other possibility for lost notifications?

In user space that is:

Multi-threaded application races, single threaded applications and
signal processing races, other bugs ...

For example systemd has it's own event handling sub-system and handles
half a dozen or so event types of which the mount changes are one. It's
fairly complex so I find myself wondering if I can trust it and
wondering if there are undiscovered bugs in it. The answer to the
former is probably yes but the answer to the later is also probably
yes.

Maybe I just paranoid!
Ian




Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]

2020-08-05 Thread Ian Kent
On Wed, 2020-08-05 at 09:43 +0200, Miklos Szeredi wrote:
> On Wed, Aug 5, 2020 at 3:54 AM Ian Kent  wrote:
> > > > It's way more useful to have these in the notification than
> > > > obtainable
> > > > via fsinfo() IMHO.
> > > 
> > > What is it useful for?
> > 
> > Only to verify that you have seen all the notifications.
> > 
> > If you have to grab that info with a separate call then the count
> > isn't necessarily consistent because other notifications can occur
> > while you grab it.
> 
> No, no no.   The watch queue will signal an overflow, without any
> additional overhead for the normal case.  If you think of this as a
> protocol stack, then the overflow detection happens on the transport
> layer, instead of the application layer.  The application layer is
> responsible for restoring state in case of a transport layer error,
> but detection of that error is not the responsibility of the
> application layer.

I can see in the kernel code that an error is returned if the message
buffer is full when trying to add a message, I just can't see where
to get it in the libmount code.

That's not really a communication protocol problem.

Still I need to work out how to detect it, maybe it is seen by
the code in libmount already and I simply can't see what I need
to do to recognise it ...

So I'm stuck wanting to verify I have got everything that was
sent and am having trouble moving on from that.

Ian



Re: [GIT PULL] Filesystem Information

2020-08-05 Thread Ian Kent
On Wed, 2020-08-05 at 10:00 +0200, Miklos Szeredi wrote:
> On Wed, Aug 5, 2020 at 3:33 AM Ian Kent  wrote:
> > On Tue, 2020-08-04 at 16:36 +0200, Miklos Szeredi wrote:
> > > And notice how similar the above interface is to getxattr(), or
> > > the
> > > proposed readfile().  Where has the "everything is  a file"
> > > philosophy
> > > gone?
> > 
> > Maybe, but that philosophy (in a roundabout way) is what's resulted
> > in some of the problems we now have. Granted it's blind application
> > of that philosophy rather than the philosophy itself but that is
> > what happens.
> 
> Agree.   What people don't seem to realize, even though there are
> blindingly obvious examples, that binary interfaces like the proposed
> fsinfo(2) syscall can also result in a multitude of problems at the
> same time as solving some others.
> 
> There's no magic solution in API design,  it's not balck and white.
> We just need to strive for a good enough solution.  The problem seems
> to be that trying to discuss the merits of other approaches seems to
> hit a brick wall.  We just see repeated pull requests from David,
> without any real discussion of the proposed alternatives.
> 
> > I get that your comments are driven by the way that philosophy
> > should
> > be applied which is more of a "if it works best doing it that way
> > then
> > do it that way, and that's usually a file".
> > 
> > In this case there is a logical division of various types of file
> > system information and the underlying suggestion is maybe it's time
> > to move away from the "everything is a file" hard and fast rule,
> > and get rid of some of the problems that have resulted from it.
> > 
> > The notifications is an example, yes, the delivery mechanism is
> > a "file" but the design of the queueing mechanism makes a lot of
> > sense for the throughput that's going to be needed as time marches
> > on. Then there's different sub-systems each with unique information
> > that needs to be deliverable some other way because delivering
> > "all"
> > the information via the notification would be just plain wrong so
> > a multi-faceted information delivery mechanism makes the most
> > sense to allow specific targeted retrieval of individual items of
> > information.
> > 
> > But that also supposes your at least open to the idea that "maybe
> > not everything should be a file".
> 
> Sure.  I've learned pragmatism, although idealist at heart.  And I'm
> not saying all API's from David are shit.  statx(2) is doing fine.
> It's a simple binary interface that does its job well.   Compare the
> header files for statx and fsinfo, though, and maybe you'll see what
> I'm getting at...

Yeah, but I'm biased so not much joy there ... ;)

Ian



Re: [PATCH 10/18] fsinfo: Provide notification overrun handling support [ver #21]

2020-08-04 Thread Ian Kent
On Wed, 2020-08-05 at 10:05 +0800, Ian Kent wrote:
> On Tue, 2020-08-04 at 15:56 +0200, Miklos Szeredi wrote:
> > On Mon, Aug 03, 2020 at 02:37:50PM +0100, David Howells wrote:
> > > Provide support for the handling of an overrun in a watch
> > > queue.  In the
> > > event that an overrun occurs, the watcher needs to be able to
> > > find
> > > out what
> > > it was that they missed.  To this end, previous patches added
> > > event
> > > counters to struct mount.
> > 
> > So this is optimizing the buffer overrun case?
> > 
> > Shoun't we just make sure that the likelyhood of overruns is low
> > and
> > if it
> > happens, just reinitialize everthing from scratch (shouldn't be
> > *that*
> > expensive).
> 
> But maybe not possible if you are using notifications for tracking
> state in user space, you need to know when the thing you have needs
> to be synced because you missed something and it's during the
> notification processing you actually have the object that may need
> to be refreshed.
> 
> > Trying to find out what was missed seems like just adding
> > complexity
> > for no good
> > reason.

Coming back to an actual use case.

What I said above is one aspect but, since I'm looking at this right
now with systemd, and I do have the legacy code to fall back to, the
"just reset everything" suggestion does make sense.

But I'm struggling to see how I can identify notification buffer
overrun in libmount, and overrun is just one possibility for lost
notifications, so I like the idea that, as a library user, I can
work out that I need to take action based on what I have in the
notifications themselves.

Ian



Re: [PATCH 10/18] fsinfo: Provide notification overrun handling support [ver #21]

2020-08-04 Thread Ian Kent
On Tue, 2020-08-04 at 15:56 +0200, Miklos Szeredi wrote:
> On Mon, Aug 03, 2020 at 02:37:50PM +0100, David Howells wrote:
> > Provide support for the handling of an overrun in a watch
> > queue.  In the
> > event that an overrun occurs, the watcher needs to be able to find
> > out what
> > it was that they missed.  To this end, previous patches added event
> > counters to struct mount.
> 
> So this is optimizing the buffer overrun case?
> 
> Shoun't we just make sure that the likelyhood of overruns is low and
> if it
> happens, just reinitialize everthing from scratch (shouldn't be
> *that*
> expensive).

But maybe not possible if you are using notifications for tracking
state in user space, you need to know when the thing you have needs
to be synced because you missed something and it's during the
notification processing you actually have the object that may need
to be refreshed.

> 
> Trying to find out what was missed seems like just adding complexity
> for no good
> reason.
> 



Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]

2020-08-04 Thread Ian Kent
On Tue, 2020-08-04 at 15:19 +0200, Miklos Szeredi wrote:
> On Tue, Aug 4, 2020 at 1:39 PM Ian Kent  wrote:
> > On Mon, 2020-08-03 at 11:29 +0200, Miklos Szeredi wrote:
> > > On Thu, Jul 23, 2020 at 12:48 PM David Howells <
> > > dhowe...@redhat.com>
> > > wrote:
> > > 
> > > > > > __u32   topology_changes;
> > > > > > __u32   attr_changes;
> > > > > > __u32   aux_topology_changes;
> > > > > 
> > > > > Being 32bit this introduces wraparound effects.  Is that
> > > > > really
> > > > > worth it?
> > > > 
> > > > You'd have to make 2 billion changes without whoever's
> > > > monitoring
> > > > getting a
> > > > chance to update their counters.  But maybe it's not worth it
> > > > putting them
> > > > here.  If you'd prefer, I can make the counters all 64-bit and
> > > > just
> > > > retrieve
> > > > them with fsinfo().
> > > 
> > > Yes, I think that would be preferable.
> > 
> > I think this is the source of the recommendation for removing the
> > change counters from the notification message, correct?
> > 
> > While it looks like I may not need those counters for systemd
> > message
> > buffer overflow handling myself I think removing them from the
> > notification message isn't a sensible thing to do.
> > 
> > If you need to detect missing messages, perhaps due to message
> > buffer
> > overflow, then you need change counters that are relevant to the
> > notification message itself. That's so the next time you get a
> > message
> > for that object you can be sure that change counter comparisons you
> > you make relate to object notifications you have processed.
> 
> I don't quite get it.  Change notification is just that: a
> notification.   You need to know what object that notification
> relates
> to, to be able to retrieve the up to date attributes of said object.
> 
> What happens if you get a change counter N in the notification
> message, then get a change counter N + 1 in the attribute retrieval?
> You know that another change happened, and you haven't yet processed
> the notification yet.  So when the notification with N + 1 comes in,
> you can optimize away the attribute retrieve.
> 
> Nice optimization, but it's optimizing a race condition, and I don't
> think that's warranted.  I don't see any other use for the change
> counter in the notification message.
> 
> 
> > Yes, I know it isn't quite that simple, but tallying up what you
> > have
> > processed in the current batch of messages (or in multiple batches
> > of
> > messages if more than one read has been possible) to perform the
> > check
> > is a user space responsibility. And it simply can't be done if the
> > counters consistency is in question which it would be if you need
> > to
> > perform another system call to get it.
> > 
> > It's way more useful to have these in the notification than
> > obtainable
> > via fsinfo() IMHO.
> 
> What is it useful for?

Only to verify that you have seen all the notifications.

If you have to grab that info with a separate call then the count
isn't necessarily consistent because other notifications can occur
while you grab it.

My per-object rant isn't quite right, what's needed is a consistent
way to verify you have seen everything you were supposed to.

I think your point is that if you grab the info in another call and
it doesn't match you need to refresh and that's fine but I think it's
better to be able to verify you have got everything that was sent as
you go and avoid the need for the refresh more often.

> 
> If the notification itself would contain the list of updated
> attributes and their new values, then yes, this would make sense.  If
> the notification just tells us that the object was modified, but not
> the modifications themselves, then I don't see how the change counter
> in itself could add any information (other than optimizing the race
> condition above).
> 
> Thanks,
> Miklos
> 
> Thanks,
> 
> 
> 
> > > > > > n->watch.info & NOTIFY_MOUNT_IS_RECURSIVE if true
> > > > > > indicates that
> > > > > > the notifcation was generated by an event (eg.
> > > > > > SETATTR)
> > > > > > that was
> > > > > > applied recursively.  The notification is only
> > > > > > generated for the
> > > > > > object that initially triggered it.
> > > > > 
> > > > > Unused in this patchset.  Please don't add things to the API
> > > > > which are not
> > > > > used.
> > > > 
> > > > Christian Brauner has patches for mount_setattr() that will
> > > > need to
> > > > use this.
> > > 
> > > Fine, then that patch can add the flag.
> > > 
> > > Thanks,
> > > Miklos



Re: [GIT PULL] Filesystem Information

2020-08-04 Thread Ian Kent
On Tue, 2020-08-04 at 16:36 +0200, Miklos Szeredi wrote:
> On Tue, Aug 4, 2020 at 4:15 AM Ian Kent  wrote:
> > On Mon, 2020-08-03 at 18:42 +0200, Miklos Szeredi wrote:
> > > On Mon, Aug 3, 2020 at 5:50 PM David Howells  > > >
> > > wrote:
> > > > Hi Linus,
> > > > 
> > > > Here's a set of patches that adds a system call, fsinfo(), that
> > > > allows
> > > > information about the VFS, mount topology, superblock and files
> > > > to
> > > > be
> > > > retrieved.
> > > > 
> > > > The patchset is based on top of the mount notifications
> > > > patchset so
> > > > that
> > > > the mount notification mechanism can be hooked to provide event
> > > > counters
> > > > that can be retrieved with fsinfo(), thereby making it a lot
> > > > faster
> > > > to work
> > > > out which mounts have changed.
> > > > 
> > > > Note that there was a last minute change requested by Miklós:
> > > > the
> > > > event
> > > > counter bits got moved from the mount notification patchset to
> > > > this
> > > > one.
> > > > The counters got made atomic_long_t inside the kernel and __u64
> > > > in
> > > > the
> > > > UAPI.  The aggregate changes can be assessed by comparing pre-
> > > > change tag,
> > > > fsinfo-core-20200724 to the requested pull tag.
> > > > 
> > > > Karel Zak has created preliminary patches that add support to
> > > > libmount[*]
> > > > and Ian Kent has started working on making systemd use these
> > > > and
> > > > mount
> > > > notifications[**].
> > > 
> > > So why are you asking to pull at this stage?
> > > 
> > > Has anyone done a review of the patchset?
> > 
> > I have been working with the patch set as it has evolved for quite
> > a
> > while now.
> > 
> > I've been reading the kernel code quite a bit and forwarded
> > questions
> > and minor changes to David as they arose.
> > 
> > As for a review, not specifically, but while the series implements
> > a
> > rather large change it's surprisingly straight forward to read.
> > 
> > In the time I have been working with it I haven't noticed any
> > problems
> > except for those few minor things that I reported to David early on
> > (in
> > some cases accompanied by simple patches).
> > 
> > And more recently (obviously) I've been working with the mount
> > notifications changes and, from a readability POV, I find it's the
> > same as the fsinfo() code.
> > 
> > > I think it's obvious that this API needs more work.  The
> > > integration
> > > work done by Ian is a good direction, but it's not quite the full
> > > validation and review that a complex new API needs.
> > 
> > Maybe but the system call is fundamental to making notifications
> > useful
> > and, as I say, after working with it for quite a while I don't fell
> > there's missing features (that David hasn't added along the way)
> > and
> > have found it provides what's needed for what I'm doing (for mount
> > notifications at least).
> 
> Apart from the various issues related to the various mount ID's and
> their sizes, my general comment is (and was always): why are we
> adding
> a multiplexer for retrieval of mostly unrelated binary structures?
> 
>  is 345 lines.  This is not a simple and clean API.
> 
> A simple and clean replacement API would be:
> 
> int get_mount_attribute(int dfd, const char *path, const char
> *attr_name, char *value_buf, size_t buf_size, int flags);
> 
> No header file needed with dubiously sized binary values.
> 
> The only argument was performance, but apart from purely synthetic
> microbenchmarks that hasn't been proven to be an issue.
> 
> And notice how similar the above interface is to getxattr(), or the
> proposed readfile().  Where has the "everything is  a file"
> philosophy
> gone?

Maybe, but that philosophy (in a roundabout way) is what's resulted
in some of the problems we now have. Granted it's blind application
of that philosophy rather than the philosophy itself but that is
what happens.

I get that your comments are driven by the way that philosophy should
be applied which is more of a "if it works best doing it that way then
do it that way, and that's usually a file".

In this case there is a logical division of various types of file

Re: [PATCH 15/18] fsinfo: Add an attribute that lists all the visible mounts in a namespace [ver #21]

2020-08-04 Thread Ian Kent
On Tue, 2020-08-04 at 16:05 +0200, Miklos Szeredi wrote:
> On Mon, Aug 03, 2020 at 02:38:34PM +0100, David Howells wrote:
> > Add a filesystem attribute that exports a list of all the visible
> > mounts in
> > a namespace, given the caller's chroot setting.  The returned list
> > is an
> > array of:
> > 
> > struct fsinfo_mount_child {
> > __u64   mnt_unique_id;
> > __u32   mnt_id;
> > __u32   parent_id;
> > __u32   mnt_notify_sum;
> > __u32   sb_notify_sum;
> > };
> > 
> > where each element contains a once-in-a-system-lifetime unique ID,
> > the
> > mount ID (which may get reused), the parent mount ID and sums of
> > the
> > notification/change counters for the mount and its superblock.
> 
> The change counters are currently conditional on
> CONFIG_MOUNT_NOTIFICATIONS.
> Is this is intentional?
> 
> > This works with a read lock on the namespace_sem, but ideally would
> > do it
> > under the RCU read lock only.
> > 
> > Signed-off-by: David Howells 
> > ---
> > 
> >  fs/fsinfo.c |1 +
> >  fs/internal.h   |1 +
> >  fs/namespace.c  |   37
> > +
> >  include/uapi/linux/fsinfo.h |4 
> >  samples/vfs/test-fsinfo.c   |   22 ++
> >  5 files changed, 65 insertions(+)
> > 
> > diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> > index 0540cce89555..f230124ffdf5 100644
> > --- a/fs/fsinfo.c
> > +++ b/fs/fsinfo.c
> > @@ -296,6 +296,7 @@ static const struct fsinfo_attribute
> > fsinfo_common_attributes[] = {
> > FSINFO_STRING   (FSINFO_ATTR_MOUNT_POINT,   fsinfo_gene
> > ric_mount_point),
> > FSINFO_STRING   (FSINFO_ATTR_MOUNT_POINT_FULL,  fsinfo_gene
> > ric_mount_point_full),
> > FSINFO_LIST (FSINFO_ATTR_MOUNT_CHILDREN,fsinfo_generic_moun
> > t_children),
> > +   FSINFO_LIST (FSINFO_ATTR_MOUNT_ALL, fsinfo_generic_moun
> > t_all),
> > {}
> >  };
> >  
> > diff --git a/fs/internal.h b/fs/internal.h
> > index cb5edcc7125a..267b4aaf0271 100644
> > --- a/fs/internal.h
> > +++ b/fs/internal.h
> > @@ -102,6 +102,7 @@ extern int fsinfo_generic_mount_topology(struct
> > path *, struct fsinfo_context *)
> >  extern int fsinfo_generic_mount_point(struct path *, struct
> > fsinfo_context *);
> >  extern int fsinfo_generic_mount_point_full(struct path *, struct
> > fsinfo_context *);
> >  extern int fsinfo_generic_mount_children(struct path *, struct
> > fsinfo_context *);
> > +extern int fsinfo_generic_mount_all(struct path *, struct
> > fsinfo_context *);
> >  
> >  /*
> >   * fs_struct.c
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 122c12f9512b..1f2e06507244 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -4494,4 +4494,41 @@ int fsinfo_generic_mount_children(struct
> > path *path, struct fsinfo_context *ctx)
> > return ctx->usage;
> >  }
> >  
> > +/*
> > + * Return information about all the mounts in the namespace
> > referenced by the
> > + * path.
> > + */
> > +int fsinfo_generic_mount_all(struct path *path, struct
> > fsinfo_context *ctx)
> > +{
> > +   struct mnt_namespace *ns;
> > +   struct mount *m, *p;
> > +   struct path chroot;
> > +   bool allow;
> > +
> > +   m = real_mount(path->mnt);
> > +   ns = m->mnt_ns;
> > +
> > +   get_fs_root(current->fs, );
> > +   rcu_read_lock();
> > +   allow = are_paths_connected(, path) ||
> > capable(CAP_SYS_ADMIN);
> > +   rcu_read_unlock();
> > +   path_put();
> > +   if (!allow)
> > +   return -EPERM;
> > +
> > +   down_read(_sem);
> > +
> > +   list_for_each_entry(p, >list, mnt_list) {
> 
> This is missing locking and check added by commit 9f6c61f96f2d
> ("proc/mounts:
> add cursor").

That's a good catch Miklos.

Yes, the extra lock and the cursor check that's now needed.

> 
> > +   struct path mnt_root;
> > +
> > +   mnt_root.mnt= >mnt;
> > +   mnt_root.dentry = p->mnt.mnt_root;
> > +   if (are_paths_connected(path, _root))
> > +   fsinfo_store_mount(ctx, p, p == m);
> > +   }
> > +
> > +   up_read(_sem);
> > +   return ctx->usage;
> > +}
> > +
> >  #endif /* CONFIG_FSINFO */
> > diff --git a/include/uapi/linux/fsinfo.h
> > b/include/uapi/linux/fsinfo.h
> > index 81329de6905e..e40192d98648 100644
> > --- a/include/uapi/linux/fsinfo.h
> > +++ b/include/uapi/linux/fsinfo.h
> > @@ -37,6 +37,7 @@
> >  #define FSINFO_ATTR_MOUNT_POINT_FULL   0x203   /* Absolute
> > path of mount (string) */
> >  #define FSINFO_ATTR_MOUNT_TOPOLOGY 0x204   /* Mount object
> > topology */
> >  #define FSINFO_ATTR_MOUNT_CHILDREN 0x205   /* Children of this
> > mount (list) */
> > +#define FSINFO_ATTR_MOUNT_ALL  0x206   /* List all
> > mounts in a namespace (list) */
> >  
> >  #define FSINFO_ATTR_AFS_CELL_NAME  0x300   /* AFS cell name
> > (string) */
> >  #define FSINFO_ATTR_AFS_SERVER_NAME0x301   /* Name of
> > the Nth server (string) */
> > @@ -128,6 +129,8 @@ struct 

Re: [PATCH 06/18] fsinfo: Add a uniquifier ID to struct mount [ver #21]

2020-08-04 Thread Ian Kent
On Tue, 2020-08-04 at 12:41 +0200, Miklos Szeredi wrote:
> On Mon, Aug 03, 2020 at 02:37:16PM +0100, David Howells wrote:
> > Add a uniquifier ID to struct mount that is effectively unique over
> > the
> > kernel lifetime to deal around mnt_id values being reused.  This
> > can then
> > be exported through fsinfo() to allow detection of replacement
> > mounts that
> > happen to end up with the same mount ID.
> > 
> > The normal mount handle is still used for referring to a particular
> > mount.
> > 
> > The mount notification is then changed to convey these unique mount
> > IDs
> > rather than the mount handle.
> > 
> > Signed-off-by: David Howells 
> > ---
> > 
> >  fs/mount.h|3 +++
> >  fs/mount_notify.c |4 ++--
> >  fs/namespace.c|3 +++
> >  3 files changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/mount.h b/fs/mount.h
> > index 85456a5f5a3a..1037781be055 100644
> > --- a/fs/mount.h
> > +++ b/fs/mount.h
> > @@ -79,6 +79,9 @@ struct mount {
> > int mnt_expiry_mark;/* true if marked for
> > expiry */
> > struct hlist_head mnt_pins;
> > struct hlist_head mnt_stuck_children;
> > +#ifdef CONFIG_FSINFO
> > +   u64 mnt_unique_id;  /* ID unique over lifetime of
> > kernel */
> > +#endif
> 
> Not sure if it's worth making conditional.
> 
> >  #ifdef CONFIG_MOUNT_NOTIFICATIONS
> > struct watch_list *mnt_watchers; /* Watches on dentries within
> > this mount */
> >  #endif
> > diff --git a/fs/mount_notify.c b/fs/mount_notify.c
> > index 44f570e4cebe..d8ba66ed5f77 100644
> > --- a/fs/mount_notify.c
> > +++ b/fs/mount_notify.c
> > @@ -90,7 +90,7 @@ void notify_mount(struct mount *trigger,
> > n.watch.type= WATCH_TYPE_MOUNT_NOTIFY;
> > n.watch.subtype = subtype;
> > n.watch.info= info_flags | watch_sizeof(n);
> > -   n.triggered_on  = trigger->mnt_id;
> > +   n.triggered_on  = trigger->mnt_unique_id;
> >  
> > switch (subtype) {
> > case NOTIFY_MOUNT_EXPIRY:
> > @@ -102,7 +102,7 @@ void notify_mount(struct mount *trigger,
> > case NOTIFY_MOUNT_UNMOUNT:
> > case NOTIFY_MOUNT_MOVE_FROM:
> > case NOTIFY_MOUNT_MOVE_TO:
> > -   n.auxiliary_mount   = aux->mnt_id;
> > +   n.auxiliary_mount = aux->mnt_unique_id;
> 
> Hmm, so we now have two ID's:
> 
>  - one can be used to look up the mount
>  - one is guaranteed to be unique
> 
> With this change the mount cannot be looked up with
> FSINFO_FLAGS_QUERY_MOUNT,
> right?
> 
> Should we be merging the two ID's into a single one which has both
> properties?

I'd been thinking we would probably need to change to 64 bit ids
for a while now and I thought that was what was going to happen.

We'll need to change libmount and current code but better early
on than later.

Ian

> 
> > break;
> >  
> > default:
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index b2b9920ffd3c..1db8a64cd76f 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -115,6 +115,9 @@ static int mnt_alloc_id(struct mount *mnt)
> > if (res < 0)
> > return res;
> > mnt->mnt_id = res;
> > +#ifdef CONFIG_FSINFO
> > +   mnt->mnt_unique_id = atomic64_inc_return(_unique_counter);
> > +#endif
> > return 0;
> >  }
> >  
> > 
> > 



Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]

2020-08-04 Thread Ian Kent
On Mon, 2020-08-03 at 11:29 +0200, Miklos Szeredi wrote:
> On Thu, Jul 23, 2020 at 12:48 PM David Howells 
> wrote:
> 
> > > > __u32   topology_changes;
> > > > __u32   attr_changes;
> > > > __u32   aux_topology_changes;
> > > 
> > > Being 32bit this introduces wraparound effects.  Is that really
> > > worth it?
> > 
> > You'd have to make 2 billion changes without whoever's monitoring
> > getting a
> > chance to update their counters.  But maybe it's not worth it
> > putting them
> > here.  If you'd prefer, I can make the counters all 64-bit and just
> > retrieve
> > them with fsinfo().
> 
> Yes, I think that would be preferable.

I think this is the source of the recommendation for removing the
change counters from the notification message, correct?

While it looks like I may not need those counters for systemd message
buffer overflow handling myself I think removing them from the
notification message isn't a sensible thing to do.

If you need to detect missing messages, perhaps due to message buffer
overflow, then you need change counters that are relevant to the
notification message itself. That's so the next time you get a message
for that object you can be sure that change counter comparisons you
you make relate to object notifications you have processed.

Yes, I know it isn't quite that simple, but tallying up what you have
processed in the current batch of messages (or in multiple batches of
messages if more than one read has been possible) to perform the check
is a user space responsibility. And it simply can't be done if the
counters consistency is in question which it would be if you need to
perform another system call to get it.

It's way more useful to have these in the notification than obtainable
via fsinfo() IMHO.

> 
> > > > n->watch.info & NOTIFY_MOUNT_IS_RECURSIVE if true
> > > > indicates that
> > > > the notifcation was generated by an event (eg. SETATTR)
> > > > that was
> > > > applied recursively.  The notification is only
> > > > generated for the
> > > > object that initially triggered it.
> > > 
> > > Unused in this patchset.  Please don't add things to the API
> > > which are not
> > > used.
> > 
> > Christian Brauner has patches for mount_setattr() that will need to
> > use this.
> 
> Fine, then that patch can add the flag.
> 
> Thanks,
> Miklos



Re: [GIT PULL] Filesystem Information

2020-08-03 Thread Ian Kent
On Mon, 2020-08-03 at 18:42 +0200, Miklos Szeredi wrote:
> On Mon, Aug 3, 2020 at 5:50 PM David Howells 
> wrote:
> > 
> > Hi Linus,
> > 
> > Here's a set of patches that adds a system call, fsinfo(), that
> > allows
> > information about the VFS, mount topology, superblock and files to
> > be
> > retrieved.
> > 
> > The patchset is based on top of the mount notifications patchset so
> > that
> > the mount notification mechanism can be hooked to provide event
> > counters
> > that can be retrieved with fsinfo(), thereby making it a lot faster
> > to work
> > out which mounts have changed.
> > 
> > Note that there was a last minute change requested by Miklós: the
> > event
> > counter bits got moved from the mount notification patchset to this
> > one.
> > The counters got made atomic_long_t inside the kernel and __u64 in
> > the
> > UAPI.  The aggregate changes can be assessed by comparing pre-
> > change tag,
> > fsinfo-core-20200724 to the requested pull tag.
> > 
> > Karel Zak has created preliminary patches that add support to
> > libmount[*]
> > and Ian Kent has started working on making systemd use these and
> > mount
> > notifications[**].
> 
> So why are you asking to pull at this stage?
> 
> Has anyone done a review of the patchset?

I have been working with the patch set as it has evolved for quite a
while now.

I've been reading the kernel code quite a bit and forwarded questions
and minor changes to David as they arose.

As for a review, not specifically, but while the series implements a
rather large change it's surprisingly straight forward to read.

In the time I have been working with it I haven't noticed any problems
except for those few minor things that I reported to David early on (in
some cases accompanied by simple patches).

And more recently (obviously) I've been working with the mount
notifications changes and, from a readability POV, I find it's the
same as the fsinfo() code.

> 
> I think it's obvious that this API needs more work.  The integration
> work done by Ian is a good direction, but it's not quite the full
> validation and review that a complex new API needs.

Maybe but the system call is fundamental to making notifications useful
and, as I say, after working with it for quite a while I don't fell
there's missing features (that David hasn't added along the way) and
have found it provides what's needed for what I'm doing (for mount
notifications at least).

I'll be posting a github PR for systemd for discussion soon while I
get on with completing the systemd change. Like overflow handling and
meson build system changes to allow building with and without the
util-linux libmount changes.

So, ideally, I'd like to see the series merged, we've been working on
it for quite a considerable time now.

Ian



Re: [GIT PULL] Mount notifications

2020-08-03 Thread Ian Kent
On Mon, 2020-08-03 at 16:27 +0100, David Howells wrote:
> Hi Linus,
> 
> Here's a set of patches to add notifications for mount topology
> events,
> such as mounting, unmounting, mount expiry, mount reconfiguration.
> 
> The first patch in the series adds a hard limit on the number of
> watches
> that any particular user can add.  The RLIMIT_NOFILE value for the
> process
> adding a watch is used as the limit.  Even if you don't take the rest
> of
> the series, can you at least take this one?
> 
> An LSM hook is included for an LSM to rule on whether or not a mount
> watch
> may be set on a particular path.
> 
> This series is intended to be taken in conjunction with the fsinfo
> series
> which I'll post a pull request for shortly and which is dependent on
> it.
> 
> Karel Zak[*] has created preliminary patches that add support to
> libmount
> and Ian Kent has started working on making systemd use them.
> 
> [*] https://github.com/karelzak/util-linux/commits/topic/fsinfo
> 
> Note that there have been some last minute changes to the patchset:
> you
> wanted something adding and Miklós wanted some bits taking
> out/changing.
> I've placed a tag, fsinfo-core-20200724 on the aggregate of these two
> patchsets that can be compared to fsinfo-core-20200803.
> 
> To summarise the changes: I added the limiter that you wanted;
> removed an
> unused symbol; made the mount ID fields in the notificaion 64-bit
> (the
> fsinfo patchset has a change to convey the mount uniquifier instead
> of the
> mount ID); removed the event counters from the mount notification and
> moved
> the event counters into the fsinfo patchset.

I've pushed my systemd changes to a github repo.
I haven't yet updated it with the changes above but will get to it.

They can be found at:
https://github.com/raven-au/systemd.git branch notifications-devel

> 
> 
> 
> WHY?
> 
> 
> Why do we want mount notifications?  Whilst /proc/mounts can be
> polled, it
> only tells you that something changed in your namespace.  To find
> out, you
> have to trawl /proc/mounts or similar to work out what changed in the
> mount
> object attributes and mount topology.  I'm told that the proc file
> holding
> the namespace_sem is a point of contention, especially as the process
> of
> generating the text descriptions of the mounts/superblocks can be
> quite
> involved.
> 
> The notification generated here directly indicates the mounts
> involved in
> any particular event and gives an idea of what the change was.
> 
> This is combined with a new fsinfo() system call that allows, amongst
> other
> things, the ability to retrieve in one go an { id, change_counter }
> tuple
> from all the children of a specified mount, allowing buffer overruns
> to be
> dealt with quickly.
> 
> This is of use to systemd to improve efficiency:
> 
>   
> https://lore.kernel.org/linux-fsdevel/20200227151421.3u74ijhqt6ekb...@ws.net.home/
> 
> And it's not just Red Hat that's potentially interested in this:
> 
>   
> https://lore.kernel.org/linux-fsdevel/293c9bd3-f530-d75e-c353-ddeabac27...@6wind.com/
> 
> 
> David
> ---
> The following changes since commit
> ba47d845d715a010f7b51f6f89bae32845e6acb7:
> 
>   Linux 5.8-rc6 (2020-07-19 15:41:18 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git 
> tags/mount-notifications-20200803
> 
> for you to fetch changes up to
> 841a0dfa511364fa9a8d67512e0643669f1f03e3:
> 
>   watch_queue: sample: Display mount tree change notifications (2020-
> 08-03 12:15:38 +0100)
> 
> 
> Mount notifications
> 
> 
> David Howells (5):
>   watch_queue: Limit the number of watches a user can hold
>   watch_queue: Make watch_sizeof() check record size
>   watch_queue: Add security hooks to rule on setting mount
> watches
>   watch_queue: Implement mount topology and attribute change
> notifications
>   watch_queue: sample: Display mount tree change notifications
> 
>  Documentation/watch_queue.rst   |  12 +-
>  arch/alpha/kernel/syscalls/syscall.tbl  |   1 +
>  arch/arm/tools/syscall.tbl  |   1 +
>  arch/arm64/include/asm/unistd.h |   2 +-
>  arch/arm64/include/asm/unistd32.h   |   2 +
>  arch/ia64/kernel/syscalls/syscall.tbl   |   1 +
>  arch/m68k/kernel/syscalls/syscall.tbl   |   1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
>  ar

Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]

2020-08-03 Thread Ian Kent
On Mon, 2020-08-03 at 13:31 +0100, David Howells wrote:
> Ian Kent  wrote:
> 
> > > I'm changing it so that the fields are 64-bit, but initialised
> > > with the
> > > existing mount ID in the notifications set.  The fsinfo set
> > > changes that
> > > to a unique ID.  I'm tempted to make the unique IDs start at
> > > UINT_MAX+1 to
> > > disambiguate them.
> > 
> > Mmm ... so what would I use as a mount id that's not used, like
> > NULL
> > for strings?
> 
> Zero is skipped, so you could use that.
> 
> > I'm using -1 now but changing this will mean I need something
> > different.
> 
> It's 64-bits, so you're not likely to see it reach -1, even if it
> does start
> at UINT_MAX+1.

Ha, either or, I don't think it will be a problem, there's
bound to be a few changes so the components using this will
need to change a bit before it's finalized, shouldn't be a
big deal I think. At least not for me and shouldn't be much
for libmount either I think.

Ian



Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]

2020-08-03 Thread Ian Kent
On Mon, 2020-08-03 at 12:49 +0100, David Howells wrote:
> Miklos Szeredi  wrote:
> 
> > OTOH mount notification is way smaller and IMO a more mature
> > interface.  So just picking the unique ID patch into this set might
> > make sense.
> 
> But userspace can't retrieve the unique ID without fsinfo() as things
> stand.
> 
> I'm changing it so that the fields are 64-bit, but initialised with
> the
> existing mount ID in the notifications set.  The fsinfo set changes
> that to a
> unique ID.  I'm tempted to make the unique IDs start at UINT_MAX+1 to
> disambiguate them.

Mmm ... so what would I use as a mount id that's not used, like NULL
for strings?

I'm using -1 now but changing this will mean I need something
different.

Could we set aside a mount id that will never be used so it can be
used for this case?

Maybe mount ids should start at 1 instead of zero ...

Ian




Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]

2020-07-24 Thread Ian Kent
On Fri, 2020-07-24 at 11:19 +0100, David Howells wrote:
> David Howells  wrote:
> 
> > > What guarantees that mount_id is going to remain a 32bit entity?
> > 
> > You think it likely we'd have >4 billion concurrent mounts on a
> > system?  That
> > would require >1.2TiB of RAM just for the struct mount allocations.
> > 
> > But I can expand it to __u64.
> 
> That said, sys_name_to_handle_at() assumes it's a 32-bit signed
> integer, so
> we're currently limited to ~2 billion concurrent mounts:-/

I was wondering about id re-use.

Assuming that ids that are returned to the idr db are re-used
what would the chance that a recently used id would end up
being used?

Would that chance increase as ids are consumed and freed over
time?

Yeah, it's one of those questions ... ;)

Ian



Re: [PATCH] autofs: fix doubled word

2020-07-15 Thread Ian Kent
On Wed, 2020-07-15 at 18:28 -0700, Randy Dunlap wrote:
> From: Randy Dunlap 
> 
> Change doubled word "is" to "it is".
> 
> Signed-off-by: Randy Dunlap 
> Cc: Ian Kent 
> Cc: aut...@vger.kernel.org
> Cc: Andrew Morton 

Acked-by: Ian Kent 

> ---
>  include/uapi/linux/auto_dev-ioctl.h |2 +-Acked-by: Ian Kent <
> ra...@themaw.net>
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-next-20200714.orig/include/uapi/linux/auto_dev-ioctl.h
> +++ linux-next-20200714/include/uapi/linux/auto_dev-ioctl.h
> @@ -82,7 +82,7 @@ struct args_ismountpoint {
>  /*
>   * All the ioctls use this structure.
>   * When sending a path size must account for the total length
> - * of the chunk of memory otherwise is is the size of the
> + * of the chunk of memory otherwise it is the size of the
>   * structure.
>   */
>  
> 



Re: [PATCH 01/10] Documentation: filesystems: autofs-mount-control: drop doubled words

2020-07-05 Thread Ian Kent
On Fri, 2020-07-03 at 14:43 -0700, Randy Dunlap wrote:
> Drop the doubled words "the" and "and".
> 
> Signed-off-by: Randy Dunlap 
> Cc: Jonathan Corbet 
> Cc: linux-...@vger.kernel.org
> Cc: Ian Kent 

Acked-by: Ian Kent 

> Cc: aut...@vger.kernel.org
> ---
>  Documentation/filesystems/autofs-mount-control.rst |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> --- linux-next-20200701.orig/Documentation/filesystems/autofs-mount-
> control.rst
> +++ linux-next-20200701/Documentation/filesystems/autofs-mount-
> control.rst
> @@ -391,7 +391,7 @@ variation uses the path and optionally i
>  set to an autofs mount type. The call returns 1 if this is a mount
> point
>  and sets out.devid field to the device number of the mount and
> out.magic
>  field to the relevant super block magic number (described below) or
> 0 if
> -it isn't a mountpoint. In both cases the the device number (as
> returned
> +it isn't a mountpoint. In both cases the device number (as returned
>  by new_encode_dev()) is returned in out.devid field.
>  
>  If supplied with a file descriptor we're looking for a specific
> mount,
> @@ -399,12 +399,12 @@ not necessarily at the top of the mounte
>  the descriptor corresponds to is considered a mountpoint if it is
> itself
>  a mountpoint or contains a mount, such as a multi-mount without a
> root
>  mount. In this case we return 1 if the descriptor corresponds to a
> mount
> -point and and also returns the super magic of the covering mount if
> there
> +point and also returns the super magic of the covering mount if
> there
>  is one or 0 if it isn't a mountpoint.
>  
>  If a path is supplied (and the ioctlfd field is set to -1) then the
> path
>  is looked up and is checked to see if it is the root of a mount. If
> a
>  type is also given we are looking for a particular autofs mount and
> if
> -a match isn't found a fail is returned. If the the located path is
> the
> +a match isn't found a fail is returned. If the located path is the
>  root of a mount 1 is returned along with the super magic of the
> mount
>  or 0 otherwise.



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-25 Thread Ian Kent
On Thu, 2020-06-25 at 11:43 +0200, Greg Kroah-Hartman wrote:
> On Thu, Jun 25, 2020 at 04:15:19PM +0800, Ian Kent wrote:
> > On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> > > Hello, Rick.
> > > 
> > > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > > > I don't know. The above highlights the absurdity of the
> > > > > approach
> > > > > itself to
> > > > > me. You seem to be aware of it too in writing: 250,000
> > > > > "devices".
> > > > 
> > > > Just because it is absurd doesn't mean it wasn't built that way
> > > > :)
> > > > 
> > > > I agree, and I'm trying to influence the next hardware design.
> > > > However,
> > > 
> > > I'm not saying that the hardware should not segment things into
> > > however many
> > > pieces that it wants / needs to. That part is fine.
> > > 
> > > > what's already out there is memory units that must be accessed
> > > > in
> > > > 256MB
> > > > blocks. If you want to remove/add a GB, that's really 4 blocks
> > > > of
> > > > memory
> > > > you're manipulating, to the hardware. Those blocks have to be
> > > > registered
> > > > and recognized by the kernel for that to work.
> > > 
> > > The problem is fitting that into an interface which wholly
> > > doesn't
> > > fit that
> > > particular requirement. It's not that difficult to imagine
> > > different
> > > ways to
> > > represent however many memory slots, right? It'd take work to
> > > make
> > > sure that
> > > integrates well with whatever tooling or use cases but once done
> > > this
> > > particular problem will be resolved permanently and the whole
> > > thing
> > > will
> > > look a lot less silly. Wouldn't that be better?
> > 
> > Well, no, I am finding it difficult to imagine different ways to
> > represent this but perhaps that's because I'm blinker eyed on what
> > a solution might look like because of my file system focus.
> > 
> > Can "anyone" throw out some ideas with a little more detail than we
> > have had so far so we can maybe start to formulate an actual plan
> > of
> > what needs to be done.
> 
> I think both Tejun and I have provided a number of alternatives for
> you
> all to look into, and yet you all keep saying that those are
> impossible
> for some unknown reason.

Yes, those comments are a starting point to be sure.
And continuing on that path isn't helping anyone.

That's why I'm asking for your input on what a solution you
would see as adequate might look like to you (and Tejun).

> 
> It's not up to me to tell you what to do to fix your broken
> interfaces
> as only you all know who is using this and how to handle those
> changes.

But it would be useful to go into a little more detail, based on
your own experience, about what you think a suitable solution might
be.

That surely needs to be taken into account and used to guide the
direction of our investigation of what we do.

> 
> It is up to me to say "don't do that!" and to refuse patches that
> don't
> solve the root problem here.  I'll review these later on (I have
> 1500+
> patches to review at the moment) as these are a nice
> micro-optimization...

Sure, and I get the "I don't want another post and run set of
patches that I have to maintain forever that don't fully solve
the problem" view and any ideas and perhaps a little more detail
on where we might go with this would be very much appreciated.

> 
> And as this conversation seems to just going in circles, I think this
> is
> going to be my last response to it...

Which is why I'm asking this, I really would like to see this
discussion change course and become useful.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-25 Thread Ian Kent
On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> Hello, Rick.
> 
> On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > I don't know. The above highlights the absurdity of the approach
> > > itself to
> > > me. You seem to be aware of it too in writing: 250,000 "devices".
> > 
> > Just because it is absurd doesn't mean it wasn't built that way :)
> > 
> > I agree, and I'm trying to influence the next hardware design.
> > However,
> 
> I'm not saying that the hardware should not segment things into
> however many
> pieces that it wants / needs to. That part is fine.
> 
> > what's already out there is memory units that must be accessed in
> > 256MB
> > blocks. If you want to remove/add a GB, that's really 4 blocks of
> > memory
> > you're manipulating, to the hardware. Those blocks have to be
> > registered
> > and recognized by the kernel for that to work.
> 
> The problem is fitting that into an interface which wholly doesn't
> fit that
> particular requirement. It's not that difficult to imagine different
> ways to
> represent however many memory slots, right? It'd take work to make
> sure that
> integrates well with whatever tooling or use cases but once done this
> particular problem will be resolved permanently and the whole thing
> will
> look a lot less silly. Wouldn't that be better?

Well, no, I am finding it difficult to imagine different ways to
represent this but perhaps that's because I'm blinker eyed on what
a solution might look like because of my file system focus.

Can "anyone" throw out some ideas with a little more detail than we
have had so far so we can maybe start to formulate an actual plan of
what needs to be done.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Ian Kent
On Tue, 2020-06-23 at 02:33 -0700, Rick Lindsley wrote:
> On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:
> 
> > First off, this is not my platform, and not my problem, so it's
> > funny
> > you ask me :)
> 
> Wlll, not your platform perhaps but MAINTAINERS does list you
> first and Tejun second as maintainers for kernfs.  So in that sense,
> any patches would need to go thru you.  So, your opinions do matter.
> 
>   
> > Anyway, as I have said before, my first guesses would be:
> > - increase the granularity size of the "memory chunks",
> > reducing
> >   the number of devices you create.
> 
> This would mean finding every utility that relies on this
> behavior.  That may be possible, although not easy, for distro or
> platform software, but it's hard to guess what user-related utilities
> may have been created by other consumers of those distros or that
> platform.  In any case, removing an interface without warning is a
> hanging offense in many Linux circles.
> 
> > - delay creating the devices until way after booting, or do it
> >   on a totally different path/thread/workqueue/whatever to
> >   prevent delay at booting
> 
> This has been considered, but it again requires a full list of
> utilities relying on this interface and determining which of them may
> want to run before the devices are "loaded" at boot time.  It may be
> few, or even zero, but it would be a much more disruptive change in
> the boot process than what we are suggesting.
> 
> > And then there's always:
> > - don't create them at all, only only do so if userspace asks
> >   you to.
> 
> If they are done in parallel on demand, you'll see the same problem
> (load average of 1000+, contention in the same spot.)  You obviously
> won't hold up the boot, of course, but your utility and anything else
> running on the machine will take an unexpected pause ... for
> somewhere between 30 and 90 minutes.  Seems equally unfriendly.
> 
> A variant of this, which does have a positive effect, is to observe
> that coldplug during initramfs does seem to load up the memory device
> tree without incident.  We do a second coldplug after we switch roots
> and this is the one that runs into timer issues.  I have asked "those
> that should know" why there is a second coldplug.  I can guess but
> would prefer to know to avoid that screaming option.  If that second
> coldplug is unnecessary for the kernfs memory interfaces to work
> correctly, then that is an alternate, and perhaps even better
> solution.  (It wouldn't change the fact that kernfs was not built for
> speed and this problem remains below the surface to trip up another.)

We might still need the patches here for that on-demand mechanism
to be feasible.

For example, for an ls of the node directory it should be doable to
enumerate the nodes in readdir without creating dentries but there's
the inevitable stat() of each path that follows that would probably
lead to similar contention.

And changing the division of the entries into sub-directories would
inevitably break anything that does actually need to access them.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Ian Kent
On Tue, 2020-06-23 at 16:01 +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > > Hello, Ian.
> > > > > 
> > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > > The
> > > > > > > > size of
> > > > > > > > the
> > > > > > > > segments (and thus the number of them) is dictated by
> > > > > > > > the
> > > > > > > > underlying
> > > > > > > > hardware.
> > > > > > > 
> > > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > > that,
> > > > > > > right?
> > > > > > 
> > > > > > I'm still struggling a bit to grasp what your getting at
> > > > > > but
> > > > > > ...
> > > > > 
> > > > > I was more trying to say that the sysfs device interface with
> > > > > per-
> > > > > object
> > > > > directory isn't the right interface for this sort of usage at
> > > > > all.
> > > > > Are these
> > > > > even real hardware pieces which can be plugged in and out?
> > > > > While
> > > > > being a
> > > > > discrete piece of hardware isn't a requirement to be a device
> > > > > model
> > > > > device,
> > > > > the whole thing is designed with such use cases on mind. It
> > > > > definitely isn't
> > > > > the right design for representing six digit number of logical
> > > > > entities.
> > > > > 
> > > > > It should be obvious that representing each consecutive
> > > > > memory
> > > > > range with a
> > > > > separate directory entry is far from an optimal way of
> > > > > representing
> > > > > something like this. It's outright silly.
> > > > 
> > > > I agree.  And again, Ian, you are just "kicking the problem
> > > > down
> > > > the
> > > > road" if we accept these patches.  Please fix this up properly
> > > > so
> > > > that
> > > > this interface is correctly fixed to not do looney things like
> > > > this.
> > > 
> > > Fine, mitigating this problem isn't the end of the story, and you
> > > don't want to do accept a change to mitigate it because that
> > > could
> > > mean no further discussion on it and no further work toward
> > > solving
> > > it.
> > > 
> > > But it seems to me a "proper" solution to this will cross a
> > > number
> > > of areas so this isn't just "my" problem and, as you point out,
> > > it's
> > > likely to become increasingly problematic over time.
> > > 
> > > So what are your ideas and recommendations on how to handle
> > > hotplug
> > > memory at this granularity for this much RAM (and larger
> > > amounts)?
> > 
> > First off, this is not my platform, and not my problem, so it's
> > funny
> > you ask me :)
> 
> Sorry, but I don't think it's funny at all.
> 
> It's not "my platform" either, I'm just the poor old sole that
> took this on because, on the face of it, it's a file system
> problem as claimed by others that looked at it and promptly
> washed their hands of it.
> 
> I don't see how asking for your advice is out of order at all.
> 
> > Anyway, as I have said before, my first guesses would be:
> > - increase the granularity size of the "memory chunks",
> > reducing
> >   the number of devices you create.
> 
> Yes, I didn't get that from your initial comments but you've said
> it a couple of times recently and I do get it now.
> 
> I'll try and find someone appropriate to consult about that and
> see where it goes.
> 
> > - delay creating the devices until way after booting, or do it
> >   on a totally different path/thread/workqueue/whatever to
> >   prevent delay at booting
> 
> When you first said 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Ian Kent
On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > Hello, Ian.
> > > > 
> > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > The
> > > > > > > size of
> > > > > > > the
> > > > > > > segments (and thus the number of them) is dictated by the
> > > > > > > underlying
> > > > > > > hardware.
> > > > > > 
> > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > that,
> > > > > > right?
> > > > > 
> > > > > I'm still struggling a bit to grasp what your getting at but
> > > > > ...
> > > > 
> > > > I was more trying to say that the sysfs device interface with
> > > > per-
> > > > object
> > > > directory isn't the right interface for this sort of usage at
> > > > all.
> > > > Are these
> > > > even real hardware pieces which can be plugged in and out?
> > > > While
> > > > being a
> > > > discrete piece of hardware isn't a requirement to be a device
> > > > model
> > > > device,
> > > > the whole thing is designed with such use cases on mind. It
> > > > definitely isn't
> > > > the right design for representing six digit number of logical
> > > > entities.
> > > > 
> > > > It should be obvious that representing each consecutive memory
> > > > range with a
> > > > separate directory entry is far from an optimal way of
> > > > representing
> > > > something like this. It's outright silly.
> > > 
> > > I agree.  And again, Ian, you are just "kicking the problem down
> > > the
> > > road" if we accept these patches.  Please fix this up properly so
> > > that
> > > this interface is correctly fixed to not do looney things like
> > > this.
> > 
> > Fine, mitigating this problem isn't the end of the story, and you
> > don't want to do accept a change to mitigate it because that could
> > mean no further discussion on it and no further work toward solving
> > it.
> > 
> > But it seems to me a "proper" solution to this will cross a number
> > of areas so this isn't just "my" problem and, as you point out,
> > it's
> > likely to become increasingly problematic over time.
> > 
> > So what are your ideas and recommendations on how to handle hotplug
> > memory at this granularity for this much RAM (and larger amounts)?
> 
> First off, this is not my platform, and not my problem, so it's funny
> you ask me :)

Sorry, but I don't think it's funny at all.

It's not "my platform" either, I'm just the poor old sole that
took this on because, on the face of it, it's a file system
problem as claimed by others that looked at it and promptly
washed their hands of it.

I don't see how asking for your advice is out of order at all.

> 
> Anyway, as I have said before, my first guesses would be:
>   - increase the granularity size of the "memory chunks",
> reducing
> the number of devices you create.

Yes, I didn't get that from your initial comments but you've said
it a couple of times recently and I do get it now.

I'll try and find someone appropriate to consult about that and
see where it goes.

>   - delay creating the devices until way after booting, or do it
> on a totally different path/thread/workqueue/whatever to
> prevent delay at booting

When you first said this it sounded like a ugly workaround to me.
But perhaps it isn't (I'm not really convinced it is TBH), so it's
probably worth trying to follow up on too.

> 
> And then there's always:
>   - don't create them at all, only only do so if userspace asks
> you to.

At first glance the impression I get from this is that it's an even
uglier work around than delaying it but it might actually the most
sensible way to handle this, as it's been called, silliness.

We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
the dentry op ->d_automount() to be called on access so, from a path
walk perspective, the dentries could just appear when needed.

The question I'd need to answer is do the kernfs nodes exist so
->d_automount() can discover if the node lookup is valid, and I think
the answer might be yes (but we would need to suppress udev
notifications for S_AUTOMOUNT nodes).

The catch will be that this is "not" mounting per-se, so anything
I do would probably be seen as an ugly hack that subverts the VFS
automount support.

If I could find a way to reconcile that I could probably do this.

Al, what say you on this?

> 
> You all have the userspace tools/users for this interface and know it
> best to know what will work for them.  If you don't, then hey, let's
> just delete the whole thing and see who screams :)

Please, no joking, I'm finding it hard enough to cope with this
disappointment as it is, ;)

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-22 Thread Ian Kent
On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > Hello, Ian.
> > 
> > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > They are used for hotplugging and partitioning memory. The
> > > > > size of
> > > > > the
> > > > > segments (and thus the number of them) is dictated by the
> > > > > underlying
> > > > > hardware.
> > > > 
> > > > This sounds so bad. There gotta be a better interface for that,
> > > > right?
> > > 
> > > I'm still struggling a bit to grasp what your getting at but ...
> > 
> > I was more trying to say that the sysfs device interface with per-
> > object
> > directory isn't the right interface for this sort of usage at all.
> > Are these
> > even real hardware pieces which can be plugged in and out? While
> > being a
> > discrete piece of hardware isn't a requirement to be a device model
> > device,
> > the whole thing is designed with such use cases on mind. It
> > definitely isn't
> > the right design for representing six digit number of logical
> > entities.
> > 
> > It should be obvious that representing each consecutive memory
> > range with a
> > separate directory entry is far from an optimal way of representing
> > something like this. It's outright silly.
> 
> I agree.  And again, Ian, you are just "kicking the problem down the
> road" if we accept these patches.  Please fix this up properly so
> that
> this interface is correctly fixed to not do looney things like this.

Fine, mitigating this problem isn't the end of the story, and you
don't want to do accept a change to mitigate it because that could
mean no further discussion on it and no further work toward solving
it.

But it seems to me a "proper" solution to this will cross a number
of areas so this isn't just "my" problem and, as you point out, it's
likely to become increasingly problematic over time.

So what are your ideas and recommendations on how to handle hotplug
memory at this granularity for this much RAM (and larger amounts)?

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-20 Thread Ian Kent
On Fri, 2020-06-19 at 18:23 -0400, Tejun Heo wrote:
> On Fri, Jun 19, 2020 at 01:41:39PM -0700, Rick Lindsley wrote:
> > On 6/19/20 8:38 AM, Tejun Heo wrote:
> > 
> > > I don't have strong objections to the series but the rationales
> > > don't seem
> > > particularly strong. It's solving a suspected problem but only
> > > half way. It
> > > isn't clear whether this can be the long term solution for the
> > > problem
> > > machine and whether it will benefit anyone else in a meaningful
> > > way either.
> > 
> > I don't understand your statement about solving the problem
> > halfway. Could
> > you elaborate?
> 
> Spending 5 minutes during boot creating sysfs objects doesn't seem
> like a
> particularly good solution and I don't know whether anyone else would
> experience similar issues. Again, not necessarily against improving
> the
> scalability of kernfs code but the use case seems a bit out there.
> 
> > > I think Greg already asked this but how are the 100,000+ memory
> > > objects
> > > used? Is that justified in the first place?
> > 
> > They are used for hotplugging and partitioning memory. The size of
> > the
> > segments (and thus the number of them) is dictated by the
> > underlying
> > hardware.
> 
> This sounds so bad. There gotta be a better interface for that,
> right?

I'm still struggling a bit to grasp what your getting at but ...

Maybe your talking about the underlying notifications system where
a notification is sent for every event.

There's nothing new about that problem and it's becoming increasingly
clear that existing kernel notification sub-systems don't scale well.

Mount handling is a current example which is one of the areas David
Howells is trying to improve and that's taken years now to get as
far as it has.

It seems to me that any improvements in the area here would have a
different solution, perhaps something along the lines of multiple
notification merging, increased context carried in notifications,
or the like. Something like the notification merging to reduce
notification volume might eventually be useful for David's
notifications sub-system too (and, I think the design of that
sub-system could probably accommodate that sort of change away
from the problematic anonymous notification sub-systems we have
now).

But it's taken a long time to get that far with that project and
the case here would have a far more significant impact on a fairly
large number of sub-systems, both kernel and user space, so all I
can hope for with this discussion is to raise awareness of the need
so that it's recognised and thought about approaches to improving
it can happen.

So, while the questions you ask are valid and your concerns real,
it's unrealistic to think there's a simple solution that can be
implemented in short order. Problem awareness is all that can be done
now so that fundamental and probably wide spread improvements might
be able to be implemented over time.

But if I misunderstand your thinking on this please elaborate further.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-20 Thread Ian Kent
On Fri, 2020-06-19 at 11:38 -0400, Tejun Heo wrote:
> Hello, Ian.
> 
> On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote:
> > The series here tries to reduce the locking needed during path
> > walks
> > based on the assumption that there are many path walks with a
> > fairly
> > large portion of those for non-existent paths, as described above.
> > 
> > That was done by adding kernfs negative dentry caching (non-
> > existent
> > paths) to avoid continual alloc/free cycle of dentries and a
> > read/write
> > semaphore introduced to increase kernfs concurrency during path
> > walks.
> > 
> > With these changes we still need kernel parameters of
> > udev.children-max=2048
> > and systemd.default_timeout_start_sec=300 for the fastest boot
> > times of
> > under 5 minutes.
> 
> I don't have strong objections to the series but the rationales don't
> seem
> particularly strong. It's solving a suspected problem but only half
> way. It
> isn't clear whether this can be the long term solution for the
> problem
> machine and whether it will benefit anyone else in a meaningful way
> either.
> 
> I think Greg already asked this but how are the 100,000+ memory
> objects
> used? Is that justified in the first place?

The problem is real enough, however, whether improvements can be made
in other areas flowing on from the arch specific device creation
notifications is not clear cut.

There's no question that there is very high contention between the
VFS and sysfs and that's all the series is trying to improve, nothing
more.

What both you and Greg have raised are good questions but are
unfortunately very difficult to answer.

I tried to add some discussion about it, to the extent that I could,
in the cover letter.

Basically the division of memory into 256M chunks is something that's
needed to provide flexibility for arbitrary partition creation (a set
of hardware allocated that's used for, essentially, a bare metal OS
install). Whether that's many small partitions for load balanced server
farms (or whatever) or much larger partitions for for demanding
applications, such as Oracle systems, is not something that can be
known in advance.

So the division into small memory chunks can't change.

The question of sysfs node creation, what uses them and when they
are used is much harder.

I'm not able to find that out and, I doubt even IBM would know, if
their customers use applications that need to consult the sysfs
file system for this information or when it's needed if it is need
at all. So I'm stuck on this question.

One thing is for sure though, it would be (at the very least) risky
for a vendor to assume they either aren't needed or aren't needed early
during system start up.

OTOH I've looked at what gets invoked on udev notifications (which
is the source of the heavy path walk activity, I admit I need to
dig deeper) and that doesn't appear to be doing anything obviously
wrong so that far seems ok.

For my part, as long as the series proves to be sound, why not,
it does substantially reduce contention between the VFS and sysfs
is the face of heavy sysfs path walk activity so I think that
stands alone as sufficient to consider the change worthwhile.

Ian



[PATCH v2 4/6] kernfs: use revision to identify directory node changes

2020-06-17 Thread Ian Kent
If a kernfs directory node hasn't changed there's no need to search for
an added (or removed) child dentry.

Add a revision counter to kernfs directory nodes so it can be used
to detect if a directory node has changed.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |   17 +++--
 fs/kernfs/kernfs-internal.h |   24 
 include/linux/kernfs.h  |5 +
 3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index f4943329e578..03f4f179bbc4 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -383,6 +383,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
/* successfully added, account subdir number */
if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs++;
+   kernfs_inc_rev(kn->parent);
 
return 0;
 }
@@ -405,6 +406,7 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
 
if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs--;
+   kernfs_inc_rev(kn->parent);
 
rb_erase(>rb, >parent->dir.children);
RB_CLEAR_NODE(>rb);
@@ -1044,9 +1046,16 @@ struct kernfs_node *kernfs_create_empty_dir(struct 
kernfs_node *parent,
 
 static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 {
+   struct kernfs_node *parent;
struct kernfs_node *kn;
 
if (flags & LOOKUP_RCU) {
+   /* Directory node changed? */
+   parent = kernfs_dentry_node(dentry->d_parent);
+
+   if (!kernfs_dir_changed(parent, dentry))
+   return 1;
+
kn = kernfs_dentry_node(dentry);
if (!kn) {
/* Negative hashed dentry, tell the VFS to switch to
@@ -1093,8 +1102,6 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
 
kn = kernfs_dentry_node(dentry);
if (!kn) {
-   struct kernfs_node *parent;
-
/* If the kernfs node can be found this is a stale negative
 * hashed dentry so it must be discarded and the lookup redone.
 */
@@ -1102,6 +1109,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
if (parent) {
const void *ns = NULL;
 
+   /* Directory node changed? */
+   if (kernfs_dir_changed(parent, dentry))
+   goto out_bad;
+
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_parent->d_sb)->ns;
kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
@@ -1156,6 +1167,8 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 
down_read(_rwsem);
 
+   kernfs_set_rev(dentry, parent);
+
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;
 
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 097c1a989aa4..a7b0e2074260 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -82,6 +82,30 @@ static inline struct kernfs_node *kernfs_dentry_node(struct 
dentry *dentry)
return d_inode(dentry)->i_private;
 }
 
+static inline void kernfs_set_rev(struct dentry *dentry,
+ struct kernfs_node *kn)
+{
+   dentry->d_time = kn->dir.rev;
+}
+
+static inline void kernfs_inc_rev(struct kernfs_node *kn)
+{
+   if (kernfs_type(kn) == KERNFS_DIR) {
+   if (!++kn->dir.rev)
+   kn->dir.rev++;
+   }
+}
+
+static inline bool kernfs_dir_changed(struct kernfs_node *kn,
+ struct dentry *dentry)
+{
+   if (kernfs_type(kn) == KERNFS_DIR) {
+   if (kn->dir.rev != dentry->d_time)
+   return true;
+   }
+   return false;
+}
+
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
 
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 89f6a4214a70..74727d98e380 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -98,6 +98,11 @@ struct kernfs_elem_dir {
 * better directly in kernfs_node but is here to save space.
 */
struct kernfs_root  *root;
+   /*
+* Monotonic revision counter, used to identify if a directory
+* node has changed during revalidation.
+*/
+   unsigned long rev;
 };
 
 struct kernfs_elem_symlink {




[PATCH v2 2/6] kernfs: move revalidate to be near lookup

2020-06-17 Thread Ian Kent
While the dentry operation kernfs_dop_revalidate() is grouped with
dentry'ish functions it also has a strong afinity to the inode
operation ->lookup(). And when path walk improvements are applied
it will need to call kernfs_find_ns() so move it to be near
kernfs_iop_lookup() to avoid the need for a forward declaration.

There's no functional change from this patch.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |   86 ---
 1 file changed, 43 insertions(+), 43 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index d8213fc65eba..9b315f3b20ee 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -559,49 +559,6 @@ void kernfs_put(struct kernfs_node *kn)
 }
 EXPORT_SYMBOL_GPL(kernfs_put);
 
-static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
-{
-   struct kernfs_node *kn;
-
-   if (flags & LOOKUP_RCU)
-   return -ECHILD;
-
-   /* Always perform fresh lookup for negatives */
-   if (d_really_is_negative(dentry))
-   goto out_bad_unlocked;
-
-   kn = kernfs_dentry_node(dentry);
-   down_read(_rwsem);
-
-   /* The kernfs node has been deactivated */
-   if (!kernfs_active_read(kn))
-   goto out_bad;
-
-   /* The kernfs node has been moved? */
-   if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
-   goto out_bad;
-
-   /* The kernfs node has been renamed */
-   if (strcmp(dentry->d_name.name, kn->name) != 0)
-   goto out_bad;
-
-   /* The kernfs node has been moved to a different namespace */
-   if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-   kernfs_info(dentry->d_sb)->ns != kn->ns)
-   goto out_bad;
-
-   up_read(_rwsem);
-   return 1;
-out_bad:
-   up_read(_rwsem);
-out_bad_unlocked:
-   return 0;
-}
-
-const struct dentry_operations kernfs_dops = {
-   .d_revalidate   = kernfs_dop_revalidate,
-};
-
 /**
  * kernfs_node_from_dentry - determine kernfs_node associated with a dentry
  * @dentry: the dentry in question
@@ -1085,6 +1042,49 @@ struct kernfs_node *kernfs_create_empty_dir(struct 
kernfs_node *parent,
return ERR_PTR(rc);
 }
 
+static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct kernfs_node *kn;
+
+   if (flags & LOOKUP_RCU)
+   return -ECHILD;
+
+   /* Always perform fresh lookup for negatives */
+   if (d_really_is_negative(dentry))
+   goto out_bad_unlocked;
+
+   kn = kernfs_dentry_node(dentry);
+   down_read(_rwsem);
+
+   /* The kernfs node has been deactivated */
+   if (!kernfs_active_read(kn))
+   goto out_bad;
+
+   /* The kernfs node has been moved? */
+   if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
+   goto out_bad;
+
+   /* The kernfs node has been renamed */
+   if (strcmp(dentry->d_name.name, kn->name) != 0)
+   goto out_bad;
+
+   /* The kernfs node has been moved to a different namespace */
+   if (kn->parent && kernfs_ns_enabled(kn->parent) &&
+   kernfs_info(dentry->d_sb)->ns != kn->ns)
+   goto out_bad;
+
+   up_read(_rwsem);
+   return 1;
+out_bad:
+   up_read(_rwsem);
+out_bad_unlocked:
+   return 0;
+}
+
+const struct dentry_operations kernfs_dops = {
+   .d_revalidate   = kernfs_dop_revalidate,
+};
+
 static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *dentry,
unsigned int flags)




[PATCH v2 6/6] kernfs: make attr_mutex a local kernfs node lock

2020-06-17 Thread Ian Kent
The global mutex attr_mutex is used to protect the update of inode
attributes in kernfs_refresh_inode() (as well as kernfs node attribute
structure creation) and this function is called by the inode operation
.permission().

Since .permission() is called quite frequently during path walks it
can lead to contention when the number of concurrent path walks is
high.

This mutex is used for kernfs node objects only so make it local to
the kernfs node to reduce the impact of this type of contention.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c|1 +
 fs/kernfs/inode.c  |   12 ++--
 include/linux/kernfs.h |2 ++
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 03f4f179bbc4..3233e01651e4 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -597,6 +597,7 @@ static struct kernfs_node *__kernfs_new_node(struct 
kernfs_root *root,
kn = kmem_cache_zalloc(kernfs_node_cache, GFP_KERNEL);
if (!kn)
goto err_out1;
+   mutex_init(>attr_mutex);
 
idr_preload(GFP_KERNEL);
spin_lock(_idr_lock);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 5c3fac356ce0..5eb11094bb2e 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -36,7 +36,7 @@ static struct kernfs_iattrs *__kernfs_iattrs(struct 
kernfs_node *kn, int alloc)
 {
struct kernfs_iattrs *iattr = NULL;
 
-   mutex_lock(_mutex);
+   mutex_lock(>attr_mutex);
if (kn->iattr || !alloc) {
iattr = kn->iattr;
goto out_unlock;
@@ -59,7 +59,7 @@ static struct kernfs_iattrs *__kernfs_iattrs(struct 
kernfs_node *kn, int alloc)
atomic_set(>user_xattr_size, 0);
kn->iattr = iattr;
 out_unlock:
-   mutex_unlock(_mutex);
+   mutex_unlock(>attr_mutex);
return iattr;
 }
 
@@ -192,9 +192,9 @@ int kernfs_iop_getattr(const struct path *path, struct 
kstat *stat,
struct kernfs_node *kn = inode->i_private;
 
down_read(_rwsem);
-   mutex_lock(_mutex);
+   mutex_lock(>attr_mutex);
kernfs_refresh_inode(kn, inode);
-   mutex_unlock(_mutex);
+   mutex_unlock(>attr_mutex);
up_read(_rwsem);
 
generic_fillattr(inode, stat);
@@ -286,9 +286,9 @@ int kernfs_iop_permission(struct inode *inode, int mask)
kn = inode->i_private;
 
down_read(_rwsem);
-   mutex_lock(_mutex);
+   mutex_lock(>attr_mutex);
kernfs_refresh_inode(kn, inode);
-   mutex_unlock(_mutex);
+   mutex_unlock(>attr_mutex);
up_read(_rwsem);
 
return generic_permission(inode, mask);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 74727d98e380..8669f65d5a39 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -142,6 +142,8 @@ struct kernfs_node {
 
struct rb_node  rb;
 
+   struct mutexattr_mutex; /* protect attr updates */
+
const void  *ns;/* namespace tag */
unsigned inthash;   /* ns + name hash */
union {




[PATCH v2 1/6] kernfs: switch kernfs to use an rwsem

2020-06-17 Thread Ian Kent
The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel.

Change the kernfs mutex to an rwsem so that, when oppertunity arises,
node searches can be done in parallel.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |  119 +++
 fs/kernfs/file.c|4 +
 fs/kernfs/inode.c   |   16 +++---
 fs/kernfs/kernfs-internal.h |5 +-
 fs/kernfs/mount.c   |   12 ++--
 fs/kernfs/symlink.c |4 +
 6 files changed, 86 insertions(+), 74 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9aec80b9d7c6..d8213fc65eba 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@
 
 #include "kernfs-internal.h"
 
-DEFINE_MUTEX(kernfs_mutex);
+DECLARE_RWSEM(kernfs_rwsem);
 static DEFINE_SPINLOCK(kernfs_rename_lock);/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];  /* protected by rename_lock */
 static DEFINE_SPINLOCK(kernfs_idr_lock);   /* root->ino_idr */
@@ -26,10 +26,21 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);/* 
root->ino_idr */
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-   lockdep_assert_held(_mutex);
return atomic_read(>active) >= 0;
 }
 
+static bool kernfs_active_write(struct kernfs_node *kn)
+{
+   lockdep_assert_held_write(_rwsem);
+   return kernfs_active(kn);
+}
+
+static bool kernfs_active_read(struct kernfs_node *kn)
+{
+   lockdep_assert_held_read(_rwsem);
+   return kernfs_active(kn);
+}
+
 static bool kernfs_lockdep(struct kernfs_node *kn)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -340,7 +351,7 @@ static int kernfs_sd_compare(const struct kernfs_node *left,
  * @kn->parent->dir.children.
  *
  * Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem write lock
  *
  * RETURNS:
  * 0 on susccess -EEXIST on failure.
@@ -385,7 +396,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
  * removed, %false if @kn wasn't on the rbtree.
  *
  * Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem write lock
  */
 static bool kernfs_unlink_sibling(struct kernfs_node *kn)
 {
@@ -455,14 +466,14 @@ void kernfs_put_active(struct kernfs_node *kn)
  * return after draining is complete.
  */
 static void kernfs_drain(struct kernfs_node *kn)
-   __releases(_mutex) __acquires(_mutex)
+   __releases(_rwsem) __acquires(_rwsem)
 {
struct kernfs_root *root = kernfs_root(kn);
 
-   lockdep_assert_held(_mutex);
+   lockdep_assert_held_write(_rwsem);
WARN_ON_ONCE(kernfs_active(kn));
 
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
 
if (kernfs_lockdep(kn)) {
rwsem_acquire(>dep_map, 0, 0, _RET_IP_);
@@ -481,7 +492,7 @@ static void kernfs_drain(struct kernfs_node *kn)
 
kernfs_drain_open_files(kn);
 
-   mutex_lock(_mutex);
+   down_write(_rwsem);
 }
 
 /**
@@ -560,10 +571,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
goto out_bad_unlocked;
 
kn = kernfs_dentry_node(dentry);
-   mutex_lock(_mutex);
+   down_read(_rwsem);
 
/* The kernfs node has been deactivated */
-   if (!kernfs_active(kn))
+   if (!kernfs_active_read(kn))
goto out_bad;
 
/* The kernfs node has been moved? */
@@ -579,10 +590,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
 
-   mutex_unlock(_mutex);
+   up_read(_rwsem);
return 1;
 out_bad:
-   mutex_unlock(_mutex);
+   up_read(_rwsem);
 out_bad_unlocked:
return 0;
 }
@@ -764,7 +775,7 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;
 
-   mutex_lock(_mutex);
+   down_write(_rwsem);
 
ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -779,7 +790,7 @@ int kernfs_add_one(struct kernfs_node *kn)
if (parent->flags & KERNFS_EMPTY_DIR)
goto out_unlock;
 
-   if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent))
+   if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active_write(parent))
goto out_unlock;
 
kn->hash = kernfs_name_hash(kn->name, kn->ns);
@@ -795,7 +806,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
 
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
 
/*
 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -809,7 +820,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;
 
 out_unlock:
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
return ret;
 }
 
@@ -830,7 +841,7 @@ static struct kernfs_node *kernfs_find_ns(struct 
kernfs_node *parent,
   

[PATCH v2 3/6] kernfs: improve kernfs path resolution

2020-06-17 Thread Ian Kent
Now that an rwsem is used by kernfs, take advantage of it to reduce
lookup overhead.

If there are many lookups (possibly many negative ones) there can
be a lot of overhead during path walks.

To reduce lookup overhead avoid allocating a new dentry where possible.

To do this stay in rcu-walk mode where possible and use the dentry cache
handling of negative hashed dentries to avoid allocating (and freeing
shortly after) new dentries on every negative lookup.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |   87 ++-
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9b315f3b20ee..f4943329e578 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1046,15 +1046,75 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
 {
struct kernfs_node *kn;
 
-   if (flags & LOOKUP_RCU)
+   if (flags & LOOKUP_RCU) {
+   kn = kernfs_dentry_node(dentry);
+   if (!kn) {
+   /* Negative hashed dentry, tell the VFS to switch to
+* ref-walk mode and call us again so that node
+* existence can be checked.
+*/
+   if (!d_unhashed(dentry))
+   return -ECHILD;
+
+   /* Negative unhashed dentry, this shouldn't happen
+* because this case occurs in rcu-walk mode after
+* dentry allocation which is followed by a call
+* to ->loopup(). But if it does happen the dentry
+* is surely invalid.
+*/
+   return 0;
+   }
+
+   /* Since the dentry is positive (we got the kernfs node) a
+* kernfs node reference was held at the time. Now if the
+* dentry reference count is still greater than 0 it's still
+* positive so take a reference to the node to perform an
+* active check.
+*/
+   if (d_count(dentry) <= 0 || !atomic_inc_not_zero(>count))
+   return -ECHILD;
+
+   /* The kernfs node reference count was greater than 0, if
+* it's active continue in rcu-walk mode.
+*/
+   if (kernfs_active_read(kn)) {
+   kernfs_put(kn);
+   return 1;
+   }
+
+   /* Otherwise, just tell the VFS to switch to ref-walk mode
+* and call us again so the kernfs node can be validated.
+*/
+   kernfs_put(kn);
return -ECHILD;
+   }
 
-   /* Always perform fresh lookup for negatives */
-   if (d_really_is_negative(dentry))
-   goto out_bad_unlocked;
+   down_read(_rwsem);
 
kn = kernfs_dentry_node(dentry);
-   down_read(_rwsem);
+   if (!kn) {
+   struct kernfs_node *parent;
+
+   /* If the kernfs node can be found this is a stale negative
+* hashed dentry so it must be discarded and the lookup redone.
+*/
+   parent = kernfs_dentry_node(dentry->d_parent);
+   if (parent) {
+   const void *ns = NULL;
+
+   if (kernfs_ns_enabled(parent))
+   ns = kernfs_info(dentry->d_parent->d_sb)->ns;
+   kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+   if (kn)
+   goto out_bad;
+   }
+
+   /* The kernfs node doesn't exist, leave the dentry negative
+* and return success.
+*/
+   goto out;
+   }
+
 
/* The kernfs node has been deactivated */
if (!kernfs_active_read(kn))
@@ -1072,12 +1132,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
if (kn->parent && kernfs_ns_enabled(kn->parent) &&
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
-
+out:
up_read(_rwsem);
return 1;
 out_bad:
up_read(_rwsem);
-out_bad_unlocked:
return 0;
 }
 
@@ -1092,7 +1151,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *ret;
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
-   struct inode *inode;
+   struct inode *inode = NULL;
const void *ns = NULL;
 
down_read(_rwsem);
@@ -1102,11 +1161,9 @@ static struct dentry *kernfs_iop_lookup(struct inode 
*dir,
 
kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
 
-   /* no such entry */
-   if (!kn || !kernfs_active(kn)) {
-   ret = NULL;
- 

[PATCH v2 5/6] kernfs: refactor attr locking

2020-06-17 Thread Ian Kent
The inode operations .permission() and .getattr() use the kernfs node
write lock but all that's needed is to keep the rb tree stable while
copying the node attributes. And .permission() is called frequently
during path walks so it can cause quite a bit of contention between
kernfs node opertations and path walks when the number of concurrant
walks is high.

Ideally the inode mutex would protect the inode update but .permission()
may be called both with and without holding the inode mutex so there's no
way for kernfs .permission() to know if it is the holder of the mutex
which means it could be released during the update.

So refactor __kernfs_iattrs() by moving the static mutex declaration out
of the function and changing the function itself a little. And also use
the mutex to protect the inode attribute fields updated by .permission()
and .getattr() calls to kernfs_refresh_inode().

Using the attr mutex to protect two different things, the node
attributes as well as the copy of them to the inode is not ideal. But
the only other choice is to use two locks which seems like excessive
ovherhead when the attr mutex is so closely related to the inode fields
it's protecting.

Signed-off-by: Ian Kent 
---
 fs/kernfs/inode.c |   50 --
 1 file changed, 28 insertions(+), 22 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 23a7996d06a9..5c3fac356ce0 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -17,6 +17,8 @@
 
 #include "kernfs-internal.h"
 
+static DEFINE_MUTEX(attr_mutex);
+
 static const struct address_space_operations kernfs_aops = {
.readpage   = simple_readpage,
.write_begin= simple_write_begin,
@@ -32,33 +34,33 @@ static const struct inode_operations kernfs_iops = {
 
 static struct kernfs_iattrs *__kernfs_iattrs(struct kernfs_node *kn, int alloc)
 {
-   static DEFINE_MUTEX(iattr_mutex);
-   struct kernfs_iattrs *ret;
-
-   mutex_lock(_mutex);
+   struct kernfs_iattrs *iattr = NULL;
 
-   if (kn->iattr || !alloc)
+   mutex_lock(_mutex);
+   if (kn->iattr || !alloc) {
+   iattr = kn->iattr;
goto out_unlock;
+   }
 
-   kn->iattr = kmem_cache_zalloc(kernfs_iattrs_cache, GFP_KERNEL);
-   if (!kn->iattr)
+   iattr = kmem_cache_zalloc(kernfs_iattrs_cache, GFP_KERNEL);
+   if (!iattr)
goto out_unlock;
 
/* assign default attributes */
-   kn->iattr->ia_uid = GLOBAL_ROOT_UID;
-   kn->iattr->ia_gid = GLOBAL_ROOT_GID;
+   iattr->ia_uid = GLOBAL_ROOT_UID;
+   iattr->ia_gid = GLOBAL_ROOT_GID;
 
-   ktime_get_real_ts64(>iattr->ia_atime);
-   kn->iattr->ia_mtime = kn->iattr->ia_atime;
-   kn->iattr->ia_ctime = kn->iattr->ia_atime;
+   ktime_get_real_ts64(>ia_atime);
+   iattr->ia_mtime = iattr->ia_atime;
+   iattr->ia_ctime = iattr->ia_atime;
 
-   simple_xattrs_init(>iattr->xattrs);
-   atomic_set(>iattr->nr_user_xattrs, 0);
-   atomic_set(>iattr->user_xattr_size, 0);
+   simple_xattrs_init(>xattrs);
+   atomic_set(>nr_user_xattrs, 0);
+   atomic_set(>user_xattr_size, 0);
+   kn->iattr = iattr;
 out_unlock:
-   ret = kn->iattr;
-   mutex_unlock(_mutex);
-   return ret;
+   mutex_unlock(_mutex);
+   return iattr;
 }
 
 static struct kernfs_iattrs *kernfs_iattrs(struct kernfs_node *kn)
@@ -189,9 +191,11 @@ int kernfs_iop_getattr(const struct path *path, struct 
kstat *stat,
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;
 
-   down_write(_rwsem);
+   down_read(_rwsem);
+   mutex_lock(_mutex);
kernfs_refresh_inode(kn, inode);
-   up_writeread(_rwsem);
+   mutex_unlock(_mutex);
+   up_read(_rwsem);
 
generic_fillattr(inode, stat);
return 0;
@@ -281,9 +285,11 @@ int kernfs_iop_permission(struct inode *inode, int mask)
 
kn = inode->i_private;
 
-   down_write(_rwsem);
+   down_read(_rwsem);
+   mutex_lock(_mutex);
kernfs_refresh_inode(kn, inode);
-   up_write(_rwsem);
+   mutex_unlock(_mutex);
+   up_read(_rwsem);
 
return generic_permission(inode, mask);
 }




Re: [PATCH v2 1/6] kernfs: switch kernfs to use an rwsem

2020-06-17 Thread Ian Kent
On Wed, 2020-06-17 at 15:30 +0800, Ian Kent wrote:
> The kernfs global lock restricts the ability to perform kernfs node
> lookup operations in parallel.
> 
> Change the kernfs mutex to an rwsem so that, when oppertunity arises,
> node searches can be done in parallel.

Please ignore, false start with a missing option!

> 
> Signed-off-by: Ian Kent 
> ---
>  fs/kernfs/dir.c |  119 +++
> 
>  fs/kernfs/file.c|4 +
>  fs/kernfs/inode.c   |   16 +++---
>  fs/kernfs/kernfs-internal.h |5 +-
>  fs/kernfs/mount.c   |   12 ++--
>  fs/kernfs/symlink.c |4 +
>  6 files changed, 86 insertions(+), 74 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 9aec80b9d7c6..d8213fc65eba 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -17,7 +17,7 @@
>  
>  #include "kernfs-internal.h"
>  
> -DEFINE_MUTEX(kernfs_mutex);
> +DECLARE_RWSEM(kernfs_rwsem);
>  static DEFINE_SPINLOCK(kernfs_rename_lock);  /* kn->parent and
> ->name */
>  static char kernfs_pr_cont_buf[PATH_MAX];/* protected by
> rename_lock */
>  static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */
> @@ -26,10 +26,21 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);  /*
> root->ino_idr */
>  
>  static bool kernfs_active(struct kernfs_node *kn)
>  {
> - lockdep_assert_held(_mutex);
>   return atomic_read(>active) >= 0;
>  }
>  
> +static bool kernfs_active_write(struct kernfs_node *kn)
> +{
> + lockdep_assert_held_write(_rwsem);
> + return kernfs_active(kn);
> +}
> +
> +static bool kernfs_active_read(struct kernfs_node *kn)
> +{
> + lockdep_assert_held_read(_rwsem);
> + return kernfs_active(kn);
> +}
> +
>  static bool kernfs_lockdep(struct kernfs_node *kn)
>  {
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> @@ -340,7 +351,7 @@ static int kernfs_sd_compare(const struct
> kernfs_node *left,
>   *   @kn->parent->dir.children.
>   *
>   *   Locking:
> - *   mutex_lock(kernfs_mutex)
> + *   kernfs_rwsem write lock
>   *
>   *   RETURNS:
>   *   0 on susccess -EEXIST on failure.
> @@ -385,7 +396,7 @@ static int kernfs_link_sibling(struct kernfs_node
> *kn)
>   *   removed, %false if @kn wasn't on the rbtree.
>   *
>   *   Locking:
> - *   mutex_lock(kernfs_mutex)
> + *   kernfs_rwsem write lock
>   */
>  static bool kernfs_unlink_sibling(struct kernfs_node *kn)
>  {
> @@ -455,14 +466,14 @@ void kernfs_put_active(struct kernfs_node *kn)
>   * return after draining is complete.
>   */
>  static void kernfs_drain(struct kernfs_node *kn)
> - __releases(_mutex) __acquires(_mutex)
> + __releases(_rwsem) __acquires(_rwsem)
>  {
>   struct kernfs_root *root = kernfs_root(kn);
>  
> - lockdep_assert_held(_mutex);
> + lockdep_assert_held_write(_rwsem);
>   WARN_ON_ONCE(kernfs_active(kn));
>  
> - mutex_unlock(_mutex);
> + up_write(_rwsem);
>  
>   if (kernfs_lockdep(kn)) {
>   rwsem_acquire(>dep_map, 0, 0, _RET_IP_);
> @@ -481,7 +492,7 @@ static void kernfs_drain(struct kernfs_node *kn)
>  
>   kernfs_drain_open_files(kn);
>  
> - mutex_lock(_mutex);
> + down_write(_rwsem);
>  }
>  
>  /**
> @@ -560,10 +571,10 @@ static int kernfs_dop_revalidate(struct dentry
> *dentry, unsigned int flags)
>   goto out_bad_unlocked;
>  
>   kn = kernfs_dentry_node(dentry);
> - mutex_lock(_mutex);
> + down_read(_rwsem);
>  
>   /* The kernfs node has been deactivated */
> - if (!kernfs_active(kn))
> + if (!kernfs_active_read(kn))
>   goto out_bad;
>  
>   /* The kernfs node has been moved? */
> @@ -579,10 +590,10 @@ static int kernfs_dop_revalidate(struct dentry
> *dentry, unsigned int flags)
>   kernfs_info(dentry->d_sb)->ns != kn->ns)
>   goto out_bad;
>  
> - mutex_unlock(_mutex);
> + up_read(_rwsem);
>   return 1;
>  out_bad:
> - mutex_unlock(_mutex);
> + up_read(_rwsem);
>  out_bad_unlocked:
>   return 0;
>  }
> @@ -764,7 +775,7 @@ int kernfs_add_one(struct kernfs_node *kn)
>   bool has_ns;
>   int ret;
>  
> - mutex_lock(_mutex);
> + down_write(_rwsem);
>  
>   ret = -EINVAL;
>   has_ns = kernfs_ns_enabled(parent);
> @@ -779,7 +790,7 @@ int kernfs_add_one(struct kernfs_node *kn)
>   if (parent->flags & KERNFS_EMPTY_DIR)
>   goto out_unlock;
>  
> - if ((parent->flags & KERNFS_ACTIVATED) &&
> !kernfs_active(parent))
> + if ((parent->flag

[PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-17 Thread Ian Kent
For very large IBM Power mainframe systems with hundreds of CPUs and TBs
of RAM booting can take a very long time.

Initial reports showed that booting a configuration of several hundred
CPUs and 64TB of RAM would take more than 30 minutes and require kernel
parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600
to prevent dropping into emergency mode.

Gathering information about what's happening during the boot is a bit
challenging but two main issues appeared to be: a large number of path
lookups for non-existent files, and very high lock contention in the VFS
during path walks particularly in the dentry allocation code path.

The underlying cause of this was thought to be the sheer number of sysfs
memory objects, 100,000+ for a 64TB memory configuration as the hardware
divides the memory into 256MB logical blocks. This is believed to be due
to either IBM Power hardware design or a requirement of the mainframe
software used to create logical partitions (LPARs, that are used to
install an operating system to provide services), since these can be made
up of a wide range of resources, CPU, Memory, disks, etc.

It's unclear yet whether the creation of syfs nodes for these memory
devices can be postponed or spread out over a larger amount of time.
That's because the high overhead looks to be due to notifications received
by udev which invokes a systemd program for them and attempts by systemd
folks to improve this have not focused on changing the handling of these
notifications, possibly because of difficulties with doing so. This
remains an avenue of investigation.

Kernel traces show there are many path walks with a fairly large portion
of those for non-existent paths. However, looking at the systemd code
invoked by the udev action it appears there's only one additional lookup
for each invocation so the large number of negative lookups is most likely
due to the large number of notifications rather than a fault with the
systemd program.

The series here tries to reduce the locking needed during path walks
based on the assumption that there are many path walks with a fairly
large portion of those for non-existent paths, as described above.

That was done by adding kernfs negative dentry caching (non-existent
paths) to avoid continual alloc/free cycle of dentries and a read/write
semaphore introduced to increase kernfs concurrency during path walks.

With these changes we still need kernel parameters of udev.children-max=2048
and systemd.default_timeout_start_sec=300 for the fastest boot times of
under 5 minutes.

There may be opportunities for further improvements but the series here
has seen a fair amount of testing and thinking about what else these could
be. Discussing it with Rick Lindsay, I suspect improvements will get more
difficult to implement for somewhat less improvement so I think what we
have here is a good start for now.

Changes since v1:
- fix locking in .permission() and .getattr() by re-factoring the attribute
  handling code.
---

Ian Kent (6):
  kernfs: switch kernfs to use an rwsem
  kernfs: move revalidate to be near lookup
  kernfs: improve kernfs path resolution
  kernfs: use revision to identify directory node changes
  kernfs: refactor attr locking
  kernfs: make attr_mutex a local kernfs node lock


 fs/kernfs/dir.c |  284 ---
 fs/kernfs/file.c|4 -
 fs/kernfs/inode.c   |   58 +
 fs/kernfs/kernfs-internal.h |   29 
 fs/kernfs/mount.c   |   12 +-
 fs/kernfs/symlink.c |4 -
 include/linux/kernfs.h  |7 +
 7 files changed, 259 insertions(+), 139 deletions(-)

--
Ian



[PATCH v2 1/6] kernfs: switch kernfs to use an rwsem

2020-06-17 Thread Ian Kent
The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel.

Change the kernfs mutex to an rwsem so that, when oppertunity arises,
node searches can be done in parallel.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |  119 +++
 fs/kernfs/file.c|4 +
 fs/kernfs/inode.c   |   16 +++---
 fs/kernfs/kernfs-internal.h |5 +-
 fs/kernfs/mount.c   |   12 ++--
 fs/kernfs/symlink.c |4 +
 6 files changed, 86 insertions(+), 74 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9aec80b9d7c6..d8213fc65eba 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@
 
 #include "kernfs-internal.h"
 
-DEFINE_MUTEX(kernfs_mutex);
+DECLARE_RWSEM(kernfs_rwsem);
 static DEFINE_SPINLOCK(kernfs_rename_lock);/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];  /* protected by rename_lock */
 static DEFINE_SPINLOCK(kernfs_idr_lock);   /* root->ino_idr */
@@ -26,10 +26,21 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);/* 
root->ino_idr */
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-   lockdep_assert_held(_mutex);
return atomic_read(>active) >= 0;
 }
 
+static bool kernfs_active_write(struct kernfs_node *kn)
+{
+   lockdep_assert_held_write(_rwsem);
+   return kernfs_active(kn);
+}
+
+static bool kernfs_active_read(struct kernfs_node *kn)
+{
+   lockdep_assert_held_read(_rwsem);
+   return kernfs_active(kn);
+}
+
 static bool kernfs_lockdep(struct kernfs_node *kn)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -340,7 +351,7 @@ static int kernfs_sd_compare(const struct kernfs_node *left,
  * @kn->parent->dir.children.
  *
  * Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem write lock
  *
  * RETURNS:
  * 0 on susccess -EEXIST on failure.
@@ -385,7 +396,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
  * removed, %false if @kn wasn't on the rbtree.
  *
  * Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem write lock
  */
 static bool kernfs_unlink_sibling(struct kernfs_node *kn)
 {
@@ -455,14 +466,14 @@ void kernfs_put_active(struct kernfs_node *kn)
  * return after draining is complete.
  */
 static void kernfs_drain(struct kernfs_node *kn)
-   __releases(_mutex) __acquires(_mutex)
+   __releases(_rwsem) __acquires(_rwsem)
 {
struct kernfs_root *root = kernfs_root(kn);
 
-   lockdep_assert_held(_mutex);
+   lockdep_assert_held_write(_rwsem);
WARN_ON_ONCE(kernfs_active(kn));
 
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
 
if (kernfs_lockdep(kn)) {
rwsem_acquire(>dep_map, 0, 0, _RET_IP_);
@@ -481,7 +492,7 @@ static void kernfs_drain(struct kernfs_node *kn)
 
kernfs_drain_open_files(kn);
 
-   mutex_lock(_mutex);
+   down_write(_rwsem);
 }
 
 /**
@@ -560,10 +571,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
goto out_bad_unlocked;
 
kn = kernfs_dentry_node(dentry);
-   mutex_lock(_mutex);
+   down_read(_rwsem);
 
/* The kernfs node has been deactivated */
-   if (!kernfs_active(kn))
+   if (!kernfs_active_read(kn))
goto out_bad;
 
/* The kernfs node has been moved? */
@@ -579,10 +590,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
 
-   mutex_unlock(_mutex);
+   up_read(_rwsem);
return 1;
 out_bad:
-   mutex_unlock(_mutex);
+   up_read(_rwsem);
 out_bad_unlocked:
return 0;
 }
@@ -764,7 +775,7 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;
 
-   mutex_lock(_mutex);
+   down_write(_rwsem);
 
ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -779,7 +790,7 @@ int kernfs_add_one(struct kernfs_node *kn)
if (parent->flags & KERNFS_EMPTY_DIR)
goto out_unlock;
 
-   if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent))
+   if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active_write(parent))
goto out_unlock;
 
kn->hash = kernfs_name_hash(kn->name, kn->ns);
@@ -795,7 +806,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
 
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
 
/*
 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -809,7 +820,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;
 
 out_unlock:
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
return ret;
 }
 
@@ -830,7 +841,7 @@ static struct kernfs_node *kernfs_find_ns(struct 
kernfs_node *parent,
   

Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]

2020-06-13 Thread Ian Kent
On Thu, 2020-04-02 at 17:19 +0200, Miklos Szeredi wrote:
> 
> > Firstly, a watch queue needs to be created:
> > 
> > pipe2(fds, O_NOTIFICATION_PIPE);
> > ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256);
> > 
> > then a notification can be set up to report notifications via that
> > queue:
> > 
> > struct watch_notification_filter filter = {
> > .nr_filters = 1,
> > .filters = {
> > [0] = {
> > .type = WATCH_TYPE_MOUNT_NOTIFY,
> > .subtype_filter[0] = UINT_MAX,
> > },
> > },
> > };
> > ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, );
> > watch_mount(AT_FDCWD, "/", 0, fds[1], 0x02);
> > 
> > In this case, it would let me monitor the mount topology subtree
> > rooted at
> > "/" for events.  Mount notifications propagate up the tree towards
> > the
> > root, so a watch will catch all of the events happening in the
> > subtree
> > rooted at the watch.
> 
> Does it make sense to watch a single mount?  A set of mounts?   A
> subtree with an exclusion list (subtrees, types, ???)?

Yes, filtering, perhaps, I'm not sure a single mount is useful
as changes generally need to be monitored for a set of mounts.

Monitoring a subtree is obviously possible because the monitor
path doesn't need to be "/".

Or am I misunderstanding what your trying to get at.

The notion of filtering types and other things is interesting
but what I've seen that doesn't fit in the current implementation
so far probably isn't appropriate for kernel implementation.

There's a special case of acquiring a list of mounts where the
path is not a mount point itself but you need all mount below
that path prefix.

In this case you get all mounts, including the mounts of the mount
containing the path, so you still need to traverse the list to match
the prefix and that can easily mean the whole list of mounts in the
system.

Point is it leads to multiple traversals of a larger than needed list
of mounts, one to get the list of mounts to check, and one to filter
on the prefix.

I've seen this use case with fsinfo() and that's where it's needed
although it may be useful to carry it through to notifications as
well.

While this sounds like it isn't such a big deal it can sometimes
make a considerable difference to the number of mounts you need
to traverse when there are a large number of mounts in the system.

I didn't consider it appropriate for kernel implementation but
since you asked here it is. OTOH were checking for connectedness
in fsinfo() anyway so maybe this is something that could be done
without undue overhead.

But that's all I've seen so far.

Ian



Re: [kernfs] ea7c5fc39a: stress-ng.stream.ops_per_sec 11827.2% improvement

2020-06-10 Thread Ian Kent
On Thu, 2020-06-11 at 10:06 +0800, kernel test robot wrote:
> On Sun, Jun 07, 2020 at 09:13:08AM +0800, Ian Kent wrote:
> > On Sat, 2020-06-06 at 20:18 +0200, Greg Kroah-Hartman wrote:
> > > On Sat, Jun 06, 2020 at 11:52:16PM +0800, kernel test robot
> > > wrote:
> > > > Greeting,
> > > > 
> > > > FYI, we noticed a 11827.2% improvement of stress-
> > > > ng.stream.ops_per_sec due to commit:
> > > > 
> > > > 
> > > > commit: ea7c5fc39ab005b501e0c7666c29db36321e4f74 ("[PATCH 1/4]
> > > > kernfs: switch kernfs to use an rwsem")
> > > > url: 
> > > > https://github.com/0day-ci/linux/commits/Ian-Kent/kernfs-proposed-locking-and-concurrency-improvement/20200525-134849
> > > > 
> > > 
> > > Seriously?  That's a huge performance increase, and one that
> > > feels
> > > really odd.  Why would a stress-ng test be touching sysfs?
> > 
> > That is unusually high even if there's a lot of sysfs or kernfs
> > activity and that patch shouldn't improve VFS path walk contention
> > very much even if it is present.
> > 
> > Maybe I've missed something, and the information provided doesn't
> > seem to be quite enough to even make a start on it.
> > 
> > That's going to need some analysis which, for my part, will need to
> > wait probably until around rc1 time frame to allow me to get
> > through
> > the push down stack (reactive, postponed due to other priorities)
> > of
> > jobs I have in order to get back to the fifo queue (longer term
> > tasks,
> > of which this is one) list of jobs I need to do as well, ;)
> > 
> > Please, kernel test robot, more information about this test and
> > what
> > it's doing.
> > 
> 
> Hi Ian,
> 
> We increased the timeout of stress-ng from 1s to 32s, and there's
> only
> 3% improvement of stress-ng.stream.ops_per_sec:
> 
> fefcfc968723caf9  ea7c5fc39ab005b501e0c7666c  testcase/testparams/tes
> tbox
>   --  ---
> 
>  %stddev  change %stddev
>  \  |\  
>  10686   3%  11037stress-ng/cpu-cache-
> performance-1HDD-100%-32s-ucode=0x52c/lkp-csl-2sp5
>  10686   3%  11037GEO-MEAN stress-
> ng.stream.ops_per_sec
> 
> It seems the result of stress-ng is inaccurate if test time too
> short, we'll increase the test time to avoid unreasonable results,
> sorry for the inconvenience.

Haha, I was worried there wasn't anything that could be done to
work out what was wrong.

I had tried to reproduce it, and failed since the job file specifies
a host config that I simply don't have, and I don't get how to alter
the job to suit, or how to specify a host definition file.

I also couldn't work out what parameters where used in running the
test so I was about to ask on the lkp list after working through
this in a VM.

So your timing on looking into this is fortunate, for sure.
Thank you very much for that.

Now, Greg, there's that locking I changed around kernfs_refresh_inode()
that I need to fix which I re-considered as a result of this, so that's
a plus for the testing because it's certainly wrong.

I'll have another look at that and boot test it on a couple of systems
then post a v2 for you to consider. What I've done might offend your
sensibilities as it does mine, or perhaps not so much.

Ian



Re: [PATCH 1/4] kernfs: switch kernfs to use an rwsem

2020-06-08 Thread Ian Kent
On Sun, 2020-06-07 at 16:40 +0800, Ian Kent wrote:
> Hi Greg,
> 
> On Mon, 2020-05-25 at 13:47 +0800, Ian Kent wrote:
> > @@ -189,9 +189,9 @@ int kernfs_iop_getattr(const struct path *path,
> > struct kstat *stat,
> > struct inode *inode = d_inode(path->dentry);
> > struct kernfs_node *kn = inode->i_private;
> >  
> > -   mutex_lock(_mutex);
> > +   down_read(_rwsem);
> > kernfs_refresh_inode(kn, inode);
> > -   mutex_unlock(_mutex);
> > +   up_read(_rwsem);
> >  
> > generic_fillattr(inode, stat);
> > return 0;
> > @@ -281,9 +281,9 @@ int kernfs_iop_permission(struct inode *inode,
> > int mask)
> >  
> > kn = inode->i_private;
> >  
> > -   mutex_lock(_mutex);
> > +   down_read(_rwsem);
> > kernfs_refresh_inode(kn, inode);
> > -   mutex_unlock(_mutex);
> > +   up_read(_rwsem);
> >  
> > return generic_permission(inode, mask);
> >  }
> 
> I changed these from a write lock to a read lock late in the
> development.
> 
> But kernfs_refresh_inode() modifies the inode so I think I should
> have taken the inode lock as well as taking the read lock.
> 
> I'll look again but a second opinion (anyone) would be welcome.

I had a look at this today and came up with a couple of patches
to fix it, I don't particularly like to have to do what I did
but I don't think there's any other choice. That's because the
rb tree locking is under significant contention and changing
this back to use the write lock will adversely affect that.

But unless I can find out more about the anomalous kernel test
robot result I can't do anything!

Providing a job.yaml to reproduce it with the hardware specification
of the lkp machine it was run on and no guidelines on what that test
does and what the test needs so it can actually be reproduced isn't
that useful.

Ian



Re: [GIT PULL] General notification queue and key notifications

2020-06-07 Thread Ian Kent
On Wed, 2020-06-03 at 10:15 +0800, Ian Kent wrote:
> On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
> > [[ With regard to the mount/sb notifications and fsinfo(), Karel
> > Zak
> > and
> >Ian Kent have been working on making libmount use them,
> > preparatory to
> >working on systemd:
> > 
> > https://github.com/karelzak/util-linux/commits/topic/fsinfo
> > 
> > https://github.com/raven-au/util-linux/commits/topic/fsinfo.public
> > 
> >Development has stalled briefly due to other commitments, so I'm
> > not
> >sure I can ask you to pull those parts of the series for
> > now.  Christian
> >Brauner would like to use them in lxc, but hasn't started.
> >]]
> 
> Linus,
> 
> Just so your aware of what has been done and where we are at here's
> a summary.
> 
> Karel has done quite a bit of work on libmount (at this stage it's
> getting hold of the mount information, aka. fsinfo()) and most of
> what I have done is included in that too which you can see in Karel's
> repo above). You can see a couple of bug fixes and a little bit of
> new code present in my repo which hasn't been sent over to Karel
> yet.
> 
> This infrastructure is essential before notifications work is started
> which is where we will see the most improvement.
> 
> It turns out that while systemd uses libmount it has it's own
> notifications handling sub-system as it deals with several event
> types, not just mount information, in the same area. So,
> unfortunately,
> changes will need to be made there as well as in libmount, more so
> than the trivial changes to use fsinfo() via libmount.
> 
> That's where we are at the moment and I will get back to it once
> I've dealt with a few things I postponed to work on libmount.
> 
> If you would like a more detailed account of what we have found I
> can provide that too.
> 
> Is there anything else you would like from me or Karel?

I think there's a bit more I should say about this.

One reason work hasn't progressed further on this is I spent
quite a bit of time looking at the affects of using fsinfo().

My testing was done by using a large autofs direct mount map of
2 entries which means that at autofs startup 2 autofs
mounts must be done and at autofs shutdown those 2 mounts
must be umounted. Not very scientific but something to use to
get a feel for the affect of our changes.

Initially just using fsinfo() to load all the mount entries was
done to see how that would perform. This was done in a way that
required no modifications to library user code but didn't get
much improvement.

Next loading all the mount ids (alone) for mount entry traversal
was done and the various fields retrieved on-demand (implemented
by Karel).

Loading the entire mount table and then traversing the entries
means the mount table is always possibly out of date. And loading
the ids and getting the fields on-demand might have made that
problem worse. But loading only the mount ids and using an
on-demand method to get needed fields worked surprisingly well.

The main issue is a mount going away while getting the fields.
Testing showed that simply checking the field is valid and
ignoring the entry if it isn't is enough to handle that case.

Also the mount going away after the needed fields have been
retrieved must be handled by callers of libmount as mounts
can just as easily go away after reading the proc based tables.

The case of the underlying mount information changing needs to
be considered too. We will need to do better on that in the
future but it too is a problem with the proc table handing and
hasn't seen problems logged against libmount for it AFAIK.

So, all in all, this approach worked pretty well as libmount
users do use the getter access methods to retrieve the mount
entry fields (which is required for the on-demand method to
work). Certainly systemd always uses them (and it looks like
udisks2 does too).

Unfortunately using the libmount on-demand implementation
requires library user code be modified (only a little in
the systemd case) to use the implementation.

Testing showed that we get between 10-15% reduction in
overhead and CPU usage remained high.

I think processing large numbers of mounts is simply a lot
of work and there are particular cases that will remain that
require the use of the load and traverse method. For example
matching all mounts with a given prefix string (one of the
systemd use cases).

It's hard to get information about this but I can say that
running pref during the autofs start and stop shows the bulk
of the counter hits on the fsinfo() table construction code
so that ahs to be where the overhead is.

The unavoidable conclusion is that the load and traverse method
that's been imposed on us for so long (even before libmount)
for mount h

Re: [PATCH 1/4] kernfs: switch kernfs to use an rwsem

2020-06-07 Thread Ian Kent
Hi Greg,

On Mon, 2020-05-25 at 13:47 +0800, Ian Kent wrote:
> @@ -189,9 +189,9 @@ int kernfs_iop_getattr(const struct path *path,
> struct kstat *stat,
>   struct inode *inode = d_inode(path->dentry);
>   struct kernfs_node *kn = inode->i_private;
>  
> - mutex_lock(_mutex);
> + down_read(_rwsem);
>   kernfs_refresh_inode(kn, inode);
> - mutex_unlock(_mutex);
> + up_read(_rwsem);
>  
>   generic_fillattr(inode, stat);
>   return 0;
> @@ -281,9 +281,9 @@ int kernfs_iop_permission(struct inode *inode,
> int mask)
>  
>   kn = inode->i_private;
>  
> - mutex_lock(_mutex);
> + down_read(_rwsem);
>   kernfs_refresh_inode(kn, inode);
> - mutex_unlock(_mutex);
> + up_read(_rwsem);
>  
>   return generic_permission(inode, mask);
>  }

I changed these from a write lock to a read lock late in the
development.

But kernfs_refresh_inode() modifies the inode so I think I should
have taken the inode lock as well as taking the read lock.

I'll look again but a second opinion (anyone) would be welcome.

Ian



Re: [kernfs] ea7c5fc39a: stress-ng.stream.ops_per_sec 11827.2% improvement

2020-06-06 Thread Ian Kent
On Sat, 2020-06-06 at 20:18 +0200, Greg Kroah-Hartman wrote:
> On Sat, Jun 06, 2020 at 11:52:16PM +0800, kernel test robot wrote:
> > Greeting,
> > 
> > FYI, we noticed a 11827.2% improvement of stress-
> > ng.stream.ops_per_sec due to commit:
> > 
> > 
> > commit: ea7c5fc39ab005b501e0c7666c29db36321e4f74 ("[PATCH 1/4]
> > kernfs: switch kernfs to use an rwsem")
> > url: 
> > https://github.com/0day-ci/linux/commits/Ian-Kent/kernfs-proposed-locking-and-concurrency-improvement/20200525-134849
> > 
> 
> Seriously?  That's a huge performance increase, and one that feels
> really odd.  Why would a stress-ng test be touching sysfs?

That is unusually high even if there's a lot of sysfs or kernfs
activity and that patch shouldn't improve VFS path walk contention
very much even if it is present.

Maybe I've missed something, and the information provided doesn't
seem to be quite enough to even make a start on it.

That's going to need some analysis which, for my part, will need to
wait probably until around rc1 time frame to allow me to get through
the push down stack (reactive, postponed due to other priorities) of
jobs I have in order to get back to the fifo queue (longer term tasks,
of which this is one) list of jobs I need to do as well, ;)

Please, kernel test robot, more information about this test and what
it's doing.

Ian



Re: [GIT PULL] General notification queue and key notifications

2020-06-02 Thread Ian Kent
On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
> 
> [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak
> and
>Ian Kent have been working on making libmount use them,
> preparatory to
>working on systemd:
> 
>   https://github.com/karelzak/util-linux/commits/topic/fsinfo
>   
> https://github.com/raven-au/util-linux/commits/topic/fsinfo.public
> 
>Development has stalled briefly due to other commitments, so I'm
> not
>sure I can ask you to pull those parts of the series for
> now.  Christian
>Brauner would like to use them in lxc, but hasn't started.
>]]

Linus,

Just so your aware of what has been done and where we are at here's
a summary.

Karel has done quite a bit of work on libmount (at this stage it's
getting hold of the mount information, aka. fsinfo()) and most of
what I have done is included in that too which you can see in Karel's
repo above). You can see a couple of bug fixes and a little bit of
new code present in my repo which hasn't been sent over to Karel
yet.

This infrastructure is essential before notifications work is started
which is where we will see the most improvement.

It turns out that while systemd uses libmount it has it's own
notifications handling sub-system as it deals with several event
types, not just mount information, in the same area. So, unfortunately,
changes will need to be made there as well as in libmount, more so
than the trivial changes to use fsinfo() via libmount.

That's where we are at the moment and I will get back to it once
I've dealt with a few things I postponed to work on libmount.

If you would like a more detailed account of what we have found I
can provide that too.

Is there anything else you would like from me or Karel?

Ian



Re: [PATCH 0/4] kernfs: proposed locking and concurrency improvement

2020-05-25 Thread Ian Kent
On Mon, 2020-05-25 at 08:16 +0200, Greg Kroah-Hartman wrote:
> On Mon, May 25, 2020 at 01:46:59PM +0800, Ian Kent wrote:
> > For very large systems with hundreds of CPUs and TBs of RAM booting
> > can
> > take a very long time.
> > 
> > Initial reports showed that booting a configuration of several
> > hundred
> > CPUs and 64TB of RAM would take more than 30 minutes and require
> > kernel
> > parameters of udev.children-max=1024
> > systemd.default_timeout_start_sec=3600
> > to prevent dropping into emergency mode.
> > 
> > Gathering information about what's happening during the boot is a
> > bit
> > challenging. But two main issues appeared to be, a large number of
> > path
> > lookups for non-existent files, and high lock contention in the VFS
> > during
> > path walks particularly in the dentry allocation code path.
> > 
> > The underlying cause of this was believed to be the sheer number of
> > sysfs
> > memory objects, 100,000+ for a 64TB memory configuration.
> 
> Independant of your kernfs changes, why do we really need to
> represent
> all of this memory with that many different "memory objects"?  What
> is
> that providing to userspace?
> 
> I remember Ben Herrenschmidt did a lot of work on some of the kernfs
> and
> other functions to make large-memory systems boot faster to remove
> some
> of the complexity in our functions, but that too did not look into
> why
> we needed to create so many objects in the first place.
> 
> Perhaps you might want to look there instead?

I presumed it was a hardware design requirement or IBM VM design
requirement.

Perhaps Rick can find out more on that question.

Ian



[PATCH 0/4] kernfs: proposed locking and concurrency improvement

2020-05-24 Thread Ian Kent
For very large systems with hundreds of CPUs and TBs of RAM booting can
take a very long time.

Initial reports showed that booting a configuration of several hundred
CPUs and 64TB of RAM would take more than 30 minutes and require kernel
parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600
to prevent dropping into emergency mode.

Gathering information about what's happening during the boot is a bit
challenging. But two main issues appeared to be, a large number of path
lookups for non-existent files, and high lock contention in the VFS during
path walks particularly in the dentry allocation code path.

The underlying cause of this was believed to be the sheer number of sysfs
memory objects, 100,000+ for a 64TB memory configuration.

This patch series tries to reduce the locking needed during path walks
based on the assumption that there are many path walks with a fairly
large portion of those for non-existent paths.

This was done by adding kernfs negative dentry caching (non-existent
paths) to avoid continual alloc/free cycle of dentries and a read/write
semaphore introduced to increase kernfs concurrency during path walks.

With these changes the kernel parameters of udev.children-max=2048 and
systemd.default_timeout_start_sec=300 for are still needed to get the
fastest boot times and result in boot time of under 5 minutes.

There may be opportunities for further improvements but the series here
has seen a fair amount of testing. And thinking about what else could be
done, and discussing it with Rick Lindsay, I suspect improvements will
get more difficult to implement for somewhat less improvement so I think
what we have here is a good start for now.

I think what's needed now is patch review, and if we can get through
that, send them via linux-next for broader exposure and hopefully have
them merged into mainline.
---

Ian Kent (4):
  kernfs: switch kernfs to use an rwsem
  kernfs: move revalidate to be near lookup
  kernfs: improve kernfs path resolution
  kernfs: use revision to identify directory node changes


 fs/kernfs/dir.c |  283 ---
 fs/kernfs/file.c|4 -
 fs/kernfs/inode.c   |   16 +-
 fs/kernfs/kernfs-internal.h |   29 
 fs/kernfs/mount.c   |   12 +-
 fs/kernfs/symlink.c |4 -
 include/linux/kernfs.h  |5 +
 7 files changed, 232 insertions(+), 121 deletions(-)

--
Ian



[PATCH 1/4] kernfs: switch kernfs to use an rwsem

2020-05-24 Thread Ian Kent
The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel.

Change the kernfs mutex to an rwsem so that, when oppertunity arises,
node searches can be done in parallel.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |  119 +++
 fs/kernfs/file.c|4 +
 fs/kernfs/inode.c   |   16 +++---
 fs/kernfs/kernfs-internal.h |5 +-
 fs/kernfs/mount.c   |   12 ++--
 fs/kernfs/symlink.c |4 +
 6 files changed, 86 insertions(+), 74 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9aec80b9d7c6..d8213fc65eba 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@
 
 #include "kernfs-internal.h"
 
-DEFINE_MUTEX(kernfs_mutex);
+DECLARE_RWSEM(kernfs_rwsem);
 static DEFINE_SPINLOCK(kernfs_rename_lock);/* kn->parent and ->name */
 static char kernfs_pr_cont_buf[PATH_MAX];  /* protected by rename_lock */
 static DEFINE_SPINLOCK(kernfs_idr_lock);   /* root->ino_idr */
@@ -26,10 +26,21 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);/* 
root->ino_idr */
 
 static bool kernfs_active(struct kernfs_node *kn)
 {
-   lockdep_assert_held(_mutex);
return atomic_read(>active) >= 0;
 }
 
+static bool kernfs_active_write(struct kernfs_node *kn)
+{
+   lockdep_assert_held_write(_rwsem);
+   return kernfs_active(kn);
+}
+
+static bool kernfs_active_read(struct kernfs_node *kn)
+{
+   lockdep_assert_held_read(_rwsem);
+   return kernfs_active(kn);
+}
+
 static bool kernfs_lockdep(struct kernfs_node *kn)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -340,7 +351,7 @@ static int kernfs_sd_compare(const struct kernfs_node *left,
  * @kn->parent->dir.children.
  *
  * Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem write lock
  *
  * RETURNS:
  * 0 on susccess -EEXIST on failure.
@@ -385,7 +396,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
  * removed, %false if @kn wasn't on the rbtree.
  *
  * Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem write lock
  */
 static bool kernfs_unlink_sibling(struct kernfs_node *kn)
 {
@@ -455,14 +466,14 @@ void kernfs_put_active(struct kernfs_node *kn)
  * return after draining is complete.
  */
 static void kernfs_drain(struct kernfs_node *kn)
-   __releases(_mutex) __acquires(_mutex)
+   __releases(_rwsem) __acquires(_rwsem)
 {
struct kernfs_root *root = kernfs_root(kn);
 
-   lockdep_assert_held(_mutex);
+   lockdep_assert_held_write(_rwsem);
WARN_ON_ONCE(kernfs_active(kn));
 
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
 
if (kernfs_lockdep(kn)) {
rwsem_acquire(>dep_map, 0, 0, _RET_IP_);
@@ -481,7 +492,7 @@ static void kernfs_drain(struct kernfs_node *kn)
 
kernfs_drain_open_files(kn);
 
-   mutex_lock(_mutex);
+   down_write(_rwsem);
 }
 
 /**
@@ -560,10 +571,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
goto out_bad_unlocked;
 
kn = kernfs_dentry_node(dentry);
-   mutex_lock(_mutex);
+   down_read(_rwsem);
 
/* The kernfs node has been deactivated */
-   if (!kernfs_active(kn))
+   if (!kernfs_active_read(kn))
goto out_bad;
 
/* The kernfs node has been moved? */
@@ -579,10 +590,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
 
-   mutex_unlock(_mutex);
+   up_read(_rwsem);
return 1;
 out_bad:
-   mutex_unlock(_mutex);
+   up_read(_rwsem);
 out_bad_unlocked:
return 0;
 }
@@ -764,7 +775,7 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;
 
-   mutex_lock(_mutex);
+   down_write(_rwsem);
 
ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -779,7 +790,7 @@ int kernfs_add_one(struct kernfs_node *kn)
if (parent->flags & KERNFS_EMPTY_DIR)
goto out_unlock;
 
-   if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent))
+   if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active_write(parent))
goto out_unlock;
 
kn->hash = kernfs_name_hash(kn->name, kn->ns);
@@ -795,7 +806,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
 
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
 
/*
 * Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -809,7 +820,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;
 
 out_unlock:
-   mutex_unlock(_mutex);
+   up_write(_rwsem);
return ret;
 }
 
@@ -830,7 +841,7 @@ static struct kernfs_node *kernfs_find_ns(struct 
kernfs_node *parent,
   

[PATCH 4/4] kernfs: use revision to identify directory node changes

2020-05-24 Thread Ian Kent
If a kernfs directory node hasn't changed there's no need to search for
an added (or removed) child dentry.

Add a revision counter to kernfs directory nodes so it can be used
to detect if a directory node has changed.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |   17 +++--
 fs/kernfs/kernfs-internal.h |   24 
 include/linux/kernfs.h  |5 +
 3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index f4943329e578..03f4f179bbc4 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -383,6 +383,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
/* successfully added, account subdir number */
if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs++;
+   kernfs_inc_rev(kn->parent);
 
return 0;
 }
@@ -405,6 +406,7 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
 
if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs--;
+   kernfs_inc_rev(kn->parent);
 
rb_erase(>rb, >parent->dir.children);
RB_CLEAR_NODE(>rb);
@@ -1044,9 +1046,16 @@ struct kernfs_node *kernfs_create_empty_dir(struct 
kernfs_node *parent,
 
 static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 {
+   struct kernfs_node *parent;
struct kernfs_node *kn;
 
if (flags & LOOKUP_RCU) {
+   /* Directory node changed? */
+   parent = kernfs_dentry_node(dentry->d_parent);
+
+   if (!kernfs_dir_changed(parent, dentry))
+   return 1;
+
kn = kernfs_dentry_node(dentry);
if (!kn) {
/* Negative hashed dentry, tell the VFS to switch to
@@ -1093,8 +1102,6 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
 
kn = kernfs_dentry_node(dentry);
if (!kn) {
-   struct kernfs_node *parent;
-
/* If the kernfs node can be found this is a stale negative
 * hashed dentry so it must be discarded and the lookup redone.
 */
@@ -1102,6 +1109,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
if (parent) {
const void *ns = NULL;
 
+   /* Directory node changed? */
+   if (kernfs_dir_changed(parent, dentry))
+   goto out_bad;
+
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_parent->d_sb)->ns;
kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
@@ -1156,6 +1167,8 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
 
down_read(_rwsem);
 
+   kernfs_set_rev(dentry, parent);
+
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;
 
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 097c1a989aa4..a7b0e2074260 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -82,6 +82,30 @@ static inline struct kernfs_node *kernfs_dentry_node(struct 
dentry *dentry)
return d_inode(dentry)->i_private;
 }
 
+static inline void kernfs_set_rev(struct dentry *dentry,
+ struct kernfs_node *kn)
+{
+   dentry->d_time = kn->dir.rev;
+}
+
+static inline void kernfs_inc_rev(struct kernfs_node *kn)
+{
+   if (kernfs_type(kn) == KERNFS_DIR) {
+   if (!++kn->dir.rev)
+   kn->dir.rev++;
+   }
+}
+
+static inline bool kernfs_dir_changed(struct kernfs_node *kn,
+ struct dentry *dentry)
+{
+   if (kernfs_type(kn) == KERNFS_DIR) {
+   if (kn->dir.rev != dentry->d_time)
+   return true;
+   }
+   return false;
+}
+
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
 
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 89f6a4214a70..74727d98e380 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -98,6 +98,11 @@ struct kernfs_elem_dir {
 * better directly in kernfs_node but is here to save space.
 */
struct kernfs_root  *root;
+   /*
+* Monotonic revision counter, used to identify if a directory
+* node has changed during revalidation.
+*/
+   unsigned long rev;
 };
 
 struct kernfs_elem_symlink {




[PATCH 3/4] kernfs: improve kernfs path resolution

2020-05-24 Thread Ian Kent
Now that an rwsem is used by kernfs, take advantage of it to reduce
lookup overhead.

If there are many lookups (possibly many negative ones) there can
be a lot of overhead during path walks.

To reduce lookup overhead avoid allocating a new dentry where possible.

To do this stay in rcu-walk mode where possible and use the dentry cache
handling of negative hashed dentries to avoid allocating (and freeing
shortly after) new dentries on every negative lookup.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |   87 ++-
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9b315f3b20ee..f4943329e578 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1046,15 +1046,75 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
 {
struct kernfs_node *kn;
 
-   if (flags & LOOKUP_RCU)
+   if (flags & LOOKUP_RCU) {
+   kn = kernfs_dentry_node(dentry);
+   if (!kn) {
+   /* Negative hashed dentry, tell the VFS to switch to
+* ref-walk mode and call us again so that node
+* existence can be checked.
+*/
+   if (!d_unhashed(dentry))
+   return -ECHILD;
+
+   /* Negative unhashed dentry, this shouldn't happen
+* because this case occurs in rcu-walk mode after
+* dentry allocation which is followed by a call
+* to ->loopup(). But if it does happen the dentry
+* is surely invalid.
+*/
+   return 0;
+   }
+
+   /* Since the dentry is positive (we got the kernfs node) a
+* kernfs node reference was held at the time. Now if the
+* dentry reference count is still greater than 0 it's still
+* positive so take a reference to the node to perform an
+* active check.
+*/
+   if (d_count(dentry) <= 0 || !atomic_inc_not_zero(>count))
+   return -ECHILD;
+
+   /* The kernfs node reference count was greater than 0, if
+* it's active continue in rcu-walk mode.
+*/
+   if (kernfs_active_read(kn)) {
+   kernfs_put(kn);
+   return 1;
+   }
+
+   /* Otherwise, just tell the VFS to switch to ref-walk mode
+* and call us again so the kernfs node can be validated.
+*/
+   kernfs_put(kn);
return -ECHILD;
+   }
 
-   /* Always perform fresh lookup for negatives */
-   if (d_really_is_negative(dentry))
-   goto out_bad_unlocked;
+   down_read(_rwsem);
 
kn = kernfs_dentry_node(dentry);
-   down_read(_rwsem);
+   if (!kn) {
+   struct kernfs_node *parent;
+
+   /* If the kernfs node can be found this is a stale negative
+* hashed dentry so it must be discarded and the lookup redone.
+*/
+   parent = kernfs_dentry_node(dentry->d_parent);
+   if (parent) {
+   const void *ns = NULL;
+
+   if (kernfs_ns_enabled(parent))
+   ns = kernfs_info(dentry->d_parent->d_sb)->ns;
+   kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+   if (kn)
+   goto out_bad;
+   }
+
+   /* The kernfs node doesn't exist, leave the dentry negative
+* and return success.
+*/
+   goto out;
+   }
+
 
/* The kernfs node has been deactivated */
if (!kernfs_active_read(kn))
@@ -1072,12 +1132,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
unsigned int flags)
if (kn->parent && kernfs_ns_enabled(kn->parent) &&
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
-
+out:
up_read(_rwsem);
return 1;
 out_bad:
up_read(_rwsem);
-out_bad_unlocked:
return 0;
 }
 
@@ -1092,7 +1151,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *ret;
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
-   struct inode *inode;
+   struct inode *inode = NULL;
const void *ns = NULL;
 
down_read(_rwsem);
@@ -1102,11 +1161,9 @@ static struct dentry *kernfs_iop_lookup(struct inode 
*dir,
 
kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
 
-   /* no such entry */
-   if (!kn || !kernfs_active(kn)) {
-   ret = NULL;
- 

[PATCH 2/4] kernfs: move revalidate to be near lookup

2020-05-24 Thread Ian Kent
While the dentry operation kernfs_dop_revalidate() is grouped with
dentry'ish functions it also has a strong afinity to the inode
operation ->lookup(). And when path walk improvements are applied
it will need to call kernfs_find_ns() so move it to be near
kernfs_iop_lookup() to avoid the need for a forward declaration.

There's no functional change from this patch.

Signed-off-by: Ian Kent 
---
 fs/kernfs/dir.c |   86 ---
 1 file changed, 43 insertions(+), 43 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index d8213fc65eba..9b315f3b20ee 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -559,49 +559,6 @@ void kernfs_put(struct kernfs_node *kn)
 }
 EXPORT_SYMBOL_GPL(kernfs_put);
 
-static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
-{
-   struct kernfs_node *kn;
-
-   if (flags & LOOKUP_RCU)
-   return -ECHILD;
-
-   /* Always perform fresh lookup for negatives */
-   if (d_really_is_negative(dentry))
-   goto out_bad_unlocked;
-
-   kn = kernfs_dentry_node(dentry);
-   down_read(_rwsem);
-
-   /* The kernfs node has been deactivated */
-   if (!kernfs_active_read(kn))
-   goto out_bad;
-
-   /* The kernfs node has been moved? */
-   if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
-   goto out_bad;
-
-   /* The kernfs node has been renamed */
-   if (strcmp(dentry->d_name.name, kn->name) != 0)
-   goto out_bad;
-
-   /* The kernfs node has been moved to a different namespace */
-   if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-   kernfs_info(dentry->d_sb)->ns != kn->ns)
-   goto out_bad;
-
-   up_read(_rwsem);
-   return 1;
-out_bad:
-   up_read(_rwsem);
-out_bad_unlocked:
-   return 0;
-}
-
-const struct dentry_operations kernfs_dops = {
-   .d_revalidate   = kernfs_dop_revalidate,
-};
-
 /**
  * kernfs_node_from_dentry - determine kernfs_node associated with a dentry
  * @dentry: the dentry in question
@@ -1085,6 +1042,49 @@ struct kernfs_node *kernfs_create_empty_dir(struct 
kernfs_node *parent,
return ERR_PTR(rc);
 }
 
+static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct kernfs_node *kn;
+
+   if (flags & LOOKUP_RCU)
+   return -ECHILD;
+
+   /* Always perform fresh lookup for negatives */
+   if (d_really_is_negative(dentry))
+   goto out_bad_unlocked;
+
+   kn = kernfs_dentry_node(dentry);
+   down_read(_rwsem);
+
+   /* The kernfs node has been deactivated */
+   if (!kernfs_active_read(kn))
+   goto out_bad;
+
+   /* The kernfs node has been moved? */
+   if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
+   goto out_bad;
+
+   /* The kernfs node has been renamed */
+   if (strcmp(dentry->d_name.name, kn->name) != 0)
+   goto out_bad;
+
+   /* The kernfs node has been moved to a different namespace */
+   if (kn->parent && kernfs_ns_enabled(kn->parent) &&
+   kernfs_info(dentry->d_sb)->ns != kn->ns)
+   goto out_bad;
+
+   up_read(_rwsem);
+   return 1;
+out_bad:
+   up_read(_rwsem);
+out_bad_unlocked:
+   return 0;
+}
+
+const struct dentry_operations kernfs_dops = {
+   .d_revalidate   = kernfs_dop_revalidate,
+};
+
 static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *dentry,
unsigned int flags)




Re: [PATCH 02/14] autofs: switch to kernel_write

2020-05-14 Thread Ian Kent
On Wed, 2020-05-13 at 08:56 +0200, Christoph Hellwig wrote:
> While pipes don't really need sb_writers projection, __kernel_write
> is an
> interface better kept private, and the additional rw_verify_area does
> not
> hurt here.
> 
> Signed-off-by: Christoph Hellwig 

Right, should be fine AFAICS.
Acked-by: Ian Kent 

> ---
>  fs/autofs/waitq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/autofs/waitq.c b/fs/autofs/waitq.c
> index b04c528b19d34..74c886f7c51cb 100644
> --- a/fs/autofs/waitq.c
> +++ b/fs/autofs/waitq.c
> @@ -53,7 +53,7 @@ static int autofs_write(struct autofs_sb_info *sbi,
>  
>   mutex_lock(>pipe_mutex);
>   while (bytes) {
> - wr = __kernel_write(file, data, bytes, >f_pos);
> + wr = kernel_write(file, data, bytes, >f_pos);
>   if (wr <= 0)
>   break;
>   data += wr;



Re: mount on tmpfs failing to parse context option

2019-10-08 Thread Ian Kent
On Tue, 2019-10-08 at 20:38 +0800, Ian Kent wrote:
> On Mon, 2019-10-07 at 17:50 -0700, Hugh Dickins wrote:
> > On Mon, 7 Oct 2019, Laura Abbott wrote:
> > > On 9/30/19 12:07 PM, Laura Abbott wrote:
> > > > Hi,
> > > > 
> > > > Fedora got a bug report 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1757104
> > > > of a failure to parse options with the context mount option.
> > > > From
> > the
> > > > reporter:
> > > > 
> > > > 
> > > > $ unshare -rm mount -t tmpfs tmpfs /tmp -o
> > > > 'context="system_u:object_r:container_file_t:s0:c475,c690"'
> > > > mount: /tmp: wrong fs type, bad option, bad superblock on
> > > > tmpfs,
> > missing
> > > > codepage or helper program, or other error.
> > > > 
> > > > 
> > > > Sep 30 16:50:42 kernel: tmpfs: Unknown parameter 'c690"'
> > > > 
> > > > I haven't asked the reporter to bisect yet but I'm suspecting
> > > > one
> > of the
> > > > conversion to the new mount API:
> > > > 
> > > > $ git log --oneline v5.3..origin/master mm/shmem.c
> > > > edf445ad7c8d Merge branch 'hugepage-fallbacks' (hugepatch
> > > > patches
> > from
> > > > David Rientjes)
> > > > 19deb7695e07 Revert "Revert "Revert "mm, thp: consolidate THP
> > > > gfp
> > handling
> > > > into alloc_hugepage_direct_gfpmask""
> > > > 28eb3c808719 shmem: fix obsolete comment in shmem_getpage_gfp()
> > > > 4101196b19d7 mm: page cache: store only head pages in i_pages
> > > > d8c6546b1aea mm: introduce compound_nr()
> > > > f32356261d44 vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs
> > to use the
> > > > new mount API
> > > > 626c3920aeb4 shmem_parse_one(): switch to use of fs_parse()
> > > > e04dc423ae2c shmem_parse_options(): take handling a single
> > > > option
> > into a
> > > > helper
> > > > f6490b7fbb82 shmem_parse_options(): don't bother with mpol in
> > separate
> > > > variable
> > > > 0b5071dd323d shmem_parse_options(): use a separate structure to
> > keep the
> > > > results
> > > > 7e30d2a5eb0b make shmem_fill_super() static
> > > > 
> > > > 
> > > > I didn't find another report or a fix yet. Is it worth asking
> > > > the
> > reporter
> > > > to bisect?
> > > > 
> > > > Thanks,
> > > > Laura
> > > 
> > > Ping again, I never heard anything back and I didn't see anything
> > come in
> > > with -rc2
> > 
> > Sorry for not responding sooner, Laura, I was travelling: and
> > dearly
> > hoping that David or Al would take it.  I'm afraid this is rather
> > beyond
> > my capability (can I admit that it's the first time I even heard of
> > the
> > "context" mount option? and grepping for "context" has not yet
> > shown
> > me
> > at what level it is handled; and I've no idea of what a valid
> > "context"
> > is for my own tmpfs mounts, to start playing around with its
> > parsing).
> > 
> > Yes, I think we can assume that this bug comes from f32356261d44
> > ("vfs:
> > Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount
> > API")
> > or one of shmem_parse ones associated with it; but I'm pretty sure
> > that
> > it's not worth troubling the reporter to bisect.  I expect David
> > and
> > Al
> > are familiar with "context", and can go straight to where it's
> > handled,
> > and see what's up.
> > 
> > (tmpfs, very tiresomely, supports a NUMA "mpol" mount option which
> > can
> > have commas in it e.g "mpol=bind:0,2": which makes all its comma
> > parsing
> > awkward.  I assume that where the new mount API commits bend over
> > to
> > accommodate that peculiarity, they end up mishandling the comma in
> > the context string above.)
> > 
> > And since we're on the subject of new mount API breakage in tmpfs,
> > I'll
> > take the liberty of repeating this different case, reported earlier
> > and
> > still broken in rc2: again something that I'd be hard-pressed to
> > fix
> > myself, without endangering some other filesystem's mount parsing:-
> > 
> > My /etc/fstab has a line in for one of my test 

Re: mount on tmpfs failing to parse context option

2019-10-08 Thread Ian Kent
On Mon, 2019-10-07 at 17:50 -0700, Hugh Dickins wrote:
> On Mon, 7 Oct 2019, Laura Abbott wrote:
> > On 9/30/19 12:07 PM, Laura Abbott wrote:
> > > Hi,
> > > 
> > > Fedora got a bug report 
> https://bugzilla.redhat.com/show_bug.cgi?id=1757104
> > > of a failure to parse options with the context mount option. From
> the
> > > reporter:
> > > 
> > > 
> > > $ unshare -rm mount -t tmpfs tmpfs /tmp -o
> > > 'context="system_u:object_r:container_file_t:s0:c475,c690"'
> > > mount: /tmp: wrong fs type, bad option, bad superblock on tmpfs,
> missing
> > > codepage or helper program, or other error.
> > > 
> > > 
> > > Sep 30 16:50:42 kernel: tmpfs: Unknown parameter 'c690"'
> > > 
> > > I haven't asked the reporter to bisect yet but I'm suspecting one
> of the
> > > conversion to the new mount API:
> > > 
> > > $ git log --oneline v5.3..origin/master mm/shmem.c
> > > edf445ad7c8d Merge branch 'hugepage-fallbacks' (hugepatch patches
> from
> > > David Rientjes)
> > > 19deb7695e07 Revert "Revert "Revert "mm, thp: consolidate THP gfp
> handling
> > > into alloc_hugepage_direct_gfpmask""
> > > 28eb3c808719 shmem: fix obsolete comment in shmem_getpage_gfp()
> > > 4101196b19d7 mm: page cache: store only head pages in i_pages
> > > d8c6546b1aea mm: introduce compound_nr()
> > > f32356261d44 vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs
> to use the
> > > new mount API
> > > 626c3920aeb4 shmem_parse_one(): switch to use of fs_parse()
> > > e04dc423ae2c shmem_parse_options(): take handling a single option
> into a
> > > helper
> > > f6490b7fbb82 shmem_parse_options(): don't bother with mpol in
> separate
> > > variable
> > > 0b5071dd323d shmem_parse_options(): use a separate structure to
> keep the
> > > results
> > > 7e30d2a5eb0b make shmem_fill_super() static
> > > 
> > > 
> > > I didn't find another report or a fix yet. Is it worth asking the
> reporter
> > > to bisect?
> > > 
> > > Thanks,
> > > Laura
> > 
> > Ping again, I never heard anything back and I didn't see anything
> come in
> > with -rc2
> 
> Sorry for not responding sooner, Laura, I was travelling: and dearly
> hoping that David or Al would take it.  I'm afraid this is rather
> beyond
> my capability (can I admit that it's the first time I even heard of
> the
> "context" mount option? and grepping for "context" has not yet shown
> me
> at what level it is handled; and I've no idea of what a valid
> "context"
> is for my own tmpfs mounts, to start playing around with its
> parsing).
> 
> Yes, I think we can assume that this bug comes from f32356261d44
> ("vfs:
> Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount
> API")
> or one of shmem_parse ones associated with it; but I'm pretty sure
> that
> it's not worth troubling the reporter to bisect.  I expect David and
> Al
> are familiar with "context", and can go straight to where it's
> handled,
> and see what's up.
> 
> (tmpfs, very tiresomely, supports a NUMA "mpol" mount option which
> can
> have commas in it e.g "mpol=bind:0,2": which makes all its comma
> parsing
> awkward.  I assume that where the new mount API commits bend over to
> accommodate that peculiarity, they end up mishandling the comma in
> the context string above.)
> 
> And since we're on the subject of new mount API breakage in tmpfs,
> I'll
> take the liberty of repeating this different case, reported earlier
> and
> still broken in rc2: again something that I'd be hard-pressed to fix
> myself, without endangering some other filesystem's mount parsing:-
> 
> My /etc/fstab has a line in for one of my test mounts:
> tmpfs/tlo tmpfs 
> size=4G   0 0
> and that "size=4G" is what causes the problem: because each time
> shmem_parse_options(fc, data) is called for a remount, data (that is,
> options) points to a string starting with "size=4G,", followed by
> what's actually been asked for in the remount options.
> 
> So if I try
> mount -o remount,size=0 /tlo
> that succeeds, setting the filesystem size to 0 meaning unlimited.
> So if then as a test I try
> mount -o remount,size=1M /tlo
> that correctly fails with "Cannot retroactively limit size".
> But then when I try
> mount -o remount,nr_inodes=0 /tlo
> I again get "Cannot retroactively limit size",
> when it should have succeeded (again, 0 here meaning unlimited).
> 
> That's because the options in shmem_parse_options() are
> "size=4G,nr_inodes=0", which indeed looks like an attempt to
> retroactively limit size; but the user never asked "size=4G" there.

I believe that's mount(8) doing that.
I don't think it's specific to the new mount api.

AFAIK it's not new but it does mean the that things that come
through that have been found in mtab by mount(8) need to be
checked against the current value before failing or ignored if
changing them is not allowed.

I wonder if the problem has been present for quite a while but
gone unnoticed perhaps.

IIUC the order should always be command line options last and it
must be that way 

[ANNOUNCE] autofs 5.1.6 release

2019-10-06 Thread Ian Kent
Hi all,

It's time for a release, autofs-5.1.6.

This is an important release because it marks the beginning of
work to be done torward resolving the very long standing problem
of using very large direct mount maps in autofs, along with the
needed mitigatigation of the effects of those large mount tables
in user space.

The first thing that needs to be done is for autofs to get back
to what is was before the symlinking of the mount table to the
proc file system. From experience having a large number of (largely
not useful) autofs mount entries showing up in the mount table
makes system administation frustrating and is quite annoying.

To do this I'm using the same approach used in other SysV autofs
implementations of providing an autofs pseudo mount option "ignore"
that can be used by user space as a hint to ignore these mount
entries.

This will require fairly straight forward changes to glibc and
libmount at least. The glibc change has been accepted and I plan
on submitting a change for libmount when I get a chance.

A configuration option, use_ignore_mount_option, has been added
to autofs that is initially disabled to allow people to enable it
when they are confident that there won't be unexpected problems.

The side effects of very large mount tables in user space is
somewhat difficult to mitigate.

First, to acheive this autofs needs to not use the system mount
table for expiration "at all". Not using the mount table for expires
has proven very difficult to do and initial attempts resulted in
changes that didn't fit in well at all.

The changes to clean up the mount table listing amounted to making
autofs use it's own genmntent(3) implementation (borrowed from glibc)
but quite a bit of that change was re-factoring toward eliminating
the need to use the mount table during expires. I had trouble getting
that to work, let alone stable, but the approach will fit in well
with the current design so it's progress.

Then there's the affect of very large mount tables on other user
space applications.

For example, under rapid mount activity we see several user space
process, systemd, udisk2, et. al., growing to consume all available
CPU and a couple of others performing poorly simply because the
mount table is large.

I had planned on using the fsinfo() system call being proposed by
David Howells for initial mount handing improvements in libmount,
and later David's related kernel notifications proposal for further
libmount mount table handling improvements but those propsals have
seen some challenges being accepted so we will have to wait and see
how things go before working out how to acheive this rather difficult
goal.

So there's a long way to go but progress is bieng made!

Additionally there are a number of bug fixes and other minor
improvements.

autofs
==

The package can be found at:
https://www.kernel.org/pub/linux/daemons/autofs/v5/

It is autofs-5.1.6.tar.[gz|xz]

No source rpm is there as it can be produced by using:

rpmbuild -ts autofs-5.1.6.tar.gz

and the binary rpm by using:

rpmbuild -tb autofs-5.1.6.tar.gz

Here are the entries from the CHANGELOG which outline the updates:

07/10/2019 autofs-5.1.6
- support strictexpire mount option.
- fix hesiod string check in master_parse().
- add NULL check for get_addr_string() return.
- use malloc(3) in spawn.c.
- add mount_verbose configuration option.
- optionally log mount requestor process info.
- log mount call arguments if mount_verbose is set.
- Fix NFS mount from IPv6 addresses.
- make expire remaining log level debug.
- allow period following macro in selector value.
- fix macro expansion in selector values.
- fix typing errors.
- Explain /etc/auto.master.d usage.
- plus map includes are only allowed in file sources.
- Update README.
- fix additional typing errors.
- update autofs(8) offset map entry update description.
- increase group buffer size geometrically.
- also use strictexpire for offsets.
- remove unused function has_fstab_option().
- remove unused function reverse_mnt_list().
- remove a couple of old debug messages.
- fix amd entry memory leak.
- fix unlink_mount_tree() not umounting mounts.
- use ignore option for offset mounts as well.
- add config option for "ignore" mount option
- use bit flags for autofs mount types in mnt_list.
- use mp instead of path in mnt_list entries.
- always use PROC_MOUNTS to make mount lists.
- add glibc getmntent_r().
- use local getmntent_r in table_is_mounted().
- refactor unlink_active_mounts() in direct.c.
- don't use tree_is_mounted() for mounted checks.
- use single unlink_umount_tree() for both direct and indirect mounts.
- move unlink_mount_tree() to lib/mounts.c.
- use local_getmntent_r() for unlink_mount_tree().
- use local getmntent_r() in get_mnt_list().
- use local getmntent_r() in tree_make_mnt_list().
- fix missing initialization of autofs_point flags.

Ian



Re: [PATCH 4/6] vfs: Allow mount information to be queried by fsinfo() [ver #15]

2019-07-02 Thread Ian Kent
On Wed, 2019-07-03 at 09:24 +0800, Ian Kent wrote:
> On Wed, 2019-07-03 at 09:09 +0800, Ian Kent wrote:
> > Hi Christian,
> > 
> > About the propagation attributes you mentioned ...
> 
> Umm ... how did you work out if a mount is unbindable from proc
> mountinfo?
> 
> I didn't notice anything that could be used for that when I was
> looking at this.

Oh wait, fs/proc_namespace.c:show_mountinfo() has:
if (IS_MNT_UNBINDABLE(r))
seq_puts(m, " unbindable");

I missed that, probably because I didn't have any unbindable mounts
at the time I was looking at it, oops!

That's missing and probably should be added too.

> 
> > On Fri, 2019-06-28 at 16:47 +0100, David Howells wrote:
> > 
> > snip ...
> > 
> > > +
> > > +#ifdef CONFIG_FSINFO
> > > +int fsinfo_generic_mount_info(struct path *path, struct fsinfo_kparams
> > > *params)
> > > +{
> > > + struct fsinfo_mount_info *p = params->buffer;
> > > + struct super_block *sb;
> > > + struct mount *m;
> > > + struct path root;
> > > + unsigned int flags;
> > > +
> > > + if (!path->mnt)
> > > + return -ENODATA;
> > > +
> > > + m = real_mount(path->mnt);
> > > + sb = m->mnt.mnt_sb;
> > > +
> > > + p->f_sb_id  = sb->s_unique_id;
> > > + p->mnt_id   = m->mnt_id;
> > > + p->parent_id= m->mnt_parent->mnt_id;
> > > + p->change_counter   = atomic_read(>mnt_change_counter);
> > > +
> > > + get_fs_root(current->fs, );
> > > + if (path->mnt == root.mnt) {
> > > + p->parent_id = p->mnt_id;
> > > + } else {
> > > + rcu_read_lock();
> > > + if (!are_paths_connected(, path))
> > > + p->parent_id = p->mnt_id;
> > > + rcu_read_unlock();
> > > + }
> > > + if (IS_MNT_SHARED(m))
> > > + p->group_id = m->mnt_group_id;
> > > + if (IS_MNT_SLAVE(m)) {
> > > + int master = m->mnt_master->mnt_group_id;
> > > + int dom = get_dominating_id(m, );
> > > + p->master_id = master;
> > > + if (dom && dom != master)
> > > + p->from_id = dom;
> > 
> > This provides information about mount propagation (well mostly).
> > 
> > My understanding of this was that:
> > "If a mount is propagation private (or slave) the group_id will
> > be zero otherwise it's propagation shared and it's group id will
> > be non-zero.
> > 
> > If a mount is propagation slave and propagation peers exist then
> > the mount field mnt_master will be non-NULL. Then mnt_master
> > (slave's master) can be used to set master_id. If the group id
> > of the propagation source is not that of the master then set
> > the from_id group as well."
> > 
> > This parallels the way in which these values are reported in
> > the proc pseudo file system.
> > 
> > Perhaps adding flags as well as setting the fields would be
> > useful too, since interpreting the meaning of the structure
> > fields isn't obvious, ;)
> > 
> > David, Al, thoughts?
> > 
> > Ian



Re: [PATCH 4/6] vfs: Allow mount information to be queried by fsinfo() [ver #15]

2019-07-02 Thread Ian Kent
On Wed, 2019-07-03 at 09:09 +0800, Ian Kent wrote:
> Hi Christian,
> 
> About the propagation attributes you mentioned ...

Umm ... how did you work out if a mount is unbindable from proc
mountinfo?

I didn't notice anything that could be used for that when I was
looking at this.

> 
> On Fri, 2019-06-28 at 16:47 +0100, David Howells wrote:
> 
> snip ...
> 
> > +
> > +#ifdef CONFIG_FSINFO
> > +int fsinfo_generic_mount_info(struct path *path, struct fsinfo_kparams
> > *params)
> > +{
> > +   struct fsinfo_mount_info *p = params->buffer;
> > +   struct super_block *sb;
> > +   struct mount *m;
> > +   struct path root;
> > +   unsigned int flags;
> > +
> > +   if (!path->mnt)
> > +   return -ENODATA;
> > +
> > +   m = real_mount(path->mnt);
> > +   sb = m->mnt.mnt_sb;
> > +
> > +   p->f_sb_id  = sb->s_unique_id;
> > +   p->mnt_id   = m->mnt_id;
> > +   p->parent_id= m->mnt_parent->mnt_id;
> > +   p->change_counter   = atomic_read(>mnt_change_counter);
> > +
> > +   get_fs_root(current->fs, );
> > +   if (path->mnt == root.mnt) {
> > +   p->parent_id = p->mnt_id;
> > +   } else {
> > +   rcu_read_lock();
> > +   if (!are_paths_connected(, path))
> > +   p->parent_id = p->mnt_id;
> > +   rcu_read_unlock();
> > +   }
> > +   if (IS_MNT_SHARED(m))
> > +   p->group_id = m->mnt_group_id;
> > +   if (IS_MNT_SLAVE(m)) {
> > +   int master = m->mnt_master->mnt_group_id;
> > +   int dom = get_dominating_id(m, );
> > +   p->master_id = master;
> > +   if (dom && dom != master)
> > +   p->from_id = dom;
> 
> This provides information about mount propagation (well mostly).
> 
> My understanding of this was that:
> "If a mount is propagation private (or slave) the group_id will
> be zero otherwise it's propagation shared and it's group id will
> be non-zero.
> 
> If a mount is propagation slave and propagation peers exist then
> the mount field mnt_master will be non-NULL. Then mnt_master
> (slave's master) can be used to set master_id. If the group id
> of the propagation source is not that of the master then set
> the from_id group as well."
> 
> This parallels the way in which these values are reported in
> the proc pseudo file system.
> 
> Perhaps adding flags as well as setting the fields would be
> useful too, since interpreting the meaning of the structure
> fields isn't obvious, ;)
> 
> David, Al, thoughts?
> 
> Ian



Re: [PATCH 4/6] vfs: Allow mount information to be queried by fsinfo() [ver #15]

2019-07-02 Thread Ian Kent
Hi Christian,

About the propagation attributes you mentioned ...

On Fri, 2019-06-28 at 16:47 +0100, David Howells wrote:

snip ...

> +
> +#ifdef CONFIG_FSINFO
> +int fsinfo_generic_mount_info(struct path *path, struct fsinfo_kparams
> *params)
> +{
> + struct fsinfo_mount_info *p = params->buffer;
> + struct super_block *sb;
> + struct mount *m;
> + struct path root;
> + unsigned int flags;
> +
> + if (!path->mnt)
> + return -ENODATA;
> +
> + m = real_mount(path->mnt);
> + sb = m->mnt.mnt_sb;
> +
> + p->f_sb_id  = sb->s_unique_id;
> + p->mnt_id   = m->mnt_id;
> + p->parent_id= m->mnt_parent->mnt_id;
> + p->change_counter   = atomic_read(>mnt_change_counter);
> +
> + get_fs_root(current->fs, );
> + if (path->mnt == root.mnt) {
> + p->parent_id = p->mnt_id;
> + } else {
> + rcu_read_lock();
> + if (!are_paths_connected(, path))
> + p->parent_id = p->mnt_id;
> + rcu_read_unlock();
> + }
> + if (IS_MNT_SHARED(m))
> + p->group_id = m->mnt_group_id;
> + if (IS_MNT_SLAVE(m)) {
> + int master = m->mnt_master->mnt_group_id;
> + int dom = get_dominating_id(m, );
> + p->master_id = master;
> + if (dom && dom != master)
> + p->from_id = dom;

This provides information about mount propagation (well mostly).

My understanding of this was that:
"If a mount is propagation private (or slave) the group_id will
be zero otherwise it's propagation shared and it's group id will
be non-zero.

If a mount is propagation slave and propagation peers exist then
the mount field mnt_master will be non-NULL. Then mnt_master
(slave's master) can be used to set master_id. If the group id
of the propagation source is not that of the master then set
the from_id group as well."

This parallels the way in which these values are reported in
the proc pseudo file system.

Perhaps adding flags as well as setting the fields would be
useful too, since interpreting the meaning of the structure
fields isn't obvious, ;)

David, Al, thoughts?

Ian



Re: [PATCH 00/25] VFS: Introduce filesystem information query syscall [ver #14]

2019-06-26 Thread Ian Kent
On Wed, 2019-06-26 at 12:47 +0200, Christian Brauner wrote:
> On Wed, Jun 26, 2019 at 06:42:51PM +0800, Ian Kent wrote:
> > On Wed, 2019-06-26 at 12:05 +0200, Christian Brauner wrote:
> > > On Mon, Jun 24, 2019 at 03:08:45PM +0100, David Howells wrote:
> > > > Hi Al,
> > > > 
> > > > Here are a set of patches that adds a syscall, fsinfo(), that allows
> > > > attributes of a filesystem/superblock to be queried.  Attribute values
> > > > are
> > > > of four basic types:
> > > > 
> > > >  (1) Version dependent-length structure (size defined by type).
> > > > 
> > > >  (2) Variable-length string (up to PAGE_SIZE).
> > > > 
> > > >  (3) Array of fixed-length structures (up to INT_MAX size).
> > > > 
> > > >  (4) Opaque blob (up to INT_MAX size).
> > > > 
> > > > Attributes can have multiple values in up to two dimensions and all the
> > > > values of a particular attribute must have the same type.
> > > > 
> > > > Note that the attribute values *are* allowed to vary between dentries
> > > > within a single superblock, depending on the specific dentry that you're
> > > > looking at.
> > > > 
> > > > I've tried to make the interface as light as possible, so integer/enum
> > > > attribute selector rather than string and the core does all the
> > > > allocation
> > > > and extensibility support work rather than leaving that to the
> > > > filesystems.
> > > > That means that for the first two attribute types, sb->s_op->fsinfo()
> > > > may
> > > > assume that the provided buffer is always present and always big enough.
> > > > 
> > > > Further, this removes the possibility of the filesystem gaining access
> > > > to
> > > > the
> > > > userspace buffer.
> > > > 
> > > > 
> > > > fsinfo() allows a variety of information to be retrieved about a
> > > > filesystem
> > > > and the mount topology:
> > > > 
> > > >  (1) General superblock attributes:
> > > > 
> > > >   - The amount of space/free space in a filesystem (as statfs()).
> > > >   - Filesystem identifiers (UUID, volume label, device numbers, ...)
> > > >   - The limits on a filesystem's capabilities
> > > >   - Information on supported statx fields and attributes and IOC
> > > > flags.
> > > >   - A variety single-bit flags indicating supported capabilities.
> > > >   - Timestamp resolution and range.
> > > >   - Sources (as per mount(2), but fsconfig() allows multiple
> > > > sources).
> > > >   - In-filesystem filename format information.
> > > >   - Filesystem parameters ("mount -o xxx"-type things).
> > > >   - LSM parameters (again "mount -o xxx"-type things).
> > > > 
> > > >  (2) Filesystem-specific superblock attributes:
> > > > 
> > > >   - Server names and addresses.
> > > >   - Cell name.
> > > > 
> > > >  (3) Filesystem configuration metadata attributes:
> > > > 
> > > >   - Filesystem parameter type descriptions.
> > > >   - Name -> parameter mappings.
> > > >   - Simple enumeration name -> value mappings.
> > > > 
> > > >  (4) Mount topology:
> > > > 
> > > >   - General information about a mount object.
> > > >   - Mount device name(s).
> > > >   - Children of a mount object and their relative paths.
> > > > 
> > > >  (5) Information about what the fsinfo() syscall itself supports,
> > > > including
> > > >  the number of attibutes supported and the number of capability bits
> > > >  supported.
> > > 
> > > Phew, this patchset is a lot. It's good of course but can we please cut
> > > some of the more advanced features such as querying by mount id,
> > > submounts etc. pp. for now?
> > 
> > Did you mean the "vfs: Allow fsinfo() to look up a mount object by ID"
> > patch?
> > 
> > We would need to be very careful what was dropped.
> 
> Not dropped as in never implement but rather defer it by one merge
> window to give us a) more time to review and settle the interface while
> b) not stalling the overall patch.

Sure,

Re: [PATCH 00/25] VFS: Introduce filesystem information query syscall [ver #14]

2019-06-26 Thread Ian Kent
On Wed, 2019-06-26 at 12:05 +0200, Christian Brauner wrote:
> On Mon, Jun 24, 2019 at 03:08:45PM +0100, David Howells wrote:
> > Hi Al,
> > 
> > Here are a set of patches that adds a syscall, fsinfo(), that allows
> > attributes of a filesystem/superblock to be queried.  Attribute values are
> > of four basic types:
> > 
> >  (1) Version dependent-length structure (size defined by type).
> > 
> >  (2) Variable-length string (up to PAGE_SIZE).
> > 
> >  (3) Array of fixed-length structures (up to INT_MAX size).
> > 
> >  (4) Opaque blob (up to INT_MAX size).
> > 
> > Attributes can have multiple values in up to two dimensions and all the
> > values of a particular attribute must have the same type.
> > 
> > Note that the attribute values *are* allowed to vary between dentries
> > within a single superblock, depending on the specific dentry that you're
> > looking at.
> > 
> > I've tried to make the interface as light as possible, so integer/enum
> > attribute selector rather than string and the core does all the allocation
> > and extensibility support work rather than leaving that to the filesystems.
> > That means that for the first two attribute types, sb->s_op->fsinfo() may
> > assume that the provided buffer is always present and always big enough.
> > 
> > Further, this removes the possibility of the filesystem gaining access to
> > the
> > userspace buffer.
> > 
> > 
> > fsinfo() allows a variety of information to be retrieved about a filesystem
> > and the mount topology:
> > 
> >  (1) General superblock attributes:
> > 
> >   - The amount of space/free space in a filesystem (as statfs()).
> >   - Filesystem identifiers (UUID, volume label, device numbers, ...)
> >   - The limits on a filesystem's capabilities
> >   - Information on supported statx fields and attributes and IOC flags.
> >   - A variety single-bit flags indicating supported capabilities.
> >   - Timestamp resolution and range.
> >   - Sources (as per mount(2), but fsconfig() allows multiple sources).
> >   - In-filesystem filename format information.
> >   - Filesystem parameters ("mount -o xxx"-type things).
> >   - LSM parameters (again "mount -o xxx"-type things).
> > 
> >  (2) Filesystem-specific superblock attributes:
> > 
> >   - Server names and addresses.
> >   - Cell name.
> > 
> >  (3) Filesystem configuration metadata attributes:
> > 
> >   - Filesystem parameter type descriptions.
> >   - Name -> parameter mappings.
> >   - Simple enumeration name -> value mappings.
> > 
> >  (4) Mount topology:
> > 
> >   - General information about a mount object.
> >   - Mount device name(s).
> >   - Children of a mount object and their relative paths.
> > 
> >  (5) Information about what the fsinfo() syscall itself supports, including
> >  the number of attibutes supported and the number of capability bits
> >  supported.
> 
> Phew, this patchset is a lot. It's good of course but can we please cut
> some of the more advanced features such as querying by mount id,
> submounts etc. pp. for now?

Did you mean the "vfs: Allow fsinfo() to look up a mount object by ID"
patch?

We would need to be very careful what was dropped.

For example, I've found that the patch above is pretty much essential
for fsinfo() to be useful from user space.

> I feel this would help with review and since your interface is
> extensible it's really not a big deal if we defer fancy features to
> later cycles after people had more time to review and the interface has
> seen some exposure.
> 
> The mount api changes over the last months have honestly been so huge
> that any chance to make the changes smaller and easier to digest we
> should take. (I'm really not complaining. Good that the work is done and
> it's entirely ok that it's a lot of code.)
> 
> It would also be great if after you have dropped some stuff from this
> patchset and gotten an Ack we could stuff it into linux-next for some
> time because it hasn't been so far...
> 
> Christian



Re: [PATCH 05/13] vfs: don't parse "silent" option

2019-06-19 Thread Ian Kent
On Wed, 2019-06-19 at 14:30 +0200, Miklos Szeredi wrote:
> While this is a standard option as documented in mount(8), it is ignored by
> most filesystems.  So reject, unless filesystem explicitly wants to handle
> it.
> 
> The exception is unconverted filesystems, where it is unknown if the
> filesystem handles this or not.
> 
> Any implementation, such as mount(8) that needs to parse this option
> without failing should simply ignore the return value from fsconfig().

In theory this is fine but every time someone has attempted
to change the handling of this in the past autofs has had
problems so I'm a bit wary of the change.

It was originally meant to tell the file system to ignore
invalid options such as could be found in automount maps that
are used with multiple OS implementations that have differences
in their options.

That was, IIRC, primarily NFS although NFS should handle most
(if not all of those) cases these days.

Nevertheless I'm a bit nervous about it, ;)

> 
> Signed-off-by: Miklos Szeredi 
> ---
>  fs/fs_context.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index 49636e541293..c26b353aa858 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -51,7 +51,6 @@ static const struct constant_table common_clear_sb_flag[] =
> {
>   { "nolazytime", SB_LAZYTIME },
>   { "nomand", SB_MANDLOCK },
>   { "rw", SB_RDONLY },
> - { "silent", SB_SILENT },
>  };
>  
>  /*
> @@ -535,6 +534,9 @@ static int legacy_parse_param(struct fs_context *fc,
> struct fs_parameter *param)
>   if (ret != -ENOPARAM)
>   return ret;
>  
> + if (strcmp(param->key, "silent") == 0)
> + fc->sb_flags |= SB_SILENT;
> +
>   if (strcmp(param->key, "source") == 0) {
>   if (param->type != fs_value_is_string)
>   return invalf(fc, "VFS: Legacy: Non-string source");



[PATCH] vfs: update d_make_root() description

2019-04-10 Thread Ian Kent
Clearify d_make_root() usage, error handling and cleanup
requirements.

Signed-off-by: Ian Kent 
---
 Documentation/filesystems/porting |   15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/porting 
b/Documentation/filesystems/porting
index cf43bc4dbf31..1ebc1c6eb64b 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -428,8 +428,19 @@ release it yourself.
 --
 [mandatory]
d_alloc_root() is gone, along with a lot of bugs caused by code
-misusing it.  Replacement: d_make_root(inode).  The difference is,
-d_make_root() drops the reference to inode if dentry allocation fails.  
+misusing it.  Replacement: d_make_root(inode).  On success d_make_root(inode)
+allocates and returns a new dentry instantiated with the passed in inode.
+On failure NULL is returned and the passed in inode is dropped so failure
+handling need not do any cleanup for the inode. If d_make_root(inode)
+is passed a NULL inode it returns NULL and also requires no further
+error handling. Typical usage is:
+
+   inode = foofs_new_inode();
+   s->s_root = d_make_inode(inode);
+   if (!s->s_root)
+   /* Nothing needed for the inode cleanup */
+   return -ENOMEM;
+   ...
 
 --
 [mandatory]



Re: kernel BUG at fs/inode.c:LINE!

2019-04-10 Thread Ian Kent
On Wed, 2019-04-10 at 14:41 +0200, Dmitry Vyukov wrote:
> On Wed, Apr 10, 2019 at 2:12 PM Al Viro  wrote:
> > 
> > On Wed, Apr 10, 2019 at 08:07:15PM +0800, Ian Kent wrote:
> > 
> > > > I'm unable to find a branch matching the line numbers.
> > > > 
> > > > Given that, on the face of it, the scenario is impossible I'm
> > > > seeking clarification on what linux-next to look at for the
> > > > sake of accuracy.
> > > > 
> > > > So I'm wondering if this testing done using the master branch
> > > > or one of the daily branches one would use to check for conflicts
> > > > before posting?
> > > 
> > > Sorry those are tags not branches.
> > 
> > FWIW, that's next-20181214; it is what master had been in mid-December
> > and master is rebased every day.  Can it be reproduced with the current
> > tree?
> 
> From the info on the dashboard we know that it happened only once on
> d14b746c (the second one is result of reproducing the first one). So
> it was either fixed or just hard to trigger.

Looking at the source of tag next-20181214 in linux-next-history I see
this is mistake I made due to incorrect error handling which I fixed
soon after (there was in fact a double iput()).

I'm pretty sure this never made it to a released kernel so unless
there's a report of this in a stable released kernel I'm going to
move on.

Thanks
Ian



Re: kernel BUG at fs/inode.c:LINE!

2019-04-10 Thread Ian Kent
On Wed, 2019-04-10 at 19:57 +0800, Ian Kent wrote:
> On Wed, 2019-04-10 at 13:40 +0200, Dmitry Vyukov wrote:
> > On Wed, Apr 10, 2019 at 12:35 PM Ian Kent  wrote:
> > > 
> > > On Wed, 2019-04-10 at 10:27 +0200, Dmitry Vyukov wrote:
> > > > On Wed, Apr 10, 2019 at 2:26 AM Al Viro  wrote:
> > > > > 
> > > > > On Tue, Apr 09, 2019 at 07:36:00AM -0700, syzbot wrote:
> > > > > > Bisection is inconclusive: the first bad commit could be any of:
> > > > > 
> > > > > [snip the useless pile]
> > > > > 
> > > > > > bisection log:
> > > > > > https://syzkaller.appspot.com/x/bisect.txt?x=15e1fc2b20
> > > > > > start commit:   [unknown
> > > > > > git tree:   linux-next
> > > > > > dashboard link:
> > > > > > https://syzkaller.appspot.com/bug?extid=5399ed0832693e29f392
> > > > > > syz repro:  
> > > > > > https://syzkaller.appspot.com/x/repro.syz?x=101032b340
> > > > > > C reproducer:   
> > > > > > https://syzkaller.appspot.com/x/repro.c?x=1653406340
> > > > > > 
> > > > > > For information about bisection process see:
> > > > > > https://goo.gl/tpsmEJ#bisection
> > > > > 
> > > > > If I'm not misreading the "crash report" there, it has injected an
> > > > > allocation
> > > > > failure in dentry allocation in d_make_root() from autofs_fill_super()
> > > > > (
> > > > > root_inode = autofs_get_inode(s, S_IFDIR | 0755);
> > > > > root = d_make_root(root_inode);
> > > > > ) which has triggered iput() on the inode passed to d_make_root() (as
> > > > > it
> > > > > ought
> > > > > to).  At which point it stepped into some BUG_ON() in fs/inode.c, but
> > > > > I've
> > > > > no idea which one it is - line numbers do not match anything in linux-
> > > > > next
> > > > > or in mainline.  Reported line 1566 is
> > > > > if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME))
> > > > > {
> > > > > in all of them; as the matter of fact, the diff in fs/inode.c between
> > > > > -next and mainline is empty.
> > > > > 
> > > > > There is a BUG_ON() several lines prior, and in 4.20 it used to be
> > > > > line
> > > > > 1566,
> > > > > so _probably_ that's what it is.  With that assumption, it's
> > > > > BUG_ON(inode->i_state & I_CLEAR);
> > > > > IOW, we'd got I_CLEAR in the inode passed to d_make_root()
> > > > > there.  Which
> > > > > should not happen - the inode must have come from new_inode(), which
> > > > > gets it from new_inode_pseudo(), which zeroes ->i_state.  And I_CLEAR
> > > > > is set only in clear_inode().  For autofs inodes that can come only
> > > > > from autofs_evict_inode(), called as ->evict() from evict_inode().
> > > > > Which should never ever be called for inode with positive ->i_count...
> > > > > 
> > > > > It might be memory corruption; it might be a dangling inode pointer
> > > > > somewhere, it might be something else.
> > > > > 
> > > > > To get any further we really need a confirmation of the identity of
> > > > > triggered BUG_ON().
> > > > > 
> > > > > As an aside, your "sample crash reports" would've been much more
> > > > > useful
> > > > > if
> > > > > they went with commit SHA1 in question, especially when they contain
> > > > > line
> > > > > numbers.
> > > > 
> > > > Hi Al,
> > > > 
> > > > This is the commit for matching lines:
> > > > 
> > > > > HEAD commit:d14b746c6c1c Add linux-next specific files for
> > > > > 20181214
> > > 
> > > Are you sure?
> > > what does 20181214 mean?
> > 
> > Yes, I just copy-pasted from the report. "d14b746c6c1c" is the commit
> > hash. "Add linux-next specific files for 20181214" is the commit
> > subject.
> > 
> > 
> > > Looking at current next code (and several branches) it doesn't
> > > appear the problem is possible?
> &g

Re: kernel BUG at fs/inode.c:LINE!

2019-04-10 Thread Ian Kent
On Wed, 2019-04-10 at 13:40 +0200, Dmitry Vyukov wrote:
> On Wed, Apr 10, 2019 at 12:35 PM Ian Kent  wrote:
> > 
> > On Wed, 2019-04-10 at 10:27 +0200, Dmitry Vyukov wrote:
> > > On Wed, Apr 10, 2019 at 2:26 AM Al Viro  wrote:
> > > > 
> > > > On Tue, Apr 09, 2019 at 07:36:00AM -0700, syzbot wrote:
> > > > > Bisection is inconclusive: the first bad commit could be any of:
> > > > 
> > > > [snip the useless pile]
> > > > 
> > > > > bisection log:
> > > > > https://syzkaller.appspot.com/x/bisect.txt?x=15e1fc2b20
> > > > > start commit:   [unknown
> > > > > git tree:   linux-next
> > > > > dashboard link:
> > > > > https://syzkaller.appspot.com/bug?extid=5399ed0832693e29f392
> > > > > syz repro:  
> > > > > https://syzkaller.appspot.com/x/repro.syz?x=101032b340
> > > > > C reproducer:   
> > > > > https://syzkaller.appspot.com/x/repro.c?x=1653406340
> > > > > 
> > > > > For information about bisection process see:
> > > > > https://goo.gl/tpsmEJ#bisection
> > > > 
> > > > If I'm not misreading the "crash report" there, it has injected an
> > > > allocation
> > > > failure in dentry allocation in d_make_root() from autofs_fill_super() (
> > > > root_inode = autofs_get_inode(s, S_IFDIR | 0755);
> > > > root = d_make_root(root_inode);
> > > > ) which has triggered iput() on the inode passed to d_make_root() (as it
> > > > ought
> > > > to).  At which point it stepped into some BUG_ON() in fs/inode.c, but
> > > > I've
> > > > no idea which one it is - line numbers do not match anything in linux-
> > > > next
> > > > or in mainline.  Reported line 1566 is
> > > > if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > > > in all of them; as the matter of fact, the diff in fs/inode.c between
> > > > -next and mainline is empty.
> > > > 
> > > > There is a BUG_ON() several lines prior, and in 4.20 it used to be line
> > > > 1566,
> > > > so _probably_ that's what it is.  With that assumption, it's
> > > > BUG_ON(inode->i_state & I_CLEAR);
> > > > IOW, we'd got I_CLEAR in the inode passed to d_make_root() there.  Which
> > > > should not happen - the inode must have come from new_inode(), which
> > > > gets it from new_inode_pseudo(), which zeroes ->i_state.  And I_CLEAR
> > > > is set only in clear_inode().  For autofs inodes that can come only
> > > > from autofs_evict_inode(), called as ->evict() from evict_inode().
> > > > Which should never ever be called for inode with positive ->i_count...
> > > > 
> > > > It might be memory corruption; it might be a dangling inode pointer
> > > > somewhere, it might be something else.
> > > > 
> > > > To get any further we really need a confirmation of the identity of
> > > > triggered BUG_ON().
> > > > 
> > > > As an aside, your "sample crash reports" would've been much more useful
> > > > if
> > > > they went with commit SHA1 in question, especially when they contain
> > > > line
> > > > numbers.
> > > 
> > > Hi Al,
> > > 
> > > This is the commit for matching lines:
> > > 
> > > > HEAD commit:d14b746c6c1c Add linux-next specific files for 20181214
> > 
> > Are you sure?
> > what does 20181214 mean?
> 
> Yes, I just copy-pasted from the report. "d14b746c6c1c" is the commit
> hash. "Add linux-next specific files for 20181214" is the commit
> subject.
> 
> 
> > Looking at current next code (and several branches) it doesn't
> > appear the problem is possible?
> > 
> > > > git tree:   linux-next
> > 
> > But which branch is it really, master (which doesn't match the
> > line numbers btw)?
> 
> This is d14b746c6c1c commit hash. I don't know if there is a branch
> with HEAD pointing to this commit or not, but it seems unimportant.
> Tree+commit is the identity of code state.
> 
> 
> > > fs/inode.c:1566 points to:
> > > 
> > > void iput(struct inode *inode)
> > > {
> > > ...
> > > BUG_ON(inode->i_state & I_CLEAR);
> > > 
> > > 
> > > The dashboard page provides kernel git repository and commit for each
> > > crash.
> > 
> > Those links don't seem to make sense to me ...
> > 
> > Help me out here!
> 
> There is git repo name provided and commit hash. It's meant to be
> self-explanatory. What exactly is unclear?

I'm unable to find a branch matching the line numbers.

Given that, on the face of it, the scenario is impossible I'm
seeking clarification on what linux-next to look at for the
sake of accuracy.

So I'm wondering if this testing done using the master branch
or one of the daily branches one would use to check for conflicts
before posting?

Or perhaps the master branch has been updated and the testing was
done on something different.

Ian



Re: kernel BUG at fs/inode.c:LINE!

2019-04-10 Thread Ian Kent
On Wed, 2019-04-10 at 10:27 +0200, Dmitry Vyukov wrote:
> On Wed, Apr 10, 2019 at 2:26 AM Al Viro  wrote:
> > 
> > On Tue, Apr 09, 2019 at 07:36:00AM -0700, syzbot wrote:
> > > Bisection is inconclusive: the first bad commit could be any of:
> > 
> > [snip the useless pile]
> > 
> > > bisection log:  
> > > https://syzkaller.appspot.com/x/bisect.txt?x=15e1fc2b20
> > > start commit:   [unknown
> > > git tree:   linux-next
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=5399ed0832693e29f392
> > > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=101032b340
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1653406340
> > > 
> > > For information about bisection process see: 
> > > https://goo.gl/tpsmEJ#bisection
> > 
> > If I'm not misreading the "crash report" there, it has injected an
> > allocation
> > failure in dentry allocation in d_make_root() from autofs_fill_super() (
> > root_inode = autofs_get_inode(s, S_IFDIR | 0755);
> > root = d_make_root(root_inode);
> > ) which has triggered iput() on the inode passed to d_make_root() (as it
> > ought
> > to).  At which point it stepped into some BUG_ON() in fs/inode.c, but I've
> > no idea which one it is - line numbers do not match anything in linux-next
> > or in mainline.  Reported line 1566 is
> > if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> > in all of them; as the matter of fact, the diff in fs/inode.c between
> > -next and mainline is empty.
> > 
> > There is a BUG_ON() several lines prior, and in 4.20 it used to be line
> > 1566,
> > so _probably_ that's what it is.  With that assumption, it's
> > BUG_ON(inode->i_state & I_CLEAR);
> > IOW, we'd got I_CLEAR in the inode passed to d_make_root() there.  Which
> > should not happen - the inode must have come from new_inode(), which
> > gets it from new_inode_pseudo(), which zeroes ->i_state.  And I_CLEAR
> > is set only in clear_inode().  For autofs inodes that can come only
> > from autofs_evict_inode(), called as ->evict() from evict_inode().
> > Which should never ever be called for inode with positive ->i_count...
> > 
> > It might be memory corruption; it might be a dangling inode pointer
> > somewhere, it might be something else.
> > 
> > To get any further we really need a confirmation of the identity of
> > triggered BUG_ON().
> > 
> > As an aside, your "sample crash reports" would've been much more useful if
> > they went with commit SHA1 in question, especially when they contain line
> > numbers.
> 
> Hi Al,
> 
> This is the commit for matching lines:
> 
> > HEAD commit:d14b746c6c1c Add linux-next specific files for 20181214

Are you sure?
what does 20181214 mean?

Looking at current next code (and several branches) it doesn't
appear the problem is possible?

> > git tree:   linux-next

But which branch is it really, master (which doesn't match the
line numbers btw)?

> 
> fs/inode.c:1566 points to:
> 
> void iput(struct inode *inode)
> {
> ...
> BUG_ON(inode->i_state & I_CLEAR);
> 
> 
> The dashboard page provides kernel git repository and commit for each crash.

Those links don't seem to make sense to me ...

Help me out here!
Ian




  1   2   3   4   5   6   7   8   9   10   >