Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-21 Thread Ian Kent
On Sat, 2020-12-19 at 15:47 +0800, Fox Chen wrote:
> On Sat, Dec 19, 2020 at 8:53 AM Ian Kent  wrote:
> > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent 
> > > wrote:
> > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent 
> > > > > wrote:
> > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > > > What could be done is to make the kernfs node
> > > > > > > > > attr_mutex
> > > > > > > > > a pointer and dynamically allocate it but even that
> > > > > > > > > is
> > > > > > > > > too
> > > > > > > > > costly a size addition to the kernfs node structure
> > > > > > > > > as
> > > > > > > > > Tejun has said.
> > > > > > > > 
> > > > > > > > I guess the question to ask is, is there really a need
> > > > > > > > to
> > > > > > > > call kernfs_refresh_inode() from functions that are
> > > > > > > > usually
> > > > > > > > reading/checking functions.
> > > > > > > > 
> > > > > > > > Would it be sufficient to refresh the inode in the
> > > > > > > > write/set
> > > > > > > > operations in (if there's any) places where things like
> > > > > > > > setattr_copy() is not already called?
> > > > > > > > 
> > > > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > > > 
> > > > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > > > sysfs
> > > > > > > namespace is
> > > > > > > implemented, so I don't think there's an easy around
> > > > > > > that.
> > > > > > > The
> > > > > > > only
> > > > > > > thing I
> > > > > > > can think of is embedding the lock into attrs and doing
> > > > > > > xchg
> > > > > > > dance
> > > > > > > when
> > > > > > > attaching it.
> > > > > > 
> > > > > > Sounds like your saying it would be ok to add a lock to the
> > > > > > attrs structure, am I correct?
> > > > > > 
> > > > > > Assuming it is then, to keep things simple, use two locks.
> > > > > > 
> > > > > > One global lock for the allocation and an attrs lock for
> > > > > > all
> > > > > > the
> > > > > > attrs field updates including the kernfs_refresh_inode()
> > > > > > update.
> > > > > > 
> > > > > > The critical section for the global lock could be reduced
> > > > > > and
> > > > > > it
> > > > > > changed to a spin lock.
> > > > > > 
> > > > > > In __kernfs_iattrs() we would have something like:
> > > > > > 
> > > > > > take the allocation lock
> > > > > > do the allocated checks
> > > > > >   assign if existing attrs
> > > > > >   release the allocation lock
> > > > > >   return existing if found
> > > > > > othewise
> > > > > >   release the allocation lock
> > > > > > 
> > > > > > allocate and initialize attrs
> > > > > > 
> > > > > > take the allocation lock
> > > > > > check if someone beat us to it
> > > > > >   free and grab exiting attrs
> > > > > > otherwise
> > > > > >   assign the new attrs
> > > > > > release the allocation lock
> > > > > > return attrs
> > > > > > 
> > > > > > Add a spinlock to the attrs struct and use it everywhere
> > > > > > for
> > > > > > field updates.
> > > > > > 
> > > > > > Am I on the right track or can you see problems with this?
> > > > > > 
> > > > > > Ian
> > > > > > 
> > > > > 
> > > > > umm, we update the inode in kernfs_refresh_inode, right??  So
> > > > > I
> > > > > guess
> > > > > the problem is how can we protect the inode when
> > > > > kernfs_refresh_inode
> > > > > is called, not the attrs??
> > > > 
> > > > But the attrs (which is what's copied from) were protected by
> > > > the
> > > > mutex lock (IIUC) so dealing with the inode attributes implies
> > > > dealing with the kernfs node attrs too.
> > > > 
> > > > For example in kernfs_iop_setattr() the call to setattr_copy()
> > > > copies
> > > > the node attrs to the inode under the same mutex lock. So, if a
> > > > read
> > > > lock is used the copy in kernfs_refresh_inode() is no longer
> > > > protected,
> > > > it needs to be protected in a different way.
> > > > 
> > > 
> > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem
> > > for
> > > .setattr but
> > >  no lock for .getattr (misdocumented?? sometimes they have as
> > > you've
> > > found out)?
> > > What does it protect against?? Because .permission does a similar
> > > thing
> > > here -- updating inode attributes, the goal is to provide the
> > > same
> > > protection level
> > > for .permission as for .setattr, am I right???
> > 
> > As far as the documentation goes that's probably my
> > misunderstanding
> > of it.
> > 
> > It does happen that the VFS makes assumptions about how call backs
> > are meant to be used.
> > 
> > Read like call backs, like .getattr() and .permission() are meant
> > to
> > be used, well, like read like functions so the VFS should be ok to
> > take locks or not based on the operation context at hand.
> > 
> > So it's not about the locking for these call 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-21 Thread Fox Chen
On Sun, Dec 20, 2020 at 7:52 AM Ian Kent  wrote:
>
> On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote:
> > Hello,
> >
> > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> > > And looking further I see there's a race that kernfs can't do
> > > anything
> > > about between kernfs_refresh_inode() and fs/inode.c:update_times().
> >
> > Do kernfs files end up calling into that path tho? Doesn't look like
> > it to
> > me but if so yeah we'd need to override the update_time for kernfs.
>
> Sorry, the below was very hastily done and not what I would actually
> propose.
>
> The main point of it was the question
>
> +   /* Which kernfs node attributes should be updated from
> +* time?
> +*/
>
> but looking at it again this morning I think the node iattr fields
> that might need to be updated would be atime, ctime and mtime only,
> maybe not ctime ... not sure.
>
> What do you think?
>
> Also, if kn->attr == NULL it should fall back to what the VFS
> currently does.
>
> The update_times() function is one of the few places where the
> VFS updates the inode times.
>
> The idea is that the reason kernfs needs to overwrite the inode
> attributes is to reset what the VFS might have done but if kernfs
> has this inode operation they won't need to be overwritten since
> they won't have changed.
>
> There may be other places where the attributes (or an attribute)
> are set by the VFS, I haven't finished checking that yet so my
> suggestion might not be entirely valid.
>
> What I need to do is work out what kernfs node attributes, if any,
> should be updated by .update_times(). If I go by what
> kernfs_refresh_inode() does now then that would be none but shouldn't
> atime at least be updated in the node iattr.
>
> > > +static int kernfs_iop_update_time(struct inode *inode, struct
> > > timespec64 *time, int flags)
> > >  {
> > > -   struct inode *inode = d_inode(path->dentry);
> > > struct kernfs_node *kn = inode->i_private;
> > > +   struct kernfs_iattrs *attrs;
> > >
> > > mutex_lock(_mutex);
> > > +   attrs = kernfs_iattrs(kn);
> > > +   if (!attrs) {
> > > +   mutex_unlock(_mutex);
> > > +   return -ENOMEM;
> > > +   }
> > > +
> > > +   /* Which kernfs node attributes should be updated from
> > > +* time?
> > > +*/
> > > +
> > > kernfs_refresh_inode(kn, inode);
> > > mutex_unlock(_mutex);
> >
> > I don't see how this would reflect the changes from kernfs_setattr()
> > into
> > the attached inode. This would actually make the attr updates
> > obviously racy
> > - the userland visible attrs would be stale until the inode gets
> > reclaimed
> > and then when it gets reinstantiated it'd show the latest
> > information.
>
> Right, I will have to think about that, but as I say above this
> isn't really what I would propose.
>
> If .update_times() sticks strictly to what kernfs_refresh_inode()
> does now then it would set the inode attributes from the node iattr
> only.
>
> >
> > That said, if you wanna take the direction where attr updates are
> > reflected
> > to the associated inode when the change occurs, which makes sense,
> > the right
> > thing to do would be making kernfs_setattr() update the associated
> > inode if
> > existent.
>
> Mmm, that's a good point but it looks like the inode isn't available
> there.
>
Is it possible to embed super block somewhere, then we can call
kernfs_get_inode to get inode in kernfs_setattr???


thanks,
fox


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-19 Thread Ian Kent
On Sun, 2020-12-20 at 07:52 +0800, Ian Kent wrote:
> On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote:
> > Hello,
> > 
> > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> > > And looking further I see there's a race that kernfs can't do
> > > anything
> > > about between kernfs_refresh_inode() and
> > > fs/inode.c:update_times().
> > 
> > Do kernfs files end up calling into that path tho? Doesn't look
> > like
> > it to
> > me but if so yeah we'd need to override the update_time for kernfs.

You are correct, update_time() will only be called during symlink
following and only to update atime.

So this isn't sufficient to update the inode attributes to reflect
changes make by things like kernfs_setattr() or when the directory
link count changes ...

Sigh!



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-19 Thread Ian Kent
On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote:
> Hello,
> 
> On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> > And looking further I see there's a race that kernfs can't do
> > anything
> > about between kernfs_refresh_inode() and fs/inode.c:update_times().
> 
> Do kernfs files end up calling into that path tho? Doesn't look like
> it to
> me but if so yeah we'd need to override the update_time for kernfs.

Sorry, the below was very hastily done and not what I would actually
propose.

The main point of it was the question

+   /* Which kernfs node attributes should be updated from
+* time?
+*/

but looking at it again this morning I think the node iattr fields
that might need to be updated would be atime, ctime and mtime only,
maybe not ctime ... not sure.

What do you think?

Also, if kn->attr == NULL it should fall back to what the VFS
currently does.

The update_times() function is one of the few places where the
VFS updates the inode times.

The idea is that the reason kernfs needs to overwrite the inode
attributes is to reset what the VFS might have done but if kernfs
has this inode operation they won't need to be overwritten since
they won't have changed.

There may be other places where the attributes (or an attribute)
are set by the VFS, I haven't finished checking that yet so my
suggestion might not be entirely valid.

What I need to do is work out what kernfs node attributes, if any,
should be updated by .update_times(). If I go by what
kernfs_refresh_inode() does now then that would be none but shouldn't
atime at least be updated in the node iattr.

> > +static int kernfs_iop_update_time(struct inode *inode, struct
> > timespec64 *time, int flags)
> >  {
> > -   struct inode *inode = d_inode(path->dentry);
> > struct kernfs_node *kn = inode->i_private;
> > +   struct kernfs_iattrs *attrs;
> >  
> > mutex_lock(_mutex);
> > +   attrs = kernfs_iattrs(kn);
> > +   if (!attrs) {
> > +   mutex_unlock(_mutex);
> > +   return -ENOMEM;
> > +   }
> > +
> > +   /* Which kernfs node attributes should be updated from
> > +* time?
> > +*/
> > +
> > kernfs_refresh_inode(kn, inode);
> > mutex_unlock(_mutex);
> 
> I don't see how this would reflect the changes from kernfs_setattr()
> into
> the attached inode. This would actually make the attr updates
> obviously racy
> - the userland visible attrs would be stale until the inode gets
> reclaimed
> and then when it gets reinstantiated it'd show the latest
> information.

Right, I will have to think about that, but as I say above this
isn't really what I would propose.

If .update_times() sticks strictly to what kernfs_refresh_inode()
does now then it would set the inode attributes from the node iattr
only.

> 
> That said, if you wanna take the direction where attr updates are
> reflected
> to the associated inode when the change occurs, which makes sense,
> the right
> thing to do would be making kernfs_setattr() update the associated
> inode if
> existent.

Mmm, that's a good point but it looks like the inode isn't available
there.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-19 Thread Tejun Heo
Hello,

On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> And looking further I see there's a race that kernfs can't do anything
> about between kernfs_refresh_inode() and fs/inode.c:update_times().

Do kernfs files end up calling into that path tho? Doesn't look like it to
me but if so yeah we'd need to override the update_time for kernfs.

> +static int kernfs_iop_update_time(struct inode *inode, struct timespec64 
> *time, int flags)
>  {
> - struct inode *inode = d_inode(path->dentry);
>   struct kernfs_node *kn = inode->i_private;
> + struct kernfs_iattrs *attrs;
>  
>   mutex_lock(_mutex);
> + attrs = kernfs_iattrs(kn);
> + if (!attrs) {
> + mutex_unlock(_mutex);
> + return -ENOMEM;
> + }
> +
> + /* Which kernfs node attributes should be updated from
> +  * time?
> +  */
> +
>   kernfs_refresh_inode(kn, inode);
>   mutex_unlock(_mutex);

I don't see how this would reflect the changes from kernfs_setattr() into
the attached inode. This would actually make the attr updates obviously racy
- the userland visible attrs would be stale until the inode gets reclaimed
and then when it gets reinstantiated it'd show the latest information.

That said, if you wanna take the direction where attr updates are reflected
to the associated inode when the change occurs, which makes sense, the right
thing to do would be making kernfs_setattr() update the associated inode if
existent.

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Fox Chen
On Sat, Dec 19, 2020 at 8:53 AM Ian Kent  wrote:
>
> On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent  wrote:
> > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent 
> > > > wrote:
> > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > > Hello,
> > > > > >
> > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > > What could be done is to make the kernfs node attr_mutex
> > > > > > > > a pointer and dynamically allocate it but even that is
> > > > > > > > too
> > > > > > > > costly a size addition to the kernfs node structure as
> > > > > > > > Tejun has said.
> > > > > > >
> > > > > > > I guess the question to ask is, is there really a need to
> > > > > > > call kernfs_refresh_inode() from functions that are usually
> > > > > > > reading/checking functions.
> > > > > > >
> > > > > > > Would it be sufficient to refresh the inode in the
> > > > > > > write/set
> > > > > > > operations in (if there's any) places where things like
> > > > > > > setattr_copy() is not already called?
> > > > > > >
> > > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > >
> > > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > > sysfs
> > > > > > namespace is
> > > > > > implemented, so I don't think there's an easy around that.
> > > > > > The
> > > > > > only
> > > > > > thing I
> > > > > > can think of is embedding the lock into attrs and doing xchg
> > > > > > dance
> > > > > > when
> > > > > > attaching it.
> > > > >
> > > > > Sounds like your saying it would be ok to add a lock to the
> > > > > attrs structure, am I correct?
> > > > >
> > > > > Assuming it is then, to keep things simple, use two locks.
> > > > >
> > > > > One global lock for the allocation and an attrs lock for all
> > > > > the
> > > > > attrs field updates including the kernfs_refresh_inode()
> > > > > update.
> > > > >
> > > > > The critical section for the global lock could be reduced and
> > > > > it
> > > > > changed to a spin lock.
> > > > >
> > > > > In __kernfs_iattrs() we would have something like:
> > > > >
> > > > > take the allocation lock
> > > > > do the allocated checks
> > > > >   assign if existing attrs
> > > > >   release the allocation lock
> > > > >   return existing if found
> > > > > othewise
> > > > >   release the allocation lock
> > > > >
> > > > > allocate and initialize attrs
> > > > >
> > > > > take the allocation lock
> > > > > check if someone beat us to it
> > > > >   free and grab exiting attrs
> > > > > otherwise
> > > > >   assign the new attrs
> > > > > release the allocation lock
> > > > > return attrs
> > > > >
> > > > > Add a spinlock to the attrs struct and use it everywhere for
> > > > > field updates.
> > > > >
> > > > > Am I on the right track or can you see problems with this?
> > > > >
> > > > > Ian
> > > > >
> > > >
> > > > umm, we update the inode in kernfs_refresh_inode, right??  So I
> > > > guess
> > > > the problem is how can we protect the inode when
> > > > kernfs_refresh_inode
> > > > is called, not the attrs??
> > >
> > > But the attrs (which is what's copied from) were protected by the
> > > mutex lock (IIUC) so dealing with the inode attributes implies
> > > dealing with the kernfs node attrs too.
> > >
> > > For example in kernfs_iop_setattr() the call to setattr_copy()
> > > copies
> > > the node attrs to the inode under the same mutex lock. So, if a
> > > read
> > > lock is used the copy in kernfs_refresh_inode() is no longer
> > > protected,
> > > it needs to be protected in a different way.
> > >
> >
> > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for
> > .setattr but
> >  no lock for .getattr (misdocumented?? sometimes they have as you've
> > found out)?
> > What does it protect against?? Because .permission does a similar
> > thing
> > here -- updating inode attributes, the goal is to provide the same
> > protection level
> > for .permission as for .setattr, am I right???
>
> As far as the documentation goes that's probably my misunderstanding
> of it.
>
> It does happen that the VFS makes assumptions about how call backs
> are meant to be used.
>
> Read like call backs, like .getattr() and .permission() are meant to
> be used, well, like read like functions so the VFS should be ok to
> take locks or not based on the operation context at hand.
>
> So it's not about the locking for these call backs per se, it's about
> the context in which they are called.
>
> For example, in link_path_walk(), at the beginning of the component
> lookup loop (essentially for the containing directory at that point),
> may_lookup() is called which leads to a call to .permission() without
> any inode lock held at that point.
>
> But file opens (possibly following a path walk to resolve a path)
> are different.
>
> For example, do_filp_open() calls path_openat() which leads to a
> call to open_last_lookups(), which leads 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Ian Kent
On Fri, 2020-12-18 at 09:59 -0500, Tejun Heo wrote:
> Hello,
> 
> On Fri, Dec 18, 2020 at 03:36:21PM +0800, Ian Kent wrote:
> > Sounds like your saying it would be ok to add a lock to the
> > attrs structure, am I correct?
> 
> Yeah, adding a lock to attrs is a lot less of a problem and it looks
> like
> it's gonna have to be either that or hashed locks, which might
> actually make
> sense if we're worried about the size of attrs (I don't think we need
> to).

Maybe that isn't needed.

And looking further I see there's a race that kernfs can't do anything
about between kernfs_refresh_inode() and fs/inode.c:update_times().

kernfs could avoid fighting with the VFS to keep the attributes set to
those of the kernfs node by using the inode operation .update_times()
and, if it makes sense, the kernfs node attributes that it wants to be
updated on file system activity could also be updated here.

I can't find any reason why this shouldn't be done but kernfs is
fairly widely used in other kernel subsystems so what does everyone
think of this patch, updated to set kernfs node attributes that
should be updated of course, see comment in the patch?

kernfs: fix attributes update race

From: Ian Kent 

kernfs uses kernfs_refresh_inode() (called from kernfs_iop_getattr()
and kernfs_iop_permission()) to keep the inode attributes set to the
attibutes of the kernfs node.

But there is no way for kernfs to prevent racing with the function
fs/inode.c:update_times().

The better choice is to use the inode operation .update_times() and
just let the VFS use the generic functions for .getattr() and
.permission().

Signed-off-by: Ian Kent 
---
 fs/kernfs/inode.c   |   37 ++---
 fs/kernfs/kernfs-internal.h |4 +---
 2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index fc2469a20fed..51780329590c 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -24,9 +24,8 @@ static const struct address_space_operations kernfs_aops = {
 };
 
 static const struct inode_operations kernfs_iops = {
-   .permission = kernfs_iop_permission,
+   .update_time= kernfs_update_time,
.setattr= kernfs_iop_setattr,
-   .getattr= kernfs_iop_getattr,
.listxattr  = kernfs_iop_listxattr,
 };
 
@@ -183,18 +182,26 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, 
struct inode *inode)
set_nlink(inode, kn->dir.subdirs + 2);
 }
 
-int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
-  u32 request_mask, unsigned int query_flags)
+static int kernfs_iop_update_time(struct inode *inode, struct timespec64 
*time, int flags)
 {
-   struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;
+   struct kernfs_iattrs *attrs;
 
mutex_lock(_mutex);
+   attrs = kernfs_iattrs(kn);
+   if (!attrs) {
+   mutex_unlock(_mutex);
+   return -ENOMEM;
+   }
+
+   /* Which kernfs node attributes should be updated from
+* time?
+*/
+
kernfs_refresh_inode(kn, inode);
mutex_unlock(_mutex);
 
-   generic_fillattr(inode, stat);
-   return 0;
+   return 0
 }
 
 static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
@@ -272,22 +279,6 @@ void kernfs_evict_inode(struct inode *inode)
kernfs_put(kn);
 }
 
-int kernfs_iop_permission(struct inode *inode, int mask)
-{
-   struct kernfs_node *kn;
-
-   if (mask & MAY_NOT_BLOCK)
-   return -ECHILD;
-
-   kn = inode->i_private;
-
-   mutex_lock(_mutex);
-   kernfs_refresh_inode(kn, inode);
-   mutex_unlock(_mutex);
-
-   return generic_permission(inode, mask);
-}
-
 int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
 void *value, size_t size)
 {
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 7ee97ef59184..98d08b928f93 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -89,10 +89,8 @@ extern struct kmem_cache *kernfs_node_cache, 
*kernfs_iattrs_cache;
  */
 extern const struct xattr_handler *kernfs_xattr_handlers[];
 void kernfs_evict_inode(struct inode *inode);
-int kernfs_iop_permission(struct inode *inode, int mask);
+int kernfs_update_time(struct inode *inode, struct timespec64 *time, int 
flags);
 int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr);
-int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
-  u32 request_mask, unsigned int query_flags);
 ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size);
 int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
 



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Ian Kent
On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote:
> On Fri, Dec 18, 2020 at 7:21 PM Ian Kent  wrote:
> > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent 
> > > wrote:
> > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > > What could be done is to make the kernfs node attr_mutex
> > > > > > > a pointer and dynamically allocate it but even that is
> > > > > > > too
> > > > > > > costly a size addition to the kernfs node structure as
> > > > > > > Tejun has said.
> > > > > > 
> > > > > > I guess the question to ask is, is there really a need to
> > > > > > call kernfs_refresh_inode() from functions that are usually
> > > > > > reading/checking functions.
> > > > > > 
> > > > > > Would it be sufficient to refresh the inode in the
> > > > > > write/set
> > > > > > operations in (if there's any) places where things like
> > > > > > setattr_copy() is not already called?
> > > > > > 
> > > > > > Perhaps GKH or Tejun could comment on this?
> > > > > 
> > > > > My memory is a bit hazy but invalidations on reads is how
> > > > > sysfs
> > > > > namespace is
> > > > > implemented, so I don't think there's an easy around that.
> > > > > The
> > > > > only
> > > > > thing I
> > > > > can think of is embedding the lock into attrs and doing xchg
> > > > > dance
> > > > > when
> > > > > attaching it.
> > > > 
> > > > Sounds like your saying it would be ok to add a lock to the
> > > > attrs structure, am I correct?
> > > > 
> > > > Assuming it is then, to keep things simple, use two locks.
> > > > 
> > > > One global lock for the allocation and an attrs lock for all
> > > > the
> > > > attrs field updates including the kernfs_refresh_inode()
> > > > update.
> > > > 
> > > > The critical section for the global lock could be reduced and
> > > > it
> > > > changed to a spin lock.
> > > > 
> > > > In __kernfs_iattrs() we would have something like:
> > > > 
> > > > take the allocation lock
> > > > do the allocated checks
> > > >   assign if existing attrs
> > > >   release the allocation lock
> > > >   return existing if found
> > > > othewise
> > > >   release the allocation lock
> > > > 
> > > > allocate and initialize attrs
> > > > 
> > > > take the allocation lock
> > > > check if someone beat us to it
> > > >   free and grab exiting attrs
> > > > otherwise
> > > >   assign the new attrs
> > > > release the allocation lock
> > > > return attrs
> > > > 
> > > > Add a spinlock to the attrs struct and use it everywhere for
> > > > field updates.
> > > > 
> > > > Am I on the right track or can you see problems with this?
> > > > 
> > > > Ian
> > > > 
> > > 
> > > umm, we update the inode in kernfs_refresh_inode, right??  So I
> > > guess
> > > the problem is how can we protect the inode when
> > > kernfs_refresh_inode
> > > is called, not the attrs??
> > 
> > But the attrs (which is what's copied from) were protected by the
> > mutex lock (IIUC) so dealing with the inode attributes implies
> > dealing with the kernfs node attrs too.
> > 
> > For example in kernfs_iop_setattr() the call to setattr_copy()
> > copies
> > the node attrs to the inode under the same mutex lock. So, if a
> > read
> > lock is used the copy in kernfs_refresh_inode() is no longer
> > protected,
> > it needs to be protected in a different way.
> > 
> 
> Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for
> .setattr but
>  no lock for .getattr (misdocumented?? sometimes they have as you've
> found out)?
> What does it protect against?? Because .permission does a similar
> thing
> here -- updating inode attributes, the goal is to provide the same
> protection level
> for .permission as for .setattr, am I right???

As far as the documentation goes that's probably my misunderstanding
of it.

It does happen that the VFS makes assumptions about how call backs
are meant to be used.

Read like call backs, like .getattr() and .permission() are meant to
be used, well, like read like functions so the VFS should be ok to
take locks or not based on the operation context at hand.

So it's not about the locking for these call backs per se, it's about
the context in which they are called.

For example, in link_path_walk(), at the beginning of the component
lookup loop (essentially for the containing directory at that point),
may_lookup() is called which leads to a call to .permission() without
any inode lock held at that point.

But file opens (possibly following a path walk to resolve a path)
are different.

For example, do_filp_open() calls path_openat() which leads to a
call to open_last_lookups(), which leads to a call to .permission()
along the way. And in this case there are two contexts, an open()
create or one without create, the former needing the exclusive inode
lock and the later able to use the shared lock.

So it's about the locking needed for the encompassing operation that
is 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Tejun Heo
Hello,

On Fri, Dec 18, 2020 at 03:36:21PM +0800, Ian Kent wrote:
> Sounds like your saying it would be ok to add a lock to the
> attrs structure, am I correct?

Yeah, adding a lock to attrs is a lot less of a problem and it looks like
it's gonna have to be either that or hashed locks, which might actually make
sense if we're worried about the size of attrs (I don't think we need to).

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Fox Chen
On Fri, Dec 18, 2020 at 7:21 PM Ian Kent  wrote:
>
> On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent  wrote:
> > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > > What could be done is to make the kernfs node attr_mutex
> > > > > > a pointer and dynamically allocate it but even that is too
> > > > > > costly a size addition to the kernfs node structure as
> > > > > > Tejun has said.
> > > > >
> > > > > I guess the question to ask is, is there really a need to
> > > > > call kernfs_refresh_inode() from functions that are usually
> > > > > reading/checking functions.
> > > > >
> > > > > Would it be sufficient to refresh the inode in the write/set
> > > > > operations in (if there's any) places where things like
> > > > > setattr_copy() is not already called?
> > > > >
> > > > > Perhaps GKH or Tejun could comment on this?
> > > >
> > > > My memory is a bit hazy but invalidations on reads is how sysfs
> > > > namespace is
> > > > implemented, so I don't think there's an easy around that. The
> > > > only
> > > > thing I
> > > > can think of is embedding the lock into attrs and doing xchg
> > > > dance
> > > > when
> > > > attaching it.
> > >
> > > Sounds like your saying it would be ok to add a lock to the
> > > attrs structure, am I correct?
> > >
> > > Assuming it is then, to keep things simple, use two locks.
> > >
> > > One global lock for the allocation and an attrs lock for all the
> > > attrs field updates including the kernfs_refresh_inode() update.
> > >
> > > The critical section for the global lock could be reduced and it
> > > changed to a spin lock.
> > >
> > > In __kernfs_iattrs() we would have something like:
> > >
> > > take the allocation lock
> > > do the allocated checks
> > >   assign if existing attrs
> > >   release the allocation lock
> > >   return existing if found
> > > othewise
> > >   release the allocation lock
> > >
> > > allocate and initialize attrs
> > >
> > > take the allocation lock
> > > check if someone beat us to it
> > >   free and grab exiting attrs
> > > otherwise
> > >   assign the new attrs
> > > release the allocation lock
> > > return attrs
> > >
> > > Add a spinlock to the attrs struct and use it everywhere for
> > > field updates.
> > >
> > > Am I on the right track or can you see problems with this?
> > >
> > > Ian
> > >
> >
> > umm, we update the inode in kernfs_refresh_inode, right??  So I guess
> > the problem is how can we protect the inode when kernfs_refresh_inode
> > is called, not the attrs??
>
> But the attrs (which is what's copied from) were protected by the
> mutex lock (IIUC) so dealing with the inode attributes implies
> dealing with the kernfs node attrs too.
>
> For example in kernfs_iop_setattr() the call to setattr_copy() copies
> the node attrs to the inode under the same mutex lock. So, if a read
> lock is used the copy in kernfs_refresh_inode() is no longer protected,
> it needs to be protected in a different way.
>

Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for .setattr but
 no lock for .getattr (misdocumented?? sometimes they have as you've found out)?
What does it protect against?? Because .permission does a similar thing
here -- updating inode attributes, the goal is to provide the same
protection level
for .permission as for .setattr, am I right???


thanks,
fox


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Ian Kent
On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote:
> On Fri, Dec 18, 2020 at 3:36 PM Ian Kent  wrote:
> > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > > What could be done is to make the kernfs node attr_mutex
> > > > > a pointer and dynamically allocate it but even that is too
> > > > > costly a size addition to the kernfs node structure as
> > > > > Tejun has said.
> > > > 
> > > > I guess the question to ask is, is there really a need to
> > > > call kernfs_refresh_inode() from functions that are usually
> > > > reading/checking functions.
> > > > 
> > > > Would it be sufficient to refresh the inode in the write/set
> > > > operations in (if there's any) places where things like
> > > > setattr_copy() is not already called?
> > > > 
> > > > Perhaps GKH or Tejun could comment on this?
> > > 
> > > My memory is a bit hazy but invalidations on reads is how sysfs
> > > namespace is
> > > implemented, so I don't think there's an easy around that. The
> > > only
> > > thing I
> > > can think of is embedding the lock into attrs and doing xchg
> > > dance
> > > when
> > > attaching it.
> > 
> > Sounds like your saying it would be ok to add a lock to the
> > attrs structure, am I correct?
> > 
> > Assuming it is then, to keep things simple, use two locks.
> > 
> > One global lock for the allocation and an attrs lock for all the
> > attrs field updates including the kernfs_refresh_inode() update.
> > 
> > The critical section for the global lock could be reduced and it
> > changed to a spin lock.
> > 
> > In __kernfs_iattrs() we would have something like:
> > 
> > take the allocation lock
> > do the allocated checks
> >   assign if existing attrs
> >   release the allocation lock
> >   return existing if found
> > othewise
> >   release the allocation lock
> > 
> > allocate and initialize attrs
> > 
> > take the allocation lock
> > check if someone beat us to it
> >   free and grab exiting attrs
> > otherwise
> >   assign the new attrs
> > release the allocation lock
> > return attrs
> > 
> > Add a spinlock to the attrs struct and use it everywhere for
> > field updates.
> > 
> > Am I on the right track or can you see problems with this?
> > 
> > Ian
> > 
> 
> umm, we update the inode in kernfs_refresh_inode, right??  So I guess
> the problem is how can we protect the inode when kernfs_refresh_inode
> is called, not the attrs??

But the attrs (which is what's copied from) were protected by the
mutex lock (IIUC) so dealing with the inode attributes implies
dealing with the kernfs node attrs too.

For example in kernfs_iop_setattr() the call to setattr_copy() copies
the node attrs to the inode under the same mutex lock. So, if a read
lock is used the copy in kernfs_refresh_inode() is no longer protected,
it needs to be protected in a different way.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Fox Chen
On Fri, Dec 18, 2020 at 3:36 PM Ian Kent  wrote:
>
> On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> > Hello,
> >
> > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > > What could be done is to make the kernfs node attr_mutex
> > > > a pointer and dynamically allocate it but even that is too
> > > > costly a size addition to the kernfs node structure as
> > > > Tejun has said.
> > >
> > > I guess the question to ask is, is there really a need to
> > > call kernfs_refresh_inode() from functions that are usually
> > > reading/checking functions.
> > >
> > > Would it be sufficient to refresh the inode in the write/set
> > > operations in (if there's any) places where things like
> > > setattr_copy() is not already called?
> > >
> > > Perhaps GKH or Tejun could comment on this?
> >
> > My memory is a bit hazy but invalidations on reads is how sysfs
> > namespace is
> > implemented, so I don't think there's an easy around that. The only
> > thing I
> > can think of is embedding the lock into attrs and doing xchg dance
> > when
> > attaching it.
>
> Sounds like your saying it would be ok to add a lock to the
> attrs structure, am I correct?
>
> Assuming it is then, to keep things simple, use two locks.
>
> One global lock for the allocation and an attrs lock for all the
> attrs field updates including the kernfs_refresh_inode() update.
>
> The critical section for the global lock could be reduced and it
> changed to a spin lock.
>
> In __kernfs_iattrs() we would have something like:
>
> take the allocation lock
> do the allocated checks
>   assign if existing attrs
>   release the allocation lock
>   return existing if found
> othewise
>   release the allocation lock
>
> allocate and initialize attrs
>
> take the allocation lock
> check if someone beat us to it
>   free and grab exiting attrs
> otherwise
>   assign the new attrs
> release the allocation lock
> return attrs
>
> Add a spinlock to the attrs struct and use it everywhere for
> field updates.
>
> Am I on the right track or can you see problems with this?
>
> Ian
>

umm, we update the inode in kernfs_refresh_inode, right??  So I guess
the problem is how can we protect the inode when kernfs_refresh_inode
is called, not the attrs??


thanks,
fox


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Ian Kent
On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote:
> Hello,
> 
> On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > > What could be done is to make the kernfs node attr_mutex
> > > a pointer and dynamically allocate it but even that is too
> > > costly a size addition to the kernfs node structure as
> > > Tejun has said.
> > 
> > I guess the question to ask is, is there really a need to
> > call kernfs_refresh_inode() from functions that are usually
> > reading/checking functions.
> > 
> > Would it be sufficient to refresh the inode in the write/set
> > operations in (if there's any) places where things like
> > setattr_copy() is not already called?
> > 
> > Perhaps GKH or Tejun could comment on this?
> 
> My memory is a bit hazy but invalidations on reads is how sysfs
> namespace is
> implemented, so I don't think there's an easy around that. The only
> thing I
> can think of is embedding the lock into attrs and doing xchg dance
> when
> attaching it.

Sounds like your saying it would be ok to add a lock to the
attrs structure, am I correct?

Assuming it is then, to keep things simple, use two locks.

One global lock for the allocation and an attrs lock for all the
attrs field updates including the kernfs_refresh_inode() update.

The critical section for the global lock could be reduced and it
changed to a spin lock.

In __kernfs_iattrs() we would have something like:

take the allocation lock
do the allocated checks
  assign if existing attrs
  release the allocation lock
  return existing if found
othewise
  release the allocation lock

allocate and initialize attrs

take the allocation lock
check if someone beat us to it
  free and grab exiting attrs
otherwise
  assign the new attrs
release the allocation lock
return attrs

Add a spinlock to the attrs struct and use it everywhere for
field updates.

Am I on the right track or can you see problems with this?

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Tejun Heo
Hello,

On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > What could be done is to make the kernfs node attr_mutex
> > a pointer and dynamically allocate it but even that is too
> > costly a size addition to the kernfs node structure as
> > Tejun has said.
> 
> I guess the question to ask is, is there really a need to
> call kernfs_refresh_inode() from functions that are usually
> reading/checking functions.
> 
> Would it be sufficient to refresh the inode in the write/set
> operations in (if there's any) places where things like
> setattr_copy() is not already called?
> 
> Perhaps GKH or Tejun could comment on this?

My memory is a bit hazy but invalidations on reads is how sysfs namespace is
implemented, so I don't think there's an easy around that. The only thing I
can think of is embedding the lock into attrs and doing xchg dance when
attaching it.

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Ian Kent
On Thu, 2020-12-17 at 19:09 +0800, Ian Kent wrote:
> On Thu, 2020-12-17 at 18:09 +0800, Ian Kent wrote:
> > On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote:
> > > On Thu, Dec 17, 2020 at 12:46 PM Ian Kent 
> > > wrote:
> > > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > > > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent 
> > > > > > wrote:
> > > > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <
> > > > > > > > ra...@themaw.net
> > > > > > > > wrote:
> > > > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > > > > > attr_mutex,
> > > > > > > > > > > > as
> > > > > > > > > > > > Tejun
> > > > > > > > > > > > mentioned here
> > > > > > > > > > > > (
> > > > > > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > > > > > > ),
> > > > > > > > > > > > maybe a global
> > > > > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > > > > > 
> > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could
> > > > > > > > > > > be
> > > > > > > > > > > used
> > > > > > > > > > > around
> > > > > > > > > > > the
> > > > > > > > > > > initial check and checked again at the end which
> > > > > > > > > > > would
> > > > > > > > > > > probably
> > > > > > > > > > > have
> > > > > > > > > > > been much faster but much less conservative and a
> > > > > > > > > > > bit
> > > > > > > > > > > more
> > > > > > > > > > > ugly
> > > > > > > > > > > so
> > > > > > > > > > > I just went the conservative path since there was
> > > > > > > > > > > so
> > > > > > > > > > > much
> > > > > > > > > > > change
> > > > > > > > > > > already.
> > > > > > > > > > 
> > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH
> > > > > > > > > > didn't
> > > > > > > > > > remember
> > > > > > > > > > it.
> > > > > > > > > > 
> > > > > > > > > > Based on what Tejun said it sounds like that needs
> > > > > > > > > > work.
> > > > > > > > > 
> > > > > > > > > Those attribute handling patches were meant to allow
> > > > > > > > > taking
> > > > > > > > > the
> > > > > > > > > rw
> > > > > > > > > sem read lock instead of the write lock for
> > > > > > > > > kernfs_refresh_inode()
> > > > > > > > > updates, with the added locking to protect the inode
> > > > > > > > > attributes
> > > > > > > > > update since it's called from the VFS both with and
> > > > > > > > > without
> > > > > > > > > the
> > > > > > > > > inode lock.
> > > > > > > > 
> > > > > > > > Oh, understood. I was asking also because lock on kn-
> > > > > > > > > attr_mutex
> > > > > > > > drags
> > > > > > > > concurrent performance.
> > > > > > > > 
> > > > > > > > > Looking around it looks like kernfs_iattrs() is
> > > > > > > > > called
> > > > > > > > > from
> > > > > > > > > multiple
> > > > > > > > > places without a node database lock at all.
> > > > > > > > > 
> > > > > > > > > I'm thinking that, to keep my proposed change
> > > > > > > > > straight
> > > > > > > > > forward
> > > > > > > > > and on topic, I should just leave
> > > > > > > > > kernfs_refresh_inode()
> > > > > > > > > taking
> > > > > > > > > the node db write lock for now and consider the
> > > > > > > > > attributes
> > > > > > > > > handling
> > > > > > > > > as a separate change. Once that's done we could
> > > > > > > > > reconsider
> > > > > > > > > what's
> > > > > > > > > needed to use the node db read lock in
> > > > > > > > > kernfs_refresh_inode().
> > > > > > > > 
> > > > > > > > You meant taking write lock of kernfs_rwsem for
> > > > > > > > kernfs_refresh_inode()??
> > > > > > > > It may be a lot slower in my benchmark, let me test it.
> > > > > > > 
> > > > > > > Yes, but make sure the write lock of kernfs_rwsem is
> > > > > > > being
> > > > > > > taken
> > > > > > > not the read lock.
> > > > > > > 
> > > > > > > That's a mistake I had initially?
> > > > > > > 
> > > > > > > Still, that attributes handling is, I think, sufficient
> > > > > > > to
> > > > > > > warrant
> > > > > > > a separate change since it looks like it might need work,
> > > > > > > the
> > > > > > > kernfs
> > > > > > > node db probably should be kept stable for those
> > > > > > > attribute
> > > > > > > updates
> > > > > > > but equally the existence of an instantiated dentry might
> > > > > > > mitigate
> > > > > > > the it.
> > > > > > > 
> > > > > > > Some people might just know whether it's ok or not but I
> > > > > > > would
> > > > > > > like
> > > > > > > to check the callers to work out what's going on.
> > > > > > > 
> > > > > > > In any case it's academic if GCH isn't willing to
> > > > > > > consider
> > > > > > > the
> > > > > > > series
> > > > > > > for review and possible merge.
> > > > > > > 
> > > > > > Hi Ian
> > > > > > 
> > > > > > 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Ian Kent
On Thu, 2020-12-17 at 18:09 +0800, Ian Kent wrote:
> On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote:
> > On Thu, Dec 17, 2020 at 12:46 PM Ian Kent  wrote:
> > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent 
> > > > > wrote:
> > > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <
> > > > > > > ra...@themaw.net
> > > > > > > wrote:
> > > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > > > > attr_mutex,
> > > > > > > > > > > as
> > > > > > > > > > > Tejun
> > > > > > > > > > > mentioned here
> > > > > > > > > > > (
> > > > > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > > > > > ),
> > > > > > > > > > > maybe a global
> > > > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > > > > 
> > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be
> > > > > > > > > > used
> > > > > > > > > > around
> > > > > > > > > > the
> > > > > > > > > > initial check and checked again at the end which
> > > > > > > > > > would
> > > > > > > > > > probably
> > > > > > > > > > have
> > > > > > > > > > been much faster but much less conservative and a
> > > > > > > > > > bit
> > > > > > > > > > more
> > > > > > > > > > ugly
> > > > > > > > > > so
> > > > > > > > > > I just went the conservative path since there was
> > > > > > > > > > so
> > > > > > > > > > much
> > > > > > > > > > change
> > > > > > > > > > already.
> > > > > > > > > 
> > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH
> > > > > > > > > didn't
> > > > > > > > > remember
> > > > > > > > > it.
> > > > > > > > > 
> > > > > > > > > Based on what Tejun said it sounds like that needs
> > > > > > > > > work.
> > > > > > > > 
> > > > > > > > Those attribute handling patches were meant to allow
> > > > > > > > taking
> > > > > > > > the
> > > > > > > > rw
> > > > > > > > sem read lock instead of the write lock for
> > > > > > > > kernfs_refresh_inode()
> > > > > > > > updates, with the added locking to protect the inode
> > > > > > > > attributes
> > > > > > > > update since it's called from the VFS both with and
> > > > > > > > without
> > > > > > > > the
> > > > > > > > inode lock.
> > > > > > > 
> > > > > > > Oh, understood. I was asking also because lock on kn-
> > > > > > > > attr_mutex
> > > > > > > drags
> > > > > > > concurrent performance.
> > > > > > > 
> > > > > > > > Looking around it looks like kernfs_iattrs() is called
> > > > > > > > from
> > > > > > > > multiple
> > > > > > > > places without a node database lock at all.
> > > > > > > > 
> > > > > > > > I'm thinking that, to keep my proposed change straight
> > > > > > > > forward
> > > > > > > > and on topic, I should just leave
> > > > > > > > kernfs_refresh_inode()
> > > > > > > > taking
> > > > > > > > the node db write lock for now and consider the
> > > > > > > > attributes
> > > > > > > > handling
> > > > > > > > as a separate change. Once that's done we could
> > > > > > > > reconsider
> > > > > > > > what's
> > > > > > > > needed to use the node db read lock in
> > > > > > > > kernfs_refresh_inode().
> > > > > > > 
> > > > > > > You meant taking write lock of kernfs_rwsem for
> > > > > > > kernfs_refresh_inode()??
> > > > > > > It may be a lot slower in my benchmark, let me test it.
> > > > > > 
> > > > > > Yes, but make sure the write lock of kernfs_rwsem is being
> > > > > > taken
> > > > > > not the read lock.
> > > > > > 
> > > > > > That's a mistake I had initially?
> > > > > > 
> > > > > > Still, that attributes handling is, I think, sufficient to
> > > > > > warrant
> > > > > > a separate change since it looks like it might need work,
> > > > > > the
> > > > > > kernfs
> > > > > > node db probably should be kept stable for those attribute
> > > > > > updates
> > > > > > but equally the existence of an instantiated dentry might
> > > > > > mitigate
> > > > > > the it.
> > > > > > 
> > > > > > Some people might just know whether it's ok or not but I
> > > > > > would
> > > > > > like
> > > > > > to check the callers to work out what's going on.
> > > > > > 
> > > > > > In any case it's academic if GCH isn't willing to consider
> > > > > > the
> > > > > > series
> > > > > > for review and possible merge.
> > > > > > 
> > > > > Hi Ian
> > > > > 
> > > > > I removed kn->attr_mutex and changed read lock to write lock
> > > > > for
> > > > > kernfs_refresh_inode
> > > > > 
> > > > > down_write(_rwsem);
> > > > > kernfs_refresh_inode(kn, inode);
> > > > > up_write(_rwsem);
> > > > > 
> > > > > 
> > > > > Unfortunate, changes in this way make things worse,  my
> > > > > benchmark
> > > > > runs
> > > > > 100% slower than upstream sysfs.  :(
> > > > > open+read+close a 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Ian Kent
On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote:
> On Thu, Dec 17, 2020 at 12:46 PM Ian Kent  wrote:
> > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent 
> > > > wrote:
> > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent  > > > > > >
> > > > > > wrote:
> > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > > > attr_mutex,
> > > > > > > > > > as
> > > > > > > > > > Tejun
> > > > > > > > > > mentioned here
> > > > > > > > > > (
> > > > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > > > > ),
> > > > > > > > > > maybe a global
> > > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > > > 
> > > > > > > > > I wasn't sure about that, IIRC a spin lock could be
> > > > > > > > > used
> > > > > > > > > around
> > > > > > > > > the
> > > > > > > > > initial check and checked again at the end which
> > > > > > > > > would
> > > > > > > > > probably
> > > > > > > > > have
> > > > > > > > > been much faster but much less conservative and a bit
> > > > > > > > > more
> > > > > > > > > ugly
> > > > > > > > > so
> > > > > > > > > I just went the conservative path since there was so
> > > > > > > > > much
> > > > > > > > > change
> > > > > > > > > already.
> > > > > > > > 
> > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH
> > > > > > > > didn't
> > > > > > > > remember
> > > > > > > > it.
> > > > > > > > 
> > > > > > > > Based on what Tejun said it sounds like that needs
> > > > > > > > work.
> > > > > > > 
> > > > > > > Those attribute handling patches were meant to allow
> > > > > > > taking
> > > > > > > the
> > > > > > > rw
> > > > > > > sem read lock instead of the write lock for
> > > > > > > kernfs_refresh_inode()
> > > > > > > updates, with the added locking to protect the inode
> > > > > > > attributes
> > > > > > > update since it's called from the VFS both with and
> > > > > > > without
> > > > > > > the
> > > > > > > inode lock.
> > > > > > 
> > > > > > Oh, understood. I was asking also because lock on kn-
> > > > > > > attr_mutex
> > > > > > drags
> > > > > > concurrent performance.
> > > > > > 
> > > > > > > Looking around it looks like kernfs_iattrs() is called
> > > > > > > from
> > > > > > > multiple
> > > > > > > places without a node database lock at all.
> > > > > > > 
> > > > > > > I'm thinking that, to keep my proposed change straight
> > > > > > > forward
> > > > > > > and on topic, I should just leave kernfs_refresh_inode()
> > > > > > > taking
> > > > > > > the node db write lock for now and consider the
> > > > > > > attributes
> > > > > > > handling
> > > > > > > as a separate change. Once that's done we could
> > > > > > > reconsider
> > > > > > > what's
> > > > > > > needed to use the node db read lock in
> > > > > > > kernfs_refresh_inode().
> > > > > > 
> > > > > > You meant taking write lock of kernfs_rwsem for
> > > > > > kernfs_refresh_inode()??
> > > > > > It may be a lot slower in my benchmark, let me test it.
> > > > > 
> > > > > Yes, but make sure the write lock of kernfs_rwsem is being
> > > > > taken
> > > > > not the read lock.
> > > > > 
> > > > > That's a mistake I had initially?
> > > > > 
> > > > > Still, that attributes handling is, I think, sufficient to
> > > > > warrant
> > > > > a separate change since it looks like it might need work, the
> > > > > kernfs
> > > > > node db probably should be kept stable for those attribute
> > > > > updates
> > > > > but equally the existence of an instantiated dentry might
> > > > > mitigate
> > > > > the it.
> > > > > 
> > > > > Some people might just know whether it's ok or not but I
> > > > > would
> > > > > like
> > > > > to check the callers to work out what's going on.
> > > > > 
> > > > > In any case it's academic if GCH isn't willing to consider
> > > > > the
> > > > > series
> > > > > for review and possible merge.
> > > > > 
> > > > Hi Ian
> > > > 
> > > > I removed kn->attr_mutex and changed read lock to write lock
> > > > for
> > > > kernfs_refresh_inode
> > > > 
> > > > down_write(_rwsem);
> > > > kernfs_refresh_inode(kn, inode);
> > > > up_write(_rwsem);
> > > > 
> > > > 
> > > > Unfortunate, changes in this way make things worse,  my
> > > > benchmark
> > > > runs
> > > > 100% slower than upstream sysfs.  :(
> > > > open+read+close a sysfs file concurrently took 1000us.
> > > > (Currently,
> > > > sysfs with a big mutex kernfs_mutex only takes ~500us
> > > > for one open+read+close operation concurrently)
> > > 
> > > Right, so it does need attention nowish.
> > > 
> > > I'll have a look at it in a while, I really need to get a new
> > > autofs
> > > release out, and there are quite a few changes, and testing is
> > > 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Fox Chen
On Thu, Dec 17, 2020 at 12:46 PM Ian Kent  wrote:
>
> On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent  wrote:
> > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent 
> > > > > wrote:
> > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > > >attr_mutex,
> > > > > > > > > as
> > > > > > > > > Tejun
> > > > > > > > > mentioned here
> > > > > > > > > (
> > > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > > > ),
> > > > > > > > > maybe a global
> > > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > >
> > > > > > > > I wasn't sure about that, IIRC a spin lock could be used
> > > > > > > > around
> > > > > > > > the
> > > > > > > > initial check and checked again at the end which would
> > > > > > > > probably
> > > > > > > > have
> > > > > > > > been much faster but much less conservative and a bit
> > > > > > > > more
> > > > > > > > ugly
> > > > > > > > so
> > > > > > > > I just went the conservative path since there was so much
> > > > > > > > change
> > > > > > > > already.
> > > > > > >
> > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > > > > remember
> > > > > > > it.
> > > > > > >
> > > > > > > Based on what Tejun said it sounds like that needs work.
> > > > > >
> > > > > > Those attribute handling patches were meant to allow taking
> > > > > > the
> > > > > > rw
> > > > > > sem read lock instead of the write lock for
> > > > > > kernfs_refresh_inode()
> > > > > > updates, with the added locking to protect the inode
> > > > > > attributes
> > > > > > update since it's called from the VFS both with and without
> > > > > > the
> > > > > > inode lock.
> > > > >
> > > > > Oh, understood. I was asking also because lock on kn-
> > > > > >attr_mutex
> > > > > drags
> > > > > concurrent performance.
> > > > >
> > > > > > Looking around it looks like kernfs_iattrs() is called from
> > > > > > multiple
> > > > > > places without a node database lock at all.
> > > > > >
> > > > > > I'm thinking that, to keep my proposed change straight
> > > > > > forward
> > > > > > and on topic, I should just leave kernfs_refresh_inode()
> > > > > > taking
> > > > > > the node db write lock for now and consider the attributes
> > > > > > handling
> > > > > > as a separate change. Once that's done we could reconsider
> > > > > > what's
> > > > > > needed to use the node db read lock in
> > > > > > kernfs_refresh_inode().
> > > > >
> > > > > You meant taking write lock of kernfs_rwsem for
> > > > > kernfs_refresh_inode()??
> > > > > It may be a lot slower in my benchmark, let me test it.
> > > >
> > > > Yes, but make sure the write lock of kernfs_rwsem is being taken
> > > > not the read lock.
> > > >
> > > > That's a mistake I had initially?
> > > >
> > > > Still, that attributes handling is, I think, sufficient to
> > > > warrant
> > > > a separate change since it looks like it might need work, the
> > > > kernfs
> > > > node db probably should be kept stable for those attribute
> > > > updates
> > > > but equally the existence of an instantiated dentry might
> > > > mitigate
> > > > the it.
> > > >
> > > > Some people might just know whether it's ok or not but I would
> > > > like
> > > > to check the callers to work out what's going on.
> > > >
> > > > In any case it's academic if GCH isn't willing to consider the
> > > > series
> > > > for review and possible merge.
> > > >
> > > Hi Ian
> > >
> > > I removed kn->attr_mutex and changed read lock to write lock for
> > > kernfs_refresh_inode
> > >
> > > down_write(_rwsem);
> > > kernfs_refresh_inode(kn, inode);
> > > up_write(_rwsem);
> > >
> > >
> > > Unfortunate, changes in this way make things worse,  my benchmark
> > > runs
> > > 100% slower than upstream sysfs.  :(
> > > open+read+close a sysfs file concurrently took 1000us. (Currently,
> > > sysfs with a big mutex kernfs_mutex only takes ~500us
> > > for one open+read+close operation concurrently)
> >
> > Right, so it does need attention nowish.
> >
> > I'll have a look at it in a while, I really need to get a new autofs
> > release out, and there are quite a few changes, and testing is seeing
> > a number of errors, some old, some newly introduced. It's proving
> > difficult.
>
> I've taken a breather for the autofs testing and had a look at this.

Thanks. :)

> I think my original analysis of this was wrong.
>
> Could you try this patch please.
> I'm not sure how much difference it will make but, in principle,
> it's much the same as the previous approach except it doesn't
> increase the kernfs node struct size or mess with the other
> attribute handling code.
>
> Note, this is not even compile tested.

I failed to apply this 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-16 Thread Ian Kent
On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote:
> On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent  wrote:
> > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent 
> > > > wrote:
> > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > > For the patches, there is a mutex_lock in kn-
> > > > > > > > >attr_mutex, 
> > > > > > > > as
> > > > > > > > Tejun
> > > > > > > > mentioned here
> > > > > > > > (
> > > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > > ),
> > > > > > > > maybe a global
> > > > > > > > rwsem for kn->iattr will be better??
> > > > > > > 
> > > > > > > I wasn't sure about that, IIRC a spin lock could be used
> > > > > > > around
> > > > > > > the
> > > > > > > initial check and checked again at the end which would
> > > > > > > probably
> > > > > > > have
> > > > > > > been much faster but much less conservative and a bit
> > > > > > > more
> > > > > > > ugly
> > > > > > > so
> > > > > > > I just went the conservative path since there was so much
> > > > > > > change
> > > > > > > already.
> > > > > > 
> > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > > > remember
> > > > > > it.
> > > > > > 
> > > > > > Based on what Tejun said it sounds like that needs work.
> > > > > 
> > > > > Those attribute handling patches were meant to allow taking
> > > > > the
> > > > > rw
> > > > > sem read lock instead of the write lock for
> > > > > kernfs_refresh_inode()
> > > > > updates, with the added locking to protect the inode
> > > > > attributes
> > > > > update since it's called from the VFS both with and without
> > > > > the
> > > > > inode lock.
> > > > 
> > > > Oh, understood. I was asking also because lock on kn-
> > > > >attr_mutex
> > > > drags
> > > > concurrent performance.
> > > > 
> > > > > Looking around it looks like kernfs_iattrs() is called from
> > > > > multiple
> > > > > places without a node database lock at all.
> > > > > 
> > > > > I'm thinking that, to keep my proposed change straight
> > > > > forward
> > > > > and on topic, I should just leave kernfs_refresh_inode()
> > > > > taking
> > > > > the node db write lock for now and consider the attributes
> > > > > handling
> > > > > as a separate change. Once that's done we could reconsider
> > > > > what's
> > > > > needed to use the node db read lock in
> > > > > kernfs_refresh_inode().
> > > > 
> > > > You meant taking write lock of kernfs_rwsem for
> > > > kernfs_refresh_inode()??
> > > > It may be a lot slower in my benchmark, let me test it.
> > > 
> > > Yes, but make sure the write lock of kernfs_rwsem is being taken
> > > not the read lock.
> > > 
> > > That's a mistake I had initially?
> > > 
> > > Still, that attributes handling is, I think, sufficient to
> > > warrant
> > > a separate change since it looks like it might need work, the
> > > kernfs
> > > node db probably should be kept stable for those attribute
> > > updates
> > > but equally the existence of an instantiated dentry might
> > > mitigate
> > > the it.
> > > 
> > > Some people might just know whether it's ok or not but I would
> > > like
> > > to check the callers to work out what's going on.
> > > 
> > > In any case it's academic if GCH isn't willing to consider the
> > > series
> > > for review and possible merge.
> > > 
> > Hi Ian
> > 
> > I removed kn->attr_mutex and changed read lock to write lock for
> > kernfs_refresh_inode
> > 
> > down_write(_rwsem);
> > kernfs_refresh_inode(kn, inode);
> > up_write(_rwsem);
> > 
> > 
> > Unfortunate, changes in this way make things worse,  my benchmark
> > runs
> > 100% slower than upstream sysfs.  :(
> > open+read+close a sysfs file concurrently took 1000us. (Currently,
> > sysfs with a big mutex kernfs_mutex only takes ~500us
> > for one open+read+close operation concurrently)
> 
> Right, so it does need attention nowish.
> 
> I'll have a look at it in a while, I really need to get a new autofs
> release out, and there are quite a few changes, and testing is seeing
> a number of errors, some old, some newly introduced. It's proving
> difficult.

I've taken a breather for the autofs testing and had a look at this.

I think my original analysis of this was wrong.

Could you try this patch please.
I'm not sure how much difference it will make but, in principle,
it's much the same as the previous approach except it doesn't
increase the kernfs node struct size or mess with the other
attribute handling code.

Note, this is not even compile tested.

kernfs: use kernfs read lock in .getattr() and .permission()

From: Ian Kent 

>From Documenation/filesystems.rst and (slightly outdated) comments
in fs/attr.c the inode i_rwsem is used for attribute handling.

This lock satisfies the requirememnts needed to reduce lock contention,
namely a per-object lock needs to be used 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-15 Thread Ian Kent
On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote:
> On Mon, Dec 14, 2020 at 9:30 PM Ian Kent  wrote:
> > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent 
> > > wrote:
> > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > > For the patches, there is a mutex_lock in kn->attr_mutex, 
> > > > > > > as
> > > > > > > Tejun
> > > > > > > mentioned here
> > > > > > > (
> > > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > > ),
> > > > > > > maybe a global
> > > > > > > rwsem for kn->iattr will be better??
> > > > > > 
> > > > > > I wasn't sure about that, IIRC a spin lock could be used
> > > > > > around
> > > > > > the
> > > > > > initial check and checked again at the end which would
> > > > > > probably
> > > > > > have
> > > > > > been much faster but much less conservative and a bit more
> > > > > > ugly
> > > > > > so
> > > > > > I just went the conservative path since there was so much
> > > > > > change
> > > > > > already.
> > > > > 
> > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > > remember
> > > > > it.
> > > > > 
> > > > > Based on what Tejun said it sounds like that needs work.
> > > > 
> > > > Those attribute handling patches were meant to allow taking the
> > > > rw
> > > > sem read lock instead of the write lock for
> > > > kernfs_refresh_inode()
> > > > updates, with the added locking to protect the inode attributes
> > > > update since it's called from the VFS both with and without the
> > > > inode lock.
> > > 
> > > Oh, understood. I was asking also because lock on kn->attr_mutex
> > > drags
> > > concurrent performance.
> > > 
> > > > Looking around it looks like kernfs_iattrs() is called from
> > > > multiple
> > > > places without a node database lock at all.
> > > > 
> > > > I'm thinking that, to keep my proposed change straight forward
> > > > and on topic, I should just leave kernfs_refresh_inode() taking
> > > > the node db write lock for now and consider the attributes
> > > > handling
> > > > as a separate change. Once that's done we could reconsider
> > > > what's
> > > > needed to use the node db read lock in kernfs_refresh_inode().
> > > 
> > > You meant taking write lock of kernfs_rwsem for
> > > kernfs_refresh_inode()??
> > > It may be a lot slower in my benchmark, let me test it.
> > 
> > Yes, but make sure the write lock of kernfs_rwsem is being taken
> > not the read lock.
> > 
> > That's a mistake I had initially?
> > 
> > Still, that attributes handling is, I think, sufficient to warrant
> > a separate change since it looks like it might need work, the
> > kernfs
> > node db probably should be kept stable for those attribute updates
> > but equally the existence of an instantiated dentry might mitigate
> > the it.
> > 
> > Some people might just know whether it's ok or not but I would like
> > to check the callers to work out what's going on.
> > 
> > In any case it's academic if GCH isn't willing to consider the
> > series
> > for review and possible merge.
> > 
> Hi Ian
> 
> I removed kn->attr_mutex and changed read lock to write lock for
> kernfs_refresh_inode
> 
> down_write(_rwsem);
> kernfs_refresh_inode(kn, inode);
> up_write(_rwsem);
> 
> 
> Unfortunate, changes in this way make things worse,  my benchmark
> runs
> 100% slower than upstream sysfs.  :(
> open+read+close a sysfs file concurrently took 1000us. (Currently,
> sysfs with a big mutex kernfs_mutex only takes ~500us
> for one open+read+close operation concurrently)

Right, so it does need attention nowish.

I'll have a look at it in a while, I really need to get a new autofs
release out, and there are quite a few changes, and testing is seeing
a number of errors, some old, some newly introduced. It's proving
difficult.

> 
> > --45.93%--kernfs_iop_permission
>   ||
>   |  |  |  |
>   ||
>   |  |  |
> > --22.55%--down_write
>   ||
>   |  |  |  |  |
>   ||
>   |  |  |  |
> --20.69%--rwsem_down_write_slowpath
>   ||
>   |  |  |  |
>   |
>   ||
>   |  |  |  |
>   |--8.89%--schedule
> 
> perf showed most of the time had been spent on kernfs_iop_permission
> 
> 
> thanks,
> fox



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-15 Thread Fox Chen
On Mon, Dec 14, 2020 at 9:30 PM Ian Kent  wrote:
>
> On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent  wrote:
> > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > > > > Tejun
> > > > > > mentioned here
> > > > > > (
> > > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > > ),
> > > > > > maybe a global
> > > > > > rwsem for kn->iattr will be better??
> > > > >
> > > > > I wasn't sure about that, IIRC a spin lock could be used around
> > > > > the
> > > > > initial check and checked again at the end which would probably
> > > > > have
> > > > > been much faster but much less conservative and a bit more ugly
> > > > > so
> > > > > I just went the conservative path since there was so much
> > > > > change
> > > > > already.
> > > >
> > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > > remember
> > > > it.
> > > >
> > > > Based on what Tejun said it sounds like that needs work.
> > >
> > > Those attribute handling patches were meant to allow taking the rw
> > > sem read lock instead of the write lock for kernfs_refresh_inode()
> > > updates, with the added locking to protect the inode attributes
> > > update since it's called from the VFS both with and without the
> > > inode lock.
> >
> > Oh, understood. I was asking also because lock on kn->attr_mutex
> > drags
> > concurrent performance.
> >
> > > Looking around it looks like kernfs_iattrs() is called from
> > > multiple
> > > places without a node database lock at all.
> > >
> > > I'm thinking that, to keep my proposed change straight forward
> > > and on topic, I should just leave kernfs_refresh_inode() taking
> > > the node db write lock for now and consider the attributes handling
> > > as a separate change. Once that's done we could reconsider what's
> > > needed to use the node db read lock in kernfs_refresh_inode().
> >
> > You meant taking write lock of kernfs_rwsem for
> > kernfs_refresh_inode()??
> > It may be a lot slower in my benchmark, let me test it.
>
> Yes, but make sure the write lock of kernfs_rwsem is being taken
> not the read lock.
>
> That's a mistake I had initially?
>
> Still, that attributes handling is, I think, sufficient to warrant
> a separate change since it looks like it might need work, the kernfs
> node db probably should be kept stable for those attribute updates
> but equally the existence of an instantiated dentry might mitigate
> the it.
>
> Some people might just know whether it's ok or not but I would like
> to check the callers to work out what's going on.
>
> In any case it's academic if GCH isn't willing to consider the series
> for review and possible merge.
>
Hi Ian

I removed kn->attr_mutex and changed read lock to write lock for
kernfs_refresh_inode

down_write(_rwsem);
kernfs_refresh_inode(kn, inode);
up_write(_rwsem);


Unfortunate, changes in this way make things worse,  my benchmark runs
100% slower than upstream sysfs.  :(
open+read+close a sysfs file concurrently took 1000us. (Currently,
sysfs with a big mutex kernfs_mutex only takes ~500us
for one open+read+close operation concurrently)

|--45.93%--kernfs_iop_permission
  ||
  |  |  |  |
  ||
  |  |  |
|--22.55%--down_write
  ||
  |  |  |  |  |
  ||
  |  |  |  |
--20.69%--rwsem_down_write_slowpath
  ||
  |  |  |  |
  |
  ||
  |  |  |  |
  |--8.89%--schedule

perf showed most of the time had been spent on kernfs_iop_permission


thanks,
fox


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-14 Thread Ian Kent
On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote:
> On Sun, Dec 13, 2020 at 11:46 AM Ian Kent  wrote:
> > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > > > Tejun
> > > > > mentioned here
> > > > > (
> > > > > https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/
> > > > > ),
> > > > > maybe a global
> > > > > rwsem for kn->iattr will be better??
> > > > 
> > > > I wasn't sure about that, IIRC a spin lock could be used around
> > > > the
> > > > initial check and checked again at the end which would probably
> > > > have
> > > > been much faster but much less conservative and a bit more ugly
> > > > so
> > > > I just went the conservative path since there was so much
> > > > change
> > > > already.
> > > 
> > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't
> > > remember
> > > it.
> > > 
> > > Based on what Tejun said it sounds like that needs work.
> > 
> > Those attribute handling patches were meant to allow taking the rw
> > sem read lock instead of the write lock for kernfs_refresh_inode()
> > updates, with the added locking to protect the inode attributes
> > update since it's called from the VFS both with and without the
> > inode lock.
> 
> Oh, understood. I was asking also because lock on kn->attr_mutex
> drags
> concurrent performance.
> 
> > Looking around it looks like kernfs_iattrs() is called from
> > multiple
> > places without a node database lock at all.
> > 
> > I'm thinking that, to keep my proposed change straight forward
> > and on topic, I should just leave kernfs_refresh_inode() taking
> > the node db write lock for now and consider the attributes handling
> > as a separate change. Once that's done we could reconsider what's
> > needed to use the node db read lock in kernfs_refresh_inode().
> 
> You meant taking write lock of kernfs_rwsem for
> kernfs_refresh_inode()??
> It may be a lot slower in my benchmark, let me test it.

Yes, but make sure the write lock of kernfs_rwsem is being taken
not the read lock.

That's a mistake I had initially?

Still, that attributes handling is, I think, sufficient to warrant
a separate change since it looks like it might need work, the kernfs
node db probably should be kept stable for those attribute updates
but equally the existence of an instantiated dentry might mitigate
the it.

Some people might just know whether it's ok or not but I would like
to check the callers to work out what's going on.

In any case it's academic if GCH isn't willing to consider the series
for review and possible merge.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-13 Thread Fox Chen
On Sun, Dec 13, 2020 at 11:46 AM Ian Kent  wrote:
>
> On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > > Tejun
> > > > mentioned here
> > > > (https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/),
> > > > maybe a global
> > > > rwsem for kn->iattr will be better??
> > >
> > > I wasn't sure about that, IIRC a spin lock could be used around the
> > > initial check and checked again at the end which would probably
> > > have
> > > been much faster but much less conservative and a bit more ugly so
> > > I just went the conservative path since there was so much change
> > > already.
> >
> > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember
> > it.
> >
> > Based on what Tejun said it sounds like that needs work.
>
> Those attribute handling patches were meant to allow taking the rw
> sem read lock instead of the write lock for kernfs_refresh_inode()
> updates, with the added locking to protect the inode attributes
> update since it's called from the VFS both with and without the
> inode lock.

Oh, understood. I was asking also because lock on kn->attr_mutex drags
concurrent performance.

> Looking around it looks like kernfs_iattrs() is called from multiple
> places without a node database lock at all.
>
> I'm thinking that, to keep my proposed change straight forward
> and on topic, I should just leave kernfs_refresh_inode() taking
> the node db write lock for now and consider the attributes handling
> as a separate change. Once that's done we could reconsider what's
> needed to use the node db read lock in kernfs_refresh_inode().

You meant taking write lock of kernfs_rwsem for kernfs_refresh_inode()??
It may be a lot slower in my benchmark, let me test it.

> It will reduce the effectiveness of the series but it would make
> this change much more complicated, and is somewhat off-topic, and
> could hamper the chances of reviewers spotting problem with it.
>
> Ian
>


thanks,
fox


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-12 Thread Ian Kent
On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote:
> On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> > > For the patches, there is a mutex_lock in kn->attr_mutex, as
> > > Tejun
> > > mentioned here 
> > > (https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/),
> > > maybe a global 
> > > rwsem for kn->iattr will be better??
> > 
> > I wasn't sure about that, IIRC a spin lock could be used around the
> > initial check and checked again at the end which would probably
> > have
> > been much faster but much less conservative and a bit more ugly so
> > I just went the conservative path since there was so much change
> > already.
> 
> Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember
> it.
> 
> Based on what Tejun said it sounds like that needs work.

Those attribute handling patches were meant to allow taking the rw
sem read lock instead of the write lock for kernfs_refresh_inode()
updates, with the added locking to protect the inode attributes
update since it's called from the VFS both with and without the
inode lock.

Looking around it looks like kernfs_iattrs() is called from multiple
places without a node database lock at all.

I'm thinking that, to keep my proposed change straight forward
and on topic, I should just leave kernfs_refresh_inode() taking
the node db write lock for now and consider the attributes handling
as a separate change. Once that's done we could reconsider what's
needed to use the node db read lock in kernfs_refresh_inode().

It will reduce the effectiveness of the series but it would make
this change much more complicated, and is somewhat off-topic, and
could hamper the chances of reviewers spotting problem with it.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-10 Thread Ian Kent
On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote:
> 
> > For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun
> > mentioned here 
> > (https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/),
> > maybe a global 
> > rwsem for kn->iattr will be better??
> 
> I wasn't sure about that, IIRC a spin lock could be used around the
> initial check and checked again at the end which would probably have
> been much faster but much less conservative and a bit more ugly so
> I just went the conservative path since there was so much change
> already.

Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember
it.

Based on what Tejun said it sounds like that needs work.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-10 Thread Ian Kent
On Thu, 2020-12-10 at 16:44 +, Fox Chen wrote:
> Hi,
> 
> I found this series of patches solves exact the problem I am trying
> to solve.
> https://lore.kernel.org/lkml/20201202145837.48040-1-foxhlc...@gmail.com/

Right.

> 
> The problem is reported by Brice Goglin on thread:
> Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
> https://lore.kernel.org/lkml/x60dvjot4furc...@kroah.com/
> 
> I independently comfirmed that on a 96-core AWS c5.metal server.
> Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id
> 1000 times.
> With a single thread it takes ~2.5 us for each open+read+close.
> With one thread per core, 96 threads running simultaneously takes 540
> us 
> for each of the same operation (without much variation) -- 200x
> slower than the 
> single thread one. 

Right, interesting that the it's actually a problem on such
small system configurations.

I didn't think it would be evident on hardware that doesn't
have a much larger configuration.

> 
> My Benchmark code is here:
> https://github.com/foxhlchen/sysfs_benchmark
> 
> The problem can only be observed in large machines (>=16 cores).
> The more cores you have the slower it can be.
> 
> Perf shows that CPUs spend most of the time (>80%) waiting on mutex
> locks in 
> kernfs_iop_permission and kernfs_dop_revalidate.
> 
> After applying this, performance gets huge boost -- with the fastest
> one at ~30 us 
> to the worst at ~180 us (most of on spin_locks, the delay just
> stacking up, very
> similar to the performance on ext4). 

That's the problem isn't it.

Unfortunately we don't get large improvements for nothing so I
was constantly thinking, what have I done here that isn't ok ...
and I don't have an answer for that.

The series needs review from others for that but we didn't get
that far.

> 
> I hope this problem can justifies this series of patches. A big mutex
> in kernfs
> is really not nice. Due to this BIG LOCK, concurrency in kernfs is
> almost NONE,
> even though you do operations on different files, they are
> contentious.

Well, as much as I don't like to admit it, Greg (and Tejun) do have
a point about the number of sysfs files used when there is a very
large amount of RAM. But IIUC the suggestion of altering the sysfs
representation for this devices memory would introduce all sorts
of problems, it then being different form all device memory
representations (systemd udev coldplug for example).

But I think your saying there are also visible improvements elsewhere
too, which is to be expected of course.

Let's not forget that, as the maintainer, Greg has every right to
be reluctant to take changes because he is likely to end up owning
and maintaining the changes. That can lead to considerable overhead
and frustration if the change isn't quite right and it's very hard
to be sure there aren't hidden problems with it.

Fact is that, due to Greg's rejection, there was much more focus
by the reporter to fix it at the source but I have no control over
that, I only know that it helped to get things moving.

Given the above, I was considering posting the series again and
asking for the series to be re-considered but since I annoyed
Greg so much the first time around I've been reluctant to do so.

But now is a good time I guess, Greg, please, would you re-consider
possibly accepting these patches?

I would really like some actual review of what I have done from
people like yourself and Al. Ha, after that they might well not
be ok anyway!

> 
> As we get more and more cores on normal machines and because sysfs
> provides such
> important information, this problem should be fix. So please
> reconsider accepting
> the patches.
> 
> For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun
> mentioned here 
> (https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/),
> maybe a global 
> rwsem for kn->iattr will be better??

I wasn't sure about that, IIRC a spin lock could be used around the
initial check and checked again at the end which would probably have
been much faster but much less conservative and a bit more ugly so
I just went the conservative path since there was so much change
already.

Ian



RE:[PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-10 Thread Fox Chen
Hi,

I found this series of patches solves exact the problem I am trying to solve.
https://lore.kernel.org/lkml/20201202145837.48040-1-foxhlc...@gmail.com/

The problem is reported by Brice Goglin on thread:
Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
https://lore.kernel.org/lkml/x60dvjot4furc...@kroah.com/

I independently comfirmed that on a 96-core AWS c5.metal server.
Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id 1000 times.
With a single thread it takes ~2.5 us for each open+read+close.
With one thread per core, 96 threads running simultaneously takes 540 us 
for each of the same operation (without much variation) -- 200x slower than the 
single thread one. 

My Benchmark code is here:
https://github.com/foxhlchen/sysfs_benchmark

The problem can only be observed in large machines (>=16 cores).
The more cores you have the slower it can be.

Perf shows that CPUs spend most of the time (>80%) waiting on mutex locks in 
kernfs_iop_permission and kernfs_dop_revalidate.

After applying this, performance gets huge boost -- with the fastest one at ~30 
us 
to the worst at ~180 us (most of on spin_locks, the delay just stacking up, very
similar to the performance on ext4). 

I hope this problem can justifies this series of patches. A big mutex in kernfs
is really not nice. Due to this BIG LOCK, concurrency in kernfs is almost NONE,
even though you do operations on different files, they are contentious.

As we get more and more cores on normal machines and because sysfs provides such
important information, this problem should be fix. So please reconsider 
accepting
the patches.

For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun mentioned 
here 
(https://lore.kernel.org/lkml/x8fe0cmu+aq1g...@mtj.duckdns.org/), maybe a 
global 
rwsem for kn->iattr will be better??



thanks,
fox



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-25 Thread Ian Kent
On Thu, 2020-06-25 at 11:43 +0200, Greg Kroah-Hartman wrote:
> On Thu, Jun 25, 2020 at 04:15:19PM +0800, Ian Kent wrote:
> > On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> > > Hello, Rick.
> > > 
> > > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > > > I don't know. The above highlights the absurdity of the
> > > > > approach
> > > > > itself to
> > > > > me. You seem to be aware of it too in writing: 250,000
> > > > > "devices".
> > > > 
> > > > Just because it is absurd doesn't mean it wasn't built that way
> > > > :)
> > > > 
> > > > I agree, and I'm trying to influence the next hardware design.
> > > > However,
> > > 
> > > I'm not saying that the hardware should not segment things into
> > > however many
> > > pieces that it wants / needs to. That part is fine.
> > > 
> > > > what's already out there is memory units that must be accessed
> > > > in
> > > > 256MB
> > > > blocks. If you want to remove/add a GB, that's really 4 blocks
> > > > of
> > > > memory
> > > > you're manipulating, to the hardware. Those blocks have to be
> > > > registered
> > > > and recognized by the kernel for that to work.
> > > 
> > > The problem is fitting that into an interface which wholly
> > > doesn't
> > > fit that
> > > particular requirement. It's not that difficult to imagine
> > > different
> > > ways to
> > > represent however many memory slots, right? It'd take work to
> > > make
> > > sure that
> > > integrates well with whatever tooling or use cases but once done
> > > this
> > > particular problem will be resolved permanently and the whole
> > > thing
> > > will
> > > look a lot less silly. Wouldn't that be better?
> > 
> > Well, no, I am finding it difficult to imagine different ways to
> > represent this but perhaps that's because I'm blinker eyed on what
> > a solution might look like because of my file system focus.
> > 
> > Can "anyone" throw out some ideas with a little more detail than we
> > have had so far so we can maybe start to formulate an actual plan
> > of
> > what needs to be done.
> 
> I think both Tejun and I have provided a number of alternatives for
> you
> all to look into, and yet you all keep saying that those are
> impossible
> for some unknown reason.

Yes, those comments are a starting point to be sure.
And continuing on that path isn't helping anyone.

That's why I'm asking for your input on what a solution you
would see as adequate might look like to you (and Tejun).

> 
> It's not up to me to tell you what to do to fix your broken
> interfaces
> as only you all know who is using this and how to handle those
> changes.

But it would be useful to go into a little more detail, based on
your own experience, about what you think a suitable solution might
be.

That surely needs to be taken into account and used to guide the
direction of our investigation of what we do.

> 
> It is up to me to say "don't do that!" and to refuse patches that
> don't
> solve the root problem here.  I'll review these later on (I have
> 1500+
> patches to review at the moment) as these are a nice
> micro-optimization...

Sure, and I get the "I don't want another post and run set of
patches that I have to maintain forever that don't fully solve
the problem" view and any ideas and perhaps a little more detail
on where we might go with this would be very much appreciated.

> 
> And as this conversation seems to just going in circles, I think this
> is
> going to be my last response to it...

Which is why I'm asking this, I really would like to see this
discussion change course and become useful.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-25 Thread Greg Kroah-Hartman
On Thu, Jun 25, 2020 at 04:15:19PM +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> > Hello, Rick.
> > 
> > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > > I don't know. The above highlights the absurdity of the approach
> > > > itself to
> > > > me. You seem to be aware of it too in writing: 250,000 "devices".
> > > 
> > > Just because it is absurd doesn't mean it wasn't built that way :)
> > > 
> > > I agree, and I'm trying to influence the next hardware design.
> > > However,
> > 
> > I'm not saying that the hardware should not segment things into
> > however many
> > pieces that it wants / needs to. That part is fine.
> > 
> > > what's already out there is memory units that must be accessed in
> > > 256MB
> > > blocks. If you want to remove/add a GB, that's really 4 blocks of
> > > memory
> > > you're manipulating, to the hardware. Those blocks have to be
> > > registered
> > > and recognized by the kernel for that to work.
> > 
> > The problem is fitting that into an interface which wholly doesn't
> > fit that
> > particular requirement. It's not that difficult to imagine different
> > ways to
> > represent however many memory slots, right? It'd take work to make
> > sure that
> > integrates well with whatever tooling or use cases but once done this
> > particular problem will be resolved permanently and the whole thing
> > will
> > look a lot less silly. Wouldn't that be better?
> 
> Well, no, I am finding it difficult to imagine different ways to
> represent this but perhaps that's because I'm blinker eyed on what
> a solution might look like because of my file system focus.
> 
> Can "anyone" throw out some ideas with a little more detail than we
> have had so far so we can maybe start to formulate an actual plan of
> what needs to be done.

I think both Tejun and I have provided a number of alternatives for you
all to look into, and yet you all keep saying that those are impossible
for some unknown reason.

It's not up to me to tell you what to do to fix your broken interfaces
as only you all know who is using this and how to handle those changes.

It is up to me to say "don't do that!" and to refuse patches that don't
solve the root problem here.  I'll review these later on (I have 1500+
patches to review at the moment) as these are a nice
micro-optimization...

And as this conversation seems to just going in circles, I think this is
going to be my last response to it...

greg k-h


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-25 Thread Ian Kent
On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote:
> Hello, Rick.
> 
> On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > > I don't know. The above highlights the absurdity of the approach
> > > itself to
> > > me. You seem to be aware of it too in writing: 250,000 "devices".
> > 
> > Just because it is absurd doesn't mean it wasn't built that way :)
> > 
> > I agree, and I'm trying to influence the next hardware design.
> > However,
> 
> I'm not saying that the hardware should not segment things into
> however many
> pieces that it wants / needs to. That part is fine.
> 
> > what's already out there is memory units that must be accessed in
> > 256MB
> > blocks. If you want to remove/add a GB, that's really 4 blocks of
> > memory
> > you're manipulating, to the hardware. Those blocks have to be
> > registered
> > and recognized by the kernel for that to work.
> 
> The problem is fitting that into an interface which wholly doesn't
> fit that
> particular requirement. It's not that difficult to imagine different
> ways to
> represent however many memory slots, right? It'd take work to make
> sure that
> integrates well with whatever tooling or use cases but once done this
> particular problem will be resolved permanently and the whole thing
> will
> look a lot less silly. Wouldn't that be better?

Well, no, I am finding it difficult to imagine different ways to
represent this but perhaps that's because I'm blinker eyed on what
a solution might look like because of my file system focus.

Can "anyone" throw out some ideas with a little more detail than we
have had so far so we can maybe start to formulate an actual plan of
what needs to be done.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-24 Thread Tejun Heo
Hello, Rick.

On Wed, Jun 24, 2020 at 02:04:15AM -0700, Rick Lindsley wrote:
> In contrast, the provided patch fixes the observed problem with no ripple
> effect to other subsystems or utilities.
> 
> Greg had suggested
> Treat the system as a whole please, don't go for a short-term
> fix that we all know is not solving the real problem here.
> 
> Your solution affects multiple subsystems; this one affects one.  Which is
> the whole system approach in terms of risk?  You mentioned you support 30k
> scsi disks but only because they are slow so the inefficiencies of kernfs
> don't show.  That doesn't bother you?

I suggest putting honest thoughts into finding a long term solution instead
of these rhetorical retorts. If you really can't see how ill-suited the
current use of interface and proposed solution is, I'm not sure how better
to communicate them to you.

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-24 Thread Greg Kroah-Hartman
On Wed, Jun 24, 2020 at 02:04:15AM -0700, Rick Lindsley wrote:
> In contrast, the provided patch fixes the observed problem with no ripple
> effect to other subsystems or utilities.

Your patch, as-is, is fine, but to somehow think that this is going to
solve your real problem here is the issue we keep raising.

The real problem is you have way too many devices that somehow need to
all get probed at boot time before you can do anything else.

> Greg had suggested
> Treat the system as a whole please, don't go for a short-term
> fix that we all know is not solving the real problem here.
> 
> Your solution affects multiple subsystems; this one affects one.  Which is
> the whole system approach in terms of risk?  You mentioned you support 30k
> scsi disks but only because they are slow so the inefficiencies of kernfs
> don't show.  That doesn't bother you?

Systems with 30k of devices do not have any problems that I know of
because they do not do foolish things like stall booting before they are
all discovered :)

What's the odds that if we take this patch, you all have to come back in
a few years with some other change to the api due to even larger memory
sizes happening?  What happens if you boot on a system with this change
and with 10x the memory you currently have?  Try simulating that by
creating 10x the number of devices and see what happens.  Does the
bottleneck still remain in kernfs or is it somewhere else?

thanks,

greg k-h


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-24 Thread Rick Lindsley

Thanks, Tejun, appreciate the feedback.

On 6/23/20 4:13 PM, Tejun Heo wrote:


The problem is fitting that into an interface which wholly doesn't fit that
particular requirement. It's not that difficult to imagine different ways to
represent however many memory slots, right? 


Perhaps we have different views of how this is showing up.  systemd is 
the primary instigator of the boot issues.


Systemd runs

/usr/lib/systemd/system/systemd-udev-trigger.service

which does a udev trigger, specifically

/usr/bin/udevadm trigger --type=devices --action=add

as part of its post-initramfs coldplug.  It then waits for that to 
finish, under the watch of a timer.


So, the kernel itself is reporting these devices to systemd. It gets 
that information from talking to the hardware.  That means, then, that 
the obfuscation must either start in the kernel itself (it lies to 
systemd), or start in systemd when it handles the devices it got from 
the kernel.  If the kernel lies, then the actual granularity is not 
available to any user utilities.


Unless you're suggesting a new interface be created that would allow 
utilities to determine the "real" memory addresses available for 
manipulation.  But the changes you describe cannot be limited to the 
unknown number of auxiliary utilities.


Having one subsystem lie to another seems like the start of a bad idea, 
anyway.  When the hardware management console, separate from Linux, 
reports a memory error, or adds or deletes memory in a guest system, 
it's not going to be manipulating spoofed addresses that are only a 
Linux construct.


In contrast, the provided patch fixes the observed problem with no 
ripple effect to other subsystems or utilities.


Greg had suggested
Treat the system as a whole please, don't go for a short-term
fix that we all know is not solving the real problem here.

Your solution affects multiple subsystems; this one affects one.  Which 
is the whole system approach in terms of risk?  You mentioned you 
support 30k scsi disks but only because they are slow so the 
inefficiencies of kernfs don't show.  That doesn't bother you?


Rick


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Tejun Heo
Hello, Rick.

On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote:
> > I don't know. The above highlights the absurdity of the approach itself to
> > me. You seem to be aware of it too in writing: 250,000 "devices".
> 
> Just because it is absurd doesn't mean it wasn't built that way :)
> 
> I agree, and I'm trying to influence the next hardware design. However,

I'm not saying that the hardware should not segment things into however many
pieces that it wants / needs to. That part is fine.

> what's already out there is memory units that must be accessed in 256MB
> blocks. If you want to remove/add a GB, that's really 4 blocks of memory
> you're manipulating, to the hardware. Those blocks have to be registered
> and recognized by the kernel for that to work.

The problem is fitting that into an interface which wholly doesn't fit that
particular requirement. It's not that difficult to imagine different ways to
represent however many memory slots, right? It'd take work to make sure that
integrates well with whatever tooling or use cases but once done this
particular problem will be resolved permanently and the whole thing will
look a lot less silly. Wouldn't that be better?

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Rick Lindsley

On 6/23/20 4:45 AM, Greg Kroah-Hartman wrote:


Sure, but "help, I'm abusing your code interface, so fix your code
interface and not my caller code" really isn't the best mantra :)


Well, those are your words, not mine.  What we're saying is, "we've 
identified an interface that doesn't scale in this situation, but we 
have a way to make it scale to all known configurations."


> I am offended as a number of years ago this same user of kernfs/sysfs
> did a lot of work to reduce the number of contentions in kernfs for
> this same reason.  After that work was done, "all was good".  Now
> this comes along again, blaming kernfs/sysfs, not the caller.

Ok. I don't know about the history, but I can tell you "blame" is not 
the word I'd use.  As hardware changes, Linux also changes, and over "a 
number of years" it's not surprising to me if basic assumptions changed 
again and led us back to a place we've been before.  That's not an 
indictment.  It just IS.


> Memory is only going to get bigger over time, you might want to fix it
> this way and then run away.  But we have to maintain this for the next
> 20+ years, and you are not solving the root-problem here.  It will
> come back again, right?

If hardware vendors insist on dealing with small blocks of memory in 
large aggregates, then yes it could.  You'll have to trust that I am 
also in discussion with hardware architects about how that is a very bad 
architecture and it's time to change decades and think bigger.  Separate 
audience, equally contentious discussion.  But the bottom line is, it's 
out there already and can't be walked back.


Your response here seems to center on "kernfs was never designed for 
that."  If so, we're in agreement.   We're suggesting a way it can be 
extended to be more robust, with no (apparent) side effects.  I'd like 
to discuss the merits of the patch itself.


Rick


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Ian Kent
On Tue, 2020-06-23 at 02:33 -0700, Rick Lindsley wrote:
> On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:
> 
> > First off, this is not my platform, and not my problem, so it's
> > funny
> > you ask me :)
> 
> Wlll, not your platform perhaps but MAINTAINERS does list you
> first and Tejun second as maintainers for kernfs.  So in that sense,
> any patches would need to go thru you.  So, your opinions do matter.
> 
>   
> > Anyway, as I have said before, my first guesses would be:
> > - increase the granularity size of the "memory chunks",
> > reducing
> >   the number of devices you create.
> 
> This would mean finding every utility that relies on this
> behavior.  That may be possible, although not easy, for distro or
> platform software, but it's hard to guess what user-related utilities
> may have been created by other consumers of those distros or that
> platform.  In any case, removing an interface without warning is a
> hanging offense in many Linux circles.
> 
> > - delay creating the devices until way after booting, or do it
> >   on a totally different path/thread/workqueue/whatever to
> >   prevent delay at booting
> 
> This has been considered, but it again requires a full list of
> utilities relying on this interface and determining which of them may
> want to run before the devices are "loaded" at boot time.  It may be
> few, or even zero, but it would be a much more disruptive change in
> the boot process than what we are suggesting.
> 
> > And then there's always:
> > - don't create them at all, only only do so if userspace asks
> >   you to.
> 
> If they are done in parallel on demand, you'll see the same problem
> (load average of 1000+, contention in the same spot.)  You obviously
> won't hold up the boot, of course, but your utility and anything else
> running on the machine will take an unexpected pause ... for
> somewhere between 30 and 90 minutes.  Seems equally unfriendly.
> 
> A variant of this, which does have a positive effect, is to observe
> that coldplug during initramfs does seem to load up the memory device
> tree without incident.  We do a second coldplug after we switch roots
> and this is the one that runs into timer issues.  I have asked "those
> that should know" why there is a second coldplug.  I can guess but
> would prefer to know to avoid that screaming option.  If that second
> coldplug is unnecessary for the kernfs memory interfaces to work
> correctly, then that is an alternate, and perhaps even better
> solution.  (It wouldn't change the fact that kernfs was not built for
> speed and this problem remains below the surface to trip up another.)

We might still need the patches here for that on-demand mechanism
to be feasible.

For example, for an ls of the node directory it should be doable to
enumerate the nodes in readdir without creating dentries but there's
the inevitable stat() of each path that follows that would probably
lead to similar contention.

And changing the division of the entries into sub-directories would
inevitably break anything that does actually need to access them.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Greg Kroah-Hartman
On Tue, Jun 23, 2020 at 04:01:52PM +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > > Hello, Ian.
> > > > > 
> > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > > The
> > > > > > > > size of
> > > > > > > > the
> > > > > > > > segments (and thus the number of them) is dictated by the
> > > > > > > > underlying
> > > > > > > > hardware.
> > > > > > > 
> > > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > > that,
> > > > > > > right?
> > > > > > 
> > > > > > I'm still struggling a bit to grasp what your getting at but
> > > > > > ...
> > > > > 
> > > > > I was more trying to say that the sysfs device interface with
> > > > > per-
> > > > > object
> > > > > directory isn't the right interface for this sort of usage at
> > > > > all.
> > > > > Are these
> > > > > even real hardware pieces which can be plugged in and out?
> > > > > While
> > > > > being a
> > > > > discrete piece of hardware isn't a requirement to be a device
> > > > > model
> > > > > device,
> > > > > the whole thing is designed with such use cases on mind. It
> > > > > definitely isn't
> > > > > the right design for representing six digit number of logical
> > > > > entities.
> > > > > 
> > > > > It should be obvious that representing each consecutive memory
> > > > > range with a
> > > > > separate directory entry is far from an optimal way of
> > > > > representing
> > > > > something like this. It's outright silly.
> > > > 
> > > > I agree.  And again, Ian, you are just "kicking the problem down
> > > > the
> > > > road" if we accept these patches.  Please fix this up properly so
> > > > that
> > > > this interface is correctly fixed to not do looney things like
> > > > this.
> > > 
> > > Fine, mitigating this problem isn't the end of the story, and you
> > > don't want to do accept a change to mitigate it because that could
> > > mean no further discussion on it and no further work toward solving
> > > it.
> > > 
> > > But it seems to me a "proper" solution to this will cross a number
> > > of areas so this isn't just "my" problem and, as you point out,
> > > it's
> > > likely to become increasingly problematic over time.
> > > 
> > > So what are your ideas and recommendations on how to handle hotplug
> > > memory at this granularity for this much RAM (and larger amounts)?
> > 
> > First off, this is not my platform, and not my problem, so it's funny
> > you ask me :)
> 
> Sorry, but I don't think it's funny at all.
> 
> It's not "my platform" either, I'm just the poor old sole that
> took this on because, on the face of it, it's a file system
> problem as claimed by others that looked at it and promptly
> washed their hands of it.
> 
> I don't see how asking for your advice is out of order at all.
> 
> > 
> > Anyway, as I have said before, my first guesses would be:
> > - increase the granularity size of the "memory chunks",
> > reducing
> >   the number of devices you create.
> 
> Yes, I didn't get that from your initial comments but you've said
> it a couple of times recently and I do get it now.
> 
> I'll try and find someone appropriate to consult about that and
> see where it goes.
> 
> > - delay creating the devices until way after booting, or do it
> >   on a totally different path/thread/workqueue/whatever to
> >   prevent delay at booting
> 
> When you first said this it sounded like a ugly workaround to me.
> But perhaps it isn't (I'm not really convinced it is TBH), so it's
> probably worth trying to follow up on too.

It's not a workaround, it lets the rest of the system come up and do
useful things while you are still discovering parts of the system that
are not up and running.  We do this all the time for lots of
drivers/devices/subsystems, why is memory any different here?

> > And then there's always:
> > - don't create them at all, only only do so if userspace asks
> >   you to.
> 
> At first glance the impression I get from this is that it's an even
> uglier work around than delaying it but it might actually the most
> sensible way to handle this, as it's been called, silliness.
> 
> We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
> DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
> the dentry op ->d_automount() to be called on access so, from a path
> walk perspective, the dentries could just appear when needed.
> 
> The question I'd need to answer is do the kernfs nodes exist so
> ->d_automount() can discover if the node lookup is valid, and I think
> the answer might be yes (but we would need to suppress udev
> notifications for S_AUTOMOUNT nodes).
> 
> The catch will be that this is 

Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Greg Kroah-Hartman
On Tue, Jun 23, 2020 at 02:33:48AM -0700, Rick Lindsley wrote:
> On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:
> 
> > First off, this is not my platform, and not my problem, so it's funny
> > you ask me :)
> 
> Wlll, not your platform perhaps but MAINTAINERS does list you
> first and Tejun second as maintainers for kernfs.  So in that sense,
> any patches would need to go thru you.  So, your opinions do matter.

Sure, but "help, I'm abusing your code interface, so fix your code
interface and not my caller code" really isn't the best mantra :)

> > Anyway, as I have said before, my first guesses would be:
> > - increase the granularity size of the "memory chunks", reducing
> >   the number of devices you create.
> 
> This would mean finding every utility that relies on this behavior.
> That may be possible, although not easy, for distro or platform
> software, but it's hard to guess what user-related utilities may have
> been created by other consumers of those distros or that platform.  In
> any case, removing an interface without warning is a hanging offense
> in many Linux circles.

I agree, so find out who uses it!  You can search all debian tools
easily.  You can ask any closed-source setup tools that are on your
platforms if they use it.  You can "break it and see if anyone notices",
which is what we do all the time.

The "hanging offence" is "I'm breaking this even if you are using it!".

> > - delay creating the devices until way after booting, or do it
> >   on a totally different path/thread/workqueue/whatever to
> >   prevent delay at booting
> 
> This has been considered, but it again requires a full list of utilities 
> relying on this interface and determining which of them may want to run 
> before the devices are "loaded" at boot time.  It may be few, or even zero, 
> but it would be a much more disruptive change in the boot process than what 
> we are suggesting.

Is that really the case?  I strongly suggest you all do some research
here.

Oh, and wrap your email lines please...

> > And then there's always:
> > - don't create them at all, only only do so if userspace asks
> >   you to.
> 
> If they are done in parallel on demand, you'll see the same problem (load 
> average of 1000+, contention in the same spot.)  You obviously won't hold up 
> the boot, of course, but your utility and anything else running on the 
> machine will take an unexpected pause ... for somewhere between 30 and 90 
> minutes.  Seems equally unfriendly.

I agree, but it shouldn't be shutting down the whole box, other stuff
should run just fine, right?  Is this platform really that "weak" that
it can't handle this happening in a single thread/cpu?

> A variant of this, which does have a positive effect, is to observe that 
> coldplug during initramfs does seem to load up the memory device tree without 
> incident.  We do a second coldplug after we switch roots and this is the one 
> that runs into timer issues.  I have asked "those that should know" why there 
> is a second coldplug.  I can guess but would prefer to know to avoid that 
> screaming option.  If that second coldplug is unnecessary for the kernfs 
> memory interfaces to work correctly, then that is an alternate, and perhaps 
> even better solution.  (It wouldn't change the fact that kernfs was not built 
> for speed and this problem remains below the surface to trip up another.)
> 
> However, nobody I've found can say that is safe, and I'm not fond of the 'see 
> who screams' test solution.
> 
> > You all have the userspace tools/users for this interface and know it
> > best to know what will work for them.  If you don't, then hey, let's
> > just delete the whole thing and see who screams :)
> 
> I guess I'm puzzled by why everyone seems offended by suggesting we change a 
> mutex to a rw semaphore.  In a vacuum, sure, but we have before and after 
> numbers.  Wouldn't the same cavalier logic apply?  Why not change it and see 
> who screams?

I am offended as a number of years ago this same user of kernfs/sysfs
did a lot of work to reduce the number of contentions in kernfs for this
same reason.  After that work was done, "all was good".  Now this comes
along again, blaming kernfs/sysfs, not the caller.

Memory is only going to get bigger over time, you might want to fix it
this way and then run away.  But we have to maintain this for the next
20+ years, and you are not solving the root-problem here.  It will come
back again, right?

> I haven't heard any criticism of the patch itself - I'm hearing criticism of 
> the problem.  This problem is not specific to memory devices.  As we get 
> larger systems,  we'll see it elsewhere. We do already see a mild form of 
> this when fibre finds 1000-2000 fibre disks and goes to add them in parallel. 
>  Small memory chunks introduces the problem at a level two orders of 
> magnitude bigger, but eventually other devices will be subject to it too.  
> Why not address this now?


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Rick Lindsley

On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote:


First off, this is not my platform, and not my problem, so it's funny
you ask me :)


Wlll, not your platform perhaps but MAINTAINERS does list you first and 
Tejun second as maintainers for kernfs.  So in that sense, any patches would 
need to go thru you.  So, your opinions do matter.

 

Anyway, as I have said before, my first guesses would be:
- increase the granularity size of the "memory chunks", reducing
  the number of devices you create.


This would mean finding every utility that relies on this behavior.  That may 
be possible, although not easy, for distro or platform software, but it's hard 
to guess what user-related utilities may have been created by other consumers 
of those distros or that platform.  In any case, removing an interface without 
warning is a hanging offense in many Linux circles.


- delay creating the devices until way after booting, or do it
  on a totally different path/thread/workqueue/whatever to
  prevent delay at booting


This has been considered, but it again requires a full list of utilities relying on this 
interface and determining which of them may want to run before the devices are 
"loaded" at boot time.  It may be few, or even zero, but it would be a much 
more disruptive change in the boot process than what we are suggesting.


And then there's always:
- don't create them at all, only only do so if userspace asks
  you to.


If they are done in parallel on demand, you'll see the same problem (load 
average of 1000+, contention in the same spot.)  You obviously won't hold up 
the boot, of course, but your utility and anything else running on the machine 
will take an unexpected pause ... for somewhere between 30 and 90 minutes.  
Seems equally unfriendly.

A variant of this, which does have a positive effect, is to observe that coldplug during 
initramfs does seem to load up the memory device tree without incident.  We do a second 
coldplug after we switch roots and this is the one that runs into timer issues.  I have 
asked "those that should know" why there is a second coldplug.  I can guess but 
would prefer to know to avoid that screaming option.  If that second coldplug is 
unnecessary for the kernfs memory interfaces to work correctly, then that is an 
alternate, and perhaps even better solution.  (It wouldn't change the fact that kernfs 
was not built for speed and this problem remains below the surface to trip up another.)

However, nobody I've found can say that is safe, and I'm not fond of the 'see 
who screams' test solution.


You all have the userspace tools/users for this interface and know it
best to know what will work for them.  If you don't, then hey, let's
just delete the whole thing and see who screams :)


I guess I'm puzzled by why everyone seems offended by suggesting we change a 
mutex to a rw semaphore.  In a vacuum, sure, but we have before and after 
numbers.  Wouldn't the same cavalier logic apply?  Why not change it and see 
who screams?

I haven't heard any criticism of the patch itself - I'm hearing criticism of 
the problem.  This problem is not specific to memory devices.  As we get larger 
systems,  we'll see it elsewhere. We do already see a mild form of this when 
fibre finds 1000-2000 fibre disks and goes to add them in parallel.  Small 
memory chunks introduces the problem at a level two orders of magnitude bigger, 
but eventually other devices will be subject to it too.  Why not address this 
now?

'Doctor, it hurts when I do this'
'Then don't do that'

Funny as a joke.  Less funny as a review comment.

Rick


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Ian Kent
On Tue, 2020-06-23 at 16:01 +0800, Ian Kent wrote:
> On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > > Hello, Ian.
> > > > > 
> > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > > The
> > > > > > > > size of
> > > > > > > > the
> > > > > > > > segments (and thus the number of them) is dictated by
> > > > > > > > the
> > > > > > > > underlying
> > > > > > > > hardware.
> > > > > > > 
> > > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > > that,
> > > > > > > right?
> > > > > > 
> > > > > > I'm still struggling a bit to grasp what your getting at
> > > > > > but
> > > > > > ...
> > > > > 
> > > > > I was more trying to say that the sysfs device interface with
> > > > > per-
> > > > > object
> > > > > directory isn't the right interface for this sort of usage at
> > > > > all.
> > > > > Are these
> > > > > even real hardware pieces which can be plugged in and out?
> > > > > While
> > > > > being a
> > > > > discrete piece of hardware isn't a requirement to be a device
> > > > > model
> > > > > device,
> > > > > the whole thing is designed with such use cases on mind. It
> > > > > definitely isn't
> > > > > the right design for representing six digit number of logical
> > > > > entities.
> > > > > 
> > > > > It should be obvious that representing each consecutive
> > > > > memory
> > > > > range with a
> > > > > separate directory entry is far from an optimal way of
> > > > > representing
> > > > > something like this. It's outright silly.
> > > > 
> > > > I agree.  And again, Ian, you are just "kicking the problem
> > > > down
> > > > the
> > > > road" if we accept these patches.  Please fix this up properly
> > > > so
> > > > that
> > > > this interface is correctly fixed to not do looney things like
> > > > this.
> > > 
> > > Fine, mitigating this problem isn't the end of the story, and you
> > > don't want to do accept a change to mitigate it because that
> > > could
> > > mean no further discussion on it and no further work toward
> > > solving
> > > it.
> > > 
> > > But it seems to me a "proper" solution to this will cross a
> > > number
> > > of areas so this isn't just "my" problem and, as you point out,
> > > it's
> > > likely to become increasingly problematic over time.
> > > 
> > > So what are your ideas and recommendations on how to handle
> > > hotplug
> > > memory at this granularity for this much RAM (and larger
> > > amounts)?
> > 
> > First off, this is not my platform, and not my problem, so it's
> > funny
> > you ask me :)
> 
> Sorry, but I don't think it's funny at all.
> 
> It's not "my platform" either, I'm just the poor old sole that
> took this on because, on the face of it, it's a file system
> problem as claimed by others that looked at it and promptly
> washed their hands of it.
> 
> I don't see how asking for your advice is out of order at all.
> 
> > Anyway, as I have said before, my first guesses would be:
> > - increase the granularity size of the "memory chunks",
> > reducing
> >   the number of devices you create.
> 
> Yes, I didn't get that from your initial comments but you've said
> it a couple of times recently and I do get it now.
> 
> I'll try and find someone appropriate to consult about that and
> see where it goes.
> 
> > - delay creating the devices until way after booting, or do it
> >   on a totally different path/thread/workqueue/whatever to
> >   prevent delay at booting
> 
> When you first said this it sounded like a ugly workaround to me.
> But perhaps it isn't (I'm not really convinced it is TBH), so it's
> probably worth trying to follow up on too.
> 
> > And then there's always:
> > - don't create them at all, only only do so if userspace asks
> >   you to.
> 
> At first glance the impression I get from this is that it's an even
> uglier work around than delaying it but it might actually the most
> sensible way to handle this, as it's been called, silliness.
> 
> We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
> DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
> the dentry op ->d_automount() to be called on access so, from a path
> walk perspective, the dentries could just appear when needed.
> 
> The question I'd need to answer is do the kernfs nodes exist so
> ->d_automount() can discover if the node lookup is valid, and I think
> the answer might be yes (but we would need to suppress udev
> notifications for S_AUTOMOUNT nodes).

Or maybe taking the notion of on-demand dentry creation further
is "nothing" more than not triggering udev notifications when
nodes are created letting the VFS create dentries on access alone
is all that would be needed.


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Ian Kent
On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote:
> On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > > Hello, Ian.
> > > > 
> > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > > They are used for hotplugging and partitioning memory.
> > > > > > > The
> > > > > > > size of
> > > > > > > the
> > > > > > > segments (and thus the number of them) is dictated by the
> > > > > > > underlying
> > > > > > > hardware.
> > > > > > 
> > > > > > This sounds so bad. There gotta be a better interface for
> > > > > > that,
> > > > > > right?
> > > > > 
> > > > > I'm still struggling a bit to grasp what your getting at but
> > > > > ...
> > > > 
> > > > I was more trying to say that the sysfs device interface with
> > > > per-
> > > > object
> > > > directory isn't the right interface for this sort of usage at
> > > > all.
> > > > Are these
> > > > even real hardware pieces which can be plugged in and out?
> > > > While
> > > > being a
> > > > discrete piece of hardware isn't a requirement to be a device
> > > > model
> > > > device,
> > > > the whole thing is designed with such use cases on mind. It
> > > > definitely isn't
> > > > the right design for representing six digit number of logical
> > > > entities.
> > > > 
> > > > It should be obvious that representing each consecutive memory
> > > > range with a
> > > > separate directory entry is far from an optimal way of
> > > > representing
> > > > something like this. It's outright silly.
> > > 
> > > I agree.  And again, Ian, you are just "kicking the problem down
> > > the
> > > road" if we accept these patches.  Please fix this up properly so
> > > that
> > > this interface is correctly fixed to not do looney things like
> > > this.
> > 
> > Fine, mitigating this problem isn't the end of the story, and you
> > don't want to do accept a change to mitigate it because that could
> > mean no further discussion on it and no further work toward solving
> > it.
> > 
> > But it seems to me a "proper" solution to this will cross a number
> > of areas so this isn't just "my" problem and, as you point out,
> > it's
> > likely to become increasingly problematic over time.
> > 
> > So what are your ideas and recommendations on how to handle hotplug
> > memory at this granularity for this much RAM (and larger amounts)?
> 
> First off, this is not my platform, and not my problem, so it's funny
> you ask me :)

Sorry, but I don't think it's funny at all.

It's not "my platform" either, I'm just the poor old sole that
took this on because, on the face of it, it's a file system
problem as claimed by others that looked at it and promptly
washed their hands of it.

I don't see how asking for your advice is out of order at all.

> 
> Anyway, as I have said before, my first guesses would be:
>   - increase the granularity size of the "memory chunks",
> reducing
> the number of devices you create.

Yes, I didn't get that from your initial comments but you've said
it a couple of times recently and I do get it now.

I'll try and find someone appropriate to consult about that and
see where it goes.

>   - delay creating the devices until way after booting, or do it
> on a totally different path/thread/workqueue/whatever to
> prevent delay at booting

When you first said this it sounded like a ugly workaround to me.
But perhaps it isn't (I'm not really convinced it is TBH), so it's
probably worth trying to follow up on too.

> 
> And then there's always:
>   - don't create them at all, only only do so if userspace asks
> you to.

At first glance the impression I get from this is that it's an even
uglier work around than delaying it but it might actually the most
sensible way to handle this, as it's been called, silliness.

We do have the inode flag S_AUTOMOUNT that will cause the dcache flag
DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause
the dentry op ->d_automount() to be called on access so, from a path
walk perspective, the dentries could just appear when needed.

The question I'd need to answer is do the kernfs nodes exist so
->d_automount() can discover if the node lookup is valid, and I think
the answer might be yes (but we would need to suppress udev
notifications for S_AUTOMOUNT nodes).

The catch will be that this is "not" mounting per-se, so anything
I do would probably be seen as an ugly hack that subverts the VFS
automount support.

If I could find a way to reconcile that I could probably do this.

Al, what say you on this?

> 
> You all have the userspace tools/users for this interface and know it
> best to know what will work for them.  If you don't, then hey, let's
> just delete the whole thing and see who screams :)

Please, no joking, I'm finding it hard enough to cope with this
disappointment as it is, ;)

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-23 Thread Greg Kroah-Hartman
On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote:
> On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > > Hello, Ian.
> > > 
> > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > > They are used for hotplugging and partitioning memory. The
> > > > > > size of
> > > > > > the
> > > > > > segments (and thus the number of them) is dictated by the
> > > > > > underlying
> > > > > > hardware.
> > > > > 
> > > > > This sounds so bad. There gotta be a better interface for that,
> > > > > right?
> > > > 
> > > > I'm still struggling a bit to grasp what your getting at but ...
> > > 
> > > I was more trying to say that the sysfs device interface with per-
> > > object
> > > directory isn't the right interface for this sort of usage at all.
> > > Are these
> > > even real hardware pieces which can be plugged in and out? While
> > > being a
> > > discrete piece of hardware isn't a requirement to be a device model
> > > device,
> > > the whole thing is designed with such use cases on mind. It
> > > definitely isn't
> > > the right design for representing six digit number of logical
> > > entities.
> > > 
> > > It should be obvious that representing each consecutive memory
> > > range with a
> > > separate directory entry is far from an optimal way of representing
> > > something like this. It's outright silly.
> > 
> > I agree.  And again, Ian, you are just "kicking the problem down the
> > road" if we accept these patches.  Please fix this up properly so
> > that
> > this interface is correctly fixed to not do looney things like this.
> 
> Fine, mitigating this problem isn't the end of the story, and you
> don't want to do accept a change to mitigate it because that could
> mean no further discussion on it and no further work toward solving
> it.
> 
> But it seems to me a "proper" solution to this will cross a number
> of areas so this isn't just "my" problem and, as you point out, it's
> likely to become increasingly problematic over time.
> 
> So what are your ideas and recommendations on how to handle hotplug
> memory at this granularity for this much RAM (and larger amounts)?

First off, this is not my platform, and not my problem, so it's funny
you ask me :)

Anyway, as I have said before, my first guesses would be:
- increase the granularity size of the "memory chunks", reducing
  the number of devices you create.
- delay creating the devices until way after booting, or do it
  on a totally different path/thread/workqueue/whatever to
  prevent delay at booting

And then there's always:
- don't create them at all, only only do so if userspace asks
  you to.

You all have the userspace tools/users for this interface and know it
best to know what will work for them.  If you don't, then hey, let's
just delete the whole thing and see who screams :)

thanks,

greg k-h


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-22 Thread Greg Kroah-Hartman
On Mon, Jun 22, 2020 at 02:27:38PM -0700, Rick Lindsley wrote:
> 
> On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> 
> > It should be obvious that representing each consecutive memory range with a
> > separate directory entry is far from an optimal way of representing
> > something like this. It's outright silly.
> 
> On 6/22/20 11:03 AM, Greg Kroah-Hartman wrote:
> 
> > I agree.  And again, Ian, you are just "kicking the problem down the
> > road" if we accept these patches.  Please fix this up properly so that
> > this interface is correctly fixed to not do looney things like this.
> 
> Given that we cannot change the underlying machine representation of
> this hardware, what do you (all, not just you Greg) consider to be
> "properly"?

Change the userspace representation of the hardware then.  Why does
userspace care about so many individual blocks, what happens if you
provide them a larger granularity?  I can't imagine userspace really
wants to see 20k devices and manage them individually, where is the code
that does that?

What happens if you delay adding the devices until after booting?
Userspace should be event driven and only handle things after it sees
the devices being present, so try delaying and seeing what happens to
prevent this from keeping boot from progressing.

thanks,

greg k-h


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-22 Thread Ian Kent
On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote:
> On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> > Hello, Ian.
> > 
> > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > > They are used for hotplugging and partitioning memory. The
> > > > > size of
> > > > > the
> > > > > segments (and thus the number of them) is dictated by the
> > > > > underlying
> > > > > hardware.
> > > > 
> > > > This sounds so bad. There gotta be a better interface for that,
> > > > right?
> > > 
> > > I'm still struggling a bit to grasp what your getting at but ...
> > 
> > I was more trying to say that the sysfs device interface with per-
> > object
> > directory isn't the right interface for this sort of usage at all.
> > Are these
> > even real hardware pieces which can be plugged in and out? While
> > being a
> > discrete piece of hardware isn't a requirement to be a device model
> > device,
> > the whole thing is designed with such use cases on mind. It
> > definitely isn't
> > the right design for representing six digit number of logical
> > entities.
> > 
> > It should be obvious that representing each consecutive memory
> > range with a
> > separate directory entry is far from an optimal way of representing
> > something like this. It's outright silly.
> 
> I agree.  And again, Ian, you are just "kicking the problem down the
> road" if we accept these patches.  Please fix this up properly so
> that
> this interface is correctly fixed to not do looney things like this.

Fine, mitigating this problem isn't the end of the story, and you
don't want to do accept a change to mitigate it because that could
mean no further discussion on it and no further work toward solving
it.

But it seems to me a "proper" solution to this will cross a number
of areas so this isn't just "my" problem and, as you point out, it's
likely to become increasingly problematic over time.

So what are your ideas and recommendations on how to handle hotplug
memory at this granularity for this much RAM (and larger amounts)?

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-22 Thread Rick Lindsley



On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:


It should be obvious that representing each consecutive memory range with a
separate directory entry is far from an optimal way of representing
something like this. It's outright silly.


On 6/22/20 11:03 AM, Greg Kroah-Hartman wrote:


I agree.  And again, Ian, you are just "kicking the problem down the
road" if we accept these patches.  Please fix this up properly so that
this interface is correctly fixed to not do looney things like this.


Given that we cannot change the underlying machine representation of this hardware, what 
do you (all, not just you Greg) consider to be "properly"?

Rick



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-22 Thread Rick Lindsley

On 6/22/20 10:53 AM, Tejun Heo wrote:


I don't know. The above highlights the absurdity of the approach itself to
me. You seem to be aware of it too in writing: 250,000 "devices".


Just because it is absurd doesn't mean it wasn't built that way :)

I agree, and I'm trying to influence the next hardware design.  However, what's 
already out there is memory units that must be accessed in 256MB blocks.  If 
you want to remove/add a GB, that's really 4 blocks of memory you're 
manipulating, to the hardware.  Those blocks have to be registered and 
recognized by the kernel for that to work.

Rick



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-22 Thread Greg Kroah-Hartman
On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote:
> Hello, Ian.
> 
> On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > > They are used for hotplugging and partitioning memory. The size of
> > > > the
> > > > segments (and thus the number of them) is dictated by the
> > > > underlying
> > > > hardware.
> > > 
> > > This sounds so bad. There gotta be a better interface for that,
> > > right?
> > 
> > I'm still struggling a bit to grasp what your getting at but ...
> 
> I was more trying to say that the sysfs device interface with per-object
> directory isn't the right interface for this sort of usage at all. Are these
> even real hardware pieces which can be plugged in and out? While being a
> discrete piece of hardware isn't a requirement to be a device model device,
> the whole thing is designed with such use cases on mind. It definitely isn't
> the right design for representing six digit number of logical entities.
> 
> It should be obvious that representing each consecutive memory range with a
> separate directory entry is far from an optimal way of representing
> something like this. It's outright silly.

I agree.  And again, Ian, you are just "kicking the problem down the
road" if we accept these patches.  Please fix this up properly so that
this interface is correctly fixed to not do looney things like this.

thanks,

greg k-h


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-22 Thread Tejun Heo
On Fri, Jun 19, 2020 at 07:44:29PM -0700, Rick Lindsley wrote:
> echo 0 > /sys/devices//system/memory/memory10374/online
> 
> and boom - you've taken memory chunk 10374 offline.
> 
> These changes are not just a whim. I used lockstat to measure contention
> during boot. The addition of 250,000 "devices" in parallel created
> tremendous contention on the kernfs_mutex and, it appears, on one of the
> directories within it where memory nodes are created. Using a mutex means
> that the details of that mutex must bounce around all the cpus ... did I
> mention 1500+ cpus? A whole lot of thrash ...

I don't know. The above highlights the absurdity of the approach itself to
me. You seem to be aware of it too in writing: 250,000 "devices".

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-22 Thread Tejun Heo
Hello, Ian.

On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote:
> > > They are used for hotplugging and partitioning memory. The size of
> > > the
> > > segments (and thus the number of them) is dictated by the
> > > underlying
> > > hardware.
> > 
> > This sounds so bad. There gotta be a better interface for that,
> > right?
> 
> I'm still struggling a bit to grasp what your getting at but ...

I was more trying to say that the sysfs device interface with per-object
directory isn't the right interface for this sort of usage at all. Are these
even real hardware pieces which can be plugged in and out? While being a
discrete piece of hardware isn't a requirement to be a device model device,
the whole thing is designed with such use cases on mind. It definitely isn't
the right design for representing six digit number of logical entities.

It should be obvious that representing each consecutive memory range with a
separate directory entry is far from an optimal way of representing
something like this. It's outright silly.

> Maybe your talking about the underlying notifications system where
> a notification is sent for every event.
> 
> There's nothing new about that problem and it's becoming increasingly
> clear that existing kernel notification sub-systems don't scale well.
> 
> Mount handling is a current example which is one of the areas David
> Howells is trying to improve and that's taken years now to get as
> far as it has.

There sure are scalability issues everywhere that needs to be improved but
there also are cases where the issue is the approach itself rather than
scalability and I'm wondering whether this is the latter.

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-20 Thread Ian Kent
On Fri, 2020-06-19 at 18:23 -0400, Tejun Heo wrote:
> On Fri, Jun 19, 2020 at 01:41:39PM -0700, Rick Lindsley wrote:
> > On 6/19/20 8:38 AM, Tejun Heo wrote:
> > 
> > > I don't have strong objections to the series but the rationales
> > > don't seem
> > > particularly strong. It's solving a suspected problem but only
> > > half way. It
> > > isn't clear whether this can be the long term solution for the
> > > problem
> > > machine and whether it will benefit anyone else in a meaningful
> > > way either.
> > 
> > I don't understand your statement about solving the problem
> > halfway. Could
> > you elaborate?
> 
> Spending 5 minutes during boot creating sysfs objects doesn't seem
> like a
> particularly good solution and I don't know whether anyone else would
> experience similar issues. Again, not necessarily against improving
> the
> scalability of kernfs code but the use case seems a bit out there.
> 
> > > I think Greg already asked this but how are the 100,000+ memory
> > > objects
> > > used? Is that justified in the first place?
> > 
> > They are used for hotplugging and partitioning memory. The size of
> > the
> > segments (and thus the number of them) is dictated by the
> > underlying
> > hardware.
> 
> This sounds so bad. There gotta be a better interface for that,
> right?

I'm still struggling a bit to grasp what your getting at but ...

Maybe your talking about the underlying notifications system where
a notification is sent for every event.

There's nothing new about that problem and it's becoming increasingly
clear that existing kernel notification sub-systems don't scale well.

Mount handling is a current example which is one of the areas David
Howells is trying to improve and that's taken years now to get as
far as it has.

It seems to me that any improvements in the area here would have a
different solution, perhaps something along the lines of multiple
notification merging, increased context carried in notifications,
or the like. Something like the notification merging to reduce
notification volume might eventually be useful for David's
notifications sub-system too (and, I think the design of that
sub-system could probably accommodate that sort of change away
from the problematic anonymous notification sub-systems we have
now).

But it's taken a long time to get that far with that project and
the case here would have a far more significant impact on a fairly
large number of sub-systems, both kernel and user space, so all I
can hope for with this discussion is to raise awareness of the need
so that it's recognised and thought about approaches to improving
it can happen.

So, while the questions you ask are valid and your concerns real,
it's unrealistic to think there's a simple solution that can be
implemented in short order. Problem awareness is all that can be done
now so that fundamental and probably wide spread improvements might
be able to be implemented over time.

But if I misunderstand your thinking on this please elaborate further.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-20 Thread Ian Kent
On Fri, 2020-06-19 at 11:38 -0400, Tejun Heo wrote:
> Hello, Ian.
> 
> On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote:
> > The series here tries to reduce the locking needed during path
> > walks
> > based on the assumption that there are many path walks with a
> > fairly
> > large portion of those for non-existent paths, as described above.
> > 
> > That was done by adding kernfs negative dentry caching (non-
> > existent
> > paths) to avoid continual alloc/free cycle of dentries and a
> > read/write
> > semaphore introduced to increase kernfs concurrency during path
> > walks.
> > 
> > With these changes we still need kernel parameters of
> > udev.children-max=2048
> > and systemd.default_timeout_start_sec=300 for the fastest boot
> > times of
> > under 5 minutes.
> 
> I don't have strong objections to the series but the rationales don't
> seem
> particularly strong. It's solving a suspected problem but only half
> way. It
> isn't clear whether this can be the long term solution for the
> problem
> machine and whether it will benefit anyone else in a meaningful way
> either.
> 
> I think Greg already asked this but how are the 100,000+ memory
> objects
> used? Is that justified in the first place?

The problem is real enough, however, whether improvements can be made
in other areas flowing on from the arch specific device creation
notifications is not clear cut.

There's no question that there is very high contention between the
VFS and sysfs and that's all the series is trying to improve, nothing
more.

What both you and Greg have raised are good questions but are
unfortunately very difficult to answer.

I tried to add some discussion about it, to the extent that I could,
in the cover letter.

Basically the division of memory into 256M chunks is something that's
needed to provide flexibility for arbitrary partition creation (a set
of hardware allocated that's used for, essentially, a bare metal OS
install). Whether that's many small partitions for load balanced server
farms (or whatever) or much larger partitions for for demanding
applications, such as Oracle systems, is not something that can be
known in advance.

So the division into small memory chunks can't change.

The question of sysfs node creation, what uses them and when they
are used is much harder.

I'm not able to find that out and, I doubt even IBM would know, if
their customers use applications that need to consult the sysfs
file system for this information or when it's needed if it is need
at all. So I'm stuck on this question.

One thing is for sure though, it would be (at the very least) risky
for a vendor to assume they either aren't needed or aren't needed early
during system start up.

OTOH I've looked at what gets invoked on udev notifications (which
is the source of the heavy path walk activity, I admit I need to
dig deeper) and that doesn't appear to be doing anything obviously
wrong so that far seems ok.

For my part, as long as the series proves to be sound, why not,
it does substantially reduce contention between the VFS and sysfs
is the face of heavy sysfs path walk activity so I think that
stands alone as sufficient to consider the change worthwhile.

Ian



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-19 Thread Rick Lindsley

On 6/19/20 3:23 PM, Tejun Heo wrote:


Spending 5 minutes during boot creating sysfs objects doesn't seem like a
particularly good solution and I don't know whether anyone else would
experience similar issues. Again, not necessarily against improving the
scalability of kernfs code but the use case seems a bit out there.


Creating sysfs objects is not a new "solution".  We already do it, and it can 
take over an hour on a machine with 64TB of memory.  We're not *adding* this ... we're 
*improving* something that has been around for a long time.


They are used for hotplugging and partitioning memory. The size of the
segments (and thus the number of them) is dictated by the underlying
hardware.


This sounds so bad. There gotta be a better interface for that, right?


Again; this is not new.  Easily changing the state of devices by writing new 
values into kernfs is one of the reasons kernfs exists.

echo 0 > /sys/devices//system/memory/memory10374/online

and boom - you've taken memory chunk 10374 offline.

These changes are not just a whim.  I used lockstat to measure contention during boot.  
The addition of 250,000 "devices" in parallel created tremendous contention on 
the kernfs_mutex and, it appears, on one of the directories within it where memory nodes 
are created.  Using a mutex means that the details of that mutex must bounce around all 
the cpus ... did I mention 1500+ cpus?  A whole lot of thrash ...

These patches turn that mutex into a rw semaphore, and any thread checking for 
the mere existence of something can get by with a read lock. lockstat showed 
that about 90% of the mutex contentions were in a read path and only 10% in a 
write path.  Switching to a rw semaphore meant that 90% of the time there was 
no waiting and the thread proceeded with its caches intact.  Total time spent 
waiting on either read or write decreased by 75%, and should scale much better 
with subsequent hardware upgrades.

With contention on this particular data structure reduced, we saw thrash on 
related control structures decrease markedly too.  The contention for the 
lockref taken in dput dropped 66% and, likely due to reduced thrash, the time 
used waiting for that structure dropped 99%.

Rick


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-19 Thread Tejun Heo
On Fri, Jun 19, 2020 at 01:41:39PM -0700, Rick Lindsley wrote:
> On 6/19/20 8:38 AM, Tejun Heo wrote:
> 
> > I don't have strong objections to the series but the rationales don't seem
> > particularly strong. It's solving a suspected problem but only half way. It
> > isn't clear whether this can be the long term solution for the problem
> > machine and whether it will benefit anyone else in a meaningful way either.
> 
> I don't understand your statement about solving the problem halfway. Could
> you elaborate?

Spending 5 minutes during boot creating sysfs objects doesn't seem like a
particularly good solution and I don't know whether anyone else would
experience similar issues. Again, not necessarily against improving the
scalability of kernfs code but the use case seems a bit out there.

> > I think Greg already asked this but how are the 100,000+ memory objects
> > used? Is that justified in the first place?
> 
> They are used for hotplugging and partitioning memory. The size of the
> segments (and thus the number of them) is dictated by the underlying
> hardware.

This sounds so bad. There gotta be a better interface for that, right?

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-19 Thread Rick Lindsley

On 6/19/20 8:38 AM, Tejun Heo wrote:


I don't have strong objections to the series but the rationales don't seem
particularly strong. It's solving a suspected problem but only half way. It
isn't clear whether this can be the long term solution for the problem
machine and whether it will benefit anyone else in a meaningful way either.


I don't understand your statement about solving the problem halfway.  Could you 
elaborate?


I think Greg already asked this but how are the 100,000+ memory objects
used? Is that justified in the first place?


They are used for hotplugging and partitioning memory.  The size of the 
segments (and thus the number of them) is dictated by the underlying hardware.

Rick



Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-19 Thread Tejun Heo
Hello, Ian.

On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote:
> The series here tries to reduce the locking needed during path walks
> based on the assumption that there are many path walks with a fairly
> large portion of those for non-existent paths, as described above.
> 
> That was done by adding kernfs negative dentry caching (non-existent
> paths) to avoid continual alloc/free cycle of dentries and a read/write
> semaphore introduced to increase kernfs concurrency during path walks.
> 
> With these changes we still need kernel parameters of udev.children-max=2048
> and systemd.default_timeout_start_sec=300 for the fastest boot times of
> under 5 minutes.

I don't have strong objections to the series but the rationales don't seem
particularly strong. It's solving a suspected problem but only half way. It
isn't clear whether this can be the long term solution for the problem
machine and whether it will benefit anyone else in a meaningful way either.

I think Greg already asked this but how are the 100,000+ memory objects
used? Is that justified in the first place?

Thanks.

-- 
tejun


[PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-06-17 Thread Ian Kent
For very large IBM Power mainframe systems with hundreds of CPUs and TBs
of RAM booting can take a very long time.

Initial reports showed that booting a configuration of several hundred
CPUs and 64TB of RAM would take more than 30 minutes and require kernel
parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600
to prevent dropping into emergency mode.

Gathering information about what's happening during the boot is a bit
challenging but two main issues appeared to be: a large number of path
lookups for non-existent files, and very high lock contention in the VFS
during path walks particularly in the dentry allocation code path.

The underlying cause of this was thought to be the sheer number of sysfs
memory objects, 100,000+ for a 64TB memory configuration as the hardware
divides the memory into 256MB logical blocks. This is believed to be due
to either IBM Power hardware design or a requirement of the mainframe
software used to create logical partitions (LPARs, that are used to
install an operating system to provide services), since these can be made
up of a wide range of resources, CPU, Memory, disks, etc.

It's unclear yet whether the creation of syfs nodes for these memory
devices can be postponed or spread out over a larger amount of time.
That's because the high overhead looks to be due to notifications received
by udev which invokes a systemd program for them and attempts by systemd
folks to improve this have not focused on changing the handling of these
notifications, possibly because of difficulties with doing so. This
remains an avenue of investigation.

Kernel traces show there are many path walks with a fairly large portion
of those for non-existent paths. However, looking at the systemd code
invoked by the udev action it appears there's only one additional lookup
for each invocation so the large number of negative lookups is most likely
due to the large number of notifications rather than a fault with the
systemd program.

The series here tries to reduce the locking needed during path walks
based on the assumption that there are many path walks with a fairly
large portion of those for non-existent paths, as described above.

That was done by adding kernfs negative dentry caching (non-existent
paths) to avoid continual alloc/free cycle of dentries and a read/write
semaphore introduced to increase kernfs concurrency during path walks.

With these changes we still need kernel parameters of udev.children-max=2048
and systemd.default_timeout_start_sec=300 for the fastest boot times of
under 5 minutes.

There may be opportunities for further improvements but the series here
has seen a fair amount of testing and thinking about what else these could
be. Discussing it with Rick Lindsay, I suspect improvements will get more
difficult to implement for somewhat less improvement so I think what we
have here is a good start for now.

Changes since v1:
- fix locking in .permission() and .getattr() by re-factoring the attribute
  handling code.
---

Ian Kent (6):
  kernfs: switch kernfs to use an rwsem
  kernfs: move revalidate to be near lookup
  kernfs: improve kernfs path resolution
  kernfs: use revision to identify directory node changes
  kernfs: refactor attr locking
  kernfs: make attr_mutex a local kernfs node lock


 fs/kernfs/dir.c |  284 ---
 fs/kernfs/file.c|4 -
 fs/kernfs/inode.c   |   58 +
 fs/kernfs/kernfs-internal.h |   29 
 fs/kernfs/mount.c   |   12 +-
 fs/kernfs/symlink.c |4 -
 include/linux/kernfs.h  |7 +
 7 files changed, 259 insertions(+), 139 deletions(-)

--
Ian