date:20231212

Re: file handle in statx

2023-12-12 Thread Donald Buczek

On 12/12/23 16:20, Theodore Ts'o wrote:
> On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
>> On 12/12/23 06:53, Dave Chinner wrote:
>>
>>> So can someone please explain to me why we need to try to re-invent
>>> a generic filehandle concept in statx when we already have a
>>> have working and widely supported user API that provides exactly
>>> this functionality?
>>
>> name_to_handle_at() is fine, but userspace could profit from being
>> able to retrieve the filehandle together with the other metadata in
>> a single system call.
> 
> Can you say more?  What, specifically is the application that would
> want to do that, and is it really in such a hot path that it would be
> a user-visible improveable, let aloine something that can be actually
> be measured?

Probably not for the specific applications I mentioned (backup, mirror,
accounting). These are intended to run continuously, slowly and unnoticed
in the background, so they are memory and i/o throttled via cgroups anyway
and one is even using sleep after so-and-so many stat calls to reduce
its impact.

If they could tell a directory from a snapshot, I would probably stop them
from walking into snapshots. And if not, the snapshot id is all that is
needed to tell a clone in a snapshot from a hardlink. So these don't really
need the filehandle.

In the thread it was assumed, that there are other (unspecified)
applications which need the filehandle and currently use name_to_handle_at().

I though it was self-evident that a single syscall to retrieve all
information atomically is better than a set of syscalls. Each additional
syscall has overhead and you need to be concerned with the data changing
between the calls.

Userspace nfs server as an example of an application, where visible
performance is more relevant, was already mentioned by someone else.

Best
  Donald

> 
>   - Ted
-- 
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433

Re: [PATCH v2 3/3] statx04: Skip STATX_ATTR_COMPRESSED on Bcachefs

2023-12-12 Thread Kent Overstreet

On Thu, Dec 07, 2023 at 08:40:11PM +0100, Petr Vorel wrote:
> STATX_ATTR_COMPRESSED is not supported on Bcachefs, thus skip it
> (it's already skipped on tmpfs and XFS).

Hang on, bcachefs most definitely does hae compression. This would be
just because statx isn't plumbed through?

> 
> Suggested-by: Cyril Hrubis 
> Signed-off-by: Petr Vorel 
> ---
> New in v2
> 
>  testcases/kernel/syscalls/statx/statx04.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/testcases/kernel/syscalls/statx/statx04.c 
> b/testcases/kernel/syscalls/statx/statx04.c
> index 58296bd24..c2ed52deb 100644
> --- a/testcases/kernel/syscalls/statx/statx04.c
> +++ b/testcases/kernel/syscalls/statx/statx04.c
> @@ -96,8 +96,9 @@ static void setup(void)
>   for (i = 0, expected_mask = 0; i < ARRAY_SIZE(attr_list); i++)
>   expected_mask |= attr_list[i].attr;
>  
> - /* STATX_ATTR_COMPRESSED not supported on XFS, TMPFS */
> - if (!strcmp(tst_device->fs_type, "xfs") || !strcmp(tst_device->fs_type, 
> "tmpfs"))
> + /* STATX_ATTR_COMPRESSED not supported on Bcachefs, TMPFS, XFS */
> + if (!strcmp(tst_device->fs_type, "bcachefs") || 
> !strcmp(tst_device->fs_type, "tmpfs") ||
> + !strcmp(tst_device->fs_type, "xfs"))
>   expected_mask &= ~STATX_ATTR_COMPRESSED;
>  
>   /* Attribute support was added to Btrfs statx() in kernel v4.13 */
> -- 
> 2.43.0
>

Re: [PATCH RFC v2 for-6.8/block 01/18] block: add some bdev apis

2023-12-12 Thread Yu Kuai


Hi,

在 2023/12/12 21:14, Christoph Hellwig 写道:

On Mon, Dec 11, 2023 at 05:52:17PM +0100, Jan Kara wrote:

+void bdev_associated_mapping(struct block_device *bdev,
+struct address_space *mapping)
+{
+   mapping->host = bdev->bd_inode;
+}


Here I'm not sure - is the helper really a win? It seems a bit obscure to
me. This initialization of another mapping for a bdev looks really special.


If we want to hide bd_inode we'll something like this helper even if
I don't particularly like it either.

But it might be a good idea to move out of this series and into the
follow on removing bd_inode, as it's rather pointless without that
context.


Yes, this sounds good, I'll remove this from v3.

Thanks,
Kuai


.

Re: [PATCH RFC v2 for-6.8/block 01/18] block: add some bdev apis

2023-12-12 Thread Yu Kuai


Hi,

在 2023/12/12 21:16, Christoph Hellwig 写道:

+void invalidate_bdev_range(struct block_device *bdev, pgoff_t start,
+  pgoff_t end)
+{
+   invalidate_mapping_pages(bdev->bd_inode->i_mapping, start, end);
+}
+EXPORT_SYMBOL_GPL(invalidate_bdev_range);


Can we have kerneldoc comments for the new helpers please?


Of course, will definitely do this in v3.



+struct folio *__bdev_get_folio(struct block_device *bdev, loff_t pos,
+  fgf_t fgp_flags, gfp_t gfp)
+{
+   return __filemap_get_folio(bdev->bd_inode->i_mapping, pos >> PAGE_SHIFT,
+  fgp_flags, gfp);
+}
+EXPORT_SYMBOL_GPL(__bdev_get_folio);


It's a bit silly to have a __-prefixed API without a version that
doesn't have the prefix, so I'd prefer to drop it.  Unless willy has
a good argument for keeping it the same as the filemap API.


Ok, I'll drop it if willy doesn't against this.

Thanks,
Kuai


.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Wed, Dec 13, 2023 at 11:02:29AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Kent Overstreet wrote:
> > On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote:
> > > On Wed, 13 Dec 2023, Kent Overstreet wrote:
> > > > On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote:
> > > > > Kent Overstreet  wrote:
> > > > > 
> > > > > > I was chatting a bit with David Howells on IRC about this, and 
> > > > > > floated
> > > > > > adding the file handle to statx. It looks like there's enough space
> > > > > > reserved to make this feasible - probably going with a fixed maximum
> > > > > > size of 128-256 bits.
> > > > > 
> > > > > We can always save the last bit to indicate extension space/extension 
> > > > > record,
> > > > > so we're not that strapped for space.
> > > > 
> > > > So we'll need that if we want to round trip NFSv4 filehandles, they
> > > > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
> > > > bytes reserved).
> > > > 
> > > > Obvious question (Neal): do/will real world implementations ever come
> > > > close to making use of this, or was this a "future proofing gone wild"
> > > > thing?
> > > 
> > > I have no useful data.  I have seen lots of filehandles but I don't pay
> > > much attention to their length.  Certainly some are longer than 32 bytes.
> > > 
> > > > 
> > > > Say we do decide we want to spec it that large: _can_ we extend struct
> > > > statx? I'm wondering if the userspace side was thought through, I'm
> > > > sure glibc people will have something to say.
> > > 
> > > The man page says:
> > > 
> > >  Therefore, do not simply set mask to UINT_MAX (all bits set), as
> > >  one or more bits may, in the future, be used to specify an
> > >  extension to the buffer.
> > > 
> > > I suspect the glibc people read that.
> > 
> > The trouble is that C has no notion of which types are safe to pass
> > across a dynamic library boundary, so if we increase the size of struct
> > statx and someone's doing that things will break in nasty ways.
> > 
> 
> Maybe we don't increase the size of struct statx.
> Maybe we declare
> 
>struct statx2 {
>  struct statx base;
>  __u8 stx_handle[128];
>}
> 
> and pass then when we request STX_HANDLE.

yeah, I think that's what would be required.

David was originally proposing having userspace just pass in a size_t
for the buffer size, and I really think that would've been better - if
this thing can ever change size, we need to make that explicit in the
API.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread NeilBrown

On Wed, 13 Dec 2023, Dave Chinner wrote:
> On Wed, Dec 13, 2023 at 09:31:13AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > 
> > > What you are suggesting is that we now duplicate filehandle encoding
> > > into every filesystem's statx() implementation.  That's a bad
> > > trade-off from a maintenance, testing and consistency POV because
> > > now we end up with lots of individual, filehandle encoding
> > > implementations in addition to the generic filehandle
> > > infrastructure that we all have to test and validate.
> > 
> > Not correct.  We are suggesting an interface, not an implementation.
> > Here you are proposing a suboptimal implementation, pointing out its
> > weakness, and suggesting the has consequences for the interface
> > proposal.  Is that the strawman fallacy?
> 
> No, you simply haven't followed deep enough into the rabbit hole to
> understand Kent was suggesting potential implementation details to
> address hot path performance concerns with filehandle encoding.

Yes, I missed that.  Sorry.

NeilBrown


> 
> > vfs_getattr_nosec could, after calling i_op->getattr, check if
> > STATX_HANDLE is set in request_mask but not in ->result_mask.
> > If so it could call exportfs_encode_fh() and handle the result.
> >
> > No filesystem need to be changed.
> 
> Well, yes, it's pretty damn obvious that is exactly what I've been
> advocating for here - if we are going to put filehandles in statx(),
> then it must use the same infrastructure as name_to_handle_at().
> i.e. calling exportfs_encode_fh(EXPORT_FH_FID) to generate the
> filehandle.
> 
> The important discussion detail you've missed about
> exportfs_encode_fh() is that it *requires* adding a new indirect
> call (via export_ops->encode_fh) in the statx path to encode the
> filehandle, and that's exactly what Kent was suggesting we can code
> the implementation to avoid.
> 
> Avoiding an indirect function call is an implementation detail, not
> an interface design requirement.
> 
> And the only way to avoid adding new indirect calls to encoding
> filesystem specific filehandles is to implement the encoding in the
> existing individual filesystem i_op->getattr methods. i.e. duplicate
> the filehandle encoding in the statx path rather than use
> exportfs_encode_fh().
> 
> -Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
>

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread NeilBrown

On Wed, 13 Dec 2023, Kent Overstreet wrote:
> On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Kent Overstreet wrote:
> > > On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote:
> > > > Kent Overstreet  wrote:
> > > > 
> > > > > I was chatting a bit with David Howells on IRC about this, and floated
> > > > > adding the file handle to statx. It looks like there's enough space
> > > > > reserved to make this feasible - probably going with a fixed maximum
> > > > > size of 128-256 bits.
> > > > 
> > > > We can always save the last bit to indicate extension space/extension 
> > > > record,
> > > > so we're not that strapped for space.
> > > 
> > > So we'll need that if we want to round trip NFSv4 filehandles, they
> > > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
> > > bytes reserved).
> > > 
> > > Obvious question (Neal): do/will real world implementations ever come
> > > close to making use of this, or was this a "future proofing gone wild"
> > > thing?
> > 
> > I have no useful data.  I have seen lots of filehandles but I don't pay
> > much attention to their length.  Certainly some are longer than 32 bytes.
> > 
> > > 
> > > Say we do decide we want to spec it that large: _can_ we extend struct
> > > statx? I'm wondering if the userspace side was thought through, I'm
> > > sure glibc people will have something to say.
> > 
> > The man page says:
> > 
> >  Therefore, do not simply set mask to UINT_MAX (all bits set), as
> >  one or more bits may, in the future, be used to specify an
> >  extension to the buffer.
> > 
> > I suspect the glibc people read that.
> 
> The trouble is that C has no notion of which types are safe to pass
> across a dynamic library boundary, so if we increase the size of struct
> statx and someone's doing that things will break in nasty ways.
> 

Maybe we don't increase the size of struct statx.
Maybe we declare

   struct statx2 {
 struct statx base;
 __u8 stx_handle[128];
   }

and pass then when we request STX_HANDLE.

Or maybe there is a better way.  I'm sure the glibc developers have face
this sort of problem before.

NeilBrown

Re: file handle in statx

2023-12-12 Thread Kent Overstreet

On Wed, Dec 13, 2023 at 10:44:07AM +1100, Dave Chinner wrote:
> On Tue, Dec 12, 2023 at 05:39:27PM -0500, Kent Overstreet wrote:
> > Like Neal mentioned we won't even be fetching the fh if it wasn't
> > explicitly requested - and like I mentioned, we can avoid the
> > .encode_fh() call for local filesystems with a bit of work at the VFS
> > layer.
> > 
> > OTOH, when you're running rsync in incremental mode, and detecting
> > hardlinks, your point that "statx can be called millions of times per
> > second" would apply just as much to the additional name_to_handle_at()
> > call - we'd be nearly doubling their overhead for scanning files that
> > don't need to be sent.
> 
> Hardlinked files are indicated by st_nlink > 1, not by requiring
> userspace to store every st_ino/dev it sees and having to compare
> the st-ino/dev of every newly stat()d inode against that ino/dev
> cache.
> 
> We only need ino/dev/filehandles for hardlink path disambiguation.
> 
> IOWs, this use case does not need name_to_handle_at() for millions
> of inodes - it is just needed on the regular file inodes that have
> st_nlink > 1.

Ok yeah, that's a really good point. Perhaps nanme_to_handle_at() is
sufficient, then.

If so, maybe we can just add STATX_ATTR_INUM_NOT_UNIQUE and STATX_VOL
now, and leave STATX_HANDLE until someone discovers an application where
it actually does matter.

> > > And then comes the cost of encoding dynamically sized information in
> > > struct statx - filehandles are not fixed size - and statx is most
> > > definitely not set up or intended for dynamically sized attribute
> > > data. This adds more complexity to statx because it wasn't designed
> > > or intended to handle dynamically sized attributes. Optional
> > > attributes, yes, but not attributes that might vary in size from fs
> > > to fs or even inode type to inode type within a fileystem (e.g. dir
> > > filehandles can, optionally, encode the parent inode in them).
> > 
> > Since it looks like expanding statx is not going to be quite as easy as
> > hoped, I proposed elsewhere in the thread that we reserve a smaller
> > fixed size in statx (32 bytes) and set a flag if it won't fit,
> > indicating that userspace needs to fall back to name_to_handle_at().
> 
> struct btrfs_fid is 40 bytes in size. Sure, that's not all used for
> name_to_handle_at(), but we already have in-kernel filehandles that
> can optionally configured to be bigger than 32 bytes...

The hell is all that for!? They never reuse inode numbers, why are there
generation numbers in there? And do they not have inode -> dirent
backrefs?

> > Stuffing a _dynamically_ sized attribute into statx would indeed be
> > painful - I believe were always talking about a fixed size buffer in
> > statx, the discussion's been over how big it needs to be...
> 
> The contents of the buffer is still dynamically sized, so there's
> still a length attribute that needs to be emitted to userspace with
> the buffer.

Correct

> And then what happens with the next attribute that someone wants
> statx() to expose that can be dynamically sized? Are we really
> planning to allow the struct statx to be expanded indefinitely
> with largely unused static data arrays?

Well, struct stat/statx is not a long lived object that anyone would
ever keep a lot of around; it's a short lived object that just needs to
be efficient to access and ABI stable, so yes, if this comes up again
that's what we should do.

The alternative would be adding fields with an [ offset, length ] scheme
and treating the statx buffer as a bump allocator, but simple and fast
to access beats space efficiency here...

Re: file handle in statx

2023-12-12 Thread Dave Chinner

On Tue, Dec 12, 2023 at 05:39:27PM -0500, Kent Overstreet wrote:
> On Wed, Dec 13, 2023 at 09:23:18AM +1100, Dave Chinner wrote:
> > On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote:
> > > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > > > >
> > > > > > > > So can someone please explain to me why we need to try to 
> > > > > > > > re-invent
> > > > > > > > a generic filehandle concept in statx when we already have a 
> > > > > > > > have
> > > > > > > > working and widely supported user API that provides exactly this
> > > > > > > > functionality?
> > > > > > >
> > > > > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > > > > able to retrieve the filehandle together with the other metadata 
> > > > > > > in a
> > > > > > > single system call.
> > > > > > 
> > > > > > Can you say more?  What, specifically is the application that would 
> > > > > > want
> > > > > to do
> > > > > > that, and is it really in such a hot path that it would be a 
> > > > > > user-visible
> > > > > > improveable, let aloine something that can be actually be measured?
> > > > > 
> > > > > A user space NFS server like Ganesha could benefit from getting 
> > > > > attributes
> > > > > and file handle in a single system call.
> > > > 
> > > > At the cost of every other application that doesn't need those
> > > > attributes.
> > > 
> > > Why do you think there would be a cost?
> > 
> > It's as much maintenance and testing cost as it is a runtime cost.
> > We have to test and check this functionality works as advertised,
> > and we have to maintain that in working order forever more. That's
> > not free, especially if it is decided that the implementation needs
> > to be hyper-optimised in each individual filesystem because of
> > performance cost reasons.
> > 
> > Indeed, even the runtime "do we need to fetch this information"
> > checks have a measurable cost, especially as statx() is a very hot
> > kernel path. We've been optimising branches out of things like
> > setting up kiocbs because when that path is taken millions of times
> > every second each logic branch that decides if something needs to be
> > done or not has a direct measurable cost. statx() is a hot path that
> > can be called millions of times a second.
> 
> Like Neal mentioned we won't even be fetching the fh if it wasn't
> explicitly requested - and like I mentioned, we can avoid the
> .encode_fh() call for local filesystems with a bit of work at the VFS
> layer.
> 
> OTOH, when you're running rsync in incremental mode, and detecting
> hardlinks, your point that "statx can be called millions of times per
> second" would apply just as much to the additional name_to_handle_at()
> call - we'd be nearly doubling their overhead for scanning files that
> don't need to be sent.

Hardlinked files are indicated by st_nlink > 1, not by requiring
userspace to store every st_ino/dev it sees and having to compare
the st-ino/dev of every newly stat()d inode against that ino/dev
cache.

We only need ino/dev/filehandles for hardlink path disambiguation.

IOWs, this use case does not need name_to_handle_at() for millions
of inodes - it is just needed on the regular file inodes that have
st_nlink > 1.

Hence even for wrokloads like rsync with hardlink detection, we
don't need filehandles for every inode being stat()d.  And that's
ignoring the fact that, outside of certain niche use cases,
hardlinks are rare.

I'm really struggling to see what filehandles in statx() actually
optimises in any meaningful manner

> > And then comes the cost of encoding dynamically sized information in
> > struct statx - filehandles are not fixed size - and statx is most
> > definitely not set up or intended for dynamically sized attribute
> > data. This adds more complexity to statx because it wasn't designed
> > or intended to handle dynamically sized attributes. Optional
> > attributes, yes, but not attributes that might vary in size from fs
> > to fs or even inode type to inode type within a fileystem (e.g. dir
> > filehandles can, optionally, encode the parent inode in them).
> 
> Since it looks like expanding statx is not going to be quite as easy as
> hoped, I proposed elsewhere in the thread that we reserve a smaller
> fixed size in statx (32 bytes) and set a flag if it won't fit,
> indicating that userspace needs to fall back to name_to_handle_at().

struct btrfs_fid is 40 bytes in size. Sure, that's not all used for
name_to_handle_at(), but we already have in-kernel filehandles that
can optionally configured to be bigger than 32 bytes...

> Stuffing a _dynamically_ sized attribute into statx would indeed be
> painful - I believe were always talking about a fixed size buffer in
> statx, the discussion's been over how big it needs to be...

The contents of t

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Kent Overstreet wrote:
> > On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote:
> > > Kent Overstreet  wrote:
> > > 
> > > > I was chatting a bit with David Howells on IRC about this, and floated
> > > > adding the file handle to statx. It looks like there's enough space
> > > > reserved to make this feasible - probably going with a fixed maximum
> > > > size of 128-256 bits.
> > > 
> > > We can always save the last bit to indicate extension space/extension 
> > > record,
> > > so we're not that strapped for space.
> > 
> > So we'll need that if we want to round trip NFSv4 filehandles, they
> > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
> > bytes reserved).
> > 
> > Obvious question (Neal): do/will real world implementations ever come
> > close to making use of this, or was this a "future proofing gone wild"
> > thing?
> 
> I have no useful data.  I have seen lots of filehandles but I don't pay
> much attention to their length.  Certainly some are longer than 32 bytes.
> 
> > 
> > Say we do decide we want to spec it that large: _can_ we extend struct
> > statx? I'm wondering if the userspace side was thought through, I'm
> > sure glibc people will have something to say.
> 
> The man page says:
> 
>  Therefore, do not simply set mask to UINT_MAX (all bits set), as
>  one or more bits may, in the future, be used to specify an
>  extension to the buffer.
> 
> I suspect the glibc people read that.

The trouble is that C has no notion of which types are safe to pass
across a dynamic library boundary, so if we increase the size of struct
statx and someone's doing that things will break in nasty ways.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Wed, Dec 13, 2023 at 10:06:37AM +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2023 at 09:31:13AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > 
> > > What you are suggesting is that we now duplicate filehandle encoding
> > > into every filesystem's statx() implementation.  That's a bad
> > > trade-off from a maintenance, testing and consistency POV because
> > > now we end up with lots of individual, filehandle encoding
> > > implementations in addition to the generic filehandle
> > > infrastructure that we all have to test and validate.
> > 
> > Not correct.  We are suggesting an interface, not an implementation.
> > Here you are proposing a suboptimal implementation, pointing out its
> > weakness, and suggesting the has consequences for the interface
> > proposal.  Is that the strawman fallacy?
> 
> No, you simply haven't followed deep enough into the rabbit hole to
> understand Kent was suggesting potential implementation details to
> address hot path performance concerns with filehandle encoding.
> 
> > vfs_getattr_nosec could, after calling i_op->getattr, check if
> > STATX_HANDLE is set in request_mask but not in ->result_mask.
> > If so it could call exportfs_encode_fh() and handle the result.
> >
> > No filesystem need to be changed.
> 
> Well, yes, it's pretty damn obvious that is exactly what I've been
> advocating for here - if we are going to put filehandles in statx(),
> then it must use the same infrastructure as name_to_handle_at().
> i.e. calling exportfs_encode_fh(EXPORT_FH_FID) to generate the
> filehandle.
> 
> The important discussion detail you've missed about
> exportfs_encode_fh() is that it *requires* adding a new indirect
> call (via export_ops->encode_fh) in the statx path to encode the
> filehandle, and that's exactly what Kent was suggesting we can code
> the implementation to avoid.
> 
> Avoiding an indirect function call is an implementation detail, not
> an interface design requirement.
> 
> And the only way to avoid adding new indirect calls to encoding
> filesystem specific filehandles is to implement the encoding in the
> existing individual filesystem i_op->getattr methods. i.e. duplicate
> the filehandle encoding in the statx path rather than use
> exportfs_encode_fh().

I was thinking along the lines of coming up with a common fh type for
local filesystems (why exactly do we need 15?) and adding a volume ID to
the VFS inode so this could live entirely in VS code for most
filesystems, but that's an option too.

Might be the best one, since btrfs and bcachefs actually do want a
different fh type (btrfs: 64 bit subvol, 64 bit ino, bcachefs: 64 bit
ino, 32 bit subvol, 32 bit generation), and we don't want to generate a
bigger fh than necessary for when it's being consumed by a stacking
filesystem that has to generate a new fh by concatanating something.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Dave Chinner

On Wed, Dec 13, 2023 at 09:31:13AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Dave Chinner wrote:
> > 
> > What you are suggesting is that we now duplicate filehandle encoding
> > into every filesystem's statx() implementation.  That's a bad
> > trade-off from a maintenance, testing and consistency POV because
> > now we end up with lots of individual, filehandle encoding
> > implementations in addition to the generic filehandle
> > infrastructure that we all have to test and validate.
> 
> Not correct.  We are suggesting an interface, not an implementation.
> Here you are proposing a suboptimal implementation, pointing out its
> weakness, and suggesting the has consequences for the interface
> proposal.  Is that the strawman fallacy?

No, you simply haven't followed deep enough into the rabbit hole to
understand Kent was suggesting potential implementation details to
address hot path performance concerns with filehandle encoding.

> vfs_getattr_nosec could, after calling i_op->getattr, check if
> STATX_HANDLE is set in request_mask but not in ->result_mask.
> If so it could call exportfs_encode_fh() and handle the result.
>
> No filesystem need to be changed.

Well, yes, it's pretty damn obvious that is exactly what I've been
advocating for here - if we are going to put filehandles in statx(),
then it must use the same infrastructure as name_to_handle_at().
i.e. calling exportfs_encode_fh(EXPORT_FH_FID) to generate the
filehandle.

The important discussion detail you've missed about
exportfs_encode_fh() is that it *requires* adding a new indirect
call (via export_ops->encode_fh) in the statx path to encode the
filehandle, and that's exactly what Kent was suggesting we can code
the implementation to avoid.

Avoiding an indirect function call is an implementation detail, not
an interface design requirement.

And the only way to avoid adding new indirect calls to encoding
filesystem specific filehandles is to implement the encoding in the
existing individual filesystem i_op->getattr methods. i.e. duplicate
the filehandle encoding in the statx path rather than use
exportfs_encode_fh().

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread NeilBrown

On Wed, 13 Dec 2023, Kent Overstreet wrote:
> On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote:
> > Kent Overstreet  wrote:
> > 
> > > I was chatting a bit with David Howells on IRC about this, and floated
> > > adding the file handle to statx. It looks like there's enough space
> > > reserved to make this feasible - probably going with a fixed maximum
> > > size of 128-256 bits.
> > 
> > We can always save the last bit to indicate extension space/extension 
> > record,
> > so we're not that strapped for space.
> 
> So we'll need that if we want to round trip NFSv4 filehandles, they
> won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
> bytes reserved).
> 
> Obvious question (Neal): do/will real world implementations ever come
> close to making use of this, or was this a "future proofing gone wild"
> thing?

I have no useful data.  I have seen lots of filehandles but I don't pay
much attention to their length.  Certainly some are longer than 32 bytes.

> 
> Say we do decide we want to spec it that large: _can_ we extend struct
> statx? I'm wondering if the userspace side was thought through, I'm
> sure glibc people will have something to say.

The man page says:

 Therefore, do not simply set mask to UINT_MAX (all bits set), as
 one or more bits may, in the future, be used to specify an
 extension to the buffer.

I suspect the glibc people read that.

NeilBrown




> 
> Kernel side we can definitely extend struct statx, and we know how many
> bytes to copy to userspace because we know what fields userspace
> requested. The part I'm concerned about is that if we extend userspace's
> struct statx, that introduces obvious ABI compabitibility issues.
> 
> So this would probably force glibc to introduce a new version of struct
> statx, if I'm not mistaken.
> 
> Or: another option would be to reserve something small and sane in
> struct statx (32 bytes max, I'd say), and then set a flag to tell
> userspace they need to use name_to_handle_at() if it didn't fit.
>

Re: file handle in statx

2023-12-12 Thread Kent Overstreet

On Wed, Dec 13, 2023 at 09:23:18AM +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > > >
> > > > > > > So can someone please explain to me why we need to try to 
> > > > > > > re-invent
> > > > > > > a generic filehandle concept in statx when we already have a have
> > > > > > > working and widely supported user API that provides exactly this
> > > > > > > functionality?
> > > > > >
> > > > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > > > able to retrieve the filehandle together with the other metadata in 
> > > > > > a
> > > > > > single system call.
> > > > > 
> > > > > Can you say more?  What, specifically is the application that would 
> > > > > want
> > > > to do
> > > > > that, and is it really in such a hot path that it would be a 
> > > > > user-visible
> > > > > improveable, let aloine something that can be actually be measured?
> > > > 
> > > > A user space NFS server like Ganesha could benefit from getting 
> > > > attributes
> > > > and file handle in a single system call.
> > > 
> > > At the cost of every other application that doesn't need those
> > > attributes.
> > 
> > Why do you think there would be a cost?
> 
> It's as much maintenance and testing cost as it is a runtime cost.
> We have to test and check this functionality works as advertised,
> and we have to maintain that in working order forever more. That's
> not free, especially if it is decided that the implementation needs
> to be hyper-optimised in each individual filesystem because of
> performance cost reasons.
> 
> Indeed, even the runtime "do we need to fetch this information"
> checks have a measurable cost, especially as statx() is a very hot
> kernel path. We've been optimising branches out of things like
> setting up kiocbs because when that path is taken millions of times
> every second each logic branch that decides if something needs to be
> done or not has a direct measurable cost. statx() is a hot path that
> can be called millions of times a second.

Like Neal mentioned we won't even be fetching the fh if it wasn't
explicitly requested - and like I mentioned, we can avoid the
.encode_fh() call for local filesystems with a bit of work at the VFS
layer.

OTOH, when you're running rsync in incremental mode, and detecting
hardlinks, your point that "statx can be called millions of times per
second" would apply just as much to the additional name_to_handle_at()
call - we'd be nearly doubling their overhead for scanning files that
don't need to be sent.

> And then comes the cost of encoding dynamically sized information in
> struct statx - filehandles are not fixed size - and statx is most
> definitely not set up or intended for dynamically sized attribute
> data. This adds more complexity to statx because it wasn't designed
> or intended to handle dynamically sized attributes. Optional
> attributes, yes, but not attributes that might vary in size from fs
> to fs or even inode type to inode type within a fileystem (e.g. dir
> filehandles can, optionally, encode the parent inode in them).

Since it looks like expanding statx is not going to be quite as easy as
hoped, I proposed elsewhere in the thread that we reserve a smaller
fixed size in statx (32 bytes) and set a flag if it won't fit,
indicating that userspace needs to fall back to name_to_handle_at().

Stuffing a _dynamically_ sized attribute into statx would indeed be
painful - I believe were always talking about a fixed size buffer in
statx, the discussion's been over how big it needs to be...

Re: file handle in statx

2023-12-12 Thread NeilBrown

On Wed, 13 Dec 2023, Dave Chinner wrote:
> On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote:
> > On Wed, 13 Dec 2023, Dave Chinner wrote:
> > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > > >
> > > > > > > So can someone please explain to me why we need to try to 
> > > > > > > re-invent
> > > > > > > a generic filehandle concept in statx when we already have a have
> > > > > > > working and widely supported user API that provides exactly this
> > > > > > > functionality?
> > > > > >
> > > > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > > > able to retrieve the filehandle together with the other metadata in 
> > > > > > a
> > > > > > single system call.
> > > > > 
> > > > > Can you say more?  What, specifically is the application that would 
> > > > > want
> > > > to do
> > > > > that, and is it really in such a hot path that it would be a 
> > > > > user-visible
> > > > > improveable, let aloine something that can be actually be measured?
> > > > 
> > > > A user space NFS server like Ganesha could benefit from getting 
> > > > attributes
> > > > and file handle in a single system call.
> > > 
> > > At the cost of every other application that doesn't need those
> > > attributes.
> > 
> > Why do you think there would be a cost?
> 
> It's as much maintenance and testing cost as it is a runtime cost.

You are moving the goal posts.  Maintenance is not a cost borne by
"every other application".

But if you think it is a concern it is certainly worth mentioning.

> We have to test and check this functionality works as advertised,
> and we have to maintain that in working order forever more. That's
> not free, especially if it is decided that the implementation needs
> to be hyper-optimised in each individual filesystem because of
> performance cost reasons.
> 
> Indeed, even the runtime "do we need to fetch this information"
> checks have a measurable cost, especially as statx() is a very hot
> kernel path. We've been optimising branches out of things like
> setting up kiocbs because when that path is taken millions of times
> every second each logic branch that decides if something needs to be
> done or not has a direct measurable cost. statx() is a hot path that
> can be called millions of times a second.
> 
> And then comes the cost of encoding dynamically sized information in
> struct statx - filehandles are not fixed size - and statx is most
> definitely not set up or intended for dynamically sized attribute
> data. This adds more complexity to statx because it wasn't designed
> or intended to handle dynamically sized attributes. Optional
> attributes, yes, but not attributes that might vary in size from fs
> to fs or even inode type to inode type within a fileystem (e.g. dir
> filehandles can, optionally, encode the parent inode in them).

Filehandles are fixed sized.  They are one integer in the range 0-128
plus 128 bytes.  We know/promise the when the integer is less than 128,
trailing bytes of the 128 will all be zero.

NeilBRown


> 
> -Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
>

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread NeilBrown

On Wed, 13 Dec 2023, Dave Chinner wrote:
> 
> What you are suggesting is that we now duplicate filehandle encoding
> into every filesystem's statx() implementation.  That's a bad
> trade-off from a maintenance, testing and consistency POV because
> now we end up with lots of individual, filehandle encoding
> implementations in addition to the generic filehandle
> infrastructure that we all have to test and validate.

Not correct.  We are suggesting an interface, not an implementation.
Here you are proposing a suboptimal implementation, pointing out its
weakness, and suggesting the has consequences for the interface
proposal.  Is that the strawman fallacy?

vfs_getattr_nosec could, after calling i_op->getattr, check if
STATX_HANDLE is set in request_mask but not in ->result_mask.
If so it could call exportfs_encode_fh() and handle the result.

No filesystem need to be changed.

NeilBrown

Re: file handle in statx

2023-12-12 Thread Dave Chinner

On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote:
> On Wed, 13 Dec 2023, Dave Chinner wrote:
> > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > >
> > > > > > So can someone please explain to me why we need to try to re-invent
> > > > > > a generic filehandle concept in statx when we already have a have
> > > > > > working and widely supported user API that provides exactly this
> > > > > > functionality?
> > > > >
> > > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > > able to retrieve the filehandle together with the other metadata in a
> > > > > single system call.
> > > > 
> > > > Can you say more?  What, specifically is the application that would want
> > > to do
> > > > that, and is it really in such a hot path that it would be a 
> > > > user-visible
> > > > improveable, let aloine something that can be actually be measured?
> > > 
> > > A user space NFS server like Ganesha could benefit from getting attributes
> > > and file handle in a single system call.
> > 
> > At the cost of every other application that doesn't need those
> > attributes.
> 
> Why do you think there would be a cost?

It's as much maintenance and testing cost as it is a runtime cost.
We have to test and check this functionality works as advertised,
and we have to maintain that in working order forever more. That's
not free, especially if it is decided that the implementation needs
to be hyper-optimised in each individual filesystem because of
performance cost reasons.

Indeed, even the runtime "do we need to fetch this information"
checks have a measurable cost, especially as statx() is a very hot
kernel path. We've been optimising branches out of things like
setting up kiocbs because when that path is taken millions of times
every second each logic branch that decides if something needs to be
done or not has a direct measurable cost. statx() is a hot path that
can be called millions of times a second.

And then comes the cost of encoding dynamically sized information in
struct statx - filehandles are not fixed size - and statx is most
definitely not set up or intended for dynamically sized attribute
data. This adds more complexity to statx because it wasn't designed
or intended to handle dynamically sized attributes. Optional
attributes, yes, but not attributes that might vary in size from fs
to fs or even inode type to inode type within a fileystem (e.g. dir
filehandles can, optionally, encode the parent inode in them).

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Dave Chinner

On Tue, Dec 12, 2023 at 04:23:06PM -0500, Kent Overstreet wrote:
> On Wed, Dec 13, 2023 at 07:48:16AM +1100, Dave Chinner wrote:
> > On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote:
> > > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> > > > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > > > the middle of the room?
> > > > 
> > > > I mean, we already have name_to_handle_at() for userspace to get a
> > > > unique, opaque, filesystem defined file handle for any given file.
> > > > It's the same filehandle that filesystems hand to the nfsd so nfs
> > > > clients can uniquely identify the file they are asking the nfsd to
> > > > operate on.
> > > > 
> > > > The contents of these filehandles is entirely defined by the file
> > > > system and completely opaque to the user. The only thing that
> > > > parses the internal contents of the handle is the filesystem itself.
> > > > Therefore, as long as the fs encodes the information it needs into the
> > > > handle to determine what subvol/snapshot the inode belongs to when
> > > > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > > > nothing else needs to care how it is encoded.
> > > > 
> > > > So can someone please explain to me why we need to try to re-invent
> > > > a generic filehandle concept in statx when we already have a
> > > > have working and widely supported user API that provides exactly
> > > > this functionality?
> > > 
> > > Definitely should be part of the discussion :)
> > > 
> > > But I think it _does_ need to be in statx; because:
> > >  - we've determined that 64 bit ino_t just isn't a future proof
> > >interface, we're having real problems with it today
> > >  - statx is _the_ standard, future proofed interface for getting inode
> > >attributes
> > 
> > No, it most definitely isn't, and statx was never intended as a
> > dumping ground for anything and everything inode related. e.g. Any
> > inode attribute that can be modified needs to use a different
> > interface - one that has a corresponding "set" operation.
> 
> And here I thought the whole point of statx was to be an extensible way
> to request any sort of inode attribute.

Within reason. When the size of a single attribute is dynamically
sized, can double the size of the statx structure and there's an
existing syscall to retrieve that information from the filesystem,
it makes no sense *at all* to duplicate that information in statx().

statx() is for new attributes we don't have any other way of exposing to
userspace easily. it is designed for fixed size attributes and
flags, it is not designed for dynamically sized information to be
embedded in struct statx.

Filehandles are not new, and we already have APIs that expose them
to userspace that handle the dynamic size of filehandles just fine.
This functionality simply does not belong in statx() at all.

> > >  - therefore, if we want userspace programmers to be using filehandles,
> > >instead of inode numbers, so there code isn't broken, we need to be
> > >providing interfaces that guide them in that direction.
> > 
> > We already have a filehandle interface they can use for this
> > purpose. It is already used by some userspace applications for this
> > purpose.
> > 
> > Anything new API function do with statx() will require application
> > changes, and the vast majority of applications aren't using statx()
> > directly - they are using stat() which glibc wraps to statx()
> > internally. So they are going to need a change of API, anyway.
> > 
> > So, fundamentally, there is a change of API for most applications
> > that need to do thorough inode uniqueness checks regardless of
> > anything else. They can do this right now - just continue using
> > stat() as they do right now, and then use name_to_filehandle_at()
> > for uniqueness checks.
> > 
> > > Even assuming we can update all the documentation to say "filehandles
> > > are the correct way to test inode uniqueness", you know at least half of
> > > programmers will stick to stx_ino instead of the filehandle if the
> > > filehandle is an extra syscall.
> > 
> > Your argument is "programmers suck so we must design for the
> > lowest common denominator". That's an -awful- way to design APIs.
> 
> No, I'm saying if the old way doing things no longer works, we ought to
> make the new future proofed way as ergonomic and easy to use as the old
> way was - else it won't get used.
> 
> At the _very_ least we need to add a flag to statx for "inode number is
> unreliable for uniqueness checks".
> 
> bcachefs could leave this off until the first snapshot has been taken.
> 
> But even with that option, I think we ought to be telling anyone doing
> uniqueness checks to use the filehandle, because it includes i_generation.
> 
> > Further, this "programmers suck" design comes at a cost to every
> > statx() call that does not need filehandles. That's the vast
> > majority of statx() calls that are made on a

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread NeilBrown

On Wed, 13 Dec 2023, Dave Chinner wrote:
> On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> > > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > > the middle of the room?
> > > 
> > > I mean, we already have name_to_handle_at() for userspace to get a
> > > unique, opaque, filesystem defined file handle for any given file.
> > > It's the same filehandle that filesystems hand to the nfsd so nfs
> > > clients can uniquely identify the file they are asking the nfsd to
> > > operate on.
> > > 
> > > The contents of these filehandles is entirely defined by the file
> > > system and completely opaque to the user. The only thing that
> > > parses the internal contents of the handle is the filesystem itself.
> > > Therefore, as long as the fs encodes the information it needs into the
> > > handle to determine what subvol/snapshot the inode belongs to when
> > > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > > nothing else needs to care how it is encoded.
> > > 
> > > So can someone please explain to me why we need to try to re-invent
> > > a generic filehandle concept in statx when we already have a
> > > have working and widely supported user API that provides exactly
> > > this functionality?
> > 
> > Definitely should be part of the discussion :)
> > 
> > But I think it _does_ need to be in statx; because:
> >  - we've determined that 64 bit ino_t just isn't a future proof
> >interface, we're having real problems with it today
> >  - statx is _the_ standard, future proofed interface for getting inode
> >attributes
> 
> No, it most definitely isn't, and statx was never intended as a
> dumping ground for anything and everything inode related. e.g. Any
> inode attribute that can be modified needs to use a different
> interface - one that has a corresponding "set" operation.

stx_size, stx_mtime, stx_atime, stx_mode, stx_owner, stx_group,
stx_nlink, 

These can all be modified but don't have matched get and set operations.

stx_handle, of course, cannot be modified.

I think I must be completely misunderstanding you, because the way I
read your words, they don't make any sense at all.  Help!

NeilBrown

Re: file handle in statx

2023-12-12 Thread NeilBrown

On Wed, 13 Dec 2023, Dave Chinner wrote:
> On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > >
> > > > > So can someone please explain to me why we need to try to re-invent
> > > > > a generic filehandle concept in statx when we already have a have
> > > > > working and widely supported user API that provides exactly this
> > > > > functionality?
> > > >
> > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > able to retrieve the filehandle together with the other metadata in a
> > > > single system call.
> > > 
> > > Can you say more?  What, specifically is the application that would want
> > to do
> > > that, and is it really in such a hot path that it would be a user-visible
> > > improveable, let aloine something that can be actually be measured?
> > 
> > A user space NFS server like Ganesha could benefit from getting attributes
> > and file handle in a single system call.
> 
> At the cost of every other application that doesn't need those
> attributes.

Why do you think there would be a cost?

statx only returns the attributes that it was asked for.  If an
application doesn't ask for STATX_HANDLE, it doesn't get STATS_HANDLE
and (providing filesystems honour the flag) doesn't pay the price for
generating it (which is often trivial, but may not always be).

NeilBrown


>   That's not a good trade-off - the cost of a syscall is
> tiny compared to the rest of the work that has to be done to create
> a stable filehandle for an inode, even if we have a variant of
> name_to_handle_at() that takes an open fd rather than a path.
> 
> > Potentially it could also avoid some of the challenges of using
> > name_to_handle_at that is a privileged operation.
> 
> It isn't. open_by_handle_at() is privileged because it bypasses all
> the path based access checking that a normal open() does. Anyone can
> get a filehandle for a path from the kernel, but few can actually
> use it for anything other than file uniqueness checks
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
>

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread NeilBrown

On Wed, 13 Dec 2023, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 17:08, Kent Overstreet  
> wrote:
> 
> > In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when
> > they _must_ do the new thing if they care about correctness; it provides
> > a way to tell userspace what guarantees we're able to provide.
> 
> That flag would not help with improving userspace software.

Are you sure?

Suppose I wanted to export an filesystem using some protocol (maybe one
called "NFSv4"), and suppose this protocol supported the communication
of an attribute called "fileid" which was optional but requires to be
fully unique if provided at all.

If I had access to STATX_ATTR_INUM_NOT_UNIQUE, then I could not export
the fileid when it didn't met the protocol requirements, but could when
it did.

This may not be a strong case for the inclusion of the flag it is, I
think, a clear indication that "would not help" is what our fact
checkers would call "over-reach".

NeilBrown

> 
> What would help, if said software started using a unique identifier.
> We already seem to have a unique ID in the form of file handles,
> though some exotic filesystems might allow more than one fh to refer
> to the same inode, so this still needs some looking into.
> 
> The big problem is that we can't do a lot about existing software, and
> must keep trying to keep st_ino unique for the foreseeable future.
> 
> Thanks,
> Miklos
>

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread NeilBrown

On Wed, 13 Dec 2023, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 16:35, Kent Overstreet  
> wrote:
> 
> > Other poeple have been finding ways to contribute to the technical
> > discussion; just calling things "ugly and broken" does not.
> 
> Kent, calm down please.  We call things "ugly and broken" all the
> time.  That's just an opinion, you are free to argue it, and no need
> to take it personally.

But maybe we shouldn't.  Maybe we should focus on saying what, exactly,
is unpleasant to look at and way.  Or what exactly causes poor
funcationality.
"ugly" and "broken" are not particularly useful words in a technical
discussion.  I understand people want to use them, but they really need
to be backed up with details.  It is details that matter.

NeilBrown

> 
> Thanks,
> Miklos
> 
>

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Wed, Dec 13, 2023 at 07:48:16AM +1100, Dave Chinner wrote:
> On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> > > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > > the middle of the room?
> > > 
> > > I mean, we already have name_to_handle_at() for userspace to get a
> > > unique, opaque, filesystem defined file handle for any given file.
> > > It's the same filehandle that filesystems hand to the nfsd so nfs
> > > clients can uniquely identify the file they are asking the nfsd to
> > > operate on.
> > > 
> > > The contents of these filehandles is entirely defined by the file
> > > system and completely opaque to the user. The only thing that
> > > parses the internal contents of the handle is the filesystem itself.
> > > Therefore, as long as the fs encodes the information it needs into the
> > > handle to determine what subvol/snapshot the inode belongs to when
> > > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > > nothing else needs to care how it is encoded.
> > > 
> > > So can someone please explain to me why we need to try to re-invent
> > > a generic filehandle concept in statx when we already have a
> > > have working and widely supported user API that provides exactly
> > > this functionality?
> > 
> > Definitely should be part of the discussion :)
> > 
> > But I think it _does_ need to be in statx; because:
> >  - we've determined that 64 bit ino_t just isn't a future proof
> >interface, we're having real problems with it today
> >  - statx is _the_ standard, future proofed interface for getting inode
> >attributes
> 
> No, it most definitely isn't, and statx was never intended as a
> dumping ground for anything and everything inode related. e.g. Any
> inode attribute that can be modified needs to use a different
> interface - one that has a corresponding "set" operation.

And here I thought the whole point of statx was to be an extensible way
to request any sort of inode attribute.

> >  - therefore, if we want userspace programmers to be using filehandles,
> >instead of inode numbers, so there code isn't broken, we need to be
> >providing interfaces that guide them in that direction.
> 
> We already have a filehandle interface they can use for this
> purpose. It is already used by some userspace applications for this
> purpose.
> 
> Anything new API function do with statx() will require application
> changes, and the vast majority of applications aren't using statx()
> directly - they are using stat() which glibc wraps to statx()
> internally. So they are going to need a change of API, anyway.
> 
> So, fundamentally, there is a change of API for most applications
> that need to do thorough inode uniqueness checks regardless of
> anything else. They can do this right now - just continue using
> stat() as they do right now, and then use name_to_filehandle_at()
> for uniqueness checks.
> 
> > Even assuming we can update all the documentation to say "filehandles
> > are the correct way to test inode uniqueness", you know at least half of
> > programmers will stick to stx_ino instead of the filehandle if the
> > filehandle is an extra syscall.
> 
> Your argument is "programmers suck so we must design for the
> lowest common denominator". That's an -awful- way to design APIs.

No, I'm saying if the old way doing things no longer works, we ought to
make the new future proofed way as ergonomic and easy to use as the old
way was - else it won't get used.

At the _very_ least we need to add a flag to statx for "inode number is
unreliable for uniqueness checks".

bcachefs could leave this off until the first snapshot has been taken.

But even with that option, I think we ought to be telling anyone doing
uniqueness checks to use the filehandle, because it includes i_generation.

> Further, this "programmers suck" design comes at a cost to every
> statx() call that does not need filehandles. That's the vast
> majority of statx() calls that are made on a system. Why should we
> slow down statx() for all users when so few applications actually
> need uniqueness and they can take the cost of robust uniqueness
> tests with an extra syscall entirely themselves?

For any local filesystem the filehandle is going to be the inode
generation number, the subvolume ID (if applicable), and the inode
number. That's 16 bytes on bcachefs, and if we make an attempt to
standardize how this works across filesystems we should be able to do it
without adding a new indirect function call to the statx() path. That
sounds pretty negligable to me.

The syscall overhead isn't going to be negligable when an application
has to do lots of scanning to find the files its interested - rsync.

Re: file handle in statx

2023-12-12 Thread Dave Chinner

On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > On 12/12/23 06:53, Dave Chinner wrote:
> > >
> > > > So can someone please explain to me why we need to try to re-invent
> > > > a generic filehandle concept in statx when we already have a have
> > > > working and widely supported user API that provides exactly this
> > > > functionality?
> > >
> > > name_to_handle_at() is fine, but userspace could profit from being
> > > able to retrieve the filehandle together with the other metadata in a
> > > single system call.
> > 
> > Can you say more?  What, specifically is the application that would want
> to do
> > that, and is it really in such a hot path that it would be a user-visible
> > improveable, let aloine something that can be actually be measured?
> 
> A user space NFS server like Ganesha could benefit from getting attributes
> and file handle in a single system call.

At the cost of every other application that doesn't need those
attributes. That's not a good trade-off - the cost of a syscall is
tiny compared to the rest of the work that has to be done to create
a stable filehandle for an inode, even if we have a variant of
name_to_handle_at() that takes an open fd rather than a path.

> Potentially it could also avoid some of the challenges of using
> name_to_handle_at that is a privileged operation.

It isn't. open_by_handle_at() is privileged because it bypasses all
the path based access checking that a normal open() does. Anyone can
get a filehandle for a path from the kernel, but few can actually
use it for anything other than file uniqueness checks

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote:
> Kent Overstreet  wrote:
> 
> > I was chatting a bit with David Howells on IRC about this, and floated
> > adding the file handle to statx. It looks like there's enough space
> > reserved to make this feasible - probably going with a fixed maximum
> > size of 128-256 bits.
> 
> We can always save the last bit to indicate extension space/extension record,
> so we're not that strapped for space.

So we'll need that if we want to round trip NFSv4 filehandles, they
won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96
bytes reserved).

Obvious question (Neal): do/will real world implementations ever come
close to making use of this, or was this a "future proofing gone wild"
thing?

Say we do decide we want to spec it that large: _can_ we extend struct
statx? I'm wondering if the userspace side was thought through, I'm
sure glibc people will have something to say.

Kernel side we can definitely extend struct statx, and we know how many
bytes to copy to userspace because we know what fields userspace
requested. The part I'm concerned about is that if we extend userspace's
struct statx, that introduces obvious ABI compabitibility issues.

So this would probably force glibc to introduce a new version of struct
statx, if I'm not mistaken.

Or: another option would be to reserve something small and sane in
struct statx (32 bytes max, I'd say), and then set a flag to tell
userspace they need to use name_to_handle_at() if it didn't fit.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Dave Chinner

On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote:
> On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > the middle of the room?
> > 
> > I mean, we already have name_to_handle_at() for userspace to get a
> > unique, opaque, filesystem defined file handle for any given file.
> > It's the same filehandle that filesystems hand to the nfsd so nfs
> > clients can uniquely identify the file they are asking the nfsd to
> > operate on.
> > 
> > The contents of these filehandles is entirely defined by the file
> > system and completely opaque to the user. The only thing that
> > parses the internal contents of the handle is the filesystem itself.
> > Therefore, as long as the fs encodes the information it needs into the
> > handle to determine what subvol/snapshot the inode belongs to when
> > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > nothing else needs to care how it is encoded.
> > 
> > So can someone please explain to me why we need to try to re-invent
> > a generic filehandle concept in statx when we already have a
> > have working and widely supported user API that provides exactly
> > this functionality?
> 
> Definitely should be part of the discussion :)
> 
> But I think it _does_ need to be in statx; because:
>  - we've determined that 64 bit ino_t just isn't a future proof
>interface, we're having real problems with it today
>  - statx is _the_ standard, future proofed interface for getting inode
>attributes

No, it most definitely isn't, and statx was never intended as a
dumping ground for anything and everything inode related. e.g. Any
inode attribute that can be modified needs to use a different
interface - one that has a corresponding "set" operation.

>  - therefore, if we want userspace programmers to be using filehandles,
>instead of inode numbers, so there code isn't broken, we need to be
>providing interfaces that guide them in that direction.

We already have a filehandle interface they can use for this
purpose. It is already used by some userspace applications for this
purpose.

Anything new API function do with statx() will require application
changes, and the vast majority of applications aren't using statx()
directly - they are using stat() which glibc wraps to statx()
internally. So they are going to need a change of API, anyway.

So, fundamentally, there is a change of API for most applications
that need to do thorough inode uniqueness checks regardless of
anything else. They can do this right now - just continue using
stat() as they do right now, and then use name_to_filehandle_at()
for uniqueness checks.

> Even assuming we can update all the documentation to say "filehandles
> are the correct way to test inode uniqueness", you know at least half of
> programmers will stick to stx_ino instead of the filehandle if the
> filehandle is an extra syscall.

Your argument is "programmers suck so we must design for the
lowest common denominator". That's an -awful- way to design APIs.

Further, this "programmers suck" design comes at a cost to every
statx() call that does not need filehandles. That's the vast
majority of statx() calls that are made on a system. Why should we
slow down statx() for all users when so few applications actually
need uniqueness and they can take the cost of robust uniqueness
tests with an extra syscall entirely themselves?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

RE: file handle in statx

2023-12-12 Thread Frank Filz

> On Tue, Dec 12, 2023 at 7:44 PM Kent Overstreet 
> wrote:
> >
> > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > > >
> > > > > > So can someone please explain to me why we need to try to
> > > > > > re-invent a generic filehandle concept in statx when we
> > > > > > already have a have working and widely supported user API that
> > > > > > provides exactly this functionality?
> > > > >
> > > > > name_to_handle_at() is fine, but userspace could profit from
> > > > > being able to retrieve the filehandle together with the other
> > > > > metadata in a single system call.
> > > >
> > > > Can you say more?  What, specifically is the application that
> > > > would want
> > > to do
> > > > that, and is it really in such a hot path that it would be a
> > > > user-visible improveable, let aloine something that can be actually be
> measured?
> > >
> > > A user space NFS server like Ganesha could benefit from getting
> > > attributes and file handle in a single system call.
> > >
> > > Potentially it could also avoid some of the challenges of using
> > > name_to_handle_at that is a privileged operation.
> >
> > which begs the question - why is name_to_handle_at() privileged?
> >
> 
> AFAICT, it is not privileged.
> Only open_by_handle_at() is privileged.

Ah, that makes sense. I'm a consumer of these interfaces, not so much in the 
kernel these days.

Ganesha depends on open_by_handle_at as well, so requires privilege to do 
handle stuff with kernel file systems.

In any case, Ganesha could easily benefit from a savings of system calls.

What would also be handy is a read directory with attributes that gave the 
statx results for each entry.

Frank

Re: file handle in statx

2023-12-12 Thread Amir Goldstein

On Tue, Dec 12, 2023 at 7:44 PM Kent Overstreet
 wrote:
>
> On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > > On 12/12/23 06:53, Dave Chinner wrote:
> > > >
> > > > > So can someone please explain to me why we need to try to re-invent
> > > > > a generic filehandle concept in statx when we already have a have
> > > > > working and widely supported user API that provides exactly this
> > > > > functionality?
> > > >
> > > > name_to_handle_at() is fine, but userspace could profit from being
> > > > able to retrieve the filehandle together with the other metadata in a
> > > > single system call.
> > >
> > > Can you say more?  What, specifically is the application that would want
> > to do
> > > that, and is it really in such a hot path that it would be a user-visible
> > > improveable, let aloine something that can be actually be measured?
> >
> > A user space NFS server like Ganesha could benefit from getting attributes
> > and file handle in a single system call.
> >
> > Potentially it could also avoid some of the challenges of using
> > name_to_handle_at that is a privileged operation.
>
> which begs the question - why is name_to_handle_at() privileged?
>

AFAICT, it is not privileged.
Only open_by_handle_at() is privileged.

Thanks,
Amir.

Re: file handle in statx

2023-12-12 Thread Kent Overstreet

On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote:
> > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > > On 12/12/23 06:53, Dave Chinner wrote:
> > >
> > > > So can someone please explain to me why we need to try to re-invent
> > > > a generic filehandle concept in statx when we already have a have
> > > > working and widely supported user API that provides exactly this
> > > > functionality?
> > >
> > > name_to_handle_at() is fine, but userspace could profit from being
> > > able to retrieve the filehandle together with the other metadata in a
> > > single system call.
> > 
> > Can you say more?  What, specifically is the application that would want
> to do
> > that, and is it really in such a hot path that it would be a user-visible
> > improveable, let aloine something that can be actually be measured?
> 
> A user space NFS server like Ganesha could benefit from getting attributes
> and file handle in a single system call.
> 
> Potentially it could also avoid some of the challenges of using
> name_to_handle_at that is a privileged operation.

which begs the question - why is name_to_handle_at() privileged?

RE: file handle in statx

2023-12-12 Thread Frank Filz

> On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> > On 12/12/23 06:53, Dave Chinner wrote:
> >
> > > So can someone please explain to me why we need to try to re-invent
> > > a generic filehandle concept in statx when we already have a have
> > > working and widely supported user API that provides exactly this
> > > functionality?
> >
> > name_to_handle_at() is fine, but userspace could profit from being
> > able to retrieve the filehandle together with the other metadata in a
> > single system call.
> 
> Can you say more?  What, specifically is the application that would want
to do
> that, and is it really in such a hot path that it would be a user-visible
> improveable, let aloine something that can be actually be measured?

A user space NFS server like Ganesha could benefit from getting attributes
and file handle in a single system call.

Potentially it could also avoid some of the challenges of using
name_to_handle_at that is a privileged operation.

Frank

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Tue, Dec 12, 2023 at 05:30:23PM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 17:08, Kent Overstreet  
> wrote:
> 
> > In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when
> > they _must_ do the new thing if they care about correctness; it provides
> > a way to tell userspace what guarantees we're able to provide.
> 
> That flag would not help with improving userspace software.

I disagree. Just having it there and in the documentation will be a good
solid nudge to userspace programmers that this is something they need to
consider.

And: if we size this thing for NFSv4 filehandles, those are big: there's
an overhead associated with using those, since userspace generally has
to track in memory all inode numbers/file handles that it's seen.
STATX_ATTR_INUM_NOT_UNIQUE would allow userspace to avoid that overhead
when it is safe to do so.

(But remember that file handles include inode generation numbers, and
inode numbers do not; that is something we should remember to document,
and explain why it is important).

> What would help, if said software started using a unique identifier.
> We already seem to have a unique ID in the form of file handles,
> though some exotic filesystems might allow more than one fh to refer
> to the same inode, so this still needs some looking into.
> 
> The big problem is that we can't do a lot about existing software, and
> must keep trying to keep st_ino unique for the foreseeable future.

Whatever hacks we try to apply to st_ino are outside the scope of this
discussion. Right now, we need to be figuring out what our future
proofed interface is going to be, so that we don't end up kicking this
can down the road when st_ino will be _really_ space constrained.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Miklos Szeredi

On Tue, 12 Dec 2023 at 17:08, Kent Overstreet  wrote:

> In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when
> they _must_ do the new thing if they care about correctness; it provides
> a way to tell userspace what guarantees we're able to provide.

That flag would not help with improving userspace software.

What would help, if said software started using a unique identifier.
We already seem to have a unique ID in the form of file handles,
though some exotic filesystems might allow more than one fh to refer
to the same inode, so this still needs some looking into.

The big problem is that we can't do a lot about existing software, and
must keep trying to keep st_ino unique for the foreseeable future.

Thanks,
Miklos

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Tue, Dec 12, 2023 at 04:57:41PM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 16:43, Kent Overstreet  
> wrote:
> >
> > On Tue, Dec 12, 2023 at 04:38:29PM +0100, Miklos Szeredi wrote:
> > > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet  
> > > wrote:
> > >
> > > > Other poeple have been finding ways to contribute to the technical
> > > > discussion; just calling things "ugly and broken" does not.
> > >
> > > Kent, calm down please.  We call things "ugly and broken" all the
> > > time.  That's just an opinion, you are free to argue it, and no need
> > > to take it personally.
> >
> > It's an entirely subjective judgement that has no place in a discussion
> > where we're trying to decide things on technical merits.
> 
> Software is like architecture.  It must not collapse, but to function
> well it also needs to look good.  That's especially relevant to APIs,
> and yes, it's a matter of taste and can be very subjective.

Good taste is a highly important trait - in the course of normal work we
trust our subjective opinions constantly, because those opinions have
been trained by years and years of judgements and we can't reason
through everything.

But when you show up to a discussion that's been going on for a page,
where everything's been constructively gathering input, and you start
namecalling - and crucially, _without giving any technical justification
for your opinions_ - that's just you being a dick.

> On this particular point (STATX_ATTR_INUM_NOT_UNIQUE) I can completely
> understand Christian's reaction.  But if you think this would be a
> useful feature, please state your technical argument.

Well, you could have scanned through the prior thread, or read the
extensive LWN coverage, but in short - inode number uniqueness is
something that's already impossible to guarantee, today.

In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when
they _must_ do the new thing if they care about correctness; it provides
a way to tell userspace what guarantees we're able to provide.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Miklos Szeredi

On Tue, 12 Dec 2023 at 16:43, Kent Overstreet  wrote:
>
> On Tue, Dec 12, 2023 at 04:38:29PM +0100, Miklos Szeredi wrote:
> > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet  
> > wrote:
> >
> > > Other poeple have been finding ways to contribute to the technical
> > > discussion; just calling things "ugly and broken" does not.
> >
> > Kent, calm down please.  We call things "ugly and broken" all the
> > time.  That's just an opinion, you are free to argue it, and no need
> > to take it personally.
>
> It's an entirely subjective judgement that has no place in a discussion
> where we're trying to decide things on technical merits.

Software is like architecture.  It must not collapse, but to function
well it also needs to look good.  That's especially relevant to APIs,
and yes, it's a matter of taste and can be very subjective.

On this particular point (STATX_ATTR_INUM_NOT_UNIQUE) I can completely
understand Christian's reaction.  But if you think this would be a
useful feature, please state your technical argument.

Thanks,
Miklos

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Tue, Dec 12, 2023 at 04:38:29PM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 16:35, Kent Overstreet  
> wrote:
> 
> > Other poeple have been finding ways to contribute to the technical
> > discussion; just calling things "ugly and broken" does not.
> 
> Kent, calm down please.  We call things "ugly and broken" all the
> time.  That's just an opinion, you are free to argue it, and no need
> to take it personally.

It's an entirely subjective judgement that has no place in a discussion
where we're trying to decide things on technical merits.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Miklos Szeredi

On Tue, 12 Dec 2023 at 16:35, Kent Overstreet  wrote:

> Other poeple have been finding ways to contribute to the technical
> discussion; just calling things "ugly and broken" does not.

Kent, calm down please.  We call things "ugly and broken" all the
time.  That's just an opinion, you are free to argue it, and no need
to take it personally.

Thanks,
Miklos

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Tue, Dec 12, 2023 at 04:29:09PM +0100, Christian Brauner wrote:
> On Tue, Dec 12, 2023 at 10:16:31AM -0500, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 09:56:45AM +0100, Christian Brauner wrote:
> > > On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote:
> > > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > > >   same inode number
> > > 
> > > This is just ugly with questionable value. A constant reminder of how
> > > broken this is. Exposing the subvolume id also makes this somewhat 
> > > redundant.
> > 
> > Oh hell no. We finally get a reasonably productive discussion and I wake
> > up to you calling things ugly and broken, with no reason or
> > justification or any response to the justification that's already been
> > given?
> > 
> > Christain, that's not how we do things. Let's not turn what was a
> > productive discussion into a trainwreck, OK?
> 
> That's a major aggressive tone for no good reason. And probably not a
> good way to convince anyone. I'm also not sure why you're taking a
> dislike of this flag as a personal affront.

No, if you're going to show up with comments like that it is entirely
called for.

Other poeple have been finding ways to contribute to the technical
discussion; just calling things "ugly and broken" does not.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Christian Brauner

On Tue, Dec 12, 2023 at 10:16:31AM -0500, Kent Overstreet wrote:
> On Tue, Dec 12, 2023 at 09:56:45AM +0100, Christian Brauner wrote:
> > On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote:
> > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > >   same inode number
> > 
> > This is just ugly with questionable value. A constant reminder of how
> > broken this is. Exposing the subvolume id also makes this somewhat 
> > redundant.
> 
> Oh hell no. We finally get a reasonably productive discussion and I wake
> up to you calling things ugly and broken, with no reason or
> justification or any response to the justification that's already been
> given?
> 
> Christain, that's not how we do things. Let's not turn what was a
> productive discussion into a trainwreck, OK?

That's a major aggressive tone for no good reason. And probably not a
good way to convince anyone. I'm also not sure why you're taking a
dislike of this flag as a personal affront.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Tue, Dec 12, 2023 at 10:35:40AM +0100, Christian Brauner wrote:
> On Tue, Dec 12, 2023 at 10:28:12AM +0100, Miklos Szeredi wrote:
> > On Tue, 12 Dec 2023 at 10:23, Christian Brauner  wrote:
> > >
> > > On Tue, Dec 12, 2023 at 09:10:47AM +, David Howells wrote:
> > > > Christian Brauner  wrote:
> > > >
> > > > > > > > I suggest:
> > > > > > > >
> > > > > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files 
> > > > > > > > have the
> > > > > > > >   same inode number
> > > > >
> > > > > This is just ugly with questionable value. A constant reminder of how
> > > > > broken this is. Exposing the subvolume id also makes this somewhat 
> > > > > redundant.
> > > >
> > > > There is a upcoming potential problem where even the 64-bit field I 
> > > > placed in
> > > > statx() may be insufficient.  The Auristor AFS server, for example, has 
> > > > a
> > > > 96-bit vnode ID, but I can't properly represent this in stx_ino.  
> > > > Currently, I
> > >
> > > Is that vnode ID akin to a volume? Because if so you could just
> > > piggy-back on a subvolume id field in statx() and expose it there.
> > 
> > And how would exporting a subvolume ID and expecting userspace to take
> > that into account when checking for hard links be meaningfully
> > different than expecting userspace to retrieve the file handle and
> > compare that?
> > 
> > The latter would at least be a generic solution, including stacking fs
> > inodes, instead of tackling just a specific corner of the problem
> > space.
> 
> So taking a step back here, please. The original motivation for this
> discussion was restricted to handle btrfs - and now bcachefs as well.
> Both have a concept of a subvolume so it made sense to go that route.
> IOW, it wasn't originally a generic problem or pitched as such.
> 
> Would overlayfs be able to utilize an extended inode field as well?

No, the original motivation was not just btrfs and bcachefs; overlayfs
fundamentally needs to export a bigger identifier than the host
filesystems - pigeonhole principle, if anyone remembers their
combinatorics.

This applies to any filesystem which is under the hood reexporting from
other filesystems; there's a lot of stuff going on in container
filesystems right now, and I expect it'll come up there (you can create
new identifiers if you're exporting file by file, but not if it's
directory trees).

And Neal brought up NFS re-exports; I think there's a compelling
argument to be made that we ought to be able to round trip NFS
filehandles.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Christian Brauner

On Tue, Dec 12, 2023 at 03:06:07PM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 14:48, Christian Brauner  wrote:
> 
> > Exposing the subvolume id in statx() is still fine imho. It's a concept
> > shared between btrfs and bcachefs and it's pretty useful for interested
> > userspace that wants to make use of these apis.
> 
> Exposing subvolume ID should be okay, as long as it's not advertised
> as a way to uniquely identify an inode.   Its use should be limited to
> finding subvolume boundaries.
> 
> > > It might help to have the fh in statx, since that's easier on the
> > > userspace programmer than having to deal with two interfaces (i_ino
> > > won't go away for some time, because of backward compatibility).
> > > OTOH I also don't like the way it would need to be shoehorned into
> > > statx.
> >
> > No, it really doesn't belong into statx().
> >
> > And besides, the file handle apis name_to_handle_at() are already
> > in wider use than a lot of people think. Not just for the exportfs case
> > but also for example, cgroupfs uses file handles to provide unique
> > identifiers for cgroups that can be compared.
> 
> The issue with name_to_handle_at() is its use of the old, non-unique
> mount ID.  Yes, yes, we can get away with
> 
>  fd = openat(dfd, path, O_PATH);
>  name_to_handle_at(fd, "", ..., AT_EMPTY_PATH);
>  statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, ...);
>  close(fd);
> 
> But that's *four* syscalls instead of one...

Yeah, but putting this into statx() isn't really nice imho. Once we do
actually land the unique mount id thing it wouldn't be the worst thing
to add name_to_handle_at2().

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote:
> Doesn't anyone else see or hear the elephant trumpeting loudly in
> the middle of the room?
> 
> I mean, we already have name_to_handle_at() for userspace to get a
> unique, opaque, filesystem defined file handle for any given file.
> It's the same filehandle that filesystems hand to the nfsd so nfs
> clients can uniquely identify the file they are asking the nfsd to
> operate on.
> 
> The contents of these filehandles is entirely defined by the file
> system and completely opaque to the user. The only thing that
> parses the internal contents of the handle is the filesystem itself.
> Therefore, as long as the fs encodes the information it needs into the
> handle to determine what subvol/snapshot the inode belongs to when
> the handle is passed back to it (e.g. from open_by_handle_at()) then
> nothing else needs to care how it is encoded.
> 
> So can someone please explain to me why we need to try to re-invent
> a generic filehandle concept in statx when we already have a
> have working and widely supported user API that provides exactly
> this functionality?

Definitely should be part of the discussion :)

But I think it _does_ need to be in statx; because:
 - we've determined that 64 bit ino_t just isn't a future proof
   interface, we're having real problems with it today
 - statx is _the_ standard, future proofed interface for getting inode
   attributes
 - therefore, if we want userspace programmers to be using filehandles,
   instead of inode numbers, so there code isn't broken, we need to be
   providing interfaces that guide them in that direction.

Even assuming we can update all the documentation to say "filehandles
are the correct way to test inode uniqueness", you know at least half of
programmers will stick to stx_ino instead of the filehandle if the
filehandle is an extra syscall.

Re: file handle in statx

2023-12-12 Thread Theodore Ts'o

On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote:
> On 12/12/23 06:53, Dave Chinner wrote:
> 
> > So can someone please explain to me why we need to try to re-invent
> > a generic filehandle concept in statx when we already have a
> > have working and widely supported user API that provides exactly
> > this functionality?
> 
> name_to_handle_at() is fine, but userspace could profit from being
> able to retrieve the filehandle together with the other metadata in
> a single system call.

Can you say more?  What, specifically is the application that would
want to do that, and is it really in such a hot path that it would be
a user-visible improveable, let aloine something that can be actually
be measured?

- Ted

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Kent Overstreet

On Tue, Dec 12, 2023 at 09:56:45AM +0100, Christian Brauner wrote:
> On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote:
> > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > >   same inode number
> 
> This is just ugly with questionable value. A constant reminder of how
> broken this is. Exposing the subvolume id also makes this somewhat redundant.

Oh hell no. We finally get a reasonably productive discussion and I wake
up to you calling things ugly and broken, with no reason or
justification or any response to the justification that's already been
given?

Christain, that's not how we do things. Let's not turn what was a
productive discussion into a trainwreck, OK?

Re: [PATCH v2 0/3] Add support bcachefs filesystem

2023-12-12 Thread Jan Kara

On Thu 07-12-23 20:40:08, Petr Vorel wrote:
> Hi all,
> 
> @Amir @Jan, could you please have look on fanotify failures on Bcachefs?
> fanotify13.c, fanotify15.c and fanotify16.c produce many errors.
> 
> To test, just apply this patchset and then run with LTP_SINGLE_FS_TYPE:
> 
> # LTP_SINGLE_FS_TYPE=bcachefs ./fanotify15

I suspect some catch with fsids but we'll see... I hopefully get to this
tomorrow.

Honza

> changes v1->v2:
> Two new commits:
> * lib: Add Bcachefs magic
> * statx04: Skip STATX_ATTR_COMPRESSED on Bcachefs
> 
> Petr Vorel (3):
>   lib: Add Bcachefs magic
>   lib: Add support bcachefs filesystem to .all_filesystems
>   statx04: Skip STATX_ATTR_COMPRESSED on Bcachefs
> 
>  include/tst_fs.h  | 3 +++
>  lib/tst_fs_type.c | 2 ++
>  lib/tst_supported_fs_types.c  | 2 ++
>  testcases/kernel/syscalls/statx/statx04.c | 5 +++--
>  4 files changed, 10 insertions(+), 2 deletions(-)
> 
> -- 
> 2.43.0
> 
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH RFC v2 for-6.8/block 15/18] buffer: add a new helper to read sb block

2023-12-12 Thread Jan Kara

On Tue 12-12-23 05:25:25, Christoph Hellwig wrote:
> On Mon, Dec 11, 2023 at 10:07:53PM +0800, Yu Kuai wrote:
> > +static __always_inline int buffer_uptodate_or_error(struct buffer_head *bh)
> > +{
> > +   /*
> > +* If the buffer has the write error flag, data was failed to write
> > +* out in the block. In this case, set buffer uptodate to prevent
> > +* reading old data.
> > +*/
> > +   if (buffer_write_io_error(bh))
> > +   set_buffer_uptodate(bh);
> > +   return buffer_uptodate(bh);
> > +}
> 
> So - risking this blows up into a lot of nasty work: Why do we even
> clear the uptodate flag on write errors?  Doing so makes not sense to
> me as the data isn't any less uptodate just because we failed to write
> it..

Historic reasons I'd say (buffer_write_io_error isn't *that* old - from
2003 it seems). And yes, it would make a lot of sense to keep uptodate flag
set and just rely on buffer_write_io_error() but it also means going
through all buffer_uptodate() checks in filesystems and determining which
need changing to buffer_write_io_error() which is something nobody is keen
on doing ;)

Honza

-- 
Jan Kara 
SUSE Labs, CR

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Miklos Szeredi

On Tue, 12 Dec 2023 at 14:48, Christian Brauner  wrote:

> Exposing the subvolume id in statx() is still fine imho. It's a concept
> shared between btrfs and bcachefs and it's pretty useful for interested
> userspace that wants to make use of these apis.

Exposing subvolume ID should be okay, as long as it's not advertised
as a way to uniquely identify an inode.   Its use should be limited to
finding subvolume boundaries.

> > It might help to have the fh in statx, since that's easier on the
> > userspace programmer than having to deal with two interfaces (i_ino
> > won't go away for some time, because of backward compatibility).
> > OTOH I also don't like the way it would need to be shoehorned into
> > statx.
>
> No, it really doesn't belong into statx().
>
> And besides, the file handle apis name_to_handle_at() are already
> in wider use than a lot of people think. Not just for the exportfs case
> but also for example, cgroupfs uses file handles to provide unique
> identifiers for cgroups that can be compared.

The issue with name_to_handle_at() is its use of the old, non-unique
mount ID.  Yes, yes, we can get away with

 fd = openat(dfd, path, O_PATH);
 name_to_handle_at(fd, "", ..., AT_EMPTY_PATH);
 statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, ...);
 close(fd);

But that's *four* syscalls instead of one...

Thanks,
Miklos

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Christian Brauner

On Tue, Dec 12, 2023 at 10:42:58AM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 10:35, Christian Brauner  wrote:
> 
> > So taking a step back here, please. The original motivation for this
> > discussion was restricted to handle btrfs - and now bcachefs as well.
> > Both have a concept of a subvolume so it made sense to go that route.
> > IOW, it wasn't originally a generic problem or pitched as such.
> >
> > Would overlayfs be able to utilize an extended inode field as well?
> 
> Well, yes, but I don't think that's the right solution.

Exposing the subvolume id in statx() is still fine imho. It's a concept
shared between btrfs and bcachefs and it's pretty useful for interested
userspace that wants to make use of these apis.

> 
> I think the right solution is to move to using file handles instead of
> st_ino, the problem with that is that there's no way kernel can force
> upgrading userspace.

That's not our job tbh. I get why this is desirable. What we can do is
advertise better and newer apis but we don't have to make unpleasant
compromises in our interfaces for that.

> 
> It might help to have the fh in statx, since that's easier on the
> userspace programmer than having to deal with two interfaces (i_ino
> won't go away for some time, because of backward compatibility).
> OTOH I also don't like the way it would need to be shoehorned into
> statx.

No, it really doesn't belong into statx().

And besides, the file handle apis name_to_handle_at() are already
in wider use than a lot of people think. Not just for the exportfs case
but also for example, cgroupfs uses file handles to provide unique
identifiers for cgroups that can be compared.

Re: [PATCH RFC v2 for-6.8/block 15/18] buffer: add a new helper to read sb block

2023-12-12 Thread Christoph Hellwig

On Mon, Dec 11, 2023 at 10:07:53PM +0800, Yu Kuai wrote:
> +static __always_inline int buffer_uptodate_or_error(struct buffer_head *bh)
> +{
> + /*
> +  * If the buffer has the write error flag, data was failed to write
> +  * out in the block. In this case, set buffer uptodate to prevent
> +  * reading old data.
> +  */
> + if (buffer_write_io_error(bh))
> + set_buffer_uptodate(bh);
> + return buffer_uptodate(bh);
> +}

So - risking this blows up into a lot of nasty work: Why do we even
clear the uptodate flag on write errors?  Doing so makes not sense to
me as the data isn't any less uptodate just because we failed to write
it..

Re: [PATCH RFC v2 for-6.8/block 17/18] ext4: remove block_device_ejected()

2023-12-12 Thread Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig

Re: [PATCH RFC v2 for-6.8/block 01/18] block: add some bdev apis

2023-12-12 Thread Christoph Hellwig

> +void invalidate_bdev_range(struct block_device *bdev, pgoff_t start,
> +pgoff_t end)
> +{
> + invalidate_mapping_pages(bdev->bd_inode->i_mapping, start, end);
> +}
> +EXPORT_SYMBOL_GPL(invalidate_bdev_range);

Can we have kerneldoc comments for the new helpers please?

> +struct folio *__bdev_get_folio(struct block_device *bdev, loff_t pos,
> +fgf_t fgp_flags, gfp_t gfp)
> +{
> + return __filemap_get_folio(bdev->bd_inode->i_mapping, pos >> PAGE_SHIFT,
> +fgp_flags, gfp);
> +}
> +EXPORT_SYMBOL_GPL(__bdev_get_folio);

It's a bit silly to have a __-prefixed API without a version that
doesn't have the prefix, so I'd prefer to drop it.  Unless willy has
a good argument for keeping it the same as the filemap API.

Re: [PATCH RFC v2 for-6.8/block 01/18] block: add some bdev apis

2023-12-12 Thread Christoph Hellwig

On Mon, Dec 11, 2023 at 05:52:17PM +0100, Jan Kara wrote:
> > +void bdev_associated_mapping(struct block_device *bdev,
> > +struct address_space *mapping)
> > +{
> > +   mapping->host = bdev->bd_inode;
> > +}
> 
> Here I'm not sure - is the helper really a win? It seems a bit obscure to
> me. This initialization of another mapping for a bdev looks really special.

If we want to hide bd_inode we'll something like this helper even if
I don't particularly like it either.

But it might be a good idea to move out of this series and into the
follow on removing bd_inode, as it's rather pointless without that
context.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread David Howells

Christian Brauner  wrote:

> > There is a upcoming potential problem where even the 64-bit field I placed
> > in statx() may be insufficient.  The Auristor AFS server, for example, has
> > a 96-bit vnode ID, but I can't properly represent this in stx_ino.
> > Currently, I
> 
> Is that vnode ID akin to a volume? Because if so you could just
> piggy-back on a subvolume id field in statx() and expose it there.

No.  The volume ID is the ID of the volume.  The vnode is the equivalent of an
inode.

> > just truncate the value to fit and hope that the discarded part will be all
> > zero, but that's not really a good thing to do - especially when stx_ino is
> > used programmatically to check for hardlinks.
> > 
> > Would it be better to add an 'stx_ino_2' field and corresponding flag?
> 
> Would this be meaningfully different from using a file handle?

There's also the matter of presenting the "inode number" to the user - "ls -i"
for example.

David

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Miklos Szeredi

On Tue, 12 Dec 2023 at 10:35, Christian Brauner  wrote:

> So taking a step back here, please. The original motivation for this
> discussion was restricted to handle btrfs - and now bcachefs as well.
> Both have a concept of a subvolume so it made sense to go that route.
> IOW, it wasn't originally a generic problem or pitched as such.
>
> Would overlayfs be able to utilize an extended inode field as well?

Well, yes, but I don't think that's the right solution.

I think the right solution is to move to using file handles instead of
st_ino, the problem with that is that there's no way kernel can force
upgrading userspace.

It might help to have the fh in statx, since that's easier on the
userspace programmer than having to deal with two interfaces (i_ino
won't go away for some time, because of backward compatibility).
OTOH I also don't like the way it would need to be shoehorned into
statx.

Thanks,
Miklos

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Christian Brauner

On Tue, Dec 12, 2023 at 10:28:12AM +0100, Miklos Szeredi wrote:
> On Tue, 12 Dec 2023 at 10:23, Christian Brauner  wrote:
> >
> > On Tue, Dec 12, 2023 at 09:10:47AM +, David Howells wrote:
> > > Christian Brauner  wrote:
> > >
> > > > > > > I suggest:
> > > > > > >
> > > > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have 
> > > > > > > the
> > > > > > >   same inode number
> > > >
> > > > This is just ugly with questionable value. A constant reminder of how
> > > > broken this is. Exposing the subvolume id also makes this somewhat 
> > > > redundant.
> > >
> > > There is a upcoming potential problem where even the 64-bit field I 
> > > placed in
> > > statx() may be insufficient.  The Auristor AFS server, for example, has a
> > > 96-bit vnode ID, but I can't properly represent this in stx_ino.  
> > > Currently, I
> >
> > Is that vnode ID akin to a volume? Because if so you could just
> > piggy-back on a subvolume id field in statx() and expose it there.
> 
> And how would exporting a subvolume ID and expecting userspace to take
> that into account when checking for hard links be meaningfully
> different than expecting userspace to retrieve the file handle and
> compare that?
> 
> The latter would at least be a generic solution, including stacking fs
> inodes, instead of tackling just a specific corner of the problem
> space.

So taking a step back here, please. The original motivation for this
discussion was restricted to handle btrfs - and now bcachefs as well.
Both have a concept of a subvolume so it made sense to go that route.
IOW, it wasn't originally a generic problem or pitched as such.

Would overlayfs be able to utilize an extended inode field as well?

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Miklos Szeredi

On Tue, 12 Dec 2023 at 10:23, Christian Brauner  wrote:
>
> On Tue, Dec 12, 2023 at 09:10:47AM +, David Howells wrote:
> > Christian Brauner  wrote:
> >
> > > > > > I suggest:
> > > > > >
> > > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > > >   same inode number
> > >
> > > This is just ugly with questionable value. A constant reminder of how
> > > broken this is. Exposing the subvolume id also makes this somewhat 
> > > redundant.
> >
> > There is a upcoming potential problem where even the 64-bit field I placed 
> > in
> > statx() may be insufficient.  The Auristor AFS server, for example, has a
> > 96-bit vnode ID, but I can't properly represent this in stx_ino.  
> > Currently, I
>
> Is that vnode ID akin to a volume? Because if so you could just
> piggy-back on a subvolume id field in statx() and expose it there.

And how would exporting a subvolume ID and expecting userspace to take
that into account when checking for hard links be meaningfully
different than expecting userspace to retrieve the file handle and
compare that?

The latter would at least be a generic solution, including stacking fs
inodes, instead of tackling just a specific corner of the problem
space.

Thanks,
Miklos

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Christian Brauner

On Tue, Dec 12, 2023 at 09:10:47AM +, David Howells wrote:
> Christian Brauner  wrote:
> 
> > > > > I suggest:
> > > > >
> > > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > > >   same inode number
> > 
> > This is just ugly with questionable value. A constant reminder of how
> > broken this is. Exposing the subvolume id also makes this somewhat 
> > redundant.
> 
> There is a upcoming potential problem where even the 64-bit field I placed in
> statx() may be insufficient.  The Auristor AFS server, for example, has a
> 96-bit vnode ID, but I can't properly represent this in stx_ino.  Currently, I

Is that vnode ID akin to a volume? Because if so you could just
piggy-back on a subvolume id field in statx() and expose it there.

> just truncate the value to fit and hope that the discarded part will be all
> zero, but that's not really a good thing to do - especially when stx_ino is
> used programmatically to check for hardlinks.
> 
> Would it be better to add an 'stx_ino_2' field and corresponding flag?

Would this be meaningfully different from using a file handle?

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread David Howells

Christian Brauner  wrote:

> > > > I suggest:
> > > >
> > > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > > >   same inode number
> 
> This is just ugly with questionable value. A constant reminder of how
> broken this is. Exposing the subvolume id also makes this somewhat redundant.

There is a upcoming potential problem where even the 64-bit field I placed in
statx() may be insufficient.  The Auristor AFS server, for example, has a
96-bit vnode ID, but I can't properly represent this in stx_ino.  Currently, I
just truncate the value to fit and hope that the discarded part will be all
zero, but that's not really a good thing to do - especially when stx_ino is
used programmatically to check for hardlinks.

Would it be better to add an 'stx_ino_2' field and corresponding flag?

David

Re: file handle in statx

2023-12-12 Thread Donald Buczek

On 12/12/23 06:53, Dave Chinner wrote:

> So can someone please explain to me why we need to try to re-invent
> a generic filehandle concept in statx when we already have a
> have working and widely supported user API that provides exactly
> this functionality?

name_to_handle_at() is fine, but userspace could profit from being able to 
retrieve
the filehandle together with the other metadata in a single system call.

One argument regarding stx_vol was that userspace tools, which walk the
filesystem tree, might want to avoid to run blindly into snapshots.

Additionally, if a volume defines the namespace for a set of unique inode 
numbers,
the tools could continue to use inode numbers just on a per-volume basis.

Best

  Donald

> Cheers,
> 
> Dave.

-- 
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Christian Brauner

On Tue, Dec 12, 2023 at 01:13:07PM +1100, NeilBrown wrote:
> On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote:
> > > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > > NFSv4 specs that for the maximum size? That is pretty hefty...
> > > 
> > > It is - but it needs room to identify the filesystem and it needs to be
> > > stable across time.  That need is more than a local filesystem needs.
> > > 
> > > NFSv2 allowed 32 bytes which is enough for a 16 byte filesys uuid, 8
> > > byte inum and 8byte generation num.  But only just.
> > > 
> > > NFSv3 allowed 64 bytes which was likely plenty for (nearly?) every
> > > situation.
> > > 
> > > NFSv4 doubled it again because  who knows.  "why not" I guess.
> > > Linux nfsd typically uses 20 or 28 bytes plus whatever the filesystem
> > > wants. (28 when the export point is not the root of the filesystem).
> > > I suspect this always fits within an NFSv3 handle except when
> > > re-exporting an NFS filesystem.  NFS re-export is an interesting case...
> > 
> > Now I'm really curious - i_generation wasn't enough? Are we including
> > filesystem UUIDs?
> 
> i_generation was invented so that it could be inserted into the NFS
> fileshandle.
> 
> The NFS filehandle is opaque.  It likely contains an inode number, a
> generation number, and a filesystem identifier.  But it is not possible
> to extract those from the handle.
> 
> > 
> > I suppose if we want to be able to round trip this stuff we do need to
> > allocate space for it, even if a local filesystem would never include
> > it.
> > 
> > > I suggest:
> > > 
> > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > >   same inode number
> > > 
> > >  
> > >  __u64 stx_vol Volume identifier.  Two files with same stx_vol and 
> > >stx_ino MUST be the same.  Exact meaning of volumes
> > >is filesys-specific
> > 
> > NFS reexport that you mentioned previously makes it seem like this
> > guarantee is impossible to provide in general (so I'd leave it out
> > entirely, it's just something for people to trip over).
> 
> NFS would not set stx_vol and would not return STATX_VOL in stx_mask.
> So it would not attempt to provide that guarantee.
> 
> Maybe we don't need to explicitly make this guarantee.
> 
> > 
> > But we definitely want stx_vol in there. Another thing that people ask
> > for is a way to ask "is this a subvolume root?" - we should make sure
> > that's clearly specified, or can we just include a bit for it?
> 
> The start way to test for a filesystem root - or mount point at least -
> is to stat the directory in question and its parent (..) and see if the
> have the same st_dev or not.

It depends. If you want to figure out whether it's a different
filesystem or a different btrfs subvolume then yes, this generally
works because of changing device ids. But it doesn't work for
bind-mounts as they don't change device numbers. But maybe you and I are
using mount point differently here.

> Applying the same logic to volumes means that a single stx_vol number is
> sufficient.

Yes, that would generally work.

> 
> I'm not strongly against a STATX_ATTR_VOL_ROOT flag providing everyone
> agrees what it means that we cannot imagine any awkward corner-cases
> (like a 'root' being different from a 'mount point').

I feel like you might have missed my previous mails where I strongly
argued for the addition of STATX_ATTR_SUBVOLUME_ROOT:

https://lore.kernel.org/linux-btrfs/20231108-herleiten-bezwangen-ffb2821f539e@brauner

The concept of a subvolume root and of a mount point should be kept
separate. Christoph tried mapping subvolumes to vfsmounts, something
that I (and Al) vehemently oppose for various reasons outlined in that
and other long threads.

I still think that we should start with exposing subvolume id first.

Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)

2023-12-12 Thread Christian Brauner

On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote:
> On Tue, Dec 12, 2023 at 7:53 AM Dave Chinner  wrote:
> >
> > On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote:
> > > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > > On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote:
> > > > > On Tue, 12 Dec 2023, Kent Overstreet wrote:
> > > > > > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote:
> > > > > > > On Sat, 09 Dec 2023, Kent Overstreet wrote:
> > > > > > Thoughts?
> > > > > >
> > > > >
> > > > > I'm completely in favour of exporting the (full) filehandle through
> > > > > statx. (If the application asked for the filehandle, it will expect a
> > > > > larger structure to be returned.  We don't need to use the currently
> > > > > reserved space).
> > > > >
> > > > > I'm completely in favour of updating user-space tools to use the
> > > > > filehandle to check if two handles are for the same file.
> > > > >
> > > > > I'm not in favour of any filesystem depending on this for correct
> > > > > functionality today.  As long as the filesystem isn't so large that
> > > > > inum+volnum simply cannot fit in 64 bits, we should make a reasonable
> > > > > effort to present them both in 64 bits.  Depending on the filehandle 
> > > > > is a
> > > > > good plan for long term growth, not for basic functionality today.
> > > >
> > > > My standing policy in these situations is that I'll do the stopgap/hacky
> > > > measure... but not before doing actual, real work on the longterm
> > > > solution :)
> > >
> > > Eminently sensible.
> > >
> > > >
> > > > So if we're all in favor of statx as the real long term solution, how
> > > > about we see how far we get with that?
> > > >
> > >
> > > I suggest:
> > >
> > >  STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the
> > >   same inode number

This is just ugly with questionable value. A constant reminder of how
broken this is. Exposing the subvolume id also makes this somewhat redundant.

> > >
> > >
> > >  __u64 stx_vol Volume identifier.  Two files with same stx_vol and
> > >stx_ino MUST be the same.  Exact meaning of volumes
> > >is filesys-specific

Exposing this makes sense. I've mentioned that I'd be open to exporting
the subvolume id in statx before even though I know some people don't
like that. I still stand by that. Especially now that we have btrfs and
bcachefs that both have the concept of a subvolume id it makes sense to
put it into statx().

> > >
> > >  STATX_VOL Want stx_vol
> > >
> > >   __u8 stx_handle_len  Length of stx_handle if present
> > >   __u8 stx_handle[128] Unique stable identifier for this file.  Will
> > >NEVER be reused for a different file.
> > >This appears AFTER __statx_pad2, beyond
> > >the current 'struct statx'.
> > >  STATX_HANDLE  Want stx_handle_len and stx_handle. Buffer for
> > >receiving statx info has at least
> > >sizeof(struct statx)+128 bytes.

No, we're should not be reinventing another opaque handle. And even if
it really doesn't belong into statx().

> >
> > Hmmm.
> >
> > Doesn't anyone else see or hear the elephant trumpeting loudly in
> > the middle of the room?
> >
> > I mean, we already have name_to_handle_at() for userspace to get a
> > unique, opaque, filesystem defined file handle for any given file.

I agree.

> > It's the same filehandle that filesystems hand to the nfsd so nfs
> > clients can uniquely identify the file they are asking the nfsd to
> > operate on.
> >
> > The contents of these filehandles is entirely defined by the file
> > system and completely opaque to the user. The only thing that
> > parses the internal contents of the handle is the filesystem itself.
> > Therefore, as long as the fs encodes the information it needs into the
> > handle to determine what subvol/snapshot the inode belongs to when
> > the handle is passed back to it (e.g. from open_by_handle_at()) then
> > nothing else needs to care how it is encoded.
> >
> > So can someone please explain to me why we need to try to re-invent
> > a generic filehandle concept in statx when we already have a
> > have working and widely supported user API that provides exactly
> > this functionality?
> 
> Yeh.
> 
> Not to mention that since commit 64343119d7b8 ("exportfs: support encoding
> non-decodeable file handles by default"), exporting file handles as strong
> object identifiers is not limited to filesystems that support NFS export.
> 
> All fs have a default implementation of encode_fh() by encoding a file id
> of type FILEID_INO64_GEN from { i_ino, i_generation } and any fs can
> define its own encode_fh() operation (e.g. to include subvol id) even without
> implementing the decode fh operations needed for NFS export.

I also would like both btrfs and bcachefs to agree.

60 matches

Mail list logo