Re: file handle in statx
On 12/12/23 16:20, Theodore Ts'o wrote: > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: >> On 12/12/23 06:53, Dave Chinner wrote: >> >>> So can someone please explain to me why we need to try to re-invent >>> a generic filehandle concept in statx when we already have a >>> have working and widely supported user API that provides exactly >>> this functionality? >> >> name_to_handle_at() is fine, but userspace could profit from being >> able to retrieve the filehandle together with the other metadata in >> a single system call. > > Can you say more? What, specifically is the application that would > want to do that, and is it really in such a hot path that it would be > a user-visible improveable, let aloine something that can be actually > be measured? Probably not for the specific applications I mentioned (backup, mirror, accounting). These are intended to run continuously, slowly and unnoticed in the background, so they are memory and i/o throttled via cgroups anyway and one is even using sleep after so-and-so many stat calls to reduce its impact. If they could tell a directory from a snapshot, I would probably stop them from walking into snapshots. And if not, the snapshot id is all that is needed to tell a clone in a snapshot from a hardlink. So these don't really need the filehandle. In the thread it was assumed, that there are other (unspecified) applications which need the filehandle and currently use name_to_handle_at(). I though it was self-evident that a single syscall to retrieve all information atomically is better than a set of syscalls. Each additional syscall has overhead and you need to be concerned with the data changing between the calls. Userspace nfs server as an example of an application, where visible performance is more relevant, was already mentioned by someone else. Best Donald > > - Ted -- Donald Buczek buc...@molgen.mpg.de Tel: +49 30 8413 1433
Re: [PATCH v2 3/3] statx04: Skip STATX_ATTR_COMPRESSED on Bcachefs
On Thu, Dec 07, 2023 at 08:40:11PM +0100, Petr Vorel wrote: > STATX_ATTR_COMPRESSED is not supported on Bcachefs, thus skip it > (it's already skipped on tmpfs and XFS). Hang on, bcachefs most definitely does hae compression. This would be just because statx isn't plumbed through? > > Suggested-by: Cyril Hrubis > Signed-off-by: Petr Vorel > --- > New in v2 > > testcases/kernel/syscalls/statx/statx04.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/testcases/kernel/syscalls/statx/statx04.c > b/testcases/kernel/syscalls/statx/statx04.c > index 58296bd24..c2ed52deb 100644 > --- a/testcases/kernel/syscalls/statx/statx04.c > +++ b/testcases/kernel/syscalls/statx/statx04.c > @@ -96,8 +96,9 @@ static void setup(void) > for (i = 0, expected_mask = 0; i < ARRAY_SIZE(attr_list); i++) > expected_mask |= attr_list[i].attr; > > - /* STATX_ATTR_COMPRESSED not supported on XFS, TMPFS */ > - if (!strcmp(tst_device->fs_type, "xfs") || !strcmp(tst_device->fs_type, > "tmpfs")) > + /* STATX_ATTR_COMPRESSED not supported on Bcachefs, TMPFS, XFS */ > + if (!strcmp(tst_device->fs_type, "bcachefs") || > !strcmp(tst_device->fs_type, "tmpfs") || > + !strcmp(tst_device->fs_type, "xfs")) > expected_mask &= ~STATX_ATTR_COMPRESSED; > > /* Attribute support was added to Btrfs statx() in kernel v4.13 */ > -- > 2.43.0 >
Re: [PATCH RFC v2 for-6.8/block 01/18] block: add some bdev apis
Hi, 在 2023/12/12 21:14, Christoph Hellwig 写道: On Mon, Dec 11, 2023 at 05:52:17PM +0100, Jan Kara wrote: +void bdev_associated_mapping(struct block_device *bdev, +struct address_space *mapping) +{ + mapping->host = bdev->bd_inode; +} Here I'm not sure - is the helper really a win? It seems a bit obscure to me. This initialization of another mapping for a bdev looks really special. If we want to hide bd_inode we'll something like this helper even if I don't particularly like it either. But it might be a good idea to move out of this series and into the follow on removing bd_inode, as it's rather pointless without that context. Yes, this sounds good, I'll remove this from v3. Thanks, Kuai .
Re: [PATCH RFC v2 for-6.8/block 01/18] block: add some bdev apis
Hi, 在 2023/12/12 21:16, Christoph Hellwig 写道: +void invalidate_bdev_range(struct block_device *bdev, pgoff_t start, + pgoff_t end) +{ + invalidate_mapping_pages(bdev->bd_inode->i_mapping, start, end); +} +EXPORT_SYMBOL_GPL(invalidate_bdev_range); Can we have kerneldoc comments for the new helpers please? Of course, will definitely do this in v3. +struct folio *__bdev_get_folio(struct block_device *bdev, loff_t pos, + fgf_t fgp_flags, gfp_t gfp) +{ + return __filemap_get_folio(bdev->bd_inode->i_mapping, pos >> PAGE_SHIFT, + fgp_flags, gfp); +} +EXPORT_SYMBOL_GPL(__bdev_get_folio); It's a bit silly to have a __-prefixed API without a version that doesn't have the prefix, so I'd prefer to drop it. Unless willy has a good argument for keeping it the same as the filemap API. Ok, I'll drop it if willy doesn't against this. Thanks, Kuai .
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, Dec 13, 2023 at 11:02:29AM +1100, NeilBrown wrote: > On Wed, 13 Dec 2023, Kent Overstreet wrote: > > On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote: > > > On Wed, 13 Dec 2023, Kent Overstreet wrote: > > > > On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote: > > > > > Kent Overstreet wrote: > > > > > > > > > > > I was chatting a bit with David Howells on IRC about this, and > > > > > > floated > > > > > > adding the file handle to statx. It looks like there's enough space > > > > > > reserved to make this feasible - probably going with a fixed maximum > > > > > > size of 128-256 bits. > > > > > > > > > > We can always save the last bit to indicate extension space/extension > > > > > record, > > > > > so we're not that strapped for space. > > > > > > > > So we'll need that if we want to round trip NFSv4 filehandles, they > > > > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96 > > > > bytes reserved). > > > > > > > > Obvious question (Neal): do/will real world implementations ever come > > > > close to making use of this, or was this a "future proofing gone wild" > > > > thing? > > > > > > I have no useful data. I have seen lots of filehandles but I don't pay > > > much attention to their length. Certainly some are longer than 32 bytes. > > > > > > > > > > > Say we do decide we want to spec it that large: _can_ we extend struct > > > > statx? I'm wondering if the userspace side was thought through, I'm > > > > sure glibc people will have something to say. > > > > > > The man page says: > > > > > > Therefore, do not simply set mask to UINT_MAX (all bits set), as > > > one or more bits may, in the future, be used to specify an > > > extension to the buffer. > > > > > > I suspect the glibc people read that. > > > > The trouble is that C has no notion of which types are safe to pass > > across a dynamic library boundary, so if we increase the size of struct > > statx and someone's doing that things will break in nasty ways. > > > > Maybe we don't increase the size of struct statx. > Maybe we declare > >struct statx2 { > struct statx base; > __u8 stx_handle[128]; >} > > and pass then when we request STX_HANDLE. yeah, I think that's what would be required. David was originally proposing having userspace just pass in a size_t for the buffer size, and I really think that would've been better - if this thing can ever change size, we need to make that explicit in the API.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, 13 Dec 2023, Dave Chinner wrote: > On Wed, Dec 13, 2023 at 09:31:13AM +1100, NeilBrown wrote: > > On Wed, 13 Dec 2023, Dave Chinner wrote: > > > > > > What you are suggesting is that we now duplicate filehandle encoding > > > into every filesystem's statx() implementation. That's a bad > > > trade-off from a maintenance, testing and consistency POV because > > > now we end up with lots of individual, filehandle encoding > > > implementations in addition to the generic filehandle > > > infrastructure that we all have to test and validate. > > > > Not correct. We are suggesting an interface, not an implementation. > > Here you are proposing a suboptimal implementation, pointing out its > > weakness, and suggesting the has consequences for the interface > > proposal. Is that the strawman fallacy? > > No, you simply haven't followed deep enough into the rabbit hole to > understand Kent was suggesting potential implementation details to > address hot path performance concerns with filehandle encoding. Yes, I missed that. Sorry. NeilBrown > > > vfs_getattr_nosec could, after calling i_op->getattr, check if > > STATX_HANDLE is set in request_mask but not in ->result_mask. > > If so it could call exportfs_encode_fh() and handle the result. > > > > No filesystem need to be changed. > > Well, yes, it's pretty damn obvious that is exactly what I've been > advocating for here - if we are going to put filehandles in statx(), > then it must use the same infrastructure as name_to_handle_at(). > i.e. calling exportfs_encode_fh(EXPORT_FH_FID) to generate the > filehandle. > > The important discussion detail you've missed about > exportfs_encode_fh() is that it *requires* adding a new indirect > call (via export_ops->encode_fh) in the statx path to encode the > filehandle, and that's exactly what Kent was suggesting we can code > the implementation to avoid. > > Avoiding an indirect function call is an implementation detail, not > an interface design requirement. > > And the only way to avoid adding new indirect calls to encoding > filesystem specific filehandles is to implement the encoding in the > existing individual filesystem i_op->getattr methods. i.e. duplicate > the filehandle encoding in the statx path rather than use > exportfs_encode_fh(). > > -Dave. > -- > Dave Chinner > da...@fromorbit.com >
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, 13 Dec 2023, Kent Overstreet wrote: > On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote: > > On Wed, 13 Dec 2023, Kent Overstreet wrote: > > > On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote: > > > > Kent Overstreet wrote: > > > > > > > > > I was chatting a bit with David Howells on IRC about this, and floated > > > > > adding the file handle to statx. It looks like there's enough space > > > > > reserved to make this feasible - probably going with a fixed maximum > > > > > size of 128-256 bits. > > > > > > > > We can always save the last bit to indicate extension space/extension > > > > record, > > > > so we're not that strapped for space. > > > > > > So we'll need that if we want to round trip NFSv4 filehandles, they > > > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96 > > > bytes reserved). > > > > > > Obvious question (Neal): do/will real world implementations ever come > > > close to making use of this, or was this a "future proofing gone wild" > > > thing? > > > > I have no useful data. I have seen lots of filehandles but I don't pay > > much attention to their length. Certainly some are longer than 32 bytes. > > > > > > > > Say we do decide we want to spec it that large: _can_ we extend struct > > > statx? I'm wondering if the userspace side was thought through, I'm > > > sure glibc people will have something to say. > > > > The man page says: > > > > Therefore, do not simply set mask to UINT_MAX (all bits set), as > > one or more bits may, in the future, be used to specify an > > extension to the buffer. > > > > I suspect the glibc people read that. > > The trouble is that C has no notion of which types are safe to pass > across a dynamic library boundary, so if we increase the size of struct > statx and someone's doing that things will break in nasty ways. > Maybe we don't increase the size of struct statx. Maybe we declare struct statx2 { struct statx base; __u8 stx_handle[128]; } and pass then when we request STX_HANDLE. Or maybe there is a better way. I'm sure the glibc developers have face this sort of problem before. NeilBrown
Re: file handle in statx
On Wed, Dec 13, 2023 at 10:44:07AM +1100, Dave Chinner wrote: > On Tue, Dec 12, 2023 at 05:39:27PM -0500, Kent Overstreet wrote: > > Like Neal mentioned we won't even be fetching the fh if it wasn't > > explicitly requested - and like I mentioned, we can avoid the > > .encode_fh() call for local filesystems with a bit of work at the VFS > > layer. > > > > OTOH, when you're running rsync in incremental mode, and detecting > > hardlinks, your point that "statx can be called millions of times per > > second" would apply just as much to the additional name_to_handle_at() > > call - we'd be nearly doubling their overhead for scanning files that > > don't need to be sent. > > Hardlinked files are indicated by st_nlink > 1, not by requiring > userspace to store every st_ino/dev it sees and having to compare > the st-ino/dev of every newly stat()d inode against that ino/dev > cache. > > We only need ino/dev/filehandles for hardlink path disambiguation. > > IOWs, this use case does not need name_to_handle_at() for millions > of inodes - it is just needed on the regular file inodes that have > st_nlink > 1. Ok yeah, that's a really good point. Perhaps nanme_to_handle_at() is sufficient, then. If so, maybe we can just add STATX_ATTR_INUM_NOT_UNIQUE and STATX_VOL now, and leave STATX_HANDLE until someone discovers an application where it actually does matter. > > > And then comes the cost of encoding dynamically sized information in > > > struct statx - filehandles are not fixed size - and statx is most > > > definitely not set up or intended for dynamically sized attribute > > > data. This adds more complexity to statx because it wasn't designed > > > or intended to handle dynamically sized attributes. Optional > > > attributes, yes, but not attributes that might vary in size from fs > > > to fs or even inode type to inode type within a fileystem (e.g. dir > > > filehandles can, optionally, encode the parent inode in them). > > > > Since it looks like expanding statx is not going to be quite as easy as > > hoped, I proposed elsewhere in the thread that we reserve a smaller > > fixed size in statx (32 bytes) and set a flag if it won't fit, > > indicating that userspace needs to fall back to name_to_handle_at(). > > struct btrfs_fid is 40 bytes in size. Sure, that's not all used for > name_to_handle_at(), but we already have in-kernel filehandles that > can optionally configured to be bigger than 32 bytes... The hell is all that for!? They never reuse inode numbers, why are there generation numbers in there? And do they not have inode -> dirent backrefs? > > Stuffing a _dynamically_ sized attribute into statx would indeed be > > painful - I believe were always talking about a fixed size buffer in > > statx, the discussion's been over how big it needs to be... > > The contents of the buffer is still dynamically sized, so there's > still a length attribute that needs to be emitted to userspace with > the buffer. Correct > And then what happens with the next attribute that someone wants > statx() to expose that can be dynamically sized? Are we really > planning to allow the struct statx to be expanded indefinitely > with largely unused static data arrays? Well, struct stat/statx is not a long lived object that anyone would ever keep a lot of around; it's a short lived object that just needs to be efficient to access and ABI stable, so yes, if this comes up again that's what we should do. The alternative would be adding fields with an [ offset, length ] scheme and treating the statx buffer as a bump allocator, but simple and fast to access beats space efficiency here...
Re: file handle in statx
On Tue, Dec 12, 2023 at 05:39:27PM -0500, Kent Overstreet wrote: > On Wed, Dec 13, 2023 at 09:23:18AM +1100, Dave Chinner wrote: > > On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote: > > > On Wed, 13 Dec 2023, Dave Chinner wrote: > > > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote: > > > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > > > > > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > > > > > > > > > > > So can someone please explain to me why we need to try to > > > > > > > > re-invent > > > > > > > > a generic filehandle concept in statx when we already have a > > > > > > > > have > > > > > > > > working and widely supported user API that provides exactly this > > > > > > > > functionality? > > > > > > > > > > > > > > name_to_handle_at() is fine, but userspace could profit from being > > > > > > > able to retrieve the filehandle together with the other metadata > > > > > > > in a > > > > > > > single system call. > > > > > > > > > > > > Can you say more? What, specifically is the application that would > > > > > > want > > > > > to do > > > > > > that, and is it really in such a hot path that it would be a > > > > > > user-visible > > > > > > improveable, let aloine something that can be actually be measured? > > > > > > > > > > A user space NFS server like Ganesha could benefit from getting > > > > > attributes > > > > > and file handle in a single system call. > > > > > > > > At the cost of every other application that doesn't need those > > > > attributes. > > > > > > Why do you think there would be a cost? > > > > It's as much maintenance and testing cost as it is a runtime cost. > > We have to test and check this functionality works as advertised, > > and we have to maintain that in working order forever more. That's > > not free, especially if it is decided that the implementation needs > > to be hyper-optimised in each individual filesystem because of > > performance cost reasons. > > > > Indeed, even the runtime "do we need to fetch this information" > > checks have a measurable cost, especially as statx() is a very hot > > kernel path. We've been optimising branches out of things like > > setting up kiocbs because when that path is taken millions of times > > every second each logic branch that decides if something needs to be > > done or not has a direct measurable cost. statx() is a hot path that > > can be called millions of times a second. > > Like Neal mentioned we won't even be fetching the fh if it wasn't > explicitly requested - and like I mentioned, we can avoid the > .encode_fh() call for local filesystems with a bit of work at the VFS > layer. > > OTOH, when you're running rsync in incremental mode, and detecting > hardlinks, your point that "statx can be called millions of times per > second" would apply just as much to the additional name_to_handle_at() > call - we'd be nearly doubling their overhead for scanning files that > don't need to be sent. Hardlinked files are indicated by st_nlink > 1, not by requiring userspace to store every st_ino/dev it sees and having to compare the st-ino/dev of every newly stat()d inode against that ino/dev cache. We only need ino/dev/filehandles for hardlink path disambiguation. IOWs, this use case does not need name_to_handle_at() for millions of inodes - it is just needed on the regular file inodes that have st_nlink > 1. Hence even for wrokloads like rsync with hardlink detection, we don't need filehandles for every inode being stat()d. And that's ignoring the fact that, outside of certain niche use cases, hardlinks are rare. I'm really struggling to see what filehandles in statx() actually optimises in any meaningful manner > > And then comes the cost of encoding dynamically sized information in > > struct statx - filehandles are not fixed size - and statx is most > > definitely not set up or intended for dynamically sized attribute > > data. This adds more complexity to statx because it wasn't designed > > or intended to handle dynamically sized attributes. Optional > > attributes, yes, but not attributes that might vary in size from fs > > to fs or even inode type to inode type within a fileystem (e.g. dir > > filehandles can, optionally, encode the parent inode in them). > > Since it looks like expanding statx is not going to be quite as easy as > hoped, I proposed elsewhere in the thread that we reserve a smaller > fixed size in statx (32 bytes) and set a flag if it won't fit, > indicating that userspace needs to fall back to name_to_handle_at(). struct btrfs_fid is 40 bytes in size. Sure, that's not all used for name_to_handle_at(), but we already have in-kernel filehandles that can optionally configured to be bigger than 32 bytes... > Stuffing a _dynamically_ sized attribute into statx would indeed be > painful - I believe were always talking about a fixed size buffer in > statx, the discussion's been over how big it needs to be... The contents of t
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, Dec 13, 2023 at 09:57:22AM +1100, NeilBrown wrote: > On Wed, 13 Dec 2023, Kent Overstreet wrote: > > On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote: > > > Kent Overstreet wrote: > > > > > > > I was chatting a bit with David Howells on IRC about this, and floated > > > > adding the file handle to statx. It looks like there's enough space > > > > reserved to make this feasible - probably going with a fixed maximum > > > > size of 128-256 bits. > > > > > > We can always save the last bit to indicate extension space/extension > > > record, > > > so we're not that strapped for space. > > > > So we'll need that if we want to round trip NFSv4 filehandles, they > > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96 > > bytes reserved). > > > > Obvious question (Neal): do/will real world implementations ever come > > close to making use of this, or was this a "future proofing gone wild" > > thing? > > I have no useful data. I have seen lots of filehandles but I don't pay > much attention to their length. Certainly some are longer than 32 bytes. > > > > > Say we do decide we want to spec it that large: _can_ we extend struct > > statx? I'm wondering if the userspace side was thought through, I'm > > sure glibc people will have something to say. > > The man page says: > > Therefore, do not simply set mask to UINT_MAX (all bits set), as > one or more bits may, in the future, be used to specify an > extension to the buffer. > > I suspect the glibc people read that. The trouble is that C has no notion of which types are safe to pass across a dynamic library boundary, so if we increase the size of struct statx and someone's doing that things will break in nasty ways.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, Dec 13, 2023 at 10:06:37AM +1100, Dave Chinner wrote: > On Wed, Dec 13, 2023 at 09:31:13AM +1100, NeilBrown wrote: > > On Wed, 13 Dec 2023, Dave Chinner wrote: > > > > > > What you are suggesting is that we now duplicate filehandle encoding > > > into every filesystem's statx() implementation. That's a bad > > > trade-off from a maintenance, testing and consistency POV because > > > now we end up with lots of individual, filehandle encoding > > > implementations in addition to the generic filehandle > > > infrastructure that we all have to test and validate. > > > > Not correct. We are suggesting an interface, not an implementation. > > Here you are proposing a suboptimal implementation, pointing out its > > weakness, and suggesting the has consequences for the interface > > proposal. Is that the strawman fallacy? > > No, you simply haven't followed deep enough into the rabbit hole to > understand Kent was suggesting potential implementation details to > address hot path performance concerns with filehandle encoding. > > > vfs_getattr_nosec could, after calling i_op->getattr, check if > > STATX_HANDLE is set in request_mask but not in ->result_mask. > > If so it could call exportfs_encode_fh() and handle the result. > > > > No filesystem need to be changed. > > Well, yes, it's pretty damn obvious that is exactly what I've been > advocating for here - if we are going to put filehandles in statx(), > then it must use the same infrastructure as name_to_handle_at(). > i.e. calling exportfs_encode_fh(EXPORT_FH_FID) to generate the > filehandle. > > The important discussion detail you've missed about > exportfs_encode_fh() is that it *requires* adding a new indirect > call (via export_ops->encode_fh) in the statx path to encode the > filehandle, and that's exactly what Kent was suggesting we can code > the implementation to avoid. > > Avoiding an indirect function call is an implementation detail, not > an interface design requirement. > > And the only way to avoid adding new indirect calls to encoding > filesystem specific filehandles is to implement the encoding in the > existing individual filesystem i_op->getattr methods. i.e. duplicate > the filehandle encoding in the statx path rather than use > exportfs_encode_fh(). I was thinking along the lines of coming up with a common fh type for local filesystems (why exactly do we need 15?) and adding a volume ID to the VFS inode so this could live entirely in VS code for most filesystems, but that's an option too. Might be the best one, since btrfs and bcachefs actually do want a different fh type (btrfs: 64 bit subvol, 64 bit ino, bcachefs: 64 bit ino, 32 bit subvol, 32 bit generation), and we don't want to generate a bigger fh than necessary for when it's being consumed by a stacking filesystem that has to generate a new fh by concatanating something.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, Dec 13, 2023 at 09:31:13AM +1100, NeilBrown wrote: > On Wed, 13 Dec 2023, Dave Chinner wrote: > > > > What you are suggesting is that we now duplicate filehandle encoding > > into every filesystem's statx() implementation. That's a bad > > trade-off from a maintenance, testing and consistency POV because > > now we end up with lots of individual, filehandle encoding > > implementations in addition to the generic filehandle > > infrastructure that we all have to test and validate. > > Not correct. We are suggesting an interface, not an implementation. > Here you are proposing a suboptimal implementation, pointing out its > weakness, and suggesting the has consequences for the interface > proposal. Is that the strawman fallacy? No, you simply haven't followed deep enough into the rabbit hole to understand Kent was suggesting potential implementation details to address hot path performance concerns with filehandle encoding. > vfs_getattr_nosec could, after calling i_op->getattr, check if > STATX_HANDLE is set in request_mask but not in ->result_mask. > If so it could call exportfs_encode_fh() and handle the result. > > No filesystem need to be changed. Well, yes, it's pretty damn obvious that is exactly what I've been advocating for here - if we are going to put filehandles in statx(), then it must use the same infrastructure as name_to_handle_at(). i.e. calling exportfs_encode_fh(EXPORT_FH_FID) to generate the filehandle. The important discussion detail you've missed about exportfs_encode_fh() is that it *requires* adding a new indirect call (via export_ops->encode_fh) in the statx path to encode the filehandle, and that's exactly what Kent was suggesting we can code the implementation to avoid. Avoiding an indirect function call is an implementation detail, not an interface design requirement. And the only way to avoid adding new indirect calls to encoding filesystem specific filehandles is to implement the encoding in the existing individual filesystem i_op->getattr methods. i.e. duplicate the filehandle encoding in the statx path rather than use exportfs_encode_fh(). -Dave. -- Dave Chinner da...@fromorbit.com
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, 13 Dec 2023, Kent Overstreet wrote: > On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote: > > Kent Overstreet wrote: > > > > > I was chatting a bit with David Howells on IRC about this, and floated > > > adding the file handle to statx. It looks like there's enough space > > > reserved to make this feasible - probably going with a fixed maximum > > > size of 128-256 bits. > > > > We can always save the last bit to indicate extension space/extension > > record, > > so we're not that strapped for space. > > So we'll need that if we want to round trip NFSv4 filehandles, they > won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96 > bytes reserved). > > Obvious question (Neal): do/will real world implementations ever come > close to making use of this, or was this a "future proofing gone wild" > thing? I have no useful data. I have seen lots of filehandles but I don't pay much attention to their length. Certainly some are longer than 32 bytes. > > Say we do decide we want to spec it that large: _can_ we extend struct > statx? I'm wondering if the userspace side was thought through, I'm > sure glibc people will have something to say. The man page says: Therefore, do not simply set mask to UINT_MAX (all bits set), as one or more bits may, in the future, be used to specify an extension to the buffer. I suspect the glibc people read that. NeilBrown > > Kernel side we can definitely extend struct statx, and we know how many > bytes to copy to userspace because we know what fields userspace > requested. The part I'm concerned about is that if we extend userspace's > struct statx, that introduces obvious ABI compabitibility issues. > > So this would probably force glibc to introduce a new version of struct > statx, if I'm not mistaken. > > Or: another option would be to reserve something small and sane in > struct statx (32 bytes max, I'd say), and then set a flag to tell > userspace they need to use name_to_handle_at() if it didn't fit. >
Re: file handle in statx
On Wed, Dec 13, 2023 at 09:23:18AM +1100, Dave Chinner wrote: > On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote: > > On Wed, 13 Dec 2023, Dave Chinner wrote: > > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote: > > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > > > > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > > > > > > > > > So can someone please explain to me why we need to try to > > > > > > > re-invent > > > > > > > a generic filehandle concept in statx when we already have a have > > > > > > > working and widely supported user API that provides exactly this > > > > > > > functionality? > > > > > > > > > > > > name_to_handle_at() is fine, but userspace could profit from being > > > > > > able to retrieve the filehandle together with the other metadata in > > > > > > a > > > > > > single system call. > > > > > > > > > > Can you say more? What, specifically is the application that would > > > > > want > > > > to do > > > > > that, and is it really in such a hot path that it would be a > > > > > user-visible > > > > > improveable, let aloine something that can be actually be measured? > > > > > > > > A user space NFS server like Ganesha could benefit from getting > > > > attributes > > > > and file handle in a single system call. > > > > > > At the cost of every other application that doesn't need those > > > attributes. > > > > Why do you think there would be a cost? > > It's as much maintenance and testing cost as it is a runtime cost. > We have to test and check this functionality works as advertised, > and we have to maintain that in working order forever more. That's > not free, especially if it is decided that the implementation needs > to be hyper-optimised in each individual filesystem because of > performance cost reasons. > > Indeed, even the runtime "do we need to fetch this information" > checks have a measurable cost, especially as statx() is a very hot > kernel path. We've been optimising branches out of things like > setting up kiocbs because when that path is taken millions of times > every second each logic branch that decides if something needs to be > done or not has a direct measurable cost. statx() is a hot path that > can be called millions of times a second. Like Neal mentioned we won't even be fetching the fh if it wasn't explicitly requested - and like I mentioned, we can avoid the .encode_fh() call for local filesystems with a bit of work at the VFS layer. OTOH, when you're running rsync in incremental mode, and detecting hardlinks, your point that "statx can be called millions of times per second" would apply just as much to the additional name_to_handle_at() call - we'd be nearly doubling their overhead for scanning files that don't need to be sent. > And then comes the cost of encoding dynamically sized information in > struct statx - filehandles are not fixed size - and statx is most > definitely not set up or intended for dynamically sized attribute > data. This adds more complexity to statx because it wasn't designed > or intended to handle dynamically sized attributes. Optional > attributes, yes, but not attributes that might vary in size from fs > to fs or even inode type to inode type within a fileystem (e.g. dir > filehandles can, optionally, encode the parent inode in them). Since it looks like expanding statx is not going to be quite as easy as hoped, I proposed elsewhere in the thread that we reserve a smaller fixed size in statx (32 bytes) and set a flag if it won't fit, indicating that userspace needs to fall back to name_to_handle_at(). Stuffing a _dynamically_ sized attribute into statx would indeed be painful - I believe were always talking about a fixed size buffer in statx, the discussion's been over how big it needs to be...
Re: file handle in statx
On Wed, 13 Dec 2023, Dave Chinner wrote: > On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote: > > On Wed, 13 Dec 2023, Dave Chinner wrote: > > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote: > > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > > > > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > > > > > > > > > So can someone please explain to me why we need to try to > > > > > > > re-invent > > > > > > > a generic filehandle concept in statx when we already have a have > > > > > > > working and widely supported user API that provides exactly this > > > > > > > functionality? > > > > > > > > > > > > name_to_handle_at() is fine, but userspace could profit from being > > > > > > able to retrieve the filehandle together with the other metadata in > > > > > > a > > > > > > single system call. > > > > > > > > > > Can you say more? What, specifically is the application that would > > > > > want > > > > to do > > > > > that, and is it really in such a hot path that it would be a > > > > > user-visible > > > > > improveable, let aloine something that can be actually be measured? > > > > > > > > A user space NFS server like Ganesha could benefit from getting > > > > attributes > > > > and file handle in a single system call. > > > > > > At the cost of every other application that doesn't need those > > > attributes. > > > > Why do you think there would be a cost? > > It's as much maintenance and testing cost as it is a runtime cost. You are moving the goal posts. Maintenance is not a cost borne by "every other application". But if you think it is a concern it is certainly worth mentioning. > We have to test and check this functionality works as advertised, > and we have to maintain that in working order forever more. That's > not free, especially if it is decided that the implementation needs > to be hyper-optimised in each individual filesystem because of > performance cost reasons. > > Indeed, even the runtime "do we need to fetch this information" > checks have a measurable cost, especially as statx() is a very hot > kernel path. We've been optimising branches out of things like > setting up kiocbs because when that path is taken millions of times > every second each logic branch that decides if something needs to be > done or not has a direct measurable cost. statx() is a hot path that > can be called millions of times a second. > > And then comes the cost of encoding dynamically sized information in > struct statx - filehandles are not fixed size - and statx is most > definitely not set up or intended for dynamically sized attribute > data. This adds more complexity to statx because it wasn't designed > or intended to handle dynamically sized attributes. Optional > attributes, yes, but not attributes that might vary in size from fs > to fs or even inode type to inode type within a fileystem (e.g. dir > filehandles can, optionally, encode the parent inode in them). Filehandles are fixed sized. They are one integer in the range 0-128 plus 128 bytes. We know/promise the when the integer is less than 128, trailing bytes of the 128 will all be zero. NeilBRown > > -Dave. > -- > Dave Chinner > da...@fromorbit.com >
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, 13 Dec 2023, Dave Chinner wrote: > > What you are suggesting is that we now duplicate filehandle encoding > into every filesystem's statx() implementation. That's a bad > trade-off from a maintenance, testing and consistency POV because > now we end up with lots of individual, filehandle encoding > implementations in addition to the generic filehandle > infrastructure that we all have to test and validate. Not correct. We are suggesting an interface, not an implementation. Here you are proposing a suboptimal implementation, pointing out its weakness, and suggesting the has consequences for the interface proposal. Is that the strawman fallacy? vfs_getattr_nosec could, after calling i_op->getattr, check if STATX_HANDLE is set in request_mask but not in ->result_mask. If so it could call exportfs_encode_fh() and handle the result. No filesystem need to be changed. NeilBrown
Re: file handle in statx
On Wed, Dec 13, 2023 at 08:57:43AM +1100, NeilBrown wrote: > On Wed, 13 Dec 2023, Dave Chinner wrote: > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote: > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > > > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > > > > > > > So can someone please explain to me why we need to try to re-invent > > > > > > a generic filehandle concept in statx when we already have a have > > > > > > working and widely supported user API that provides exactly this > > > > > > functionality? > > > > > > > > > > name_to_handle_at() is fine, but userspace could profit from being > > > > > able to retrieve the filehandle together with the other metadata in a > > > > > single system call. > > > > > > > > Can you say more? What, specifically is the application that would want > > > to do > > > > that, and is it really in such a hot path that it would be a > > > > user-visible > > > > improveable, let aloine something that can be actually be measured? > > > > > > A user space NFS server like Ganesha could benefit from getting attributes > > > and file handle in a single system call. > > > > At the cost of every other application that doesn't need those > > attributes. > > Why do you think there would be a cost? It's as much maintenance and testing cost as it is a runtime cost. We have to test and check this functionality works as advertised, and we have to maintain that in working order forever more. That's not free, especially if it is decided that the implementation needs to be hyper-optimised in each individual filesystem because of performance cost reasons. Indeed, even the runtime "do we need to fetch this information" checks have a measurable cost, especially as statx() is a very hot kernel path. We've been optimising branches out of things like setting up kiocbs because when that path is taken millions of times every second each logic branch that decides if something needs to be done or not has a direct measurable cost. statx() is a hot path that can be called millions of times a second. And then comes the cost of encoding dynamically sized information in struct statx - filehandles are not fixed size - and statx is most definitely not set up or intended for dynamically sized attribute data. This adds more complexity to statx because it wasn't designed or intended to handle dynamically sized attributes. Optional attributes, yes, but not attributes that might vary in size from fs to fs or even inode type to inode type within a fileystem (e.g. dir filehandles can, optionally, encode the parent inode in them). -Dave. -- Dave Chinner da...@fromorbit.com
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 04:23:06PM -0500, Kent Overstreet wrote: > On Wed, Dec 13, 2023 at 07:48:16AM +1100, Dave Chinner wrote: > > On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote: > > > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote: > > > > Doesn't anyone else see or hear the elephant trumpeting loudly in > > > > the middle of the room? > > > > > > > > I mean, we already have name_to_handle_at() for userspace to get a > > > > unique, opaque, filesystem defined file handle for any given file. > > > > It's the same filehandle that filesystems hand to the nfsd so nfs > > > > clients can uniquely identify the file they are asking the nfsd to > > > > operate on. > > > > > > > > The contents of these filehandles is entirely defined by the file > > > > system and completely opaque to the user. The only thing that > > > > parses the internal contents of the handle is the filesystem itself. > > > > Therefore, as long as the fs encodes the information it needs into the > > > > handle to determine what subvol/snapshot the inode belongs to when > > > > the handle is passed back to it (e.g. from open_by_handle_at()) then > > > > nothing else needs to care how it is encoded. > > > > > > > > So can someone please explain to me why we need to try to re-invent > > > > a generic filehandle concept in statx when we already have a > > > > have working and widely supported user API that provides exactly > > > > this functionality? > > > > > > Definitely should be part of the discussion :) > > > > > > But I think it _does_ need to be in statx; because: > > > - we've determined that 64 bit ino_t just isn't a future proof > > >interface, we're having real problems with it today > > > - statx is _the_ standard, future proofed interface for getting inode > > >attributes > > > > No, it most definitely isn't, and statx was never intended as a > > dumping ground for anything and everything inode related. e.g. Any > > inode attribute that can be modified needs to use a different > > interface - one that has a corresponding "set" operation. > > And here I thought the whole point of statx was to be an extensible way > to request any sort of inode attribute. Within reason. When the size of a single attribute is dynamically sized, can double the size of the statx structure and there's an existing syscall to retrieve that information from the filesystem, it makes no sense *at all* to duplicate that information in statx(). statx() is for new attributes we don't have any other way of exposing to userspace easily. it is designed for fixed size attributes and flags, it is not designed for dynamically sized information to be embedded in struct statx. Filehandles are not new, and we already have APIs that expose them to userspace that handle the dynamic size of filehandles just fine. This functionality simply does not belong in statx() at all. > > > - therefore, if we want userspace programmers to be using filehandles, > > >instead of inode numbers, so there code isn't broken, we need to be > > >providing interfaces that guide them in that direction. > > > > We already have a filehandle interface they can use for this > > purpose. It is already used by some userspace applications for this > > purpose. > > > > Anything new API function do with statx() will require application > > changes, and the vast majority of applications aren't using statx() > > directly - they are using stat() which glibc wraps to statx() > > internally. So they are going to need a change of API, anyway. > > > > So, fundamentally, there is a change of API for most applications > > that need to do thorough inode uniqueness checks regardless of > > anything else. They can do this right now - just continue using > > stat() as they do right now, and then use name_to_filehandle_at() > > for uniqueness checks. > > > > > Even assuming we can update all the documentation to say "filehandles > > > are the correct way to test inode uniqueness", you know at least half of > > > programmers will stick to stx_ino instead of the filehandle if the > > > filehandle is an extra syscall. > > > > Your argument is "programmers suck so we must design for the > > lowest common denominator". That's an -awful- way to design APIs. > > No, I'm saying if the old way doing things no longer works, we ought to > make the new future proofed way as ergonomic and easy to use as the old > way was - else it won't get used. > > At the _very_ least we need to add a flag to statx for "inode number is > unreliable for uniqueness checks". > > bcachefs could leave this off until the first snapshot has been taken. > > But even with that option, I think we ought to be telling anyone doing > uniqueness checks to use the filehandle, because it includes i_generation. > > > Further, this "programmers suck" design comes at a cost to every > > statx() call that does not need filehandles. That's the vast > > majority of statx() calls that are made on a
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, 13 Dec 2023, Dave Chinner wrote: > On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote: > > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote: > > > Doesn't anyone else see or hear the elephant trumpeting loudly in > > > the middle of the room? > > > > > > I mean, we already have name_to_handle_at() for userspace to get a > > > unique, opaque, filesystem defined file handle for any given file. > > > It's the same filehandle that filesystems hand to the nfsd so nfs > > > clients can uniquely identify the file they are asking the nfsd to > > > operate on. > > > > > > The contents of these filehandles is entirely defined by the file > > > system and completely opaque to the user. The only thing that > > > parses the internal contents of the handle is the filesystem itself. > > > Therefore, as long as the fs encodes the information it needs into the > > > handle to determine what subvol/snapshot the inode belongs to when > > > the handle is passed back to it (e.g. from open_by_handle_at()) then > > > nothing else needs to care how it is encoded. > > > > > > So can someone please explain to me why we need to try to re-invent > > > a generic filehandle concept in statx when we already have a > > > have working and widely supported user API that provides exactly > > > this functionality? > > > > Definitely should be part of the discussion :) > > > > But I think it _does_ need to be in statx; because: > > - we've determined that 64 bit ino_t just isn't a future proof > >interface, we're having real problems with it today > > - statx is _the_ standard, future proofed interface for getting inode > >attributes > > No, it most definitely isn't, and statx was never intended as a > dumping ground for anything and everything inode related. e.g. Any > inode attribute that can be modified needs to use a different > interface - one that has a corresponding "set" operation. stx_size, stx_mtime, stx_atime, stx_mode, stx_owner, stx_group, stx_nlink, These can all be modified but don't have matched get and set operations. stx_handle, of course, cannot be modified. I think I must be completely misunderstanding you, because the way I read your words, they don't make any sense at all. Help! NeilBrown
Re: file handle in statx
On Wed, 13 Dec 2023, Dave Chinner wrote: > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote: > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > > > > > So can someone please explain to me why we need to try to re-invent > > > > > a generic filehandle concept in statx when we already have a have > > > > > working and widely supported user API that provides exactly this > > > > > functionality? > > > > > > > > name_to_handle_at() is fine, but userspace could profit from being > > > > able to retrieve the filehandle together with the other metadata in a > > > > single system call. > > > > > > Can you say more? What, specifically is the application that would want > > to do > > > that, and is it really in such a hot path that it would be a user-visible > > > improveable, let aloine something that can be actually be measured? > > > > A user space NFS server like Ganesha could benefit from getting attributes > > and file handle in a single system call. > > At the cost of every other application that doesn't need those > attributes. Why do you think there would be a cost? statx only returns the attributes that it was asked for. If an application doesn't ask for STATX_HANDLE, it doesn't get STATS_HANDLE and (providing filesystems honour the flag) doesn't pay the price for generating it (which is often trivial, but may not always be). NeilBrown > That's not a good trade-off - the cost of a syscall is > tiny compared to the rest of the work that has to be done to create > a stable filehandle for an inode, even if we have a variant of > name_to_handle_at() that takes an open fd rather than a path. > > > Potentially it could also avoid some of the challenges of using > > name_to_handle_at that is a privileged operation. > > It isn't. open_by_handle_at() is privileged because it bypasses all > the path based access checking that a normal open() does. Anyone can > get a filehandle for a path from the kernel, but few can actually > use it for anything other than file uniqueness checks > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com >
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, 13 Dec 2023, Miklos Szeredi wrote: > On Tue, 12 Dec 2023 at 17:08, Kent Overstreet > wrote: > > > In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when > > they _must_ do the new thing if they care about correctness; it provides > > a way to tell userspace what guarantees we're able to provide. > > That flag would not help with improving userspace software. Are you sure? Suppose I wanted to export an filesystem using some protocol (maybe one called "NFSv4"), and suppose this protocol supported the communication of an attribute called "fileid" which was optional but requires to be fully unique if provided at all. If I had access to STATX_ATTR_INUM_NOT_UNIQUE, then I could not export the fileid when it didn't met the protocol requirements, but could when it did. This may not be a strong case for the inclusion of the flag it is, I think, a clear indication that "would not help" is what our fact checkers would call "over-reach". NeilBrown > > What would help, if said software started using a unique identifier. > We already seem to have a unique ID in the form of file handles, > though some exotic filesystems might allow more than one fh to refer > to the same inode, so this still needs some looking into. > > The big problem is that we can't do a lot about existing software, and > must keep trying to keep st_ino unique for the foreseeable future. > > Thanks, > Miklos >
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, 13 Dec 2023, Miklos Szeredi wrote: > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet > wrote: > > > Other poeple have been finding ways to contribute to the technical > > discussion; just calling things "ugly and broken" does not. > > Kent, calm down please. We call things "ugly and broken" all the > time. That's just an opinion, you are free to argue it, and no need > to take it personally. But maybe we shouldn't. Maybe we should focus on saying what, exactly, is unpleasant to look at and way. Or what exactly causes poor funcationality. "ugly" and "broken" are not particularly useful words in a technical discussion. I understand people want to use them, but they really need to be backed up with details. It is details that matter. NeilBrown > > Thanks, > Miklos > >
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Wed, Dec 13, 2023 at 07:48:16AM +1100, Dave Chinner wrote: > On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote: > > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote: > > > Doesn't anyone else see or hear the elephant trumpeting loudly in > > > the middle of the room? > > > > > > I mean, we already have name_to_handle_at() for userspace to get a > > > unique, opaque, filesystem defined file handle for any given file. > > > It's the same filehandle that filesystems hand to the nfsd so nfs > > > clients can uniquely identify the file they are asking the nfsd to > > > operate on. > > > > > > The contents of these filehandles is entirely defined by the file > > > system and completely opaque to the user. The only thing that > > > parses the internal contents of the handle is the filesystem itself. > > > Therefore, as long as the fs encodes the information it needs into the > > > handle to determine what subvol/snapshot the inode belongs to when > > > the handle is passed back to it (e.g. from open_by_handle_at()) then > > > nothing else needs to care how it is encoded. > > > > > > So can someone please explain to me why we need to try to re-invent > > > a generic filehandle concept in statx when we already have a > > > have working and widely supported user API that provides exactly > > > this functionality? > > > > Definitely should be part of the discussion :) > > > > But I think it _does_ need to be in statx; because: > > - we've determined that 64 bit ino_t just isn't a future proof > >interface, we're having real problems with it today > > - statx is _the_ standard, future proofed interface for getting inode > >attributes > > No, it most definitely isn't, and statx was never intended as a > dumping ground for anything and everything inode related. e.g. Any > inode attribute that can be modified needs to use a different > interface - one that has a corresponding "set" operation. And here I thought the whole point of statx was to be an extensible way to request any sort of inode attribute. > > - therefore, if we want userspace programmers to be using filehandles, > >instead of inode numbers, so there code isn't broken, we need to be > >providing interfaces that guide them in that direction. > > We already have a filehandle interface they can use for this > purpose. It is already used by some userspace applications for this > purpose. > > Anything new API function do with statx() will require application > changes, and the vast majority of applications aren't using statx() > directly - they are using stat() which glibc wraps to statx() > internally. So they are going to need a change of API, anyway. > > So, fundamentally, there is a change of API for most applications > that need to do thorough inode uniqueness checks regardless of > anything else. They can do this right now - just continue using > stat() as they do right now, and then use name_to_filehandle_at() > for uniqueness checks. > > > Even assuming we can update all the documentation to say "filehandles > > are the correct way to test inode uniqueness", you know at least half of > > programmers will stick to stx_ino instead of the filehandle if the > > filehandle is an extra syscall. > > Your argument is "programmers suck so we must design for the > lowest common denominator". That's an -awful- way to design APIs. No, I'm saying if the old way doing things no longer works, we ought to make the new future proofed way as ergonomic and easy to use as the old way was - else it won't get used. At the _very_ least we need to add a flag to statx for "inode number is unreliable for uniqueness checks". bcachefs could leave this off until the first snapshot has been taken. But even with that option, I think we ought to be telling anyone doing uniqueness checks to use the filehandle, because it includes i_generation. > Further, this "programmers suck" design comes at a cost to every > statx() call that does not need filehandles. That's the vast > majority of statx() calls that are made on a system. Why should we > slow down statx() for all users when so few applications actually > need uniqueness and they can take the cost of robust uniqueness > tests with an extra syscall entirely themselves? For any local filesystem the filehandle is going to be the inode generation number, the subvolume ID (if applicable), and the inode number. That's 16 bytes on bcachefs, and if we make an attempt to standardize how this works across filesystems we should be able to do it without adding a new indirect function call to the statx() path. That sounds pretty negligable to me. The syscall overhead isn't going to be negligable when an application has to do lots of scanning to find the files its interested - rsync.
Re: file handle in statx
On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote: > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > > > So can someone please explain to me why we need to try to re-invent > > > > a generic filehandle concept in statx when we already have a have > > > > working and widely supported user API that provides exactly this > > > > functionality? > > > > > > name_to_handle_at() is fine, but userspace could profit from being > > > able to retrieve the filehandle together with the other metadata in a > > > single system call. > > > > Can you say more? What, specifically is the application that would want > to do > > that, and is it really in such a hot path that it would be a user-visible > > improveable, let aloine something that can be actually be measured? > > A user space NFS server like Ganesha could benefit from getting attributes > and file handle in a single system call. At the cost of every other application that doesn't need those attributes. That's not a good trade-off - the cost of a syscall is tiny compared to the rest of the work that has to be done to create a stable filehandle for an inode, even if we have a variant of name_to_handle_at() that takes an open fd rather than a path. > Potentially it could also avoid some of the challenges of using > name_to_handle_at that is a privileged operation. It isn't. open_by_handle_at() is privileged because it bypasses all the path based access checking that a normal open() does. Anyone can get a filehandle for a path from the kernel, but few can actually use it for anything other than file uniqueness checks Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Mon, Dec 11, 2023 at 11:40:16PM +, David Howells wrote: > Kent Overstreet wrote: > > > I was chatting a bit with David Howells on IRC about this, and floated > > adding the file handle to statx. It looks like there's enough space > > reserved to make this feasible - probably going with a fixed maximum > > size of 128-256 bits. > > We can always save the last bit to indicate extension space/extension record, > so we're not that strapped for space. So we'll need that if we want to round trip NFSv4 filehandles, they won't fit in existing struct statx (nfsv4 specs 128 bytes, statx has 96 bytes reserved). Obvious question (Neal): do/will real world implementations ever come close to making use of this, or was this a "future proofing gone wild" thing? Say we do decide we want to spec it that large: _can_ we extend struct statx? I'm wondering if the userspace side was thought through, I'm sure glibc people will have something to say. Kernel side we can definitely extend struct statx, and we know how many bytes to copy to userspace because we know what fields userspace requested. The part I'm concerned about is that if we extend userspace's struct statx, that introduces obvious ABI compabitibility issues. So this would probably force glibc to introduce a new version of struct statx, if I'm not mistaken. Or: another option would be to reserve something small and sane in struct statx (32 bytes max, I'd say), and then set a flag to tell userspace they need to use name_to_handle_at() if it didn't fit.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 10:21:53AM -0500, Kent Overstreet wrote: > On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote: > > Doesn't anyone else see or hear the elephant trumpeting loudly in > > the middle of the room? > > > > I mean, we already have name_to_handle_at() for userspace to get a > > unique, opaque, filesystem defined file handle for any given file. > > It's the same filehandle that filesystems hand to the nfsd so nfs > > clients can uniquely identify the file they are asking the nfsd to > > operate on. > > > > The contents of these filehandles is entirely defined by the file > > system and completely opaque to the user. The only thing that > > parses the internal contents of the handle is the filesystem itself. > > Therefore, as long as the fs encodes the information it needs into the > > handle to determine what subvol/snapshot the inode belongs to when > > the handle is passed back to it (e.g. from open_by_handle_at()) then > > nothing else needs to care how it is encoded. > > > > So can someone please explain to me why we need to try to re-invent > > a generic filehandle concept in statx when we already have a > > have working and widely supported user API that provides exactly > > this functionality? > > Definitely should be part of the discussion :) > > But I think it _does_ need to be in statx; because: > - we've determined that 64 bit ino_t just isn't a future proof >interface, we're having real problems with it today > - statx is _the_ standard, future proofed interface for getting inode >attributes No, it most definitely isn't, and statx was never intended as a dumping ground for anything and everything inode related. e.g. Any inode attribute that can be modified needs to use a different interface - one that has a corresponding "set" operation. > - therefore, if we want userspace programmers to be using filehandles, >instead of inode numbers, so there code isn't broken, we need to be >providing interfaces that guide them in that direction. We already have a filehandle interface they can use for this purpose. It is already used by some userspace applications for this purpose. Anything new API function do with statx() will require application changes, and the vast majority of applications aren't using statx() directly - they are using stat() which glibc wraps to statx() internally. So they are going to need a change of API, anyway. So, fundamentally, there is a change of API for most applications that need to do thorough inode uniqueness checks regardless of anything else. They can do this right now - just continue using stat() as they do right now, and then use name_to_filehandle_at() for uniqueness checks. > Even assuming we can update all the documentation to say "filehandles > are the correct way to test inode uniqueness", you know at least half of > programmers will stick to stx_ino instead of the filehandle if the > filehandle is an extra syscall. Your argument is "programmers suck so we must design for the lowest common denominator". That's an -awful- way to design APIs. Further, this "programmers suck" design comes at a cost to every statx() call that does not need filehandles. That's the vast majority of statx() calls that are made on a system. Why should we slow down statx() for all users when so few applications actually need uniqueness and they can take the cost of robust uniqueness tests with an extra syscall entirely themselves? -Dave. -- Dave Chinner da...@fromorbit.com
RE: file handle in statx
> On Tue, Dec 12, 2023 at 7:44 PM Kent Overstreet > wrote: > > > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote: > > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > > > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > > > > > > > So can someone please explain to me why we need to try to > > > > > > re-invent a generic filehandle concept in statx when we > > > > > > already have a have working and widely supported user API that > > > > > > provides exactly this functionality? > > > > > > > > > > name_to_handle_at() is fine, but userspace could profit from > > > > > being able to retrieve the filehandle together with the other > > > > > metadata in a single system call. > > > > > > > > Can you say more? What, specifically is the application that > > > > would want > > > to do > > > > that, and is it really in such a hot path that it would be a > > > > user-visible improveable, let aloine something that can be actually be > measured? > > > > > > A user space NFS server like Ganesha could benefit from getting > > > attributes and file handle in a single system call. > > > > > > Potentially it could also avoid some of the challenges of using > > > name_to_handle_at that is a privileged operation. > > > > which begs the question - why is name_to_handle_at() privileged? > > > > AFAICT, it is not privileged. > Only open_by_handle_at() is privileged. Ah, that makes sense. I'm a consumer of these interfaces, not so much in the kernel these days. Ganesha depends on open_by_handle_at as well, so requires privilege to do handle stuff with kernel file systems. In any case, Ganesha could easily benefit from a savings of system calls. What would also be handy is a read directory with attributes that gave the statx results for each entry. Frank
Re: file handle in statx
On Tue, Dec 12, 2023 at 7:44 PM Kent Overstreet wrote: > > On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote: > > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > > > > > So can someone please explain to me why we need to try to re-invent > > > > > a generic filehandle concept in statx when we already have a have > > > > > working and widely supported user API that provides exactly this > > > > > functionality? > > > > > > > > name_to_handle_at() is fine, but userspace could profit from being > > > > able to retrieve the filehandle together with the other metadata in a > > > > single system call. > > > > > > Can you say more? What, specifically is the application that would want > > to do > > > that, and is it really in such a hot path that it would be a user-visible > > > improveable, let aloine something that can be actually be measured? > > > > A user space NFS server like Ganesha could benefit from getting attributes > > and file handle in a single system call. > > > > Potentially it could also avoid some of the challenges of using > > name_to_handle_at that is a privileged operation. > > which begs the question - why is name_to_handle_at() privileged? > AFAICT, it is not privileged. Only open_by_handle_at() is privileged. Thanks, Amir.
Re: file handle in statx
On Tue, Dec 12, 2023 at 09:15:29AM -0800, Frank Filz wrote: > > On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > > > So can someone please explain to me why we need to try to re-invent > > > > a generic filehandle concept in statx when we already have a have > > > > working and widely supported user API that provides exactly this > > > > functionality? > > > > > > name_to_handle_at() is fine, but userspace could profit from being > > > able to retrieve the filehandle together with the other metadata in a > > > single system call. > > > > Can you say more? What, specifically is the application that would want > to do > > that, and is it really in such a hot path that it would be a user-visible > > improveable, let aloine something that can be actually be measured? > > A user space NFS server like Ganesha could benefit from getting attributes > and file handle in a single system call. > > Potentially it could also avoid some of the challenges of using > name_to_handle_at that is a privileged operation. which begs the question - why is name_to_handle_at() privileged?
RE: file handle in statx
> On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > > On 12/12/23 06:53, Dave Chinner wrote: > > > > > So can someone please explain to me why we need to try to re-invent > > > a generic filehandle concept in statx when we already have a have > > > working and widely supported user API that provides exactly this > > > functionality? > > > > name_to_handle_at() is fine, but userspace could profit from being > > able to retrieve the filehandle together with the other metadata in a > > single system call. > > Can you say more? What, specifically is the application that would want to do > that, and is it really in such a hot path that it would be a user-visible > improveable, let aloine something that can be actually be measured? A user space NFS server like Ganesha could benefit from getting attributes and file handle in a single system call. Potentially it could also avoid some of the challenges of using name_to_handle_at that is a privileged operation. Frank
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 05:30:23PM +0100, Miklos Szeredi wrote: > On Tue, 12 Dec 2023 at 17:08, Kent Overstreet > wrote: > > > In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when > > they _must_ do the new thing if they care about correctness; it provides > > a way to tell userspace what guarantees we're able to provide. > > That flag would not help with improving userspace software. I disagree. Just having it there and in the documentation will be a good solid nudge to userspace programmers that this is something they need to consider. And: if we size this thing for NFSv4 filehandles, those are big: there's an overhead associated with using those, since userspace generally has to track in memory all inode numbers/file handles that it's seen. STATX_ATTR_INUM_NOT_UNIQUE would allow userspace to avoid that overhead when it is safe to do so. (But remember that file handles include inode generation numbers, and inode numbers do not; that is something we should remember to document, and explain why it is important). > What would help, if said software started using a unique identifier. > We already seem to have a unique ID in the form of file handles, > though some exotic filesystems might allow more than one fh to refer > to the same inode, so this still needs some looking into. > > The big problem is that we can't do a lot about existing software, and > must keep trying to keep st_ino unique for the foreseeable future. Whatever hacks we try to apply to st_ino are outside the scope of this discussion. Right now, we need to be figuring out what our future proofed interface is going to be, so that we don't end up kicking this can down the road when st_ino will be _really_ space constrained.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, 12 Dec 2023 at 17:08, Kent Overstreet wrote: > In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when > they _must_ do the new thing if they care about correctness; it provides > a way to tell userspace what guarantees we're able to provide. That flag would not help with improving userspace software. What would help, if said software started using a unique identifier. We already seem to have a unique ID in the form of file handles, though some exotic filesystems might allow more than one fh to refer to the same inode, so this still needs some looking into. The big problem is that we can't do a lot about existing software, and must keep trying to keep st_ino unique for the foreseeable future. Thanks, Miklos
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 04:57:41PM +0100, Miklos Szeredi wrote: > On Tue, 12 Dec 2023 at 16:43, Kent Overstreet > wrote: > > > > On Tue, Dec 12, 2023 at 04:38:29PM +0100, Miklos Szeredi wrote: > > > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet > > > wrote: > > > > > > > Other poeple have been finding ways to contribute to the technical > > > > discussion; just calling things "ugly and broken" does not. > > > > > > Kent, calm down please. We call things "ugly and broken" all the > > > time. That's just an opinion, you are free to argue it, and no need > > > to take it personally. > > > > It's an entirely subjective judgement that has no place in a discussion > > where we're trying to decide things on technical merits. > > Software is like architecture. It must not collapse, but to function > well it also needs to look good. That's especially relevant to APIs, > and yes, it's a matter of taste and can be very subjective. Good taste is a highly important trait - in the course of normal work we trust our subjective opinions constantly, because those opinions have been trained by years and years of judgements and we can't reason through everything. But when you show up to a discussion that's been going on for a page, where everything's been constructively gathering input, and you start namecalling - and crucially, _without giving any technical justification for your opinions_ - that's just you being a dick. > On this particular point (STATX_ATTR_INUM_NOT_UNIQUE) I can completely > understand Christian's reaction. But if you think this would be a > useful feature, please state your technical argument. Well, you could have scanned through the prior thread, or read the extensive LWN coverage, but in short - inode number uniqueness is something that's already impossible to guarantee, today. In short, STATX_ATTR_INUM_NOT_UNIQUE is required to tell userspace when they _must_ do the new thing if they care about correctness; it provides a way to tell userspace what guarantees we're able to provide.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, 12 Dec 2023 at 16:43, Kent Overstreet wrote: > > On Tue, Dec 12, 2023 at 04:38:29PM +0100, Miklos Szeredi wrote: > > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet > > wrote: > > > > > Other poeple have been finding ways to contribute to the technical > > > discussion; just calling things "ugly and broken" does not. > > > > Kent, calm down please. We call things "ugly and broken" all the > > time. That's just an opinion, you are free to argue it, and no need > > to take it personally. > > It's an entirely subjective judgement that has no place in a discussion > where we're trying to decide things on technical merits. Software is like architecture. It must not collapse, but to function well it also needs to look good. That's especially relevant to APIs, and yes, it's a matter of taste and can be very subjective. On this particular point (STATX_ATTR_INUM_NOT_UNIQUE) I can completely understand Christian's reaction. But if you think this would be a useful feature, please state your technical argument. Thanks, Miklos
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 04:38:29PM +0100, Miklos Szeredi wrote: > On Tue, 12 Dec 2023 at 16:35, Kent Overstreet > wrote: > > > Other poeple have been finding ways to contribute to the technical > > discussion; just calling things "ugly and broken" does not. > > Kent, calm down please. We call things "ugly and broken" all the > time. That's just an opinion, you are free to argue it, and no need > to take it personally. It's an entirely subjective judgement that has no place in a discussion where we're trying to decide things on technical merits.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, 12 Dec 2023 at 16:35, Kent Overstreet wrote: > Other poeple have been finding ways to contribute to the technical > discussion; just calling things "ugly and broken" does not. Kent, calm down please. We call things "ugly and broken" all the time. That's just an opinion, you are free to argue it, and no need to take it personally. Thanks, Miklos
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 04:29:09PM +0100, Christian Brauner wrote: > On Tue, Dec 12, 2023 at 10:16:31AM -0500, Kent Overstreet wrote: > > On Tue, Dec 12, 2023 at 09:56:45AM +0100, Christian Brauner wrote: > > > On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote: > > > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the > > > > > > same inode number > > > > > > This is just ugly with questionable value. A constant reminder of how > > > broken this is. Exposing the subvolume id also makes this somewhat > > > redundant. > > > > Oh hell no. We finally get a reasonably productive discussion and I wake > > up to you calling things ugly and broken, with no reason or > > justification or any response to the justification that's already been > > given? > > > > Christain, that's not how we do things. Let's not turn what was a > > productive discussion into a trainwreck, OK? > > That's a major aggressive tone for no good reason. And probably not a > good way to convince anyone. I'm also not sure why you're taking a > dislike of this flag as a personal affront. No, if you're going to show up with comments like that it is entirely called for. Other poeple have been finding ways to contribute to the technical discussion; just calling things "ugly and broken" does not.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 10:16:31AM -0500, Kent Overstreet wrote: > On Tue, Dec 12, 2023 at 09:56:45AM +0100, Christian Brauner wrote: > > On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote: > > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the > > > > > same inode number > > > > This is just ugly with questionable value. A constant reminder of how > > broken this is. Exposing the subvolume id also makes this somewhat > > redundant. > > Oh hell no. We finally get a reasonably productive discussion and I wake > up to you calling things ugly and broken, with no reason or > justification or any response to the justification that's already been > given? > > Christain, that's not how we do things. Let's not turn what was a > productive discussion into a trainwreck, OK? That's a major aggressive tone for no good reason. And probably not a good way to convince anyone. I'm also not sure why you're taking a dislike of this flag as a personal affront.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 10:35:40AM +0100, Christian Brauner wrote: > On Tue, Dec 12, 2023 at 10:28:12AM +0100, Miklos Szeredi wrote: > > On Tue, 12 Dec 2023 at 10:23, Christian Brauner wrote: > > > > > > On Tue, Dec 12, 2023 at 09:10:47AM +, David Howells wrote: > > > > Christian Brauner wrote: > > > > > > > > > > > > I suggest: > > > > > > > > > > > > > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files > > > > > > > > have the > > > > > > > > same inode number > > > > > > > > > > This is just ugly with questionable value. A constant reminder of how > > > > > broken this is. Exposing the subvolume id also makes this somewhat > > > > > redundant. > > > > > > > > There is a upcoming potential problem where even the 64-bit field I > > > > placed in > > > > statx() may be insufficient. The Auristor AFS server, for example, has > > > > a > > > > 96-bit vnode ID, but I can't properly represent this in stx_ino. > > > > Currently, I > > > > > > Is that vnode ID akin to a volume? Because if so you could just > > > piggy-back on a subvolume id field in statx() and expose it there. > > > > And how would exporting a subvolume ID and expecting userspace to take > > that into account when checking for hard links be meaningfully > > different than expecting userspace to retrieve the file handle and > > compare that? > > > > The latter would at least be a generic solution, including stacking fs > > inodes, instead of tackling just a specific corner of the problem > > space. > > So taking a step back here, please. The original motivation for this > discussion was restricted to handle btrfs - and now bcachefs as well. > Both have a concept of a subvolume so it made sense to go that route. > IOW, it wasn't originally a generic problem or pitched as such. > > Would overlayfs be able to utilize an extended inode field as well? No, the original motivation was not just btrfs and bcachefs; overlayfs fundamentally needs to export a bigger identifier than the host filesystems - pigeonhole principle, if anyone remembers their combinatorics. This applies to any filesystem which is under the hood reexporting from other filesystems; there's a lot of stuff going on in container filesystems right now, and I expect it'll come up there (you can create new identifiers if you're exporting file by file, but not if it's directory trees). And Neal brought up NFS re-exports; I think there's a compelling argument to be made that we ought to be able to round trip NFS filehandles.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 03:06:07PM +0100, Miklos Szeredi wrote: > On Tue, 12 Dec 2023 at 14:48, Christian Brauner wrote: > > > Exposing the subvolume id in statx() is still fine imho. It's a concept > > shared between btrfs and bcachefs and it's pretty useful for interested > > userspace that wants to make use of these apis. > > Exposing subvolume ID should be okay, as long as it's not advertised > as a way to uniquely identify an inode. Its use should be limited to > finding subvolume boundaries. > > > > It might help to have the fh in statx, since that's easier on the > > > userspace programmer than having to deal with two interfaces (i_ino > > > won't go away for some time, because of backward compatibility). > > > OTOH I also don't like the way it would need to be shoehorned into > > > statx. > > > > No, it really doesn't belong into statx(). > > > > And besides, the file handle apis name_to_handle_at() are already > > in wider use than a lot of people think. Not just for the exportfs case > > but also for example, cgroupfs uses file handles to provide unique > > identifiers for cgroups that can be compared. > > The issue with name_to_handle_at() is its use of the old, non-unique > mount ID. Yes, yes, we can get away with > > fd = openat(dfd, path, O_PATH); > name_to_handle_at(fd, "", ..., AT_EMPTY_PATH); > statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, ...); > close(fd); > > But that's *four* syscalls instead of one... Yeah, but putting this into statx() isn't really nice imho. Once we do actually land the unique mount id thing it wouldn't be the worst thing to add name_to_handle_at2().
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 04:53:28PM +1100, Dave Chinner wrote: > Doesn't anyone else see or hear the elephant trumpeting loudly in > the middle of the room? > > I mean, we already have name_to_handle_at() for userspace to get a > unique, opaque, filesystem defined file handle for any given file. > It's the same filehandle that filesystems hand to the nfsd so nfs > clients can uniquely identify the file they are asking the nfsd to > operate on. > > The contents of these filehandles is entirely defined by the file > system and completely opaque to the user. The only thing that > parses the internal contents of the handle is the filesystem itself. > Therefore, as long as the fs encodes the information it needs into the > handle to determine what subvol/snapshot the inode belongs to when > the handle is passed back to it (e.g. from open_by_handle_at()) then > nothing else needs to care how it is encoded. > > So can someone please explain to me why we need to try to re-invent > a generic filehandle concept in statx when we already have a > have working and widely supported user API that provides exactly > this functionality? Definitely should be part of the discussion :) But I think it _does_ need to be in statx; because: - we've determined that 64 bit ino_t just isn't a future proof interface, we're having real problems with it today - statx is _the_ standard, future proofed interface for getting inode attributes - therefore, if we want userspace programmers to be using filehandles, instead of inode numbers, so there code isn't broken, we need to be providing interfaces that guide them in that direction. Even assuming we can update all the documentation to say "filehandles are the correct way to test inode uniqueness", you know at least half of programmers will stick to stx_ino instead of the filehandle if the filehandle is an extra syscall.
Re: file handle in statx
On Tue, Dec 12, 2023 at 10:10:23AM +0100, Donald Buczek wrote: > On 12/12/23 06:53, Dave Chinner wrote: > > > So can someone please explain to me why we need to try to re-invent > > a generic filehandle concept in statx when we already have a > > have working and widely supported user API that provides exactly > > this functionality? > > name_to_handle_at() is fine, but userspace could profit from being > able to retrieve the filehandle together with the other metadata in > a single system call. Can you say more? What, specifically is the application that would want to do that, and is it really in such a hot path that it would be a user-visible improveable, let aloine something that can be actually be measured? - Ted
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 09:56:45AM +0100, Christian Brauner wrote: > On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote: > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the > > > > same inode number > > This is just ugly with questionable value. A constant reminder of how > broken this is. Exposing the subvolume id also makes this somewhat redundant. Oh hell no. We finally get a reasonably productive discussion and I wake up to you calling things ugly and broken, with no reason or justification or any response to the justification that's already been given? Christain, that's not how we do things. Let's not turn what was a productive discussion into a trainwreck, OK?
Re: [PATCH v2 0/3] Add support bcachefs filesystem
On Thu 07-12-23 20:40:08, Petr Vorel wrote: > Hi all, > > @Amir @Jan, could you please have look on fanotify failures on Bcachefs? > fanotify13.c, fanotify15.c and fanotify16.c produce many errors. > > To test, just apply this patchset and then run with LTP_SINGLE_FS_TYPE: > > # LTP_SINGLE_FS_TYPE=bcachefs ./fanotify15 I suspect some catch with fsids but we'll see... I hopefully get to this tomorrow. Honza > changes v1->v2: > Two new commits: > * lib: Add Bcachefs magic > * statx04: Skip STATX_ATTR_COMPRESSED on Bcachefs > > Petr Vorel (3): > lib: Add Bcachefs magic > lib: Add support bcachefs filesystem to .all_filesystems > statx04: Skip STATX_ATTR_COMPRESSED on Bcachefs > > include/tst_fs.h | 3 +++ > lib/tst_fs_type.c | 2 ++ > lib/tst_supported_fs_types.c | 2 ++ > testcases/kernel/syscalls/statx/statx04.c | 5 +++-- > 4 files changed, 10 insertions(+), 2 deletions(-) > > -- > 2.43.0 > -- Jan Kara SUSE Labs, CR
Re: [PATCH RFC v2 for-6.8/block 15/18] buffer: add a new helper to read sb block
On Tue 12-12-23 05:25:25, Christoph Hellwig wrote: > On Mon, Dec 11, 2023 at 10:07:53PM +0800, Yu Kuai wrote: > > +static __always_inline int buffer_uptodate_or_error(struct buffer_head *bh) > > +{ > > + /* > > +* If the buffer has the write error flag, data was failed to write > > +* out in the block. In this case, set buffer uptodate to prevent > > +* reading old data. > > +*/ > > + if (buffer_write_io_error(bh)) > > + set_buffer_uptodate(bh); > > + return buffer_uptodate(bh); > > +} > > So - risking this blows up into a lot of nasty work: Why do we even > clear the uptodate flag on write errors? Doing so makes not sense to > me as the data isn't any less uptodate just because we failed to write > it.. Historic reasons I'd say (buffer_write_io_error isn't *that* old - from 2003 it seems). And yes, it would make a lot of sense to keep uptodate flag set and just rely on buffer_write_io_error() but it also means going through all buffer_uptodate() checks in filesystems and determining which need changing to buffer_write_io_error() which is something nobody is keen on doing ;) Honza -- Jan Kara SUSE Labs, CR
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, 12 Dec 2023 at 14:48, Christian Brauner wrote: > Exposing the subvolume id in statx() is still fine imho. It's a concept > shared between btrfs and bcachefs and it's pretty useful for interested > userspace that wants to make use of these apis. Exposing subvolume ID should be okay, as long as it's not advertised as a way to uniquely identify an inode. Its use should be limited to finding subvolume boundaries. > > It might help to have the fh in statx, since that's easier on the > > userspace programmer than having to deal with two interfaces (i_ino > > won't go away for some time, because of backward compatibility). > > OTOH I also don't like the way it would need to be shoehorned into > > statx. > > No, it really doesn't belong into statx(). > > And besides, the file handle apis name_to_handle_at() are already > in wider use than a lot of people think. Not just for the exportfs case > but also for example, cgroupfs uses file handles to provide unique > identifiers for cgroups that can be compared. The issue with name_to_handle_at() is its use of the old, non-unique mount ID. Yes, yes, we can get away with fd = openat(dfd, path, O_PATH); name_to_handle_at(fd, "", ..., AT_EMPTY_PATH); statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, ...); close(fd); But that's *four* syscalls instead of one... Thanks, Miklos
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 10:42:58AM +0100, Miklos Szeredi wrote: > On Tue, 12 Dec 2023 at 10:35, Christian Brauner wrote: > > > So taking a step back here, please. The original motivation for this > > discussion was restricted to handle btrfs - and now bcachefs as well. > > Both have a concept of a subvolume so it made sense to go that route. > > IOW, it wasn't originally a generic problem or pitched as such. > > > > Would overlayfs be able to utilize an extended inode field as well? > > Well, yes, but I don't think that's the right solution. Exposing the subvolume id in statx() is still fine imho. It's a concept shared between btrfs and bcachefs and it's pretty useful for interested userspace that wants to make use of these apis. > > I think the right solution is to move to using file handles instead of > st_ino, the problem with that is that there's no way kernel can force > upgrading userspace. That's not our job tbh. I get why this is desirable. What we can do is advertise better and newer apis but we don't have to make unpleasant compromises in our interfaces for that. > > It might help to have the fh in statx, since that's easier on the > userspace programmer than having to deal with two interfaces (i_ino > won't go away for some time, because of backward compatibility). > OTOH I also don't like the way it would need to be shoehorned into > statx. No, it really doesn't belong into statx(). And besides, the file handle apis name_to_handle_at() are already in wider use than a lot of people think. Not just for the exportfs case but also for example, cgroupfs uses file handles to provide unique identifiers for cgroups that can be compared.
Re: [PATCH RFC v2 for-6.8/block 15/18] buffer: add a new helper to read sb block
On Mon, Dec 11, 2023 at 10:07:53PM +0800, Yu Kuai wrote: > +static __always_inline int buffer_uptodate_or_error(struct buffer_head *bh) > +{ > + /* > + * If the buffer has the write error flag, data was failed to write > + * out in the block. In this case, set buffer uptodate to prevent > + * reading old data. > + */ > + if (buffer_write_io_error(bh)) > + set_buffer_uptodate(bh); > + return buffer_uptodate(bh); > +} So - risking this blows up into a lot of nasty work: Why do we even clear the uptodate flag on write errors? Doing so makes not sense to me as the data isn't any less uptodate just because we failed to write it..
Re: [PATCH RFC v2 for-6.8/block 17/18] ext4: remove block_device_ejected()
Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH RFC v2 for-6.8/block 01/18] block: add some bdev apis
> +void invalidate_bdev_range(struct block_device *bdev, pgoff_t start, > +pgoff_t end) > +{ > + invalidate_mapping_pages(bdev->bd_inode->i_mapping, start, end); > +} > +EXPORT_SYMBOL_GPL(invalidate_bdev_range); Can we have kerneldoc comments for the new helpers please? > +struct folio *__bdev_get_folio(struct block_device *bdev, loff_t pos, > +fgf_t fgp_flags, gfp_t gfp) > +{ > + return __filemap_get_folio(bdev->bd_inode->i_mapping, pos >> PAGE_SHIFT, > +fgp_flags, gfp); > +} > +EXPORT_SYMBOL_GPL(__bdev_get_folio); It's a bit silly to have a __-prefixed API without a version that doesn't have the prefix, so I'd prefer to drop it. Unless willy has a good argument for keeping it the same as the filemap API.
Re: [PATCH RFC v2 for-6.8/block 01/18] block: add some bdev apis
On Mon, Dec 11, 2023 at 05:52:17PM +0100, Jan Kara wrote: > > +void bdev_associated_mapping(struct block_device *bdev, > > +struct address_space *mapping) > > +{ > > + mapping->host = bdev->bd_inode; > > +} > > Here I'm not sure - is the helper really a win? It seems a bit obscure to > me. This initialization of another mapping for a bdev looks really special. If we want to hide bd_inode we'll something like this helper even if I don't particularly like it either. But it might be a good idea to move out of this series and into the follow on removing bd_inode, as it's rather pointless without that context.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
Christian Brauner wrote: > > There is a upcoming potential problem where even the 64-bit field I placed > > in statx() may be insufficient. The Auristor AFS server, for example, has > > a 96-bit vnode ID, but I can't properly represent this in stx_ino. > > Currently, I > > Is that vnode ID akin to a volume? Because if so you could just > piggy-back on a subvolume id field in statx() and expose it there. No. The volume ID is the ID of the volume. The vnode is the equivalent of an inode. > > just truncate the value to fit and hope that the discarded part will be all > > zero, but that's not really a good thing to do - especially when stx_ino is > > used programmatically to check for hardlinks. > > > > Would it be better to add an 'stx_ino_2' field and corresponding flag? > > Would this be meaningfully different from using a file handle? There's also the matter of presenting the "inode number" to the user - "ls -i" for example. David
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, 12 Dec 2023 at 10:35, Christian Brauner wrote: > So taking a step back here, please. The original motivation for this > discussion was restricted to handle btrfs - and now bcachefs as well. > Both have a concept of a subvolume so it made sense to go that route. > IOW, it wasn't originally a generic problem or pitched as such. > > Would overlayfs be able to utilize an extended inode field as well? Well, yes, but I don't think that's the right solution. I think the right solution is to move to using file handles instead of st_ino, the problem with that is that there's no way kernel can force upgrading userspace. It might help to have the fh in statx, since that's easier on the userspace programmer than having to deal with two interfaces (i_ino won't go away for some time, because of backward compatibility). OTOH I also don't like the way it would need to be shoehorned into statx. Thanks, Miklos
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 10:28:12AM +0100, Miklos Szeredi wrote: > On Tue, 12 Dec 2023 at 10:23, Christian Brauner wrote: > > > > On Tue, Dec 12, 2023 at 09:10:47AM +, David Howells wrote: > > > Christian Brauner wrote: > > > > > > > > > > I suggest: > > > > > > > > > > > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have > > > > > > > the > > > > > > > same inode number > > > > > > > > This is just ugly with questionable value. A constant reminder of how > > > > broken this is. Exposing the subvolume id also makes this somewhat > > > > redundant. > > > > > > There is a upcoming potential problem where even the 64-bit field I > > > placed in > > > statx() may be insufficient. The Auristor AFS server, for example, has a > > > 96-bit vnode ID, but I can't properly represent this in stx_ino. > > > Currently, I > > > > Is that vnode ID akin to a volume? Because if so you could just > > piggy-back on a subvolume id field in statx() and expose it there. > > And how would exporting a subvolume ID and expecting userspace to take > that into account when checking for hard links be meaningfully > different than expecting userspace to retrieve the file handle and > compare that? > > The latter would at least be a generic solution, including stacking fs > inodes, instead of tackling just a specific corner of the problem > space. So taking a step back here, please. The original motivation for this discussion was restricted to handle btrfs - and now bcachefs as well. Both have a concept of a subvolume so it made sense to go that route. IOW, it wasn't originally a generic problem or pitched as such. Would overlayfs be able to utilize an extended inode field as well?
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, 12 Dec 2023 at 10:23, Christian Brauner wrote: > > On Tue, Dec 12, 2023 at 09:10:47AM +, David Howells wrote: > > Christian Brauner wrote: > > > > > > > > I suggest: > > > > > > > > > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the > > > > > > same inode number > > > > > > This is just ugly with questionable value. A constant reminder of how > > > broken this is. Exposing the subvolume id also makes this somewhat > > > redundant. > > > > There is a upcoming potential problem where even the 64-bit field I placed > > in > > statx() may be insufficient. The Auristor AFS server, for example, has a > > 96-bit vnode ID, but I can't properly represent this in stx_ino. > > Currently, I > > Is that vnode ID akin to a volume? Because if so you could just > piggy-back on a subvolume id field in statx() and expose it there. And how would exporting a subvolume ID and expecting userspace to take that into account when checking for hard links be meaningfully different than expecting userspace to retrieve the file handle and compare that? The latter would at least be a generic solution, including stacking fs inodes, instead of tackling just a specific corner of the problem space. Thanks, Miklos
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 09:10:47AM +, David Howells wrote: > Christian Brauner wrote: > > > > > > I suggest: > > > > > > > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the > > > > > same inode number > > > > This is just ugly with questionable value. A constant reminder of how > > broken this is. Exposing the subvolume id also makes this somewhat > > redundant. > > There is a upcoming potential problem where even the 64-bit field I placed in > statx() may be insufficient. The Auristor AFS server, for example, has a > 96-bit vnode ID, but I can't properly represent this in stx_ino. Currently, I Is that vnode ID akin to a volume? Because if so you could just piggy-back on a subvolume id field in statx() and expose it there. > just truncate the value to fit and hope that the discarded part will be all > zero, but that's not really a good thing to do - especially when stx_ino is > used programmatically to check for hardlinks. > > Would it be better to add an 'stx_ino_2' field and corresponding flag? Would this be meaningfully different from using a file handle?
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
Christian Brauner wrote: > > > > I suggest: > > > > > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the > > > > same inode number > > This is just ugly with questionable value. A constant reminder of how > broken this is. Exposing the subvolume id also makes this somewhat redundant. There is a upcoming potential problem where even the 64-bit field I placed in statx() may be insufficient. The Auristor AFS server, for example, has a 96-bit vnode ID, but I can't properly represent this in stx_ino. Currently, I just truncate the value to fit and hope that the discarded part will be all zero, but that's not really a good thing to do - especially when stx_ino is used programmatically to check for hardlinks. Would it be better to add an 'stx_ino_2' field and corresponding flag? David
Re: file handle in statx
On 12/12/23 06:53, Dave Chinner wrote: > So can someone please explain to me why we need to try to re-invent > a generic filehandle concept in statx when we already have a > have working and widely supported user API that provides exactly > this functionality? name_to_handle_at() is fine, but userspace could profit from being able to retrieve the filehandle together with the other metadata in a single system call. One argument regarding stx_vol was that userspace tools, which walk the filesystem tree, might want to avoid to run blindly into snapshots. Additionally, if a volume defines the namespace for a set of unique inode numbers, the tools could continue to use inode numbers just on a per-volume basis. Best Donald > Cheers, > > Dave. -- Donald Buczek buc...@molgen.mpg.de Tel: +49 30 8413 1433
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 01:13:07PM +1100, NeilBrown wrote: > On Tue, 12 Dec 2023, Kent Overstreet wrote: > > On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote: > > > On Tue, 12 Dec 2023, Kent Overstreet wrote: > > > > NFSv4 specs that for the maximum size? That is pretty hefty... > > > > > > It is - but it needs room to identify the filesystem and it needs to be > > > stable across time. That need is more than a local filesystem needs. > > > > > > NFSv2 allowed 32 bytes which is enough for a 16 byte filesys uuid, 8 > > > byte inum and 8byte generation num. But only just. > > > > > > NFSv3 allowed 64 bytes which was likely plenty for (nearly?) every > > > situation. > > > > > > NFSv4 doubled it again because who knows. "why not" I guess. > > > Linux nfsd typically uses 20 or 28 bytes plus whatever the filesystem > > > wants. (28 when the export point is not the root of the filesystem). > > > I suspect this always fits within an NFSv3 handle except when > > > re-exporting an NFS filesystem. NFS re-export is an interesting case... > > > > Now I'm really curious - i_generation wasn't enough? Are we including > > filesystem UUIDs? > > i_generation was invented so that it could be inserted into the NFS > fileshandle. > > The NFS filehandle is opaque. It likely contains an inode number, a > generation number, and a filesystem identifier. But it is not possible > to extract those from the handle. > > > > > I suppose if we want to be able to round trip this stuff we do need to > > allocate space for it, even if a local filesystem would never include > > it. > > > > > I suggest: > > > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the > > > same inode number > > > > > > > > > __u64 stx_vol Volume identifier. Two files with same stx_vol and > > >stx_ino MUST be the same. Exact meaning of volumes > > >is filesys-specific > > > > NFS reexport that you mentioned previously makes it seem like this > > guarantee is impossible to provide in general (so I'd leave it out > > entirely, it's just something for people to trip over). > > NFS would not set stx_vol and would not return STATX_VOL in stx_mask. > So it would not attempt to provide that guarantee. > > Maybe we don't need to explicitly make this guarantee. > > > > > But we definitely want stx_vol in there. Another thing that people ask > > for is a way to ask "is this a subvolume root?" - we should make sure > > that's clearly specified, or can we just include a bit for it? > > The start way to test for a filesystem root - or mount point at least - > is to stat the directory in question and its parent (..) and see if the > have the same st_dev or not. It depends. If you want to figure out whether it's a different filesystem or a different btrfs subvolume then yes, this generally works because of changing device ids. But it doesn't work for bind-mounts as they don't change device numbers. But maybe you and I are using mount point differently here. > Applying the same logic to volumes means that a single stx_vol number is > sufficient. Yes, that would generally work. > > I'm not strongly against a STATX_ATTR_VOL_ROOT flag providing everyone > agrees what it means that we cannot imagine any awkward corner-cases > (like a 'root' being different from a 'mount point'). I feel like you might have missed my previous mails where I strongly argued for the addition of STATX_ATTR_SUBVOLUME_ROOT: https://lore.kernel.org/linux-btrfs/20231108-herleiten-bezwangen-ffb2821f539e@brauner The concept of a subvolume root and of a mount point should be kept separate. Christoph tried mapping subvolumes to vfsmounts, something that I (and Al) vehemently oppose for various reasons outlined in that and other long threads. I still think that we should start with exposing subvolume id first.
Re: file handle in statx (was: Re: How to cope with subvolumes and snapshots on muti-user systems?)
On Tue, Dec 12, 2023 at 08:32:55AM +0200, Amir Goldstein wrote: > On Tue, Dec 12, 2023 at 7:53 AM Dave Chinner wrote: > > > > On Tue, Dec 12, 2023 at 11:59:51AM +1100, NeilBrown wrote: > > > On Tue, 12 Dec 2023, Kent Overstreet wrote: > > > > On Tue, Dec 12, 2023 at 10:53:07AM +1100, NeilBrown wrote: > > > > > On Tue, 12 Dec 2023, Kent Overstreet wrote: > > > > > > On Tue, Dec 12, 2023 at 09:43:27AM +1100, NeilBrown wrote: > > > > > > > On Sat, 09 Dec 2023, Kent Overstreet wrote: > > > > > > Thoughts? > > > > > > > > > > > > > > > > I'm completely in favour of exporting the (full) filehandle through > > > > > statx. (If the application asked for the filehandle, it will expect a > > > > > larger structure to be returned. We don't need to use the currently > > > > > reserved space). > > > > > > > > > > I'm completely in favour of updating user-space tools to use the > > > > > filehandle to check if two handles are for the same file. > > > > > > > > > > I'm not in favour of any filesystem depending on this for correct > > > > > functionality today. As long as the filesystem isn't so large that > > > > > inum+volnum simply cannot fit in 64 bits, we should make a reasonable > > > > > effort to present them both in 64 bits. Depending on the filehandle > > > > > is a > > > > > good plan for long term growth, not for basic functionality today. > > > > > > > > My standing policy in these situations is that I'll do the stopgap/hacky > > > > measure... but not before doing actual, real work on the longterm > > > > solution :) > > > > > > Eminently sensible. > > > > > > > > > > > So if we're all in favor of statx as the real long term solution, how > > > > about we see how far we get with that? > > > > > > > > > > I suggest: > > > > > > STATX_ATTR_INUM_NOT_UNIQUE - it is possible that two files have the > > > same inode number This is just ugly with questionable value. A constant reminder of how broken this is. Exposing the subvolume id also makes this somewhat redundant. > > > > > > > > > __u64 stx_vol Volume identifier. Two files with same stx_vol and > > >stx_ino MUST be the same. Exact meaning of volumes > > >is filesys-specific Exposing this makes sense. I've mentioned that I'd be open to exporting the subvolume id in statx before even though I know some people don't like that. I still stand by that. Especially now that we have btrfs and bcachefs that both have the concept of a subvolume id it makes sense to put it into statx(). > > > > > > STATX_VOL Want stx_vol > > > > > > __u8 stx_handle_len Length of stx_handle if present > > > __u8 stx_handle[128] Unique stable identifier for this file. Will > > >NEVER be reused for a different file. > > >This appears AFTER __statx_pad2, beyond > > >the current 'struct statx'. > > > STATX_HANDLE Want stx_handle_len and stx_handle. Buffer for > > >receiving statx info has at least > > >sizeof(struct statx)+128 bytes. No, we're should not be reinventing another opaque handle. And even if it really doesn't belong into statx(). > > > > Hmmm. > > > > Doesn't anyone else see or hear the elephant trumpeting loudly in > > the middle of the room? > > > > I mean, we already have name_to_handle_at() for userspace to get a > > unique, opaque, filesystem defined file handle for any given file. I agree. > > It's the same filehandle that filesystems hand to the nfsd so nfs > > clients can uniquely identify the file they are asking the nfsd to > > operate on. > > > > The contents of these filehandles is entirely defined by the file > > system and completely opaque to the user. The only thing that > > parses the internal contents of the handle is the filesystem itself. > > Therefore, as long as the fs encodes the information it needs into the > > handle to determine what subvol/snapshot the inode belongs to when > > the handle is passed back to it (e.g. from open_by_handle_at()) then > > nothing else needs to care how it is encoded. > > > > So can someone please explain to me why we need to try to re-invent > > a generic filehandle concept in statx when we already have a > > have working and widely supported user API that provides exactly > > this functionality? > > Yeh. > > Not to mention that since commit 64343119d7b8 ("exportfs: support encoding > non-decodeable file handles by default"), exporting file handles as strong > object identifiers is not limited to filesystems that support NFS export. > > All fs have a default implementation of encode_fh() by encoding a file id > of type FILEID_INO64_GEN from { i_ino, i_generation } and any fs can > define its own encode_fh() operation (e.g. to include subvol id) even without > implementing the decode fh operations needed for NFS export. I also would like both btrfs and bcachefs to agree.