Re: [PATCH 86/87] fs: switch timespec64 fields in inode to discrete integers

2023-09-28 Thread Amir Goldstein
On Fri, Sep 29, 2023 at 3:19 AM Linus Torvalds
 wrote:
...
> So yes, real programs to cache stat information, and it matters for 
> performance.
>
> But I don't think any actual reasonable program will have
> *correctness* issues, though -

I beg to disagree.

> because there are certainly filesystems
> out there that don't do nanosecond resolution (and other operations
> like copying trees around will obviously also change times).
>
> Anybody doing steganography in the timestamps is already not going to
> have a great time, really.
>

Your thesis implies that all applications are portable across different
filesystems and all applications are expected to cope with copying
trees around.

There are applications that work on specific filesystems and those
applications are very much within sanity if they expect that past
observed values of nsec will not to change if the file was not changed.

But even if we agree that will "only" hurt performance, your example of
performance hit (10s of git diff) is nowhere close to the performance
hit of invalidating the mtime cache of billions of files at once (i.e. after
kernel upgrade), which means that rsync-like programs need to
re-read all the data from remote locations.

I am not saying that filesystems cannot decide to *stop storing nsec
granularity* from this day forth, but like btrfs pre-historic timestamps,
those fs have an obligation to preserve existing metadata, unless
users opted to throw it away.

OTOH, it is perfectly fine if the vfs wants to stop providing sub 100ns
services to filesystems. It's just going to be the fs problem and the
preserved pre-historic/fine-grained time on existing files would only
need to be provided in getattr(). It does not need to be in __i_mtime.

Thanks,
Amir.


Re: [PATCH 86/87] fs: switch timespec64 fields in inode to discrete integers

2023-09-28 Thread Amir Goldstein
On Thu, Sep 28, 2023 at 8:19 PM Darrick J. Wong  wrote:
>
> On Thu, Sep 28, 2023 at 01:06:03PM -0400, Jeff Layton wrote:
> > On Thu, 2023-09-28 at 11:48 -0400, Arnd Bergmann wrote:
> > > On Thu, Sep 28, 2023, at 07:05, Jeff Layton wrote:
> > > > This shaves 8 bytes off struct inode, according to pahole.
> > > >
> > > > Signed-off-by: Jeff Layton 
> > >
> > > FWIW, this is similar to the approach that Deepa suggested
> > > back in 2016:
> > >
> > > https://lore.kernel.org/lkml/1452144972-15802-3-git-send-email-deepa.ker...@gmail.com/
> > >
> > > It was NaKed at the time because of the added complexity,
> > > though it would have been much easier to do it then,
> > > as we had to touch all the timespec references anyway.
> > >
> > > The approach still seems ok to me, but I'm not sure it's worth
> > > doing it now if we didn't do it then.
> > >
> >
> > I remember seeing those patches go by. I don't remember that change
> > being NaK'ed, but I wasn't paying close attention at the time
> >
> > Looking at it objectively now, I think it's worth it to recover 8 bytes
> > per inode and open a 4 byte hole that Amir can use to grow the
> > i_fsnotify_mask. We might even able to shave off another 12 bytes
> > eventually if we can move to a single 64-bit word per timestamp.
>
> I don't think you can, since btrfs timestamps utilize s64 seconds
> counting in both directions from the Unix epoch.  They also support ns
> resolution:
>
> struct btrfs_timespec {
> __le64 sec;
> __le32 nsec;
> } __attribute__ ((__packed__));
>
> --D
>

Sure we can.
That's what btrfs_inode is for.
vfs inode also does not store i_otime (birth time) and there is even a
precedent of vfs/btrfs variable size mismatch:

/* full 64 bit generation number, struct vfs_inode doesn't have a big
 * enough field for this.
 */
u64 generation;

If we decide that vfs should use "bigtime", btrfs pre-historic
timestamps are not a show stopper.

Thanks,
Amir.


Re: [PATCH 87/87] fs: move i_blocks up a few places in struct inode

2023-09-28 Thread Amir Goldstein
On Thu, Sep 28, 2023 at 2:06 PM Jeff Layton  wrote:
>
> The recent change to use discrete integers instead of struct timespec64
> in struct inode shaved 8 bytes off of it, but it also moves the i_lock
> into the previous cacheline, away from the fields that it protects.
>
> Move i_blocks up above the i_lock, which moves the new 4 byte hole to
> just after the timestamps, without changing the size of the structure.
>

Instead of creating an implicit hole, can you please move i_generation
to fill the 4 bytes hole.

It makes sense in the same cache line with i_ino and I could
use the vacant 4 bytes hole above i_fsnotify_mask to expand the
mask to 64bit (the 32bit event mask space is running out).

Thanks,
Amir.

> Signed-off-by: Jeff Layton 
> ---
>  include/linux/fs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index de902ff2938b..3e0fe0f52e7c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -677,11 +677,11 @@ struct inode {
> u32 i_atime_nsec;
> u32 i_mtime_nsec;
> u32 i_ctime_nsec;
> +   blkcnt_ti_blocks;
> spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
> unsigned short  i_bytes;
> u8  i_blkbits;
> u8  i_write_hint;
> -   blkcnt_ti_blocks;
>
>  #ifdef __NEED_I_SIZE_ORDERED
> seqcount_t  i_size_seqcount;
> --
> 2.41.0
>


Re: [RFC PATCH v2] fs/xattr: add *at family syscalls

2023-05-15 Thread Amir Goldstein
On Mon, May 15, 2023 at 4:52 PM Christian Brauner  wrote:
>
> On Mon, May 15, 2023 at 04:04:21PM +0300, Amir Goldstein wrote:
> > On Mon, May 15, 2023 at 1:33 PM Christian Brauner  
> > wrote:
> > >
> > > On Thu, May 11, 2023 at 05:08:02PM +0200, Christian Göttsche wrote:
> > > > Add the four syscalls setxattrat(), getxattrat(), listxattrat() and
> > > > removexattrat().  Those can be used to operate on extended attributes,
> > > > especially security related ones, either relative to a pinned directory
> > > > or on a file descriptor without read access, avoiding a
> > > > /proc//fd/ detour, requiring a mounted procfs.
> > > >
> > > > One use case will be setfiles(8) setting SELinux file contexts
> > > > ("security.selinux") without race conditions.
> > > >
> > > > Add XATTR flags to the private namespace of AT_* flags.
> > > >
> > > > Use the do_{name}at() pattern from fs/open.c.
> > > >
> > > > Use a single flag parameter for extended attribute flags (currently
> > > > XATTR_CREATE and XATTR_REPLACE) and *at() flags to not exceed six
> > > > syscall arguments in setxattrat().
> > > >
> > > > Previous approach ("f*xattr: allow O_PATH descriptors"): 
> > > > https://lore.kernel.org/all/20220607153139.35588-1-cgzo...@googlemail.com/
> > > > v1 discussion: 
> > > > https://lore.kernel.org/all/20220830152858.14866-2-cgzo...@googlemail.com/
> > > >
> > > > Signed-off-by: Christian Göttsche 
> > > > CC: x...@kernel.org
> > > > CC: linux-al...@vger.kernel.org
> > > > CC: linux-ker...@vger.kernel.org
> > > > CC: linux-arm-ker...@lists.infradead.org
> > > > CC: linux-i...@vger.kernel.org
> > > > CC: linux-m...@lists.linux-m68k.org
> > > > CC: linux-m...@vger.kernel.org
> > > > CC: linux-par...@vger.kernel.org
> > > > CC: linuxppc-dev@lists.ozlabs.org
> > > > CC: linux-s...@vger.kernel.org
> > > > CC: linux...@vger.kernel.org
> > > > CC: sparcli...@vger.kernel.org
> > > > CC: linux-fsde...@vger.kernel.org
> > > > CC: au...@vger.kernel.org
> > > > CC: linux-a...@vger.kernel.org
> > > > CC: linux-...@vger.kernel.org
> > > > CC: linux-security-mod...@vger.kernel.org
> > > > CC: seli...@vger.kernel.org
> > > > ---
> > >
> > > Fwiw, your header doesn't let me see who the mail was directly sent to
> > > so I'm only able to reply to lists which is a bit pointless...
> > >
> > > > v2:
> > > >   - squash syscall introduction and wire up commits
> > > >   - add AT_XATTR_CREATE and AT_XATTR_REPLACE constants
> > >
> > > > +#define AT_XATTR_CREATE  0x1 /* setxattrat(2): set 
> > > > value, fail if attr already exists */
> > > > +#define AT_XATTR_REPLACE 0x2 /* setxattrat(2): set value, fail 
> > > > if attr does not exist */
> > >
> > > We really shouldn't waste any AT_* flags for this. Otherwise we'll run
> > > out of them rather quickly. Two weeks ago we added another AT_* flag
> > > which is up for merging for v6.5 iirc and I've glimpsed another AT_*
> > > flag proposal in one of the talks at last weeks Vancouver conference
> > > extravaganza.
> > >
> > > Even if we reuse 0x200 for AT_XATTR_CREATE (like we did for AT_EACCESS
> > > and AT_REMOVEDIR) we still need another bit for AT_XATTR_REPLACE.
> > >
> > > Plus, this is really ugly since AT_XATTR_{CREATE,REPLACE} really isn't
> > > in any way related to lookup and we're mixing it in with lookup
> > > modifying flags.
> > >
> > > So my proposal for {g,s}etxattrat() would be:
> > >
> > > struct xattr_args {
> > > __aligned_u64 value;
> > > __u32 size;
> > > __u32 cmd;
> > > };
> > >
> > > So everything's nicely 64bit aligned in the struct. Use the @cmd member
> > > to set either XATTR_REPLACE or XATTR_CREATE and treat it as a proper
> > > enum and not as a flag argument like the old calls did.
> > >
> > > So then we'd have:
> > >
> > > setxattrat(int dfd, const char *path, const char __user *name,
> > >struct xattr_args __user *args, size_t size, unsigned int 
> > > flags)
> > > getxattrat(int dfd, const char *path, const char __user *name,
> > >  

Re: [RFC PATCH v2] fs/xattr: add *at family syscalls

2023-05-15 Thread Amir Goldstein
On Mon, May 15, 2023 at 1:33 PM Christian Brauner  wrote:
>
> On Thu, May 11, 2023 at 05:08:02PM +0200, Christian Göttsche wrote:
> > Add the four syscalls setxattrat(), getxattrat(), listxattrat() and
> > removexattrat().  Those can be used to operate on extended attributes,
> > especially security related ones, either relative to a pinned directory
> > or on a file descriptor without read access, avoiding a
> > /proc//fd/ detour, requiring a mounted procfs.
> >
> > One use case will be setfiles(8) setting SELinux file contexts
> > ("security.selinux") without race conditions.
> >
> > Add XATTR flags to the private namespace of AT_* flags.
> >
> > Use the do_{name}at() pattern from fs/open.c.
> >
> > Use a single flag parameter for extended attribute flags (currently
> > XATTR_CREATE and XATTR_REPLACE) and *at() flags to not exceed six
> > syscall arguments in setxattrat().
> >
> > Previous approach ("f*xattr: allow O_PATH descriptors"): 
> > https://lore.kernel.org/all/20220607153139.35588-1-cgzo...@googlemail.com/
> > v1 discussion: 
> > https://lore.kernel.org/all/20220830152858.14866-2-cgzo...@googlemail.com/
> >
> > Signed-off-by: Christian Göttsche 
> > CC: x...@kernel.org
> > CC: linux-al...@vger.kernel.org
> > CC: linux-ker...@vger.kernel.org
> > CC: linux-arm-ker...@lists.infradead.org
> > CC: linux-i...@vger.kernel.org
> > CC: linux-m...@lists.linux-m68k.org
> > CC: linux-m...@vger.kernel.org
> > CC: linux-par...@vger.kernel.org
> > CC: linuxppc-dev@lists.ozlabs.org
> > CC: linux-s...@vger.kernel.org
> > CC: linux...@vger.kernel.org
> > CC: sparcli...@vger.kernel.org
> > CC: linux-fsde...@vger.kernel.org
> > CC: au...@vger.kernel.org
> > CC: linux-a...@vger.kernel.org
> > CC: linux-...@vger.kernel.org
> > CC: linux-security-mod...@vger.kernel.org
> > CC: seli...@vger.kernel.org
> > ---
>
> Fwiw, your header doesn't let me see who the mail was directly sent to
> so I'm only able to reply to lists which is a bit pointless...
>
> > v2:
> >   - squash syscall introduction and wire up commits
> >   - add AT_XATTR_CREATE and AT_XATTR_REPLACE constants
>
> > +#define AT_XATTR_CREATE  0x1 /* setxattrat(2): set value, 
> > fail if attr already exists */
> > +#define AT_XATTR_REPLACE 0x2 /* setxattrat(2): set value, fail if 
> > attr does not exist */
>
> We really shouldn't waste any AT_* flags for this. Otherwise we'll run
> out of them rather quickly. Two weeks ago we added another AT_* flag
> which is up for merging for v6.5 iirc and I've glimpsed another AT_*
> flag proposal in one of the talks at last weeks Vancouver conference
> extravaganza.
>
> Even if we reuse 0x200 for AT_XATTR_CREATE (like we did for AT_EACCESS
> and AT_REMOVEDIR) we still need another bit for AT_XATTR_REPLACE.
>
> Plus, this is really ugly since AT_XATTR_{CREATE,REPLACE} really isn't
> in any way related to lookup and we're mixing it in with lookup
> modifying flags.
>
> So my proposal for {g,s}etxattrat() would be:
>
> struct xattr_args {
> __aligned_u64 value;
> __u32 size;
> __u32 cmd;
> };
>
> So everything's nicely 64bit aligned in the struct. Use the @cmd member
> to set either XATTR_REPLACE or XATTR_CREATE and treat it as a proper
> enum and not as a flag argument like the old calls did.
>
> So then we'd have:
>
> setxattrat(int dfd, const char *path, const char __user *name,
>struct xattr_args __user *args, size_t size, unsigned int flags)
> getxattrat(int dfd, const char *path, const char __user *name,
>struct xattr_args __user *args, size_t size, unsigned int flags)
>
> The current in-kernel struct xattr_ctx would be renamed to struct
> kernel_xattr_args and then we do the usual copy_struct_from_user()
> dance:
>
> struct xattr_args args;
> err = copy_struct_from_user(, sizeof(args), uargs, usize);
>
> and then go on to handle value/size for setxattrat()/getxattrat()
> accordingly.
>
> getxattr()/setxattr() aren't meaningfully filterable by seccomp already
> so there's not point in not using a struct.
>
> If that isn't very appealing then another option is to add a new flag
> namespace just for setxattrat() similar to fspick() and move_mount()
> duplicating the needed lookup modifying flags.
> Thoughts?

Here is a thought: I am not sure if I am sorry we did not discuss this API
issue in LSFMM or happy that we did not waste our time on this... :-/

I must say that I dislike redefined flag namespace like FSPICK_*
just as much as I dislike overloading the AT_* namespace and TBH,
I am not crazy about avoiding this problem with xattr_args either.

A more sane solution IMO could have been:
- Use lower word of flags for generic AT_ flags
- Use the upper word of flags for syscall specific flags

So if it were up to me, I would vote starting this practice:

+ /* Start of syscall specific range */
+ #define AT_XATTR_CREATE   0x1 /* setxattrat(2): set
value, fail if attr already exists */
+ #define AT_XATTR_REPLACE 0x2 /*