Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-12 Thread Steven Rostedt
On Fri, 12 Jan 2024 08:53:44 -0500
Steven Rostedt  wrote:

> > // We managed to open the directory so we have permission to list
> > // directory entries in "xfs".
> > fd = open("/sys/kernel/tracing/events/xfs");
> > 
> > // Remove ownership so we can't open the directory anymore
> > chown("/sys/kernel/tracing/events/xfs", 0, 0);
> > 
> > // Or just remove exec bit for the group and restrict to owner
> > chmod("/sys/kernel/tracing/events/xfs", 700);
> > 
> > // Drop caches to force an eventfs_root_lookup() on everything
> > write("/proc/sys/vm/drop_caches", "3", 1);  
> 
> This requires opening the directory, then having it's permissions
> change, and then immediately dropping the caches.
> 
> > 
> > // Returns 0 even though directory has a lot of entries and we should be
> > // able to list them
> > getdents64(fd, ...);  
> 
> And do we care?

Hmm, maybe the issue you have is with the inconsistency of the action?

For this to fail, it would require the admin to do both change the
permission and to flush caches. If you don't flush the caches then the
task with the dir open can still read it regardless. If the dentries
were already created.

In that case I'm fine if we change the creation of the dentries to not
check the permission.

But for now, it's just a weird side effect that I don't really see how
it would affect any user's workflow.

-- Steve




Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-12 Thread Steven Rostedt
On Fri, 12 Jan 2024 09:27:24 +0100
Christian Brauner  wrote:

> On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote:
> > On Thu, 11 Jan 2024 22:01:32 +0100
> > Christian Brauner  wrote:
> >   
> > > What I'm pointing out in the current logic is that the caller is
> > > taxed twice:
> > > 
> > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> > > (2) And again when you call lookup_one_len() in eventfs_start_creating()
> > > _because_ the permission check in lookup_one_len() is the exact
> > > same permission check again that the vfs has done
> > > inode_permission(MAY_EXEC, "xfs").  
> > 
> > As I described in: 
> > https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/
> > 
> > The eventfs files below "events" doesn't need the .permissions callback at
> > all. It's only there because the "events" inode uses it.
> > 
> > The .permissions call for eventfs has:  
> 
> It doesn't matter whether there's a ->permission handler. If you don't
> add one explicitly the VFS will simply call generic_permission():
> 
> inode_permission()
> -> do_inode_permission()  
>{
> if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
> if (likely(inode->i_op->permission))
> return inode->i_op->permission(idmap, inode, mask);
>
> /* This gets set once for the inode lifetime */
> spin_lock(>i_lock);
> inode->i_opflags |= IOP_FASTPERM;
> spin_unlock(>i_lock);
> }
> return generic_permission(idmap, inode, mask);
>}

Yes I know that, because that's where I knew what to call in the non
"events" dir case.

> 
> > Anyway, the issue is with "events" directory and remounting, because like
> > the tracefs system, the inode and dentry for "evnets" is created at boot
> > up, before the mount happens. The VFS layer is going to check the
> > permissions of its inode and dentry, which will be incorrect if the mount
> > was mounted with a "gid" option.  
> 
> The gid option has nothing to do with this and it is just handled fine
> if you remove the second permission checking in (2).

I guess I'm confused to what you are having an issue with. Is it just
that the permission check gets called twice?

> 
> You need to remove the inode_permission() code from
> eventfs_start_creating(). It is just an internal lookup and the fact
> that you have it in there allows userspace to break readdir on the
> eventfs portions of tracefs as I've shown in the parts of the mail that
> you cut off.

That's because I didn't see how it was related to the way I fixed the
mount=gid issue. Are you only concerned because of the check in
eventfs_start_creating()?

Yes, you posted code that would make things act funny for some code
that I see no real world use case for. Yeah, it may not act "properly"
but I'm not sure that's bad.

Here, I'll paste it back:

> // We managed to open the directory so we have permission to list
> // directory entries in "xfs".
> fd = open("/sys/kernel/tracing/events/xfs");
> 
> // Remove ownership so we can't open the directory anymore
> chown("/sys/kernel/tracing/events/xfs", 0, 0);
> 
> // Or just remove exec bit for the group and restrict to owner
> chmod("/sys/kernel/tracing/events/xfs", 700);
> 
> // Drop caches to force an eventfs_root_lookup() on everything
> write("/proc/sys/vm/drop_caches", "3", 1);

This requires opening the directory, then having it's permissions
change, and then immediately dropping the caches.

> 
> // Returns 0 even though directory has a lot of entries and we should be
> // able to list them
> getdents64(fd, ...);

And do we care?

Since tracing exposes internal kernel information, perhaps this is a
feature and not a bug. If someone who had access to the tracing system
and you wanted to stop them, if they had a directory open that they no
longer have access to, you don't want them to see what's left in the
directory.

In other words, I like the idea that the getends64(fd, ...) will fail!

If there's a file underneath that wasn't change, and the admin thought
that just keeping the top directory permissions off is good enough,
then that attacker having that directory open before the directory had
it's file permissions change is a way to still have access to the files
below it.

> 
> And the failure is in the inode_permission(MAY_EXEC, "xfs") call in
> lookup_one_len() in eventfs_start_creating() which now fails.

And I think is a good thing!

Again, tracefs is special. It gives you access and possibly control to
the kernel behavior. I like the fact that as soon as someone loses
permission to a directory, they immediately lose it.

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-12 Thread Christian Brauner
On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote:
> On Thu, 11 Jan 2024 22:01:32 +0100
> Christian Brauner  wrote:
> 
> > What I'm pointing out in the current logic is that the caller is
> > taxed twice:
> > 
> > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> > (2) And again when you call lookup_one_len() in eventfs_start_creating()
> > _because_ the permission check in lookup_one_len() is the exact
> > same permission check again that the vfs has done
> > inode_permission(MAY_EXEC, "xfs").
> 
> As I described in: 
> https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/
> 
> The eventfs files below "events" doesn't need the .permissions callback at
> all. It's only there because the "events" inode uses it.
> 
> The .permissions call for eventfs has:

It doesn't matter whether there's a ->permission handler. If you don't
add one explicitly the VFS will simply call generic_permission():

inode_permission()
-> do_inode_permission()
   {
if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
if (likely(inode->i_op->permission))
return inode->i_op->permission(idmap, inode, mask);
   
/* This gets set once for the inode lifetime */
spin_lock(>i_lock);
inode->i_opflags |= IOP_FASTPERM;
spin_unlock(>i_lock);
}
return generic_permission(idmap, inode, mask);
   }

> Anyway, the issue is with "events" directory and remounting, because like
> the tracefs system, the inode and dentry for "evnets" is created at boot
> up, before the mount happens. The VFS layer is going to check the
> permissions of its inode and dentry, which will be incorrect if the mount
> was mounted with a "gid" option.

The gid option has nothing to do with this and it is just handled fine
if you remove the second permission checking in (2).

You need to remove the inode_permission() code from
eventfs_start_creating(). It is just an internal lookup and the fact
that you have it in there allows userspace to break readdir on the
eventfs portions of tracefs as I've shown in the parts of the mail that
you cut off.



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-11 Thread Steven Rostedt
On Thu, 11 Jan 2024 22:01:32 +0100
Christian Brauner  wrote:

> What I'm pointing out in the current logic is that the caller is
> taxed twice:
> 
> (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> (2) And again when you call lookup_one_len() in eventfs_start_creating()
> _because_ the permission check in lookup_one_len() is the exact
> same permission check again that the vfs has done
> inode_permission(MAY_EXEC, "xfs").

As I described in: 
https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/

The eventfs files below "events" doesn't need the .permissions callback at
all. It's only there because the "events" inode uses it.

The .permissions call for eventfs has:

static int eventfs_permission(struct mnt_idmap *idmap,
  struct inode *inode, int mask)
{
set_top_events_ownership(inode);
return generic_permission(idmap, inode, mask);
}

Where the "set_top_events_ownership() is a nop for everything but the
"events" directory.

I guess I could have two ops:

static const struct inode_operations eventfs_root_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
.permission = eventfs_permission,
};

static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
};

And use the second one for all dentries below the root, but I figured it's
not that big of a deal if I called the permissions on all. Perhaps I should
do it with two?

Anyway, the issue is with "events" directory and remounting, because like
the tracefs system, the inode and dentry for "evnets" is created at boot
up, before the mount happens. The VFS layer is going to check the
permissions of its inode and dentry, which will be incorrect if the mount
was mounted with a "gid" option.


-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-11 Thread Christian Brauner
On Wed, Jan 10, 2024 at 08:07:46AM -0500, Steven Rostedt wrote:
> On Wed, 10 Jan 2024 12:45:36 +0100
> Christian Brauner  wrote:
> 
> > So say you do:
> > 
> > mkdir /sys/kernel/tracing/instances/foo
> > 
> > After this has returned we know everything we need to know about the new
> > tracefs instance including the ownership and the mode of all inodes in
> > /sys/kernel/tracing/instances/foo/events/* and below precisely because
> > ownership is always inherited from the parent dentry and recorded in the
> > metadata struct eventfs_inode.
> > 
> > So say someone does:
> > 
> > open("/sys/kernel/tracing/instances/foo/events/xfs");
> > 
> > and say this is the first time that someone accesses that events/
> > directory.
> > 
> > When the open pathwalk is done, the vfs will determine via
> > 
> > [1] may_lookup(inode_of(events))
> > 
> > whether you are able to list entries such as "xfs" in that directory.
> > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
> > it ends up calling i_op->eventfs_root_lookup(events).
> > 
> > At this point tracefs/eventfs adds the inodes for all entries in that
> > "events" directory including "xfs" based on the metadata it recorded
> > during the mkdir. Since now someone is actually interested in them. And
> > it initializes the inodes with ownership and everything and adds the
> > dentries that belong into that directory.
> > 
> > Nothing here depends on the permissions of the caller. The only
> > permission that mattered was done in the VFS in [1]. If the caller has
> > permissions to enter a directory they can lookup and list its contents.
> > And its contents where determined/fixed etc when mkdir was called.
> > 
> > So we just need to add the required objects into the caches (inode,
> > dentry) whose addition we intentionally defered until someone actually
> > needed them.
> > 
> > So, eventfs_root_lookup() now initializes the inodes with the ownership
> > from the stored metadata or from the parent dentry and splices in inodes
> > and dentries. No permission checking is needed for this because it is
> > always a recheck of what the vfs did in [1].
> > 
> > We now return to the vfs and path walk continues to the final component
> > that you actually want to open which is that "xfs" directory in this
> > example. We check the permissions on that inode via may_open("xfs") and
> > we open that directory returning an fd to userspace ultimately.
> > 
> > (I'm going by memory since I need to step out the door.)
> 
> So, let's say we do:
> 
>  chgrp -R rostedt /sys/kernel/tracing/

The rostedt group needs exec permissions and "other" cannot have exec
permissions otherwise you can trivially list the entries even if it's
owned by root:

chmod 750 /sys/kernel/tracing

user1@localhost:~$ ls -aln /sys/kernel/ | grep tracing
drwxr-x---   6 0 10000 Jan 11 18:23 tracing

> 
> But I don't want rostedt to have access to xfs
> 
>  chgrp -R root /sys/kernel/tracing/events/xfs

chmod 750 /sys/kernel/tracing/events/xfs

user1@localhost:~$ ls -aln /sys/kernel/tracing/events/ | grep xfs
drwxr-x--- 601 0 0 0 Jan 11 18:24 xfs

This ensure that if a user is in the group and the group has exec perms
lookup is possible (For root this will usually work because
CAP_DAC_READ_SEARCH overrides the exec requirement.).

> 
> Both actions will create the inodes and dentries of all files and
> directories (because of "-R"). But once that is done, the ref counts go to
> zero. They stay around until reclaim. But then I open Chrome ;-) and it
> reclaims all the dentries and inodes, so we are back to here we were on
> boot.
> 
> Now as rostedt I do:
> 
>  ls /sys/kernel/tracing/events/xfs
> 
> The VFS layer doesn't know if I have permission to that or not, because all
> the inodes and dentries have been freed. It has to call back to eventfs to
> find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will
> recreated the inodes with the proper permission.

Very roughly, ignoring most of the complexity of lookup and focussing on
the permission checking:

When a caller looks up an entry in a directory then the VFS will call
inode_permission(MAY_EXEC) on the directory the caller is trying to
perform that lookup in.

If the caller wants to lookup the "events" entry in the "tracing"
directory then the VFS will call inode_permission(MAY_EXEC, "tracing")
and then - assuming it's not in the cache - call into the lookup method
of the filesystem.

After the VFS has determined that the caller has the permissions to
lookup the "events" entry in the "tracing" directory no further
permission checks are needed.

Path lookup now moves on to the next part which is looking up the "xfs"
entry in the "events" directory. So again, VFS calls
inode_permission(MAY_EXEC, "events") finds that caller has the permissions
and then goes on to call into eventfs_root_lookup().

The VFS has already given the caller a dentry for "xfs" and is passing
that down to tracefs/eventfs. So eventfs gets

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt
On Wed, 10 Jan 2024 10:52:51 -0500
Steven Rostedt  wrote:

> On Wed, 10 Jan 2024 08:07:46 -0500
> Steven Rostedt  wrote:
> 
> > Or are you saying that I don't need the ".permission" callback, because
> > eventfs does it when it creates the inodes? But for eventfs to know what
> > the permissions changes are, it uses .getattr and .setattr.  
> 
> OK, if your main argument is that we do not need .permission, I agree with
> you. But that's a trivial change and doesn't affect the complexity that
> eventfs is doing. In fact, removing the "permission" check is simply this
> patch:
> 
> --
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index fdff53d5a1f8..f2af07a857e2 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap,
>   return 0;
>  }
>  
> -static int eventfs_permission(struct mnt_idmap *idmap,
> -   struct inode *inode, int mask)
> -{
> - set_top_events_ownership(inode);
> - return generic_permission(idmap, inode, mask);
> -}
> -
>  static const struct inode_operations eventfs_root_dir_inode_operations = {
>   .lookup = eventfs_root_lookup,
>   .setattr= eventfs_set_attr,
>   .getattr= eventfs_get_attr,
> - .permission = eventfs_permission,
>  };
>  
>  static const struct inode_operations eventfs_file_inode_operations = {
> --
> 
> I only did that because Linus mentioned it, and I thought it was needed.
> I'll apply this patch too, as it appears to work with this code.

Oh, eventfs files and directories don't need the .permissions because its
inodes and dentries are not created until accessed. But the "events"
directory itself has its dentry and inode created at boot up, but still
uses the eventfs_root_dir_inode_operations. So the .permissions is still
needed!

If you look at the "set_top_events_ownership()" function, it has:

/* The top events directory doesn't get automatically updated */
if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
return;

That is, it does nothing if the entry is not the "events" directory. It
falls back to he default "->permissions()" function for everything but the
top level "events" directory.

But this and .getattr are still needed for the events directory, because it
suffers the same issue as the other tracefs entries. That is, it's inodes
and dentries are created at boot up before it is mounted. So if the mount
has gid=1000, it will be ignored.

The .getattr is called by "stat" which ls does. So after boot up if you
just do:

 # chmod 0750 /sys/kernel/events
 # chmod 0770 /sys/kernel/tracing
 # mount -o remount,gid=1000 /sys/kernel/tracing
 # su - rostedt
 $ id
uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt)
 $ ls /sys/kernel/tracing/events/
9pext4iomapmodule  raw_syscalls  thermal
alarmtimerfib iommumsr rcu   thp
avc   fib6io_uring napiregmaptimer
block filelockipi  neigh   regulator tlb
bpf_test_run  filemap irq  net resctrl   udp
bpf_trace ftrace  irq_matrix   netfs   rpm   virtio_gpu[
...]

The above works because "ls" does a stat() on the directory first, which
does a .getattr() call that updates the permissions of the existing "events"
directory inode.

  BUT!

If I had used my own getents() program that has:

fd = openat(AT_FDCWD, argv[1], O_RDONLY);
if (fd < 0)
perror("openat");

n = getdents64(fd, buf, BUF_SIZE);
if (n < 0)
perror("getdents64");

Where it calls the openat() without doing a stat fist, and after boot, had done:

 # chmod 0750 /sys/kernel/events
 # chmod 0770 /sys/kernel/tracing
 # mount -o remount,gid=1000 /sys/kernel/tracing
 # su - rostedt
 $ id
uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt)
 $ ./getdents /sys/kernel/tracing/events
openat: Permission denied
getdents64: Bad file descriptor

It errors because he "events" inode permission hasn't been updated yet.
Now after getting the above error, if I do the "ls" and then run it again:

 $ ls /sys/kernel/tracing/events > /dev/null
 $ ./getdents /sys/kernel/tracing/events
enable
header_page
header_event
initcall
vsyscall
syscalls

it works!

so no, I can't remove that .permissions callback from eventfs.

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt
On Wed, 10 Jan 2024 10:52:51 -0500
Steven Rostedt  wrote:

> I'll apply this patch too, as it appears to work with this code.

I meant "appears to work without this code".

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt
On Wed, 10 Jan 2024 08:07:46 -0500
Steven Rostedt  wrote:

> Or are you saying that I don't need the ".permission" callback, because
> eventfs does it when it creates the inodes? But for eventfs to know what
> the permissions changes are, it uses .getattr and .setattr.

OK, if your main argument is that we do not need .permission, I agree with
you. But that's a trivial change and doesn't affect the complexity that
eventfs is doing. In fact, removing the "permission" check is simply this
patch:

--
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..f2af07a857e2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap,
return 0;
 }
 
-static int eventfs_permission(struct mnt_idmap *idmap,
- struct inode *inode, int mask)
-{
-   set_top_events_ownership(inode);
-   return generic_permission(idmap, inode, mask);
-}
-
 static const struct inode_operations eventfs_root_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
-   .permission = eventfs_permission,
 };
 
 static const struct inode_operations eventfs_file_inode_operations = {
--

I only did that because Linus mentioned it, and I thought it was needed.
I'll apply this patch too, as it appears to work with this code.

Thanks!

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt
On Wed, 10 Jan 2024 12:45:36 +0100
Christian Brauner  wrote:

> So say you do:
> 
> mkdir /sys/kernel/tracing/instances/foo
> 
> After this has returned we know everything we need to know about the new
> tracefs instance including the ownership and the mode of all inodes in
> /sys/kernel/tracing/instances/foo/events/* and below precisely because
> ownership is always inherited from the parent dentry and recorded in the
> metadata struct eventfs_inode.
> 
> So say someone does:
> 
> open("/sys/kernel/tracing/instances/foo/events/xfs");
> 
> and say this is the first time that someone accesses that events/
> directory.
> 
> When the open pathwalk is done, the vfs will determine via
> 
> [1] may_lookup(inode_of(events))
> 
> whether you are able to list entries such as "xfs" in that directory.
> The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
> it ends up calling i_op->eventfs_root_lookup(events).
> 
> At this point tracefs/eventfs adds the inodes for all entries in that
> "events" directory including "xfs" based on the metadata it recorded
> during the mkdir. Since now someone is actually interested in them. And
> it initializes the inodes with ownership and everything and adds the
> dentries that belong into that directory.
> 
> Nothing here depends on the permissions of the caller. The only
> permission that mattered was done in the VFS in [1]. If the caller has
> permissions to enter a directory they can lookup and list its contents.
> And its contents where determined/fixed etc when mkdir was called.
> 
> So we just need to add the required objects into the caches (inode,
> dentry) whose addition we intentionally defered until someone actually
> needed them.
> 
> So, eventfs_root_lookup() now initializes the inodes with the ownership
> from the stored metadata or from the parent dentry and splices in inodes
> and dentries. No permission checking is needed for this because it is
> always a recheck of what the vfs did in [1].
> 
> We now return to the vfs and path walk continues to the final component
> that you actually want to open which is that "xfs" directory in this
> example. We check the permissions on that inode via may_open("xfs") and
> we open that directory returning an fd to userspace ultimately.
> 
> (I'm going by memory since I need to step out the door.)

So, let's say we do:

 chgrp -R rostedt /sys/kernel/tracing/

But I don't want rostedt to have access to xfs

 chgrp -R root /sys/kernel/tracing/events/xfs

Both actions will create the inodes and dentries of all files and
directories (because of "-R"). But once that is done, the ref counts go to
zero. They stay around until reclaim. But then I open Chrome ;-) and it
reclaims all the dentries and inodes, so we are back to here we were on
boot.

Now as rostedt I do:

 ls /sys/kernel/tracing/events/xfs

The VFS layer doesn't know if I have permission to that or not, because all
the inodes and dentries have been freed. It has to call back to eventfs to
find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will
recreated the inodes with the proper permission.

Or are you saying that I don't need the ".permission" callback, because
eventfs does it when it creates the inodes? But for eventfs to know what
the permissions changes are, it uses .getattr and .setattr.

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Christian Brauner
On Mon, Jan 08, 2024 at 10:23:31AM -0500, Steven Rostedt wrote:
> On Mon, 8 Jan 2024 12:04:54 +0100
> Christian Brauner  wrote:
> 
> > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > > redundant and just wrong.  
> > > 
> > > I don't think so.  
> > 
> > I'm very well aware that the dentries and inode aren't created during
> > mkdir but the completely directory layout is determined. You're just
> > splicing in dentries and inodes during lookup and readdir.
> > 
> > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
> > do a lookup/readdir on
> > 
> > ls -al /sys/kernel/tracing/instances/foo/events
> > 
> > Why should the creation of the dentries and inodes ever fail due to a
> > permission failure?
> 
> They shouldn't.
> 
> > The vfs did already verify that you had the required
> > permissions to list entries in that directory. Why should filling up
> > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
> > That tracefs instance would be half-functional. And again, right now
> > that inode_permission() check cannot even fail.
> 
> And it shouldn't. But without dentries and inodes, how does VFS know what
> is allowed to open the files?

So say you do:

mkdir /sys/kernel/tracing/instances/foo

After this has returned we know everything we need to know about the new
tracefs instance including the ownership and the mode of all inodes in
/sys/kernel/tracing/instances/foo/events/* and below precisely because
ownership is always inherited from the parent dentry and recorded in the
metadata struct eventfs_inode.

So say someone does:

open("/sys/kernel/tracing/instances/foo/events/xfs");

and say this is the first time that someone accesses that events/
directory.

When the open pathwalk is done, the vfs will determine via

[1] may_lookup(inode_of(events))

whether you are able to list entries such as "xfs" in that directory.
The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
it ends up calling i_op->eventfs_root_lookup(events).

At this point tracefs/eventfs adds the inodes for all entries in that
"events" directory including "xfs" based on the metadata it recorded
during the mkdir. Since now someone is actually interested in them. And
it initializes the inodes with ownership and everything and adds the
dentries that belong into that directory.

Nothing here depends on the permissions of the caller. The only
permission that mattered was done in the VFS in [1]. If the caller has
permissions to enter a directory they can lookup and list its contents.
And its contents where determined/fixed etc when mkdir was called.

So we just need to add the required objects into the caches (inode,
dentry) whose addition we intentionally defered until someone actually
needed them.

So, eventfs_root_lookup() now initializes the inodes with the ownership
from the stored metadata or from the parent dentry and splices in inodes
and dentries. No permission checking is needed for this because it is
always a recheck of what the vfs did in [1].

We now return to the vfs and path walk continues to the final component
that you actually want to open which is that "xfs" directory in this
example. We check the permissions on that inode via may_open("xfs") and
we open that directory returning an fd to userspace ultimately.

(I'm going by memory since I need to step out the door.)



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Steven Rostedt
On Mon, 8 Jan 2024 12:32:46 +0100
Christian Brauner  wrote:

> On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote:
> > On Sun, 7 Jan 2024 13:29:12 -0500
> > Steven Rostedt  wrote:
> >   
> > > > 
> > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > > redundant and just wrong.
> > > 
> > > I don't think so.  
> > 
> > Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
> > It exists without that. Although one rationale to do eventfs was so  
> 
> Every instance/foo/ tracefs instances also contains an events directory
> and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's
> not a separate filesystem. Both eventfs and tracefs are on the same
> single, system wide superblock.
> 
> > that the instance directories wouldn't recreate the same 10thousands
> > event inodes and dentries for every mkdir done.  
> 
> I know but that's irrelevant to what I'm trying to tell you.
> 
> A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs
> instance. With or without the on-demand dentry and inode creation for
> the eventfs portion that tracefs "instance" has now been created in its
> entirety including all the required information for someone to later
> come along and perform a lookup on /sys/kernel/tracing/instances/foo/events.
> 
> All you've done is to defer the addition of the dentries and inodes when
> someone does actually look at the events directory of the tracefs
> instance.
> 
> Whether you choose to splice in the dentries and inodes for the eventfs
> portion during lookup and readdir or if you had chosen to not do the
> on-demand thing at all and the entries were created at the same time as
> the mkdir call are equivalent from the perspective of permission
> checking.
> 
> If you have the required permissions to look at the events directory
> then there's no reason why listing the directory entries in there should
> fail. This can't even happen right now.

Ah, I think I know where the confusion lies. The tracing information in
kernel/trace/*.c doesn't keep track of permission. It relies totally on
fs/tracefs/* to do so. If someone does 'chmod' or 'chown' or mounts with
'gid=xxx' then it's up to tracefs to maintain that information and not the
tracing subsystem. The tracing subsystem only gives the "default"
permissions (before boot finishes).

The difference between normal file systems and pseudo file systems like
debugfs and tracefs, is that normal file systems keep the permission
information stored on the external device. That is, when the inodes and
dentries are created, the information is retrieved from the stored file
system.

I think this may actually be a failure of debugfs (and tracefs as it was
based on debugfs), in that the inodes and dentries are created at the same
time the "files" backing them are. Which is normally at boot up and before
the file system is mounted.

That is, inodes and dentries are actually coupled with the data they
represent. It's not a cache for a back store like a hard drive partition.

To create a file in debugfs you do:

struct dentry *debugfs_create_file(const char *name, umode_t mode,
   struct dentry *parent, void *data,
   const struct file_operations *fops)

That is, you pass a the name, the mode, the parent dentry, data, and the
fops and that will create an inode and dentry (which is returned).

This happens at boot up before user space is running and before debugfs is
even mounted.

Because debugfs is mostly for debugging, people don't care about how it's
mounted. It is usually restricted to root only access. Especially since
there's a lot of sensitive information that shouldn't be exposed to
non-privileged users.

The reason tracefs came about is that people asked me to be able to have
access to tracing without needing to even enable debugfs. They also want to
easily make it accessible to non root users and talking with Kees Cook, he
recommended using ACL. But because it inherited a lot from debugfs, I
started doing these tricks like walking the dentry tree to make it work a
bit better. Because the dentries and inodes were created before mount, I
had to play these tricks.

But as Linus pointed out, that was the wrong way to do that. The right way
was to use .getattr and .permission callbacks to figure out what the
permissions to the files are.

This has nothing to do with the creation of the files, it's about who has
access to the files that the inodes point to.

This sounds like another topic to bring up at LSFMM ;-)  "Can we
standardize pseudo file systems like debugfs and tracefs to act more like
real file systems, and have inodes and dentries act as cache and not be so
coupled to the data?"

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Steven Rostedt
On Mon, 8 Jan 2024 12:04:54 +0100
Christian Brauner  wrote:

> > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > redundant and just wrong.  
> > 
> > I don't think so.  
> 
> I'm very well aware that the dentries and inode aren't created during
> mkdir but the completely directory layout is determined. You're just
> splicing in dentries and inodes during lookup and readdir.
> 
> If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
> do a lookup/readdir on
> 
> ls -al /sys/kernel/tracing/instances/foo/events
> 
> Why should the creation of the dentries and inodes ever fail due to a
> permission failure?

They shouldn't.

> The vfs did already verify that you had the required
> permissions to list entries in that directory. Why should filling up
> /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
> That tracefs instance would be half-functional. And again, right now
> that inode_permission() check cannot even fail.

And it shouldn't. But without dentries and inodes, how does VFS know what
is allowed to open the files?

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Christian Brauner
On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote:
> On Sun, 7 Jan 2024 13:29:12 -0500
> Steven Rostedt  wrote:
> 
> > > 
> > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > redundant and just wrong.  
> > 
> > I don't think so.
> 
> Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
> It exists without that. Although one rationale to do eventfs was so

Every instance/foo/ tracefs instances also contains an events directory
and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's
not a separate filesystem. Both eventfs and tracefs are on the same
single, system wide superblock.

> that the instance directories wouldn't recreate the same 10thousands
> event inodes and dentries for every mkdir done.

I know but that's irrelevant to what I'm trying to tell you.

A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs
instance. With or without the on-demand dentry and inode creation for
the eventfs portion that tracefs "instance" has now been created in its
entirety including all the required information for someone to later
come along and perform a lookup on /sys/kernel/tracing/instances/foo/events.

All you've done is to defer the addition of the dentries and inodes when
someone does actually look at the events directory of the tracefs
instance.

Whether you choose to splice in the dentries and inodes for the eventfs
portion during lookup and readdir or if you had chosen to not do the
on-demand thing at all and the entries were created at the same time as
the mkdir call are equivalent from the perspective of permission
checking.

If you have the required permissions to look at the events directory
then there's no reason why listing the directory entries in there should
fail. This can't even happen right now.



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Christian Brauner
> > * Tracefs supports the creation of instances from userspace via mkdir.
> >   For example,
> > 
> > mkdir /sys/kernel/tracing/instances/foo
> > 
> >   And here the idmapping is relevant so we need to make the helpers
> >   aware of the idmapping.
> > 
> >   I just went and plumbed this through to most helpers.
> > 
> > There's some subtlety in eventfs. Afaict, the directories and files for
> > the individual events are created on-demand during lookup or readdir.
> > 
> > The ownership of these events is again inherited from the parent inode
> > or recovered from stored state. In both cases the actual idmapping is
> > irrelevant.
> > 
> > The callchain here is:
> > 
> > eventfs_root_lookup("xfs", "events")
> > -> create_{dir,file}_dentry("xfs", "events")
> >-> create_{dir,file}("xfs", "events")
> >   -> eventfs_start_creating("xfs", "events")
> >  -> lookup_one_len("xfs", "events")  
> > 
> > And the subtlety is that lookup_one_len() does permission checking on
> > the parent inode (IOW, if you want a dentry for "blech" under "events"
> > it'll do a permission check on events->d_inode) for exec permissions
> > and then goes on to give you a new dentry.
> > 
> > Usually this call would have to be changed to lookup_one() and the
> > idmapping be handed down to it. But I think that's irrelevant here.
> > 
> > Lookup generally doesn't need to be aware of idmappings at all. The
> > permission checking is done purely in the vfs via may_lookup() and the
> > idmapping is irrelevant because we always initialize inodes with the
> > filesystem level ownership (see the idmappings.rst) documentation if
> > you're interested in excessive details (otherwise you get inode aliases
> > which you really don't want).
> > 
> > For tracefs it would not matter for lookup per se but only because
> > tracefs seemingly creates inodes/dentries during lookup (and readdir()).
> 
> tracefs creates the inodes/dentries at boot up, it's eventfs that does
> it on demand during lookup.
> 
> For inodes/dentries:
> 
>  /sys/kernel/tracing/* is all created at boot up, except for "events".

Yes.

>  /sys/kernel/tracing/events/* is created on demand.

Yes.

> 
> > 
> > But imho the permission checking done in current eventfs_root_lookup()
> > via lookup_one_len() is meaningless in any way; possibly even
> > (conceptually) wrong.
> > 
> > Because, the actual permission checking for the creation of the eventfs
> > entries isn't really done during lookup or readdir, it's done when mkdir
> > is called:
> > 
> > mkdir /sys/kernel/tracing/instances/foo
> 
> No. that creates a entire new tracefs instance, which happens to
> include another eventfs directory.

Yes, I'm aware of all that.

> No. Only the meta data is created for the eventfs directory with a
> mkdir instances/foo. The inodes and dentries are not there.

I know, that is what I'm saying.

> 
> > 
> > When one goes and looksup stuff under foo/events/ or readdir the entries
> > in that directory:
> > 
> > fd = open("foo/events")
> > readdir(fd, ...)
> > 
> > then they are licensed to list an entry in that directory. So all that
> > needs to be done is to actually list those files in that directory. And
> > since they already exist (they were created during mkdir) we just need
> > to splice in inodes and dentries for them. But for that we shouldn't
> > check permissions on the directory again. Because we've done that
> > already correctly when the VFS called may_lookup().
> 
> No they do not exist.

I am aware.

> 
> > 
> > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > redundant and just wrong.
> 
> I don't think so.

I'm very well aware that the dentries and inode aren't created during
mkdir but the completely directory layout is determined. You're just
splicing in dentries and inodes during lookup and readdir.

If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
do a lookup/readdir on

ls -al /sys/kernel/tracing/instances/foo/events

Why should the creation of the dentries and inodes ever fail due to a
permission failure? The vfs did already verify that you had the required
permissions to list entries in that directory. Why should filling up
/sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
That tracefs instance would be half-functional. And again, right now
that inode_permission() check cannot even fail.



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Steven Rostedt
On Sun, 7 Jan 2024 13:29:12 -0500
Steven Rostedt  wrote:

> > 
> > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > redundant and just wrong.  
> 
> I don't think so.

Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
It exists without that. Although one rationale to do eventfs was so
that the instance directories wouldn't recreate the same 10thousands
event inodes and dentries for every mkdir done.

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Steven Rostedt
On Sun, 7 Jan 2024 13:42:39 +0100
Christian Brauner  wrote:
> 
> So, I tried to do an exploratory patch even though I promised myself not
> to do it. But hey...
> 
> Some notes:
> 
> * Permission handling for idmapped mounts is done completely in the
>   VFS. That's the case for all filesytems that don't have a custom
>   ->permission() handler. So there's nothing to do for us here.  
> 
> * Idmapped mount handling for ->getattr() is done completely by the VFS
>   if the filesystem doesn't have a custom ->getattr() handler. So we're
>   done here.
> 
> * Tracefs doesn't support attribute changes via ->setattr() (chown(),
>   chmod etc.). So there's nothing to here.
> 
> * Eventfs does support attribute changes via ->setattr(). But it relies
>   on simple_setattr() which is already idmapped mount aware. So there's
>   nothing for us to do.
> 
> * Ownership is inherited from the parent inode (tracefs) or optionally
>   from stashed ownership information (eventfs). That means the idmapping
>   is irrelevant here. It's similar to the "inherit gid from parent
>   directory" logic we have in some circumstances. TL;DR nothing to do
>   here as well.

The reason ownership is inherited from the parent is because the inodes
are created at boot up before user space starts.

eventfs inodes are created on demand after user space so it needs to
either check the "default" ownership and permissions, or if the user
changed a specific file/directory, it must save it and use that again
if the inode/dentry are reclaimed and then referenced again and
recreated.

> 
> * Tracefs supports the creation of instances from userspace via mkdir.
>   For example,
> 
>   mkdir /sys/kernel/tracing/instances/foo
> 
>   And here the idmapping is relevant so we need to make the helpers
>   aware of the idmapping.
> 
>   I just went and plumbed this through to most helpers.
> 
> There's some subtlety in eventfs. Afaict, the directories and files for
> the individual events are created on-demand during lookup or readdir.
> 
> The ownership of these events is again inherited from the parent inode
> or recovered from stored state. In both cases the actual idmapping is
> irrelevant.
> 
> The callchain here is:
> 
> eventfs_root_lookup("xfs", "events")
> -> create_{dir,file}_dentry("xfs", "events")
>-> create_{dir,file}("xfs", "events")
>   -> eventfs_start_creating("xfs", "events")
>  -> lookup_one_len("xfs", "events")  
> 
> And the subtlety is that lookup_one_len() does permission checking on
> the parent inode (IOW, if you want a dentry for "blech" under "events"
> it'll do a permission check on events->d_inode) for exec permissions
> and then goes on to give you a new dentry.
> 
> Usually this call would have to be changed to lookup_one() and the
> idmapping be handed down to it. But I think that's irrelevant here.
> 
> Lookup generally doesn't need to be aware of idmappings at all. The
> permission checking is done purely in the vfs via may_lookup() and the
> idmapping is irrelevant because we always initialize inodes with the
> filesystem level ownership (see the idmappings.rst) documentation if
> you're interested in excessive details (otherwise you get inode aliases
> which you really don't want).
> 
> For tracefs it would not matter for lookup per se but only because
> tracefs seemingly creates inodes/dentries during lookup (and readdir()).

tracefs creates the inodes/dentries at boot up, it's eventfs that does
it on demand during lookup.

For inodes/dentries:

 /sys/kernel/tracing/* is all created at boot up, except for "events".
 /sys/kernel/tracing/events/* is created on demand.

> 
> But imho the permission checking done in current eventfs_root_lookup()
> via lookup_one_len() is meaningless in any way; possibly even
> (conceptually) wrong.
> 
> Because, the actual permission checking for the creation of the eventfs
> entries isn't really done during lookup or readdir, it's done when mkdir
> is called:
> 
> mkdir /sys/kernel/tracing/instances/foo

No. that creates a entire new tracefs instance, which happens to
include another eventfs directory.

The eventsfs directory is in "/sys/kernel/tracing/events" and
 "/sys/kernel/tracing/instances/*/events"

eventfs has 10s of thousands of files and directories which is why I
changed it to be on-demand inode/dentry creation and also reclaiming
when no longer accessed.

 # ls /sys/kernel/tracing/events/

will create the inodes and dentries, and a memory stress program will
free those created inodes and dentries.

> 
> Here, all possible entries beneath foo including "events" and further
> below are recorded and stored. So once mkdir returns it basically means
> that it succeeded with the creation of all the necessary directories and
> files. For all purposes the foo/events/ directory and below have all the
> entries that matter. They have been created. It's comparable to them not
> being in the {d,i}cache, I guess.

No. Only the meta data is created for the eventfs 

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Christian Brauner
On Sun, Jan 07, 2024 at 06:42:33PM +0100, Christian Brauner wrote:
> On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote:
> > > > So tracefs supports remounting with different uid/gid mount options and
> > > > then actually wades through _all_ of the inodes and changes their
> > > > ownership internally? What's the use-case for this? Containers?
> > > 
> > > No, in fact tracing doesn't work well with containers as tracing is global
> > > to the entire machine. It can work with privileged containers though.
> > 
> > At least the tracefs interface is easily supportable within a delegation
> > model. IOW, you have a privileged process that delegates relevant
> > portions to a container via idmapped mounts _without_ doing the insane thing
> > and making it mountable by a container aka the fs-to-CVE pipeline.
> > 
> > > 
> > > The reason for this is because tracefs was based off of debugfs where the
> > > files and directores are created at boot up and mounted later. The reason
> > > to do this was to allow users to mount with gid=GID to allow a given group
> > > to have access to tracing. Without this update, tracefs would ignore it
> > > like debugfs and proc does today.
> > > 
> > > I think its time I explain the purpose of tracefs and how it came to be.
> > > 
> > > The tracing system required a way to control tracing and read the traces.
> > > It could have just used a new system like perf (although
> > > /sys/kernel/debug/tracing predates perf), where it created a single 
> > > ioctl()
> > > like system call do do everything.
> > > 
> > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own 
> > > logdev
> > > tracer, which both have an embedded background, I chose an interface that
> > > could work with just an unmodified version of busybox. That is, I wanted 
> > > it
> > > to work with just cat and echo.
> > > 
> > > The main difference with tracefs compared to other file systems is that it
> > > is a control interface, where writes happen as much as reads. The data 
> > > read
> > > is controlled. The closest thing I can think of is how cgroups work.
> > > 
> > > As tracing is a privileged operation, but something that could be changed
> > > to allow a group to have access to, I wanted to make it easy for an admin
> > > to decide who gets to do what at boot up via the /etc/fstab file.
> > 
> > Yeah, ok. I think you could achieve the same thing via idmapped mounts. You
> > just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount.
> > 
> > mount(8) should just give you the ability to specify "map the ids I 
> > explicitly
> > want to remap to something else and for the rest use the identity mapping". 
> > I
> > wanted that for other reasons anyway.
> > 
> > So in one of the next versions of mount(8) you can then do (where --beneath
> > means place the mount beneath the current one and --replace is
> > self-explanatory):
> > 
> > sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' 
> > /sys/kernel/tracing
> > sudo umount /sys/kernel/tracing
> > 
> > or as a shortcut provided by mount(8):
> > 
> > sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' 
> > /sys/kernel/tracing 
> > 
> > In both cases you replace the mount without unmounting tracefs.
> > 
> > I can illustrate this right now though:
> > 
> > user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 
> > u:0:1000:1' /sys/kernel/tracing/ /mnt/
> > 
> > # This is a tool I wrote for testing the patchset I wrote back then.
> > user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath 
> > --detached /mnt /sys/kernel/tracing
> > Mounting beneath top mount
> > Creating anonymous mount
> > Attaching mount /mnt -> /sys/kernel/tracing
> > Creating single detached mount
> > 
> > user1@localhost:~/data/move-mount-beneath$
> > 
> > # Now there's two mounts stacked on top of each other.
> > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> > | `-/sys/kernel/tracingtracefstracefs 
> > rw,nosuid,nodev,noexec,relatime,idmapped
> > |   `-/sys/kernel/tracing  tracefstracefs 
> > rw,nosuid,nodev,noexec,relatime
> > 
> > user1@localhost:~/data/move-mount-beneath$ sudo ls -al 
> > /sys/kernel/tracing/| head
> > total 0
> > drwx--  6 root root 0 Jan  7 13:33 .
> > drwxr-xr-x 16 root root 0 Jan  7 13:33 ..
> > -r--r-  1 root root 0 Jan  7 13:33 README
> > -r--r-  1 root root 0 Jan  7 13:33 available_events
> > -r--r-  1 root root 0 Jan  7 13:33 available_filter_functions
> > -r--r-  1 root root 0 Jan  7 13:33 available_filter_functions_addrs
> > -r--r-  1 root root 0 Jan  7 13:33 available_tracers
> > -rw-r-  1 root root 0 Jan  7 13:33 buffer_percent
> > -rw-r-  1 root root 0 Jan  7 13:33 buffer_size_kb
> > 
> > # Reveal updated mount
> > user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing
> > 
> > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> > | 

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Christian Brauner
On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote:
> > > So tracefs supports remounting with different uid/gid mount options and
> > > then actually wades through _all_ of the inodes and changes their
> > > ownership internally? What's the use-case for this? Containers?
> > 
> > No, in fact tracing doesn't work well with containers as tracing is global
> > to the entire machine. It can work with privileged containers though.
> 
> At least the tracefs interface is easily supportable within a delegation
> model. IOW, you have a privileged process that delegates relevant
> portions to a container via idmapped mounts _without_ doing the insane thing
> and making it mountable by a container aka the fs-to-CVE pipeline.
> 
> > 
> > The reason for this is because tracefs was based off of debugfs where the
> > files and directores are created at boot up and mounted later. The reason
> > to do this was to allow users to mount with gid=GID to allow a given group
> > to have access to tracing. Without this update, tracefs would ignore it
> > like debugfs and proc does today.
> > 
> > I think its time I explain the purpose of tracefs and how it came to be.
> > 
> > The tracing system required a way to control tracing and read the traces.
> > It could have just used a new system like perf (although
> > /sys/kernel/debug/tracing predates perf), where it created a single ioctl()
> > like system call do do everything.
> > 
> > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
> > tracer, which both have an embedded background, I chose an interface that
> > could work with just an unmodified version of busybox. That is, I wanted it
> > to work with just cat and echo.
> > 
> > The main difference with tracefs compared to other file systems is that it
> > is a control interface, where writes happen as much as reads. The data read
> > is controlled. The closest thing I can think of is how cgroups work.
> > 
> > As tracing is a privileged operation, but something that could be changed
> > to allow a group to have access to, I wanted to make it easy for an admin
> > to decide who gets to do what at boot up via the /etc/fstab file.
> 
> Yeah, ok. I think you could achieve the same thing via idmapped mounts. You
> just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount.
> 
> mount(8) should just give you the ability to specify "map the ids I explicitly
> want to remap to something else and for the rest use the identity mapping". I
> wanted that for other reasons anyway.
> 
> So in one of the next versions of mount(8) you can then do (where --beneath
> means place the mount beneath the current one and --replace is
> self-explanatory):
> 
> sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing
> sudo umount /sys/kernel/tracing
> 
> or as a shortcut provided by mount(8):
> 
> sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' 
> /sys/kernel/tracing 
> 
> In both cases you replace the mount without unmounting tracefs.
> 
> I can illustrate this right now though:
> 
> user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' 
> /sys/kernel/tracing/ /mnt/
> 
> # This is a tool I wrote for testing the patchset I wrote back then.
> user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath 
> --detached /mnt /sys/kernel/tracing
> Mounting beneath top mount
> Creating anonymous mount
> Attaching mount /mnt -> /sys/kernel/tracing
> Creating single detached mount
> 
> user1@localhost:~/data/move-mount-beneath$
> 
> # Now there's two mounts stacked on top of each other.
> user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> | `-/sys/kernel/tracingtracefstracefs 
> rw,nosuid,nodev,noexec,relatime,idmapped
> |   `-/sys/kernel/tracing  tracefstracefs 
> rw,nosuid,nodev,noexec,relatime
> 
> user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| 
> head
> total 0
> drwx--  6 root root 0 Jan  7 13:33 .
> drwxr-xr-x 16 root root 0 Jan  7 13:33 ..
> -r--r-  1 root root 0 Jan  7 13:33 README
> -r--r-  1 root root 0 Jan  7 13:33 available_events
> -r--r-  1 root root 0 Jan  7 13:33 available_filter_functions
> -r--r-  1 root root 0 Jan  7 13:33 available_filter_functions_addrs
> -r--r-  1 root root 0 Jan  7 13:33 available_tracers
> -rw-r-  1 root root 0 Jan  7 13:33 buffer_percent
> -rw-r-  1 root root 0 Jan  7 13:33 buffer_size_kb
> 
> # Reveal updated mount
> user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing
> 
> user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> | `-/sys/kernel/tracingtracefstracefs 
> rw,nosuid,nodev,noexec,relatime,idmapped
> 
> user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| 
> head
> total 0
> drwx--  6 user1 user1 0 Jan  7 13:33 .
> drwxr-xr-x 16 root  root  0 Jan  7 13:33 ..
> -r--r-  1 user1 user1 0 

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Christian Brauner
> > So tracefs supports remounting with different uid/gid mount options and
> > then actually wades through _all_ of the inodes and changes their
> > ownership internally? What's the use-case for this? Containers?
> 
> No, in fact tracing doesn't work well with containers as tracing is global
> to the entire machine. It can work with privileged containers though.

At least the tracefs interface is easily supportable within a delegation
model. IOW, you have a privileged process that delegates relevant
portions to a container via idmapped mounts _without_ doing the insane thing
and making it mountable by a container aka the fs-to-CVE pipeline.

> 
> The reason for this is because tracefs was based off of debugfs where the
> files and directores are created at boot up and mounted later. The reason
> to do this was to allow users to mount with gid=GID to allow a given group
> to have access to tracing. Without this update, tracefs would ignore it
> like debugfs and proc does today.
> 
> I think its time I explain the purpose of tracefs and how it came to be.
> 
> The tracing system required a way to control tracing and read the traces.
> It could have just used a new system like perf (although
> /sys/kernel/debug/tracing predates perf), where it created a single ioctl()
> like system call do do everything.
> 
> As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
> tracer, which both have an embedded background, I chose an interface that
> could work with just an unmodified version of busybox. That is, I wanted it
> to work with just cat and echo.
> 
> The main difference with tracefs compared to other file systems is that it
> is a control interface, where writes happen as much as reads. The data read
> is controlled. The closest thing I can think of is how cgroups work.
> 
> As tracing is a privileged operation, but something that could be changed
> to allow a group to have access to, I wanted to make it easy for an admin
> to decide who gets to do what at boot up via the /etc/fstab file.

Yeah, ok. I think you could achieve the same thing via idmapped mounts. You
just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount.

mount(8) should just give you the ability to specify "map the ids I explicitly
want to remap to something else and for the rest use the identity mapping". I
wanted that for other reasons anyway.

So in one of the next versions of mount(8) you can then do (where --beneath
means place the mount beneath the current one and --replace is
self-explanatory):

sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing
sudo umount /sys/kernel/tracing

or as a shortcut provided by mount(8):

sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing 

In both cases you replace the mount without unmounting tracefs.

I can illustrate this right now though:

user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' 
/sys/kernel/tracing/ /mnt/

# This is a tool I wrote for testing the patchset I wrote back then.
user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath 
--detached /mnt /sys/kernel/tracing
Mounting beneath top mount
Creating anonymous mount
Attaching mount /mnt -> /sys/kernel/tracing
Creating single detached mount

user1@localhost:~/data/move-mount-beneath$

# Now there's two mounts stacked on top of each other.
user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
| `-/sys/kernel/tracingtracefstracefs 
rw,nosuid,nodev,noexec,relatime,idmapped
|   `-/sys/kernel/tracing  tracefstracefs 
rw,nosuid,nodev,noexec,relatime

user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| 
head
total 0
drwx--  6 root root 0 Jan  7 13:33 .
drwxr-xr-x 16 root root 0 Jan  7 13:33 ..
-r--r-  1 root root 0 Jan  7 13:33 README
-r--r-  1 root root 0 Jan  7 13:33 available_events
-r--r-  1 root root 0 Jan  7 13:33 available_filter_functions
-r--r-  1 root root 0 Jan  7 13:33 available_filter_functions_addrs
-r--r-  1 root root 0 Jan  7 13:33 available_tracers
-rw-r-  1 root root 0 Jan  7 13:33 buffer_percent
-rw-r-  1 root root 0 Jan  7 13:33 buffer_size_kb

# Reveal updated mount
user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing

user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
| `-/sys/kernel/tracingtracefstracefs 
rw,nosuid,nodev,noexec,relatime,idmapped

user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| 
head
total 0
drwx--  6 user1 user1 0 Jan  7 13:33 .
drwxr-xr-x 16 root  root  0 Jan  7 13:33 ..
-r--r-  1 user1 user1 0 Jan  7 13:33 README
-r--r-  1 user1 user1 0 Jan  7 13:33 available_events
-r--r-  1 user1 user1 0 Jan  7 13:33 available_filter_functions
-r--r-  1 user1 user1 0 Jan  7 13:33 available_filter_functions_addrs
-r--r-  1 user1 user1 0 Jan  7 13:33 available_tracers

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-05 Thread Steven Rostedt
On Fri, 5 Jan 2024 15:26:28 +0100
Christian Brauner  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > Instead of walking the dentries on mount/remount to update the gid values of
> > all the dentries if a gid option is specified on mount, just update the root
> > inode. Add .getattr, .setattr, and .permissions on the tracefs inode
> > operations to update the permissions of the files and directories.
> > 
> > For all files and directories in the top level instance:
> > 
> >  /sys/kernel/tracing/*
> > 
> > It will use the root inode as the default permissions. The inode that
> > represents: /sys/kernel/tracing (or wherever it is mounted).
> > 
> > When an instance is created:
> > 
> >  mkdir /sys/kernel/tracing/instance/foo
> > 
> > The directory "foo" and all its files and directories underneath will use
> > the default of what foo is when it was created. A remount of tracefs will
> > not affect it.  
> 
> That kinda sounds like eventfs should actually be a separate filesystem.
> But I don't know enough about the relationship between the two concepts.

It may someday become one, as eventfs is used by perf where the rest of the
tracefs system is not.

> 
> > 
> > If a user were to modify the permissions of any file or directory in
> > tracefs, it will also no longer be modified by a change in ownership of a
> > remount.  
> 
> Very odd semantics and I would recommend to avoid that. It's just plain
> weird imo.
> 
> > 
> > The events directory, if it is in the top level instance, will use the
> > tracefs root inode as the default ownership for itself and all the files and
> > directories below it.
> > 
> > For the events directory in an instance ("foo"), it will keep the ownership
> > of what it was when it was created, and that will be used as the default
> > ownership for the files and directories beneath it.
> > 
> > Link: 
> > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/
> > 
> > Signed-off-by: Steven Rostedt (Google) 
> > ---  
> 
> So tracefs supports remounting with different uid/gid mount options and
> then actually wades through _all_ of the inodes and changes their
> ownership internally? What's the use-case for this? Containers?

No, in fact tracing doesn't work well with containers as tracing is global
to the entire machine. It can work with privileged containers though.

The reason for this is because tracefs was based off of debugfs where the
files and directores are created at boot up and mounted later. The reason
to do this was to allow users to mount with gid=GID to allow a given group
to have access to tracing. Without this update, tracefs would ignore it
like debugfs and proc does today.

I think its time I explain the purpose of tracefs and how it came to be.

The tracing system required a way to control tracing and read the traces.
It could have just used a new system like perf (although
/sys/kernel/debug/tracing predates perf), where it created a single ioctl()
like system call do do everything.

As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
tracer, which both have an embedded background, I chose an interface that
could work with just an unmodified version of busybox. That is, I wanted it
to work with just cat and echo.

The main difference with tracefs compared to other file systems is that it
is a control interface, where writes happen as much as reads. The data read
is controlled. The closest thing I can think of is how cgroups work.

As tracing is a privileged operation, but something that could be changed
to allow a group to have access to, I wanted to make it easy for an admin
to decide who gets to do what at boot up via the /etc/fstab file.

> 
> Aside from optimizing this and the special semantics for this eventfs
> stuff that you really should think twice of doing, here's one idea for
> an extension that might alleviate some of the pain:
> 
> If you need flexible dynamic ownership change to e.g., be able to
> delegate (all, a directory, a single file of) tracefs to
> unprivileged/containers/whatever then you might want to consider
> supporting idmapped mounts for tracefs. Because then you can do stuff
> like:

I'm a novice here and have no idea on how id maps work ;-)

> 
> user1@localhost:~/data/scripts$ sudo mount --bind -o 
> X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt
> user1@localhost:~/data/scripts$ ls -ln /run/
> total 12
> drwxr-xr-x  2 0  0   40 Jan  5 12:12 credentials
> drwx--  2 0  0   40 Jan  5 11:57 cryptsetup
> drwxr-xr-x  2 0  0   60 Jan  5 11:57 dbus
> drwx--  6 0  0  280 Jan  5 11:57 incus_agent
> prw---  1 0  00 Jan  5 11:57 initctl
> drwxrwxrwt  4 0  0   80 Jan  5 11:57 lock
> drwxr-xr-x  3 0  0   60 Jan  5 11:57 log
> drwx--  2 0  0   40 Jan  5 11:57 lvm
> -r--r--r--  1 0  0   33 Jan  5 11:57 machine-id
> -rw-r--r--  1 0  0  101 Jan  5 11:58 motd.dynamic
> drwxr-xr-x  

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-05 Thread Christian Brauner
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> Instead of walking the dentries on mount/remount to update the gid values of
> all the dentries if a gid option is specified on mount, just update the root
> inode. Add .getattr, .setattr, and .permissions on the tracefs inode
> operations to update the permissions of the files and directories.
> 
> For all files and directories in the top level instance:
> 
>  /sys/kernel/tracing/*
> 
> It will use the root inode as the default permissions. The inode that
> represents: /sys/kernel/tracing (or wherever it is mounted).
> 
> When an instance is created:
> 
>  mkdir /sys/kernel/tracing/instance/foo
> 
> The directory "foo" and all its files and directories underneath will use
> the default of what foo is when it was created. A remount of tracefs will
> not affect it.

That kinda sounds like eventfs should actually be a separate filesystem.
But I don't know enough about the relationship between the two concepts.

> 
> If a user were to modify the permissions of any file or directory in
> tracefs, it will also no longer be modified by a change in ownership of a
> remount.

Very odd semantics and I would recommend to avoid that. It's just plain
weird imo.

> 
> The events directory, if it is in the top level instance, will use the
> tracefs root inode as the default ownership for itself and all the files and
> directories below it.
> 
> For the events directory in an instance ("foo"), it will keep the ownership
> of what it was when it was created, and that will be used as the default
> ownership for the files and directories beneath it.
> 
> Link: 
> https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/
> 
> Signed-off-by: Steven Rostedt (Google) 
> ---

So tracefs supports remounting with different uid/gid mount options and
then actually wades through _all_ of the inodes and changes their
ownership internally? What's the use-case for this? Containers?

Aside from optimizing this and the special semantics for this eventfs
stuff that you really should think twice of doing, here's one idea for
an extension that might alleviate some of the pain:

If you need flexible dynamic ownership change to e.g., be able to
delegate (all, a directory, a single file of) tracefs to
unprivileged/containers/whatever then you might want to consider
supporting idmapped mounts for tracefs. Because then you can do stuff
like:

user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 
u:0:1234:1' /run/ /mnt
user1@localhost:~/data/scripts$ ls -ln /run/
total 12
drwxr-xr-x  2 0  0   40 Jan  5 12:12 credentials
drwx--  2 0  0   40 Jan  5 11:57 cryptsetup
drwxr-xr-x  2 0  0   60 Jan  5 11:57 dbus
drwx--  6 0  0  280 Jan  5 11:57 incus_agent
prw---  1 0  00 Jan  5 11:57 initctl
drwxrwxrwt  4 0  0   80 Jan  5 11:57 lock
drwxr-xr-x  3 0  0   60 Jan  5 11:57 log
drwx--  2 0  0   40 Jan  5 11:57 lvm
-r--r--r--  1 0  0   33 Jan  5 11:57 machine-id
-rw-r--r--  1 0  0  101 Jan  5 11:58 motd.dynamic
drwxr-xr-x  2 0  0   40 Jan  5 11:57 mount
drwx--  2 0  0   40 Jan  5 11:57 multipath
drwxr-xr-x  2 0  0   40 Jan  5 11:57 sendsigs.omit.d
lrwxrwxrwx  1 0  08 Jan  5 11:57 shm -> /dev/shm
drwx--x--x  2 0  0   40 Jan  5 11:57 sudo
drwxr-xr-x 24 0  0  660 Jan  5 14:30 systemd
drwxr-xr-x  6 0  0  140 Jan  5 14:30 udev
drwxr-xr-x  4 0  0   80 Jan  5 11:58 user
-rw-rw-r--  1 0 43 2304 Jan  5 15:15 utmp

user1@localhost:~/data/scripts$ ls -ln /mnt/
total 12
drwxr-xr-x  2 1234  1000   40 Jan  5 12:12 credentials
drwx--  2 1234  1000   40 Jan  5 11:57 cryptsetup
drwxr-xr-x  2 1234  1000   60 Jan  5 11:57 dbus
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 incus_agent
prw---  1 1234  10000 Jan  5 11:57 initctl
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 lock
drwxr-xr-x  3 1234  1000   60 Jan  5 11:57 log
drwx--  2 1234  1000   40 Jan  5 11:57 lvm
-r--r--r--  1 1234  1000   33 Jan  5 11:57 machine-id
-rw-r--r--  1 1234  1000  101 Jan  5 11:58 motd.dynamic
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 mount
drwx--  2 1234  1000   40 Jan  5 11:57 multipath
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 sendsigs.omit.d
lrwxrwxrwx  1 1234  10008 Jan  5 11:57 shm -> /dev/shm
drwx--x--x  2 1234  1000   40 Jan  5 11:57 sudo
drwxr-xr-x 24 1234  1000  660 Jan  5 14:30 systemd
drwxr-xr-x  6 1234  1000  140 Jan  5 14:30 udev
drwxr-xr-x  4 1234  1000   80 Jan  5 11:58 user
-rw-rw-r--  1 1234 65534 2304 Jan  5 15:15 utmp

Where you can see that ownership of this tmpfs instance in this example
is changed. I'm not trying to advocate here but this will probably
ultimately be nicer for your users because it means that a container
manager or whatever can be handed a part of tracefs (or all of it) and
the ownership and access rights for that thing is correct. And you can
get rid of that gid based access completely.

You can change uids, gids, or both. You can 

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Al Viro
On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote:
> On Thu, 4 Jan 2024 01:48:37 +
> Al Viro  wrote:
> 
> > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > 
> > > + /* Get the tracefs root from the parent */
> > > + inode = d_inode(dentry->d_parent);
> > > + inode = d_inode(inode->i_sb->s_root);  
> > 
> > That makes no sense.  First of all, for any positive dentry we have
> > dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
> > dentries on given superblock.  So what's the point of that dance?
> > If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
> > and be done with that...
> 
> That was more of thinking that the dentry and dentry->d_parent are
> different. As dentry is part of eventfs and dentry->d_parent is part of
> tracefs.

???

>Currently they both have the same superblock so yeah, I could just
> write it that way too and it would work. But in my head, I was thinking
> that they behave differently and maybe one day eventfs would get its own
> superblock which would not work.

->d_parent *always* points to the same filesystem; if you get an (automounted?)
mountpoint there, ->d_parent simply won't work - it will point to dentry itself.

> To explain this better:
> 
>   /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events
> 
> But everything but "events" in /sys/kernel/tracing/* is part of tracefs.
> Everything in /sys/kernel/tracing/events is part of eventfs.
> 
> That was my thought process. But as both tracefs and eventfs still use
> tracefs_get_inode(), it would work as you state.
> 
> I'll update that, as I don't foresee that eventfs will become its own file
> system.

There is no way to get to underlying mountpoint by dentry - simply because
the same fs instance can be mounted in any number of places.

A crude overview of taxonomy:

file_system_type: what filesystem instances belong to.  Not quite the same
thing as fs driver (one driver can provide several of those).  Usually
it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...).

super_block: individual filesystem instance.  Hosts dentry tree (connected or
several disconnected parts - think NFSv4 or the state while trying to get
a dentry by fhandle, etc.).

dentry: object in a filesystem's directory tree(s).  Always belongs to
specific filesystem instance - that relationship never changes.  Tree
structure (and names) _within_ _filesystem_ belong on that level.
->d_parent is part of that tree structure; never NULL, root of a (sub)tree
has it pointing to itself.  Might be negative, might refer to a filesystem 
object
(file, directory, symlink, etc.).

inode: filesystem object (file, directory, etc.).  Always belongs to
specific filesystem instance.  Non-directory inodes might have any
number of dentry instances (aliases) refering to it; a directory one - no 
more than one.  Filesystem object contents belongs here; multiple hardlinks
have different dentries and the same inode.  Of course, filesystem type in
question might have no such thing as multiple hardlinks - that's up to
filesystem.  In general there is no way to find (or enumerate) such links;
e.g. a local filesystem might have an extra hardlink somewhere we had
never looked at and there won't be any dentries for such hardlinks and
no way to get them short of doing the full search of the entire tree.
The situation with e.g. NFS client is even worse, obviously.

mount: in a sense, mount to super_block is what dentry is to inode.  It
provides a view of (sub)tree hosted in given filesystem instance.  The
same filesystem may have any number of mounts, refering to its subtrees
(possibly the same subtree for each, possibly all different - up to
the callers of mount(2)).  They form mount tree(s) - that's where the
notions related to "this mounted on top of that" belong.  Note that
they can be moved around - with no telling the filesystem about that
happening.  Again, there's no such thing as "the mountpoint of given
filesystem instance" - it might be mounted in any number of places
at the same time.  Specific mount - sure, no problem, except that it
can move around.

namespace: mount tree.  Unlike everything prior, this one is a part of
process state - same as descriptor table, mappings, etc.

file: opened IO channel.  It does refer to specific mount and specific
dentry (and thus filesystem instance and an inode on it).  Current
IO position lives here, so does any per-open(2) state.

descriptor table: mapping from numbers to IO channels (opened files).
Again, a part of process state.  dup() creates a new entry, with
reference to the same file as the old one; multiple open() of the
same pathname will each yield a separate opened file.  _Some_ state
belongs here (close-on-exec, mostly).  Note that there's no such
thing as "the descriptor of this file" - not even "the user-supplied
number that had been used to get the file we are currently reading
from", since that number might be refering to 

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt
On Thu, 4 Jan 2024 01:48:37 +
Al Viro  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> 
> > +   /* Get the tracefs root from the parent */
> > +   inode = d_inode(dentry->d_parent);
> > +   inode = d_inode(inode->i_sb->s_root);  
> 
> That makes no sense.  First of all, for any positive dentry we have
> dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
> dentries on given superblock.  So what's the point of that dance?
> If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
> and be done with that...

That was more of thinking that the dentry and dentry->d_parent are
different. As dentry is part of eventfs and dentry->d_parent is part of
tracefs. Currently they both have the same superblock so yeah, I could just
write it that way too and it would work. But in my head, I was thinking
that they behave differently and maybe one day eventfs would get its own
superblock which would not work.

To explain this better:

  /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events

But everything but "events" in /sys/kernel/tracing/* is part of tracefs.
Everything in /sys/kernel/tracing/events is part of eventfs.

That was my thought process. But as both tracefs and eventfs still use
tracefs_get_inode(), it would work as you state.

I'll update that, as I don't foresee that eventfs will become its own file
system.

Thanks,

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt
On Thu, 4 Jan 2024 01:59:10 +
Al Viro  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> 
> > +static struct inode *instance_inode(struct dentry *parent, struct inode 
> > *inode)
> > +{
> > +   struct tracefs_inode *ti;
> > +   struct inode *root_inode;
> > +
> > +   root_inode = d_inode(inode->i_sb->s_root);
> > +
> > +   /* If parent is NULL then use root inode */
> > +   if (!parent)
> > +   return root_inode;
> > +
> > +   /* Find the inode that is flagged as an instance or the root inode */
> > +   do {
> > +   inode = d_inode(parent);
> > +   if (inode == root_inode)
> > +   return root_inode;
> > +
> > +   ti = get_tracefs(inode);
> > +
> > +   if (ti->flags & TRACEFS_INSTANCE_INODE)
> > +   return inode;
> > +   } while ((parent = parent->d_parent));  
> 
> *blink*
> 
> This is equivalent to
>   ...
>   parent = parent->d_parent;
>   } while (true);

Yeah, that loop went through a few iterations as I first thought that root
was a tracefs_inode and the get_tracefs() would work on it. No, it was not,
and it caused a cash. But I didn't rewrite the loop well after fixing that.

I was also not sure if parent could be NULL, and wanted to protect against it.

> 
> ->d_parent is *never* NULL.  And what the hell is that loop supposed to do,  
> anyway?  Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE?
> 
> If root is not marked that way, I would suggest
>   if (!parent)
>   parent = inode->i_sb->s_root;
>   while (!IS_ROOT(parent)) {
>   struct tracefs_inode *ti = get_tracefs(parent->d_inode);
>   if (ti->flags & TRACEFS_INSTANCE_INODE)
>   break;
>   parent = parent->d_parent;
>   }
>   return parent->d_inode;

Sure, I could rewrite it that way too.

Thanks,

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Al Viro
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:

> +static struct inode *instance_inode(struct dentry *parent, struct inode 
> *inode)
> +{
> + struct tracefs_inode *ti;
> + struct inode *root_inode;
> +
> + root_inode = d_inode(inode->i_sb->s_root);
> +
> + /* If parent is NULL then use root inode */
> + if (!parent)
> + return root_inode;
> +
> + /* Find the inode that is flagged as an instance or the root inode */
> + do {
> + inode = d_inode(parent);
> + if (inode == root_inode)
> + return root_inode;
> +
> + ti = get_tracefs(inode);
> +
> + if (ti->flags & TRACEFS_INSTANCE_INODE)
> + return inode;
> + } while ((parent = parent->d_parent));

*blink*

This is equivalent to
...
parent = parent->d_parent;
} while (true);

->d_parent is *never* NULL.  And what the hell is that loop supposed to do,
anyway?  Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE?

If root is not marked that way, I would suggest
if (!parent)
parent = inode->i_sb->s_root;
while (!IS_ROOT(parent)) {
struct tracefs_inode *ti = get_tracefs(parent->d_inode);
if (ti->flags & TRACEFS_INSTANCE_INODE)
break;
parent = parent->d_parent;
}
return parent->d_inode;



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Al Viro
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:

> + /* Get the tracefs root from the parent */
> + inode = d_inode(dentry->d_parent);
> + inode = d_inode(inode->i_sb->s_root);

That makes no sense.  First of all, for any positive dentry we have
dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
dentries on given superblock.  So what's the point of that dance?
If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
and be done with that...



[PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Instead of walking the dentries on mount/remount to update the gid values of
all the dentries if a gid option is specified on mount, just update the root
inode. Add .getattr, .setattr, and .permissions on the tracefs inode
operations to update the permissions of the files and directories.

For all files and directories in the top level instance:

 /sys/kernel/tracing/*

It will use the root inode as the default permissions. The inode that
represents: /sys/kernel/tracing (or wherever it is mounted).

When an instance is created:

 mkdir /sys/kernel/tracing/instance/foo

The directory "foo" and all its files and directories underneath will use
the default of what foo is when it was created. A remount of tracefs will
not affect it.

If a user were to modify the permissions of any file or directory in
tracefs, it will also no longer be modified by a change in ownership of a
remount.

The events directory, if it is in the top level instance, will use the
tracefs root inode as the default ownership for itself and all the files and
directories below it.

For the events directory in an instance ("foo"), it will keep the ownership
of what it was when it was created, and that will be used as the default
ownership for the files and directories beneath it.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c |  80 ++-
 fs/tracefs/inode.c   | 205 ++-
 fs/tracefs/internal.h|   3 +
 3 files changed, 198 insertions(+), 90 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 53d34a4b5a2b..641bffa0f139 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -45,6 +45,7 @@ enum {
EVENTFS_SAVE_MODE   = BIT(16),
EVENTFS_SAVE_UID= BIT(17),
EVENTFS_SAVE_GID= BIT(18),
+   EVENTFS_TOPLEVEL= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
@@ -115,10 +116,17 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * The events directory dentry is never freed, unless its
 * part of an instance that is deleted. It's attr is the
 * default for its child files and directories.
-* Do not update it. It's not used for its own mode or ownership
+* Do not update it. It's not used for its own mode or 
ownership.
 */
-   if (!ei->is_events)
+   if (ei->is_events) {
+   /* But it still needs to know if it was modified */
+   if (iattr->ia_valid & ATTR_UID)
+   ei->attr.mode |= EVENTFS_SAVE_UID;
+   if (iattr->ia_valid & ATTR_GID)
+   ei->attr.mode |= EVENTFS_SAVE_GID;
+   } else {
update_attr(>attr, iattr);
+   }
 
} else {
name = dentry->d_name.name;
@@ -136,9 +144,67 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
+static void update_top_events_attr(struct eventfs_inode *ei, struct dentry 
*dentry)
+{
+   struct inode *inode;
+
+   /* Only update if the "events" was on the top level */
+   if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   return;
+
+   /* Get the tracefs root from the parent */
+   inode = d_inode(dentry->d_parent);
+   inode = d_inode(inode->i_sb->s_root);
+   ei->attr.uid = inode->i_uid;
+   ei->attr.gid = inode->i_gid;
+}
+
+static void set_top_events_ownership(struct inode *inode)
+{
+   struct tracefs_inode *ti = get_tracefs(inode);
+   struct eventfs_inode *ei = ti->private;
+   struct dentry *dentry;
+
+   /* The top events directory doesn't get automatically updated */
+   if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   return;
+
+   dentry = ei->dentry;
+
+   update_top_events_attr(ei, dentry);
+
+   if (!(ei->attr.mode & EVENTFS_SAVE_UID))
+   inode->i_uid = ei->attr.uid;
+
+   if (!(ei->attr.mode & EVENTFS_SAVE_GID))
+   inode->i_gid = ei->attr.gid;
+}
+
+static int eventfs_get_attr(struct mnt_idmap *idmap,
+   const struct path *path, struct kstat *stat,
+   u32 request_mask, unsigned int flags)
+{
+   struct dentry *dentry = path->dentry;
+   struct inode *inode = d_backing_inode(dentry);
+
+   set_top_events_ownership(inode);
+
+   generic_fillattr(idmap, request_mask, inode, stat);
+   return 0;
+}
+
+static int eventfs_permission(struct mnt_idmap *idmap,
+ struct inode *inode, int mask)
+{
+   set_top_events_ownership(inode);
+   return