Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 12 Jan 2024 08:53:44 -0500 Steven Rostedt wrote: > > // We managed to open the directory so we have permission to list > > // directory entries in "xfs". > > fd = open("/sys/kernel/tracing/events/xfs"); > > > > // Remove ownership so we can't open the directory anymore > > chown("/sys/kernel/tracing/events/xfs", 0, 0); > > > > // Or just remove exec bit for the group and restrict to owner > > chmod("/sys/kernel/tracing/events/xfs", 700); > > > > // Drop caches to force an eventfs_root_lookup() on everything > > write("/proc/sys/vm/drop_caches", "3", 1); > > This requires opening the directory, then having it's permissions > change, and then immediately dropping the caches. > > > > > // Returns 0 even though directory has a lot of entries and we should be > > // able to list them > > getdents64(fd, ...); > > And do we care? Hmm, maybe the issue you have is with the inconsistency of the action? For this to fail, it would require the admin to do both change the permission and to flush caches. If you don't flush the caches then the task with the dir open can still read it regardless. If the dentries were already created. In that case I'm fine if we change the creation of the dentries to not check the permission. But for now, it's just a weird side effect that I don't really see how it would affect any user's workflow. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 12 Jan 2024 09:27:24 +0100 Christian Brauner wrote: > On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote: > > On Thu, 11 Jan 2024 22:01:32 +0100 > > Christian Brauner wrote: > > > > > What I'm pointing out in the current logic is that the caller is > > > taxed twice: > > > > > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > > > (2) And again when you call lookup_one_len() in eventfs_start_creating() > > > _because_ the permission check in lookup_one_len() is the exact > > > same permission check again that the vfs has done > > > inode_permission(MAY_EXEC, "xfs"). > > > > As I described in: > > https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/ > > > > The eventfs files below "events" doesn't need the .permissions callback at > > all. It's only there because the "events" inode uses it. > > > > The .permissions call for eventfs has: > > It doesn't matter whether there's a ->permission handler. If you don't > add one explicitly the VFS will simply call generic_permission(): > > inode_permission() > -> do_inode_permission() >{ > if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { > if (likely(inode->i_op->permission)) > return inode->i_op->permission(idmap, inode, mask); > > /* This gets set once for the inode lifetime */ > spin_lock(>i_lock); > inode->i_opflags |= IOP_FASTPERM; > spin_unlock(>i_lock); > } > return generic_permission(idmap, inode, mask); >} Yes I know that, because that's where I knew what to call in the non "events" dir case. > > > Anyway, the issue is with "events" directory and remounting, because like > > the tracefs system, the inode and dentry for "evnets" is created at boot > > up, before the mount happens. The VFS layer is going to check the > > permissions of its inode and dentry, which will be incorrect if the mount > > was mounted with a "gid" option. > > The gid option has nothing to do with this and it is just handled fine > if you remove the second permission checking in (2). I guess I'm confused to what you are having an issue with. Is it just that the permission check gets called twice? > > You need to remove the inode_permission() code from > eventfs_start_creating(). It is just an internal lookup and the fact > that you have it in there allows userspace to break readdir on the > eventfs portions of tracefs as I've shown in the parts of the mail that > you cut off. That's because I didn't see how it was related to the way I fixed the mount=gid issue. Are you only concerned because of the check in eventfs_start_creating()? Yes, you posted code that would make things act funny for some code that I see no real world use case for. Yeah, it may not act "properly" but I'm not sure that's bad. Here, I'll paste it back: > // We managed to open the directory so we have permission to list > // directory entries in "xfs". > fd = open("/sys/kernel/tracing/events/xfs"); > > // Remove ownership so we can't open the directory anymore > chown("/sys/kernel/tracing/events/xfs", 0, 0); > > // Or just remove exec bit for the group and restrict to owner > chmod("/sys/kernel/tracing/events/xfs", 700); > > // Drop caches to force an eventfs_root_lookup() on everything > write("/proc/sys/vm/drop_caches", "3", 1); This requires opening the directory, then having it's permissions change, and then immediately dropping the caches. > > // Returns 0 even though directory has a lot of entries and we should be > // able to list them > getdents64(fd, ...); And do we care? Since tracing exposes internal kernel information, perhaps this is a feature and not a bug. If someone who had access to the tracing system and you wanted to stop them, if they had a directory open that they no longer have access to, you don't want them to see what's left in the directory. In other words, I like the idea that the getends64(fd, ...) will fail! If there's a file underneath that wasn't change, and the admin thought that just keeping the top directory permissions off is good enough, then that attacker having that directory open before the directory had it's file permissions change is a way to still have access to the files below it. > > And the failure is in the inode_permission(MAY_EXEC, "xfs") call in > lookup_one_len() in eventfs_start_creating() which now fails. And I think is a good thing! Again, tracefs is special. It gives you access and possibly control to the kernel behavior. I like the fact that as soon as someone loses permission to a directory, they immediately lose it. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote: > On Thu, 11 Jan 2024 22:01:32 +0100 > Christian Brauner wrote: > > > What I'm pointing out in the current logic is that the caller is > > taxed twice: > > > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > > (2) And again when you call lookup_one_len() in eventfs_start_creating() > > _because_ the permission check in lookup_one_len() is the exact > > same permission check again that the vfs has done > > inode_permission(MAY_EXEC, "xfs"). > > As I described in: > https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/ > > The eventfs files below "events" doesn't need the .permissions callback at > all. It's only there because the "events" inode uses it. > > The .permissions call for eventfs has: It doesn't matter whether there's a ->permission handler. If you don't add one explicitly the VFS will simply call generic_permission(): inode_permission() -> do_inode_permission() { if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { if (likely(inode->i_op->permission)) return inode->i_op->permission(idmap, inode, mask); /* This gets set once for the inode lifetime */ spin_lock(>i_lock); inode->i_opflags |= IOP_FASTPERM; spin_unlock(>i_lock); } return generic_permission(idmap, inode, mask); } > Anyway, the issue is with "events" directory and remounting, because like > the tracefs system, the inode and dentry for "evnets" is created at boot > up, before the mount happens. The VFS layer is going to check the > permissions of its inode and dentry, which will be incorrect if the mount > was mounted with a "gid" option. The gid option has nothing to do with this and it is just handled fine if you remove the second permission checking in (2). You need to remove the inode_permission() code from eventfs_start_creating(). It is just an internal lookup and the fact that you have it in there allows userspace to break readdir on the eventfs portions of tracefs as I've shown in the parts of the mail that you cut off.
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, 11 Jan 2024 22:01:32 +0100 Christian Brauner wrote: > What I'm pointing out in the current logic is that the caller is > taxed twice: > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > (2) And again when you call lookup_one_len() in eventfs_start_creating() > _because_ the permission check in lookup_one_len() is the exact > same permission check again that the vfs has done > inode_permission(MAY_EXEC, "xfs"). As I described in: https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/ The eventfs files below "events" doesn't need the .permissions callback at all. It's only there because the "events" inode uses it. The .permissions call for eventfs has: static int eventfs_permission(struct mnt_idmap *idmap, struct inode *inode, int mask) { set_top_events_ownership(inode); return generic_permission(idmap, inode, mask); } Where the "set_top_events_ownership() is a nop for everything but the "events" directory. I guess I could have two ops: static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, .permission = eventfs_permission, }; static const struct inode_operations eventfs_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, }; And use the second one for all dentries below the root, but I figured it's not that big of a deal if I called the permissions on all. Perhaps I should do it with two? Anyway, the issue is with "events" directory and remounting, because like the tracefs system, the inode and dentry for "evnets" is created at boot up, before the mount happens. The VFS layer is going to check the permissions of its inode and dentry, which will be incorrect if the mount was mounted with a "gid" option. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 10, 2024 at 08:07:46AM -0500, Steven Rostedt wrote: > On Wed, 10 Jan 2024 12:45:36 +0100 > Christian Brauner wrote: > > > So say you do: > > > > mkdir /sys/kernel/tracing/instances/foo > > > > After this has returned we know everything we need to know about the new > > tracefs instance including the ownership and the mode of all inodes in > > /sys/kernel/tracing/instances/foo/events/* and below precisely because > > ownership is always inherited from the parent dentry and recorded in the > > metadata struct eventfs_inode. > > > > So say someone does: > > > > open("/sys/kernel/tracing/instances/foo/events/xfs"); > > > > and say this is the first time that someone accesses that events/ > > directory. > > > > When the open pathwalk is done, the vfs will determine via > > > > [1] may_lookup(inode_of(events)) > > > > whether you are able to list entries such as "xfs" in that directory. > > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds > > it ends up calling i_op->eventfs_root_lookup(events). > > > > At this point tracefs/eventfs adds the inodes for all entries in that > > "events" directory including "xfs" based on the metadata it recorded > > during the mkdir. Since now someone is actually interested in them. And > > it initializes the inodes with ownership and everything and adds the > > dentries that belong into that directory. > > > > Nothing here depends on the permissions of the caller. The only > > permission that mattered was done in the VFS in [1]. If the caller has > > permissions to enter a directory they can lookup and list its contents. > > And its contents where determined/fixed etc when mkdir was called. > > > > So we just need to add the required objects into the caches (inode, > > dentry) whose addition we intentionally defered until someone actually > > needed them. > > > > So, eventfs_root_lookup() now initializes the inodes with the ownership > > from the stored metadata or from the parent dentry and splices in inodes > > and dentries. No permission checking is needed for this because it is > > always a recheck of what the vfs did in [1]. > > > > We now return to the vfs and path walk continues to the final component > > that you actually want to open which is that "xfs" directory in this > > example. We check the permissions on that inode via may_open("xfs") and > > we open that directory returning an fd to userspace ultimately. > > > > (I'm going by memory since I need to step out the door.) > > So, let's say we do: > > chgrp -R rostedt /sys/kernel/tracing/ The rostedt group needs exec permissions and "other" cannot have exec permissions otherwise you can trivially list the entries even if it's owned by root: chmod 750 /sys/kernel/tracing user1@localhost:~$ ls -aln /sys/kernel/ | grep tracing drwxr-x--- 6 0 10000 Jan 11 18:23 tracing > > But I don't want rostedt to have access to xfs > > chgrp -R root /sys/kernel/tracing/events/xfs chmod 750 /sys/kernel/tracing/events/xfs user1@localhost:~$ ls -aln /sys/kernel/tracing/events/ | grep xfs drwxr-x--- 601 0 0 0 Jan 11 18:24 xfs This ensure that if a user is in the group and the group has exec perms lookup is possible (For root this will usually work because CAP_DAC_READ_SEARCH overrides the exec requirement.). > > Both actions will create the inodes and dentries of all files and > directories (because of "-R"). But once that is done, the ref counts go to > zero. They stay around until reclaim. But then I open Chrome ;-) and it > reclaims all the dentries and inodes, so we are back to here we were on > boot. > > Now as rostedt I do: > > ls /sys/kernel/tracing/events/xfs > > The VFS layer doesn't know if I have permission to that or not, because all > the inodes and dentries have been freed. It has to call back to eventfs to > find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will > recreated the inodes with the proper permission. Very roughly, ignoring most of the complexity of lookup and focussing on the permission checking: When a caller looks up an entry in a directory then the VFS will call inode_permission(MAY_EXEC) on the directory the caller is trying to perform that lookup in. If the caller wants to lookup the "events" entry in the "tracing" directory then the VFS will call inode_permission(MAY_EXEC, "tracing") and then - assuming it's not in the cache - call into the lookup method of the filesystem. After the VFS has determined that the caller has the permissions to lookup the "events" entry in the "tracing" directory no further permission checks are needed. Path lookup now moves on to the next part which is looking up the "xfs" entry in the "events" directory. So again, VFS calls inode_permission(MAY_EXEC, "events") finds that caller has the permissions and then goes on to call into eventfs_root_lookup(). The VFS has already given the caller a dentry for "xfs" and is passing that down to tracefs/eventfs. So eventfs gets
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 10:52:51 -0500 Steven Rostedt wrote: > On Wed, 10 Jan 2024 08:07:46 -0500 > Steven Rostedt wrote: > > > Or are you saying that I don't need the ".permission" callback, because > > eventfs does it when it creates the inodes? But for eventfs to know what > > the permissions changes are, it uses .getattr and .setattr. > > OK, if your main argument is that we do not need .permission, I agree with > you. But that's a trivial change and doesn't affect the complexity that > eventfs is doing. In fact, removing the "permission" check is simply this > patch: > > -- > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index fdff53d5a1f8..f2af07a857e2 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap, > return 0; > } > > -static int eventfs_permission(struct mnt_idmap *idmap, > - struct inode *inode, int mask) > -{ > - set_top_events_ownership(inode); > - return generic_permission(idmap, inode, mask); > -} > - > static const struct inode_operations eventfs_root_dir_inode_operations = { > .lookup = eventfs_root_lookup, > .setattr= eventfs_set_attr, > .getattr= eventfs_get_attr, > - .permission = eventfs_permission, > }; > > static const struct inode_operations eventfs_file_inode_operations = { > -- > > I only did that because Linus mentioned it, and I thought it was needed. > I'll apply this patch too, as it appears to work with this code. Oh, eventfs files and directories don't need the .permissions because its inodes and dentries are not created until accessed. But the "events" directory itself has its dentry and inode created at boot up, but still uses the eventfs_root_dir_inode_operations. So the .permissions is still needed! If you look at the "set_top_events_ownership()" function, it has: /* The top events directory doesn't get automatically updated */ if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) return; That is, it does nothing if the entry is not the "events" directory. It falls back to he default "->permissions()" function for everything but the top level "events" directory. But this and .getattr are still needed for the events directory, because it suffers the same issue as the other tracefs entries. That is, it's inodes and dentries are created at boot up before it is mounted. So if the mount has gid=1000, it will be ignored. The .getattr is called by "stat" which ls does. So after boot up if you just do: # chmod 0750 /sys/kernel/events # chmod 0770 /sys/kernel/tracing # mount -o remount,gid=1000 /sys/kernel/tracing # su - rostedt $ id uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt) $ ls /sys/kernel/tracing/events/ 9pext4iomapmodule raw_syscalls thermal alarmtimerfib iommumsr rcu thp avc fib6io_uring napiregmaptimer block filelockipi neigh regulator tlb bpf_test_run filemap irq net resctrl udp bpf_trace ftrace irq_matrix netfs rpm virtio_gpu[ ...] The above works because "ls" does a stat() on the directory first, which does a .getattr() call that updates the permissions of the existing "events" directory inode. BUT! If I had used my own getents() program that has: fd = openat(AT_FDCWD, argv[1], O_RDONLY); if (fd < 0) perror("openat"); n = getdents64(fd, buf, BUF_SIZE); if (n < 0) perror("getdents64"); Where it calls the openat() without doing a stat fist, and after boot, had done: # chmod 0750 /sys/kernel/events # chmod 0770 /sys/kernel/tracing # mount -o remount,gid=1000 /sys/kernel/tracing # su - rostedt $ id uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt) $ ./getdents /sys/kernel/tracing/events openat: Permission denied getdents64: Bad file descriptor It errors because he "events" inode permission hasn't been updated yet. Now after getting the above error, if I do the "ls" and then run it again: $ ls /sys/kernel/tracing/events > /dev/null $ ./getdents /sys/kernel/tracing/events enable header_page header_event initcall vsyscall syscalls it works! so no, I can't remove that .permissions callback from eventfs. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 10:52:51 -0500 Steven Rostedt wrote: > I'll apply this patch too, as it appears to work with this code. I meant "appears to work without this code". -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 08:07:46 -0500 Steven Rostedt wrote: > Or are you saying that I don't need the ".permission" callback, because > eventfs does it when it creates the inodes? But for eventfs to know what > the permissions changes are, it uses .getattr and .setattr. OK, if your main argument is that we do not need .permission, I agree with you. But that's a trivial change and doesn't affect the complexity that eventfs is doing. In fact, removing the "permission" check is simply this patch: -- diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..f2af07a857e2 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap, return 0; } -static int eventfs_permission(struct mnt_idmap *idmap, - struct inode *inode, int mask) -{ - set_top_events_ownership(inode); - return generic_permission(idmap, inode, mask); -} - static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, - .permission = eventfs_permission, }; static const struct inode_operations eventfs_file_inode_operations = { -- I only did that because Linus mentioned it, and I thought it was needed. I'll apply this patch too, as it appears to work with this code. Thanks! -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 12:45:36 +0100 Christian Brauner wrote: > So say you do: > > mkdir /sys/kernel/tracing/instances/foo > > After this has returned we know everything we need to know about the new > tracefs instance including the ownership and the mode of all inodes in > /sys/kernel/tracing/instances/foo/events/* and below precisely because > ownership is always inherited from the parent dentry and recorded in the > metadata struct eventfs_inode. > > So say someone does: > > open("/sys/kernel/tracing/instances/foo/events/xfs"); > > and say this is the first time that someone accesses that events/ > directory. > > When the open pathwalk is done, the vfs will determine via > > [1] may_lookup(inode_of(events)) > > whether you are able to list entries such as "xfs" in that directory. > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds > it ends up calling i_op->eventfs_root_lookup(events). > > At this point tracefs/eventfs adds the inodes for all entries in that > "events" directory including "xfs" based on the metadata it recorded > during the mkdir. Since now someone is actually interested in them. And > it initializes the inodes with ownership and everything and adds the > dentries that belong into that directory. > > Nothing here depends on the permissions of the caller. The only > permission that mattered was done in the VFS in [1]. If the caller has > permissions to enter a directory they can lookup and list its contents. > And its contents where determined/fixed etc when mkdir was called. > > So we just need to add the required objects into the caches (inode, > dentry) whose addition we intentionally defered until someone actually > needed them. > > So, eventfs_root_lookup() now initializes the inodes with the ownership > from the stored metadata or from the parent dentry and splices in inodes > and dentries. No permission checking is needed for this because it is > always a recheck of what the vfs did in [1]. > > We now return to the vfs and path walk continues to the final component > that you actually want to open which is that "xfs" directory in this > example. We check the permissions on that inode via may_open("xfs") and > we open that directory returning an fd to userspace ultimately. > > (I'm going by memory since I need to step out the door.) So, let's say we do: chgrp -R rostedt /sys/kernel/tracing/ But I don't want rostedt to have access to xfs chgrp -R root /sys/kernel/tracing/events/xfs Both actions will create the inodes and dentries of all files and directories (because of "-R"). But once that is done, the ref counts go to zero. They stay around until reclaim. But then I open Chrome ;-) and it reclaims all the dentries and inodes, so we are back to here we were on boot. Now as rostedt I do: ls /sys/kernel/tracing/events/xfs The VFS layer doesn't know if I have permission to that or not, because all the inodes and dentries have been freed. It has to call back to eventfs to find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will recreated the inodes with the proper permission. Or are you saying that I don't need the ".permission" callback, because eventfs does it when it creates the inodes? But for eventfs to know what the permissions changes are, it uses .getattr and .setattr. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Mon, Jan 08, 2024 at 10:23:31AM -0500, Steven Rostedt wrote: > On Mon, 8 Jan 2024 12:04:54 +0100 > Christian Brauner wrote: > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > > redundant and just wrong. > > > > > > I don't think so. > > > > I'm very well aware that the dentries and inode aren't created during > > mkdir but the completely directory layout is determined. You're just > > splicing in dentries and inodes during lookup and readdir. > > > > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later > > do a lookup/readdir on > > > > ls -al /sys/kernel/tracing/instances/foo/events > > > > Why should the creation of the dentries and inodes ever fail due to a > > permission failure? > > They shouldn't. > > > The vfs did already verify that you had the required > > permissions to list entries in that directory. Why should filling up > > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't > > That tracefs instance would be half-functional. And again, right now > > that inode_permission() check cannot even fail. > > And it shouldn't. But without dentries and inodes, how does VFS know what > is allowed to open the files? So say you do: mkdir /sys/kernel/tracing/instances/foo After this has returned we know everything we need to know about the new tracefs instance including the ownership and the mode of all inodes in /sys/kernel/tracing/instances/foo/events/* and below precisely because ownership is always inherited from the parent dentry and recorded in the metadata struct eventfs_inode. So say someone does: open("/sys/kernel/tracing/instances/foo/events/xfs"); and say this is the first time that someone accesses that events/ directory. When the open pathwalk is done, the vfs will determine via [1] may_lookup(inode_of(events)) whether you are able to list entries such as "xfs" in that directory. The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds it ends up calling i_op->eventfs_root_lookup(events). At this point tracefs/eventfs adds the inodes for all entries in that "events" directory including "xfs" based on the metadata it recorded during the mkdir. Since now someone is actually interested in them. And it initializes the inodes with ownership and everything and adds the dentries that belong into that directory. Nothing here depends on the permissions of the caller. The only permission that mattered was done in the VFS in [1]. If the caller has permissions to enter a directory they can lookup and list its contents. And its contents where determined/fixed etc when mkdir was called. So we just need to add the required objects into the caches (inode, dentry) whose addition we intentionally defered until someone actually needed them. So, eventfs_root_lookup() now initializes the inodes with the ownership from the stored metadata or from the parent dentry and splices in inodes and dentries. No permission checking is needed for this because it is always a recheck of what the vfs did in [1]. We now return to the vfs and path walk continues to the final component that you actually want to open which is that "xfs" directory in this example. We check the permissions on that inode via may_open("xfs") and we open that directory returning an fd to userspace ultimately. (I'm going by memory since I need to step out the door.)
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Mon, 8 Jan 2024 12:32:46 +0100 Christian Brauner wrote: > On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote: > > On Sun, 7 Jan 2024 13:29:12 -0500 > > Steven Rostedt wrote: > > > > > > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > > redundant and just wrong. > > > > > > I don't think so. > > > > Just to make it clear. eventfs has nothing to do with mkdir instance/foo. > > It exists without that. Although one rationale to do eventfs was so > > Every instance/foo/ tracefs instances also contains an events directory > and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's > not a separate filesystem. Both eventfs and tracefs are on the same > single, system wide superblock. > > > that the instance directories wouldn't recreate the same 10thousands > > event inodes and dentries for every mkdir done. > > I know but that's irrelevant to what I'm trying to tell you. > > A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs > instance. With or without the on-demand dentry and inode creation for > the eventfs portion that tracefs "instance" has now been created in its > entirety including all the required information for someone to later > come along and perform a lookup on /sys/kernel/tracing/instances/foo/events. > > All you've done is to defer the addition of the dentries and inodes when > someone does actually look at the events directory of the tracefs > instance. > > Whether you choose to splice in the dentries and inodes for the eventfs > portion during lookup and readdir or if you had chosen to not do the > on-demand thing at all and the entries were created at the same time as > the mkdir call are equivalent from the perspective of permission > checking. > > If you have the required permissions to look at the events directory > then there's no reason why listing the directory entries in there should > fail. This can't even happen right now. Ah, I think I know where the confusion lies. The tracing information in kernel/trace/*.c doesn't keep track of permission. It relies totally on fs/tracefs/* to do so. If someone does 'chmod' or 'chown' or mounts with 'gid=xxx' then it's up to tracefs to maintain that information and not the tracing subsystem. The tracing subsystem only gives the "default" permissions (before boot finishes). The difference between normal file systems and pseudo file systems like debugfs and tracefs, is that normal file systems keep the permission information stored on the external device. That is, when the inodes and dentries are created, the information is retrieved from the stored file system. I think this may actually be a failure of debugfs (and tracefs as it was based on debugfs), in that the inodes and dentries are created at the same time the "files" backing them are. Which is normally at boot up and before the file system is mounted. That is, inodes and dentries are actually coupled with the data they represent. It's not a cache for a back store like a hard drive partition. To create a file in debugfs you do: struct dentry *debugfs_create_file(const char *name, umode_t mode, struct dentry *parent, void *data, const struct file_operations *fops) That is, you pass a the name, the mode, the parent dentry, data, and the fops and that will create an inode and dentry (which is returned). This happens at boot up before user space is running and before debugfs is even mounted. Because debugfs is mostly for debugging, people don't care about how it's mounted. It is usually restricted to root only access. Especially since there's a lot of sensitive information that shouldn't be exposed to non-privileged users. The reason tracefs came about is that people asked me to be able to have access to tracing without needing to even enable debugfs. They also want to easily make it accessible to non root users and talking with Kees Cook, he recommended using ACL. But because it inherited a lot from debugfs, I started doing these tricks like walking the dentry tree to make it work a bit better. Because the dentries and inodes were created before mount, I had to play these tricks. But as Linus pointed out, that was the wrong way to do that. The right way was to use .getattr and .permission callbacks to figure out what the permissions to the files are. This has nothing to do with the creation of the files, it's about who has access to the files that the inodes point to. This sounds like another topic to bring up at LSFMM ;-) "Can we standardize pseudo file systems like debugfs and tracefs to act more like real file systems, and have inodes and dentries act as cache and not be so coupled to the data?" -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Mon, 8 Jan 2024 12:04:54 +0100 Christian Brauner wrote: > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > redundant and just wrong. > > > > I don't think so. > > I'm very well aware that the dentries and inode aren't created during > mkdir but the completely directory layout is determined. You're just > splicing in dentries and inodes during lookup and readdir. > > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later > do a lookup/readdir on > > ls -al /sys/kernel/tracing/instances/foo/events > > Why should the creation of the dentries and inodes ever fail due to a > permission failure? They shouldn't. > The vfs did already verify that you had the required > permissions to list entries in that directory. Why should filling up > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't > That tracefs instance would be half-functional. And again, right now > that inode_permission() check cannot even fail. And it shouldn't. But without dentries and inodes, how does VFS know what is allowed to open the files? -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote: > On Sun, 7 Jan 2024 13:29:12 -0500 > Steven Rostedt wrote: > > > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > redundant and just wrong. > > > > I don't think so. > > Just to make it clear. eventfs has nothing to do with mkdir instance/foo. > It exists without that. Although one rationale to do eventfs was so Every instance/foo/ tracefs instances also contains an events directory and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's not a separate filesystem. Both eventfs and tracefs are on the same single, system wide superblock. > that the instance directories wouldn't recreate the same 10thousands > event inodes and dentries for every mkdir done. I know but that's irrelevant to what I'm trying to tell you. A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs instance. With or without the on-demand dentry and inode creation for the eventfs portion that tracefs "instance" has now been created in its entirety including all the required information for someone to later come along and perform a lookup on /sys/kernel/tracing/instances/foo/events. All you've done is to defer the addition of the dentries and inodes when someone does actually look at the events directory of the tracefs instance. Whether you choose to splice in the dentries and inodes for the eventfs portion during lookup and readdir or if you had chosen to not do the on-demand thing at all and the entries were created at the same time as the mkdir call are equivalent from the perspective of permission checking. If you have the required permissions to look at the events directory then there's no reason why listing the directory entries in there should fail. This can't even happen right now.
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
> > * Tracefs supports the creation of instances from userspace via mkdir. > > For example, > > > > mkdir /sys/kernel/tracing/instances/foo > > > > And here the idmapping is relevant so we need to make the helpers > > aware of the idmapping. > > > > I just went and plumbed this through to most helpers. > > > > There's some subtlety in eventfs. Afaict, the directories and files for > > the individual events are created on-demand during lookup or readdir. > > > > The ownership of these events is again inherited from the parent inode > > or recovered from stored state. In both cases the actual idmapping is > > irrelevant. > > > > The callchain here is: > > > > eventfs_root_lookup("xfs", "events") > > -> create_{dir,file}_dentry("xfs", "events") > >-> create_{dir,file}("xfs", "events") > > -> eventfs_start_creating("xfs", "events") > > -> lookup_one_len("xfs", "events") > > > > And the subtlety is that lookup_one_len() does permission checking on > > the parent inode (IOW, if you want a dentry for "blech" under "events" > > it'll do a permission check on events->d_inode) for exec permissions > > and then goes on to give you a new dentry. > > > > Usually this call would have to be changed to lookup_one() and the > > idmapping be handed down to it. But I think that's irrelevant here. > > > > Lookup generally doesn't need to be aware of idmappings at all. The > > permission checking is done purely in the vfs via may_lookup() and the > > idmapping is irrelevant because we always initialize inodes with the > > filesystem level ownership (see the idmappings.rst) documentation if > > you're interested in excessive details (otherwise you get inode aliases > > which you really don't want). > > > > For tracefs it would not matter for lookup per se but only because > > tracefs seemingly creates inodes/dentries during lookup (and readdir()). > > tracefs creates the inodes/dentries at boot up, it's eventfs that does > it on demand during lookup. > > For inodes/dentries: > > /sys/kernel/tracing/* is all created at boot up, except for "events". Yes. > /sys/kernel/tracing/events/* is created on demand. Yes. > > > > > But imho the permission checking done in current eventfs_root_lookup() > > via lookup_one_len() is meaningless in any way; possibly even > > (conceptually) wrong. > > > > Because, the actual permission checking for the creation of the eventfs > > entries isn't really done during lookup or readdir, it's done when mkdir > > is called: > > > > mkdir /sys/kernel/tracing/instances/foo > > No. that creates a entire new tracefs instance, which happens to > include another eventfs directory. Yes, I'm aware of all that. > No. Only the meta data is created for the eventfs directory with a > mkdir instances/foo. The inodes and dentries are not there. I know, that is what I'm saying. > > > > > When one goes and looksup stuff under foo/events/ or readdir the entries > > in that directory: > > > > fd = open("foo/events") > > readdir(fd, ...) > > > > then they are licensed to list an entry in that directory. So all that > > needs to be done is to actually list those files in that directory. And > > since they already exist (they were created during mkdir) we just need > > to splice in inodes and dentries for them. But for that we shouldn't > > check permissions on the directory again. Because we've done that > > already correctly when the VFS called may_lookup(). > > No they do not exist. I am aware. > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > redundant and just wrong. > > I don't think so. I'm very well aware that the dentries and inode aren't created during mkdir but the completely directory layout is determined. You're just splicing in dentries and inodes during lookup and readdir. If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later do a lookup/readdir on ls -al /sys/kernel/tracing/instances/foo/events Why should the creation of the dentries and inodes ever fail due to a permission failure? The vfs did already verify that you had the required permissions to list entries in that directory. Why should filling up /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't That tracefs instance would be half-functional. And again, right now that inode_permission() check cannot even fail.
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, 7 Jan 2024 13:29:12 -0500 Steven Rostedt wrote: > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > redundant and just wrong. > > I don't think so. Just to make it clear. eventfs has nothing to do with mkdir instance/foo. It exists without that. Although one rationale to do eventfs was so that the instance directories wouldn't recreate the same 10thousands event inodes and dentries for every mkdir done. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, 7 Jan 2024 13:42:39 +0100 Christian Brauner wrote: > > So, I tried to do an exploratory patch even though I promised myself not > to do it. But hey... > > Some notes: > > * Permission handling for idmapped mounts is done completely in the > VFS. That's the case for all filesytems that don't have a custom > ->permission() handler. So there's nothing to do for us here. > > * Idmapped mount handling for ->getattr() is done completely by the VFS > if the filesystem doesn't have a custom ->getattr() handler. So we're > done here. > > * Tracefs doesn't support attribute changes via ->setattr() (chown(), > chmod etc.). So there's nothing to here. > > * Eventfs does support attribute changes via ->setattr(). But it relies > on simple_setattr() which is already idmapped mount aware. So there's > nothing for us to do. > > * Ownership is inherited from the parent inode (tracefs) or optionally > from stashed ownership information (eventfs). That means the idmapping > is irrelevant here. It's similar to the "inherit gid from parent > directory" logic we have in some circumstances. TL;DR nothing to do > here as well. The reason ownership is inherited from the parent is because the inodes are created at boot up before user space starts. eventfs inodes are created on demand after user space so it needs to either check the "default" ownership and permissions, or if the user changed a specific file/directory, it must save it and use that again if the inode/dentry are reclaimed and then referenced again and recreated. > > * Tracefs supports the creation of instances from userspace via mkdir. > For example, > > mkdir /sys/kernel/tracing/instances/foo > > And here the idmapping is relevant so we need to make the helpers > aware of the idmapping. > > I just went and plumbed this through to most helpers. > > There's some subtlety in eventfs. Afaict, the directories and files for > the individual events are created on-demand during lookup or readdir. > > The ownership of these events is again inherited from the parent inode > or recovered from stored state. In both cases the actual idmapping is > irrelevant. > > The callchain here is: > > eventfs_root_lookup("xfs", "events") > -> create_{dir,file}_dentry("xfs", "events") >-> create_{dir,file}("xfs", "events") > -> eventfs_start_creating("xfs", "events") > -> lookup_one_len("xfs", "events") > > And the subtlety is that lookup_one_len() does permission checking on > the parent inode (IOW, if you want a dentry for "blech" under "events" > it'll do a permission check on events->d_inode) for exec permissions > and then goes on to give you a new dentry. > > Usually this call would have to be changed to lookup_one() and the > idmapping be handed down to it. But I think that's irrelevant here. > > Lookup generally doesn't need to be aware of idmappings at all. The > permission checking is done purely in the vfs via may_lookup() and the > idmapping is irrelevant because we always initialize inodes with the > filesystem level ownership (see the idmappings.rst) documentation if > you're interested in excessive details (otherwise you get inode aliases > which you really don't want). > > For tracefs it would not matter for lookup per se but only because > tracefs seemingly creates inodes/dentries during lookup (and readdir()). tracefs creates the inodes/dentries at boot up, it's eventfs that does it on demand during lookup. For inodes/dentries: /sys/kernel/tracing/* is all created at boot up, except for "events". /sys/kernel/tracing/events/* is created on demand. > > But imho the permission checking done in current eventfs_root_lookup() > via lookup_one_len() is meaningless in any way; possibly even > (conceptually) wrong. > > Because, the actual permission checking for the creation of the eventfs > entries isn't really done during lookup or readdir, it's done when mkdir > is called: > > mkdir /sys/kernel/tracing/instances/foo No. that creates a entire new tracefs instance, which happens to include another eventfs directory. The eventsfs directory is in "/sys/kernel/tracing/events" and "/sys/kernel/tracing/instances/*/events" eventfs has 10s of thousands of files and directories which is why I changed it to be on-demand inode/dentry creation and also reclaiming when no longer accessed. # ls /sys/kernel/tracing/events/ will create the inodes and dentries, and a memory stress program will free those created inodes and dentries. > > Here, all possible entries beneath foo including "events" and further > below are recorded and stored. So once mkdir returns it basically means > that it succeeded with the creation of all the necessary directories and > files. For all purposes the foo/events/ directory and below have all the > entries that matter. They have been created. It's comparable to them not > being in the {d,i}cache, I guess. No. Only the meta data is created for the eventfs
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, Jan 07, 2024 at 06:42:33PM +0100, Christian Brauner wrote: > On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote: > > > > So tracefs supports remounting with different uid/gid mount options and > > > > then actually wades through _all_ of the inodes and changes their > > > > ownership internally? What's the use-case for this? Containers? > > > > > > No, in fact tracing doesn't work well with containers as tracing is global > > > to the entire machine. It can work with privileged containers though. > > > > At least the tracefs interface is easily supportable within a delegation > > model. IOW, you have a privileged process that delegates relevant > > portions to a container via idmapped mounts _without_ doing the insane thing > > and making it mountable by a container aka the fs-to-CVE pipeline. > > > > > > > > The reason for this is because tracefs was based off of debugfs where the > > > files and directores are created at boot up and mounted later. The reason > > > to do this was to allow users to mount with gid=GID to allow a given group > > > to have access to tracing. Without this update, tracefs would ignore it > > > like debugfs and proc does today. > > > > > > I think its time I explain the purpose of tracefs and how it came to be. > > > > > > The tracing system required a way to control tracing and read the traces. > > > It could have just used a new system like perf (although > > > /sys/kernel/debug/tracing predates perf), where it created a single > > > ioctl() > > > like system call do do everything. > > > > > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own > > > logdev > > > tracer, which both have an embedded background, I chose an interface that > > > could work with just an unmodified version of busybox. That is, I wanted > > > it > > > to work with just cat and echo. > > > > > > The main difference with tracefs compared to other file systems is that it > > > is a control interface, where writes happen as much as reads. The data > > > read > > > is controlled. The closest thing I can think of is how cgroups work. > > > > > > As tracing is a privileged operation, but something that could be changed > > > to allow a group to have access to, I wanted to make it easy for an admin > > > to decide who gets to do what at boot up via the /etc/fstab file. > > > > Yeah, ok. I think you could achieve the same thing via idmapped mounts. You > > just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount. > > > > mount(8) should just give you the ability to specify "map the ids I > > explicitly > > want to remap to something else and for the rest use the identity mapping". > > I > > wanted that for other reasons anyway. > > > > So in one of the next versions of mount(8) you can then do (where --beneath > > means place the mount beneath the current one and --replace is > > self-explanatory): > > > > sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' > > /sys/kernel/tracing > > sudo umount /sys/kernel/tracing > > > > or as a shortcut provided by mount(8): > > > > sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' > > /sys/kernel/tracing > > > > In both cases you replace the mount without unmounting tracefs. > > > > I can illustrate this right now though: > > > > user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 > > u:0:1000:1' /sys/kernel/tracing/ /mnt/ > > > > # This is a tool I wrote for testing the patchset I wrote back then. > > user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath > > --detached /mnt /sys/kernel/tracing > > Mounting beneath top mount > > Creating anonymous mount > > Attaching mount /mnt -> /sys/kernel/tracing > > Creating single detached mount > > > > user1@localhost:~/data/move-mount-beneath$ > > > > # Now there's two mounts stacked on top of each other. > > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > > | `-/sys/kernel/tracingtracefstracefs > > rw,nosuid,nodev,noexec,relatime,idmapped > > | `-/sys/kernel/tracing tracefstracefs > > rw,nosuid,nodev,noexec,relatime > > > > user1@localhost:~/data/move-mount-beneath$ sudo ls -al > > /sys/kernel/tracing/| head > > total 0 > > drwx-- 6 root root 0 Jan 7 13:33 . > > drwxr-xr-x 16 root root 0 Jan 7 13:33 .. > > -r--r- 1 root root 0 Jan 7 13:33 README > > -r--r- 1 root root 0 Jan 7 13:33 available_events > > -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions > > -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions_addrs > > -r--r- 1 root root 0 Jan 7 13:33 available_tracers > > -rw-r- 1 root root 0 Jan 7 13:33 buffer_percent > > -rw-r- 1 root root 0 Jan 7 13:33 buffer_size_kb > > > > # Reveal updated mount > > user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing > > > > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > > |
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote: > > > So tracefs supports remounting with different uid/gid mount options and > > > then actually wades through _all_ of the inodes and changes their > > > ownership internally? What's the use-case for this? Containers? > > > > No, in fact tracing doesn't work well with containers as tracing is global > > to the entire machine. It can work with privileged containers though. > > At least the tracefs interface is easily supportable within a delegation > model. IOW, you have a privileged process that delegates relevant > portions to a container via idmapped mounts _without_ doing the insane thing > and making it mountable by a container aka the fs-to-CVE pipeline. > > > > > The reason for this is because tracefs was based off of debugfs where the > > files and directores are created at boot up and mounted later. The reason > > to do this was to allow users to mount with gid=GID to allow a given group > > to have access to tracing. Without this update, tracefs would ignore it > > like debugfs and proc does today. > > > > I think its time I explain the purpose of tracefs and how it came to be. > > > > The tracing system required a way to control tracing and read the traces. > > It could have just used a new system like perf (although > > /sys/kernel/debug/tracing predates perf), where it created a single ioctl() > > like system call do do everything. > > > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev > > tracer, which both have an embedded background, I chose an interface that > > could work with just an unmodified version of busybox. That is, I wanted it > > to work with just cat and echo. > > > > The main difference with tracefs compared to other file systems is that it > > is a control interface, where writes happen as much as reads. The data read > > is controlled. The closest thing I can think of is how cgroups work. > > > > As tracing is a privileged operation, but something that could be changed > > to allow a group to have access to, I wanted to make it easy for an admin > > to decide who gets to do what at boot up via the /etc/fstab file. > > Yeah, ok. I think you could achieve the same thing via idmapped mounts. You > just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount. > > mount(8) should just give you the ability to specify "map the ids I explicitly > want to remap to something else and for the rest use the identity mapping". I > wanted that for other reasons anyway. > > So in one of the next versions of mount(8) you can then do (where --beneath > means place the mount beneath the current one and --replace is > self-explanatory): > > sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing > sudo umount /sys/kernel/tracing > > or as a shortcut provided by mount(8): > > sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' > /sys/kernel/tracing > > In both cases you replace the mount without unmounting tracefs. > > I can illustrate this right now though: > > user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' > /sys/kernel/tracing/ /mnt/ > > # This is a tool I wrote for testing the patchset I wrote back then. > user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath > --detached /mnt /sys/kernel/tracing > Mounting beneath top mount > Creating anonymous mount > Attaching mount /mnt -> /sys/kernel/tracing > Creating single detached mount > > user1@localhost:~/data/move-mount-beneath$ > > # Now there's two mounts stacked on top of each other. > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > | `-/sys/kernel/tracingtracefstracefs > rw,nosuid,nodev,noexec,relatime,idmapped > | `-/sys/kernel/tracing tracefstracefs > rw,nosuid,nodev,noexec,relatime > > user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| > head > total 0 > drwx-- 6 root root 0 Jan 7 13:33 . > drwxr-xr-x 16 root root 0 Jan 7 13:33 .. > -r--r- 1 root root 0 Jan 7 13:33 README > -r--r- 1 root root 0 Jan 7 13:33 available_events > -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions > -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions_addrs > -r--r- 1 root root 0 Jan 7 13:33 available_tracers > -rw-r- 1 root root 0 Jan 7 13:33 buffer_percent > -rw-r- 1 root root 0 Jan 7 13:33 buffer_size_kb > > # Reveal updated mount > user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing > > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > | `-/sys/kernel/tracingtracefstracefs > rw,nosuid,nodev,noexec,relatime,idmapped > > user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| > head > total 0 > drwx-- 6 user1 user1 0 Jan 7 13:33 . > drwxr-xr-x 16 root root 0 Jan 7 13:33 .. > -r--r- 1 user1 user1 0
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
> > So tracefs supports remounting with different uid/gid mount options and > > then actually wades through _all_ of the inodes and changes their > > ownership internally? What's the use-case for this? Containers? > > No, in fact tracing doesn't work well with containers as tracing is global > to the entire machine. It can work with privileged containers though. At least the tracefs interface is easily supportable within a delegation model. IOW, you have a privileged process that delegates relevant portions to a container via idmapped mounts _without_ doing the insane thing and making it mountable by a container aka the fs-to-CVE pipeline. > > The reason for this is because tracefs was based off of debugfs where the > files and directores are created at boot up and mounted later. The reason > to do this was to allow users to mount with gid=GID to allow a given group > to have access to tracing. Without this update, tracefs would ignore it > like debugfs and proc does today. > > I think its time I explain the purpose of tracefs and how it came to be. > > The tracing system required a way to control tracing and read the traces. > It could have just used a new system like perf (although > /sys/kernel/debug/tracing predates perf), where it created a single ioctl() > like system call do do everything. > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev > tracer, which both have an embedded background, I chose an interface that > could work with just an unmodified version of busybox. That is, I wanted it > to work with just cat and echo. > > The main difference with tracefs compared to other file systems is that it > is a control interface, where writes happen as much as reads. The data read > is controlled. The closest thing I can think of is how cgroups work. > > As tracing is a privileged operation, but something that could be changed > to allow a group to have access to, I wanted to make it easy for an admin > to decide who gets to do what at boot up via the /etc/fstab file. Yeah, ok. I think you could achieve the same thing via idmapped mounts. You just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount. mount(8) should just give you the ability to specify "map the ids I explicitly want to remap to something else and for the rest use the identity mapping". I wanted that for other reasons anyway. So in one of the next versions of mount(8) you can then do (where --beneath means place the mount beneath the current one and --replace is self-explanatory): sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing sudo umount /sys/kernel/tracing or as a shortcut provided by mount(8): sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing In both cases you replace the mount without unmounting tracefs. I can illustrate this right now though: user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' /sys/kernel/tracing/ /mnt/ # This is a tool I wrote for testing the patchset I wrote back then. user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath --detached /mnt /sys/kernel/tracing Mounting beneath top mount Creating anonymous mount Attaching mount /mnt -> /sys/kernel/tracing Creating single detached mount user1@localhost:~/data/move-mount-beneath$ # Now there's two mounts stacked on top of each other. user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing | `-/sys/kernel/tracingtracefstracefs rw,nosuid,nodev,noexec,relatime,idmapped | `-/sys/kernel/tracing tracefstracefs rw,nosuid,nodev,noexec,relatime user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head total 0 drwx-- 6 root root 0 Jan 7 13:33 . drwxr-xr-x 16 root root 0 Jan 7 13:33 .. -r--r- 1 root root 0 Jan 7 13:33 README -r--r- 1 root root 0 Jan 7 13:33 available_events -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions_addrs -r--r- 1 root root 0 Jan 7 13:33 available_tracers -rw-r- 1 root root 0 Jan 7 13:33 buffer_percent -rw-r- 1 root root 0 Jan 7 13:33 buffer_size_kb # Reveal updated mount user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing | `-/sys/kernel/tracingtracefstracefs rw,nosuid,nodev,noexec,relatime,idmapped user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head total 0 drwx-- 6 user1 user1 0 Jan 7 13:33 . drwxr-xr-x 16 root root 0 Jan 7 13:33 .. -r--r- 1 user1 user1 0 Jan 7 13:33 README -r--r- 1 user1 user1 0 Jan 7 13:33 available_events -r--r- 1 user1 user1 0 Jan 7 13:33 available_filter_functions -r--r- 1 user1 user1 0 Jan 7 13:33 available_filter_functions_addrs -r--r- 1 user1 user1 0 Jan 7 13:33 available_tracers
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 5 Jan 2024 15:26:28 +0100 Christian Brauner wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" > > > > Instead of walking the dentries on mount/remount to update the gid values of > > all the dentries if a gid option is specified on mount, just update the root > > inode. Add .getattr, .setattr, and .permissions on the tracefs inode > > operations to update the permissions of the files and directories. > > > > For all files and directories in the top level instance: > > > > /sys/kernel/tracing/* > > > > It will use the root inode as the default permissions. The inode that > > represents: /sys/kernel/tracing (or wherever it is mounted). > > > > When an instance is created: > > > > mkdir /sys/kernel/tracing/instance/foo > > > > The directory "foo" and all its files and directories underneath will use > > the default of what foo is when it was created. A remount of tracefs will > > not affect it. > > That kinda sounds like eventfs should actually be a separate filesystem. > But I don't know enough about the relationship between the two concepts. It may someday become one, as eventfs is used by perf where the rest of the tracefs system is not. > > > > > If a user were to modify the permissions of any file or directory in > > tracefs, it will also no longer be modified by a change in ownership of a > > remount. > > Very odd semantics and I would recommend to avoid that. It's just plain > weird imo. > > > > > The events directory, if it is in the top level instance, will use the > > tracefs root inode as the default ownership for itself and all the files and > > directories below it. > > > > For the events directory in an instance ("foo"), it will keep the ownership > > of what it was when it was created, and that will be used as the default > > ownership for the files and directories beneath it. > > > > Link: > > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ > > > > Signed-off-by: Steven Rostedt (Google) > > --- > > So tracefs supports remounting with different uid/gid mount options and > then actually wades through _all_ of the inodes and changes their > ownership internally? What's the use-case for this? Containers? No, in fact tracing doesn't work well with containers as tracing is global to the entire machine. It can work with privileged containers though. The reason for this is because tracefs was based off of debugfs where the files and directores are created at boot up and mounted later. The reason to do this was to allow users to mount with gid=GID to allow a given group to have access to tracing. Without this update, tracefs would ignore it like debugfs and proc does today. I think its time I explain the purpose of tracefs and how it came to be. The tracing system required a way to control tracing and read the traces. It could have just used a new system like perf (although /sys/kernel/debug/tracing predates perf), where it created a single ioctl() like system call do do everything. As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev tracer, which both have an embedded background, I chose an interface that could work with just an unmodified version of busybox. That is, I wanted it to work with just cat and echo. The main difference with tracefs compared to other file systems is that it is a control interface, where writes happen as much as reads. The data read is controlled. The closest thing I can think of is how cgroups work. As tracing is a privileged operation, but something that could be changed to allow a group to have access to, I wanted to make it easy for an admin to decide who gets to do what at boot up via the /etc/fstab file. > > Aside from optimizing this and the special semantics for this eventfs > stuff that you really should think twice of doing, here's one idea for > an extension that might alleviate some of the pain: > > If you need flexible dynamic ownership change to e.g., be able to > delegate (all, a directory, a single file of) tracefs to > unprivileged/containers/whatever then you might want to consider > supporting idmapped mounts for tracefs. Because then you can do stuff > like: I'm a novice here and have no idea on how id maps work ;-) > > user1@localhost:~/data/scripts$ sudo mount --bind -o > X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt > user1@localhost:~/data/scripts$ ls -ln /run/ > total 12 > drwxr-xr-x 2 0 0 40 Jan 5 12:12 credentials > drwx-- 2 0 0 40 Jan 5 11:57 cryptsetup > drwxr-xr-x 2 0 0 60 Jan 5 11:57 dbus > drwx-- 6 0 0 280 Jan 5 11:57 incus_agent > prw--- 1 0 00 Jan 5 11:57 initctl > drwxrwxrwt 4 0 0 80 Jan 5 11:57 lock > drwxr-xr-x 3 0 0 60 Jan 5 11:57 log > drwx-- 2 0 0 40 Jan 5 11:57 lvm > -r--r--r-- 1 0 0 33 Jan 5 11:57 machine-id > -rw-r--r-- 1 0 0 101 Jan 5 11:58 motd.dynamic > drwxr-xr-x
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > Instead of walking the dentries on mount/remount to update the gid values of > all the dentries if a gid option is specified on mount, just update the root > inode. Add .getattr, .setattr, and .permissions on the tracefs inode > operations to update the permissions of the files and directories. > > For all files and directories in the top level instance: > > /sys/kernel/tracing/* > > It will use the root inode as the default permissions. The inode that > represents: /sys/kernel/tracing (or wherever it is mounted). > > When an instance is created: > > mkdir /sys/kernel/tracing/instance/foo > > The directory "foo" and all its files and directories underneath will use > the default of what foo is when it was created. A remount of tracefs will > not affect it. That kinda sounds like eventfs should actually be a separate filesystem. But I don't know enough about the relationship between the two concepts. > > If a user were to modify the permissions of any file or directory in > tracefs, it will also no longer be modified by a change in ownership of a > remount. Very odd semantics and I would recommend to avoid that. It's just plain weird imo. > > The events directory, if it is in the top level instance, will use the > tracefs root inode as the default ownership for itself and all the files and > directories below it. > > For the events directory in an instance ("foo"), it will keep the ownership > of what it was when it was created, and that will be used as the default > ownership for the files and directories beneath it. > > Link: > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ > > Signed-off-by: Steven Rostedt (Google) > --- So tracefs supports remounting with different uid/gid mount options and then actually wades through _all_ of the inodes and changes their ownership internally? What's the use-case for this? Containers? Aside from optimizing this and the special semantics for this eventfs stuff that you really should think twice of doing, here's one idea for an extension that might alleviate some of the pain: If you need flexible dynamic ownership change to e.g., be able to delegate (all, a directory, a single file of) tracefs to unprivileged/containers/whatever then you might want to consider supporting idmapped mounts for tracefs. Because then you can do stuff like: user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt user1@localhost:~/data/scripts$ ls -ln /run/ total 12 drwxr-xr-x 2 0 0 40 Jan 5 12:12 credentials drwx-- 2 0 0 40 Jan 5 11:57 cryptsetup drwxr-xr-x 2 0 0 60 Jan 5 11:57 dbus drwx-- 6 0 0 280 Jan 5 11:57 incus_agent prw--- 1 0 00 Jan 5 11:57 initctl drwxrwxrwt 4 0 0 80 Jan 5 11:57 lock drwxr-xr-x 3 0 0 60 Jan 5 11:57 log drwx-- 2 0 0 40 Jan 5 11:57 lvm -r--r--r-- 1 0 0 33 Jan 5 11:57 machine-id -rw-r--r-- 1 0 0 101 Jan 5 11:58 motd.dynamic drwxr-xr-x 2 0 0 40 Jan 5 11:57 mount drwx-- 2 0 0 40 Jan 5 11:57 multipath drwxr-xr-x 2 0 0 40 Jan 5 11:57 sendsigs.omit.d lrwxrwxrwx 1 0 08 Jan 5 11:57 shm -> /dev/shm drwx--x--x 2 0 0 40 Jan 5 11:57 sudo drwxr-xr-x 24 0 0 660 Jan 5 14:30 systemd drwxr-xr-x 6 0 0 140 Jan 5 14:30 udev drwxr-xr-x 4 0 0 80 Jan 5 11:58 user -rw-rw-r-- 1 0 43 2304 Jan 5 15:15 utmp user1@localhost:~/data/scripts$ ls -ln /mnt/ total 12 drwxr-xr-x 2 1234 1000 40 Jan 5 12:12 credentials drwx-- 2 1234 1000 40 Jan 5 11:57 cryptsetup drwxr-xr-x 2 1234 1000 60 Jan 5 11:57 dbus drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 incus_agent prw--- 1 1234 10000 Jan 5 11:57 initctl drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 lock drwxr-xr-x 3 1234 1000 60 Jan 5 11:57 log drwx-- 2 1234 1000 40 Jan 5 11:57 lvm -r--r--r-- 1 1234 1000 33 Jan 5 11:57 machine-id -rw-r--r-- 1 1234 1000 101 Jan 5 11:58 motd.dynamic drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 mount drwx-- 2 1234 1000 40 Jan 5 11:57 multipath drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 sendsigs.omit.d lrwxrwxrwx 1 1234 10008 Jan 5 11:57 shm -> /dev/shm drwx--x--x 2 1234 1000 40 Jan 5 11:57 sudo drwxr-xr-x 24 1234 1000 660 Jan 5 14:30 systemd drwxr-xr-x 6 1234 1000 140 Jan 5 14:30 udev drwxr-xr-x 4 1234 1000 80 Jan 5 11:58 user -rw-rw-r-- 1 1234 65534 2304 Jan 5 15:15 utmp Where you can see that ownership of this tmpfs instance in this example is changed. I'm not trying to advocate here but this will probably ultimately be nicer for your users because it means that a container manager or whatever can be handed a part of tracefs (or all of it) and the ownership and access rights for that thing is correct. And you can get rid of that gid based access completely. You can change uids, gids, or both. You can
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote: > On Thu, 4 Jan 2024 01:48:37 + > Al Viro wrote: > > > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > > > + /* Get the tracefs root from the parent */ > > > + inode = d_inode(dentry->d_parent); > > > + inode = d_inode(inode->i_sb->s_root); > > > > That makes no sense. First of all, for any positive dentry we have > > dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all > > dentries on given superblock. So what's the point of that dance? > > If you want the root inode, just go for d_inode(dentry->d_sb->s_root) > > and be done with that... > > That was more of thinking that the dentry and dentry->d_parent are > different. As dentry is part of eventfs and dentry->d_parent is part of > tracefs. ??? >Currently they both have the same superblock so yeah, I could just > write it that way too and it would work. But in my head, I was thinking > that they behave differently and maybe one day eventfs would get its own > superblock which would not work. ->d_parent *always* points to the same filesystem; if you get an (automounted?) mountpoint there, ->d_parent simply won't work - it will point to dentry itself. > To explain this better: > > /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events > > But everything but "events" in /sys/kernel/tracing/* is part of tracefs. > Everything in /sys/kernel/tracing/events is part of eventfs. > > That was my thought process. But as both tracefs and eventfs still use > tracefs_get_inode(), it would work as you state. > > I'll update that, as I don't foresee that eventfs will become its own file > system. There is no way to get to underlying mountpoint by dentry - simply because the same fs instance can be mounted in any number of places. A crude overview of taxonomy: file_system_type: what filesystem instances belong to. Not quite the same thing as fs driver (one driver can provide several of those). Usually it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...). super_block: individual filesystem instance. Hosts dentry tree (connected or several disconnected parts - think NFSv4 or the state while trying to get a dentry by fhandle, etc.). dentry: object in a filesystem's directory tree(s). Always belongs to specific filesystem instance - that relationship never changes. Tree structure (and names) _within_ _filesystem_ belong on that level. ->d_parent is part of that tree structure; never NULL, root of a (sub)tree has it pointing to itself. Might be negative, might refer to a filesystem object (file, directory, symlink, etc.). inode: filesystem object (file, directory, etc.). Always belongs to specific filesystem instance. Non-directory inodes might have any number of dentry instances (aliases) refering to it; a directory one - no more than one. Filesystem object contents belongs here; multiple hardlinks have different dentries and the same inode. Of course, filesystem type in question might have no such thing as multiple hardlinks - that's up to filesystem. In general there is no way to find (or enumerate) such links; e.g. a local filesystem might have an extra hardlink somewhere we had never looked at and there won't be any dentries for such hardlinks and no way to get them short of doing the full search of the entire tree. The situation with e.g. NFS client is even worse, obviously. mount: in a sense, mount to super_block is what dentry is to inode. It provides a view of (sub)tree hosted in given filesystem instance. The same filesystem may have any number of mounts, refering to its subtrees (possibly the same subtree for each, possibly all different - up to the callers of mount(2)). They form mount tree(s) - that's where the notions related to "this mounted on top of that" belong. Note that they can be moved around - with no telling the filesystem about that happening. Again, there's no such thing as "the mountpoint of given filesystem instance" - it might be mounted in any number of places at the same time. Specific mount - sure, no problem, except that it can move around. namespace: mount tree. Unlike everything prior, this one is a part of process state - same as descriptor table, mappings, etc. file: opened IO channel. It does refer to specific mount and specific dentry (and thus filesystem instance and an inode on it). Current IO position lives here, so does any per-open(2) state. descriptor table: mapping from numbers to IO channels (opened files). Again, a part of process state. dup() creates a new entry, with reference to the same file as the old one; multiple open() of the same pathname will each yield a separate opened file. _Some_ state belongs here (close-on-exec, mostly). Note that there's no such thing as "the descriptor of this file" - not even "the user-supplied number that had been used to get the file we are currently reading from", since that number might be refering to
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, 4 Jan 2024 01:48:37 + Al Viro wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > + /* Get the tracefs root from the parent */ > > + inode = d_inode(dentry->d_parent); > > + inode = d_inode(inode->i_sb->s_root); > > That makes no sense. First of all, for any positive dentry we have > dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all > dentries on given superblock. So what's the point of that dance? > If you want the root inode, just go for d_inode(dentry->d_sb->s_root) > and be done with that... That was more of thinking that the dentry and dentry->d_parent are different. As dentry is part of eventfs and dentry->d_parent is part of tracefs. Currently they both have the same superblock so yeah, I could just write it that way too and it would work. But in my head, I was thinking that they behave differently and maybe one day eventfs would get its own superblock which would not work. To explain this better: /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events But everything but "events" in /sys/kernel/tracing/* is part of tracefs. Everything in /sys/kernel/tracing/events is part of eventfs. That was my thought process. But as both tracefs and eventfs still use tracefs_get_inode(), it would work as you state. I'll update that, as I don't foresee that eventfs will become its own file system. Thanks, -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, 4 Jan 2024 01:59:10 + Al Viro wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > +static struct inode *instance_inode(struct dentry *parent, struct inode > > *inode) > > +{ > > + struct tracefs_inode *ti; > > + struct inode *root_inode; > > + > > + root_inode = d_inode(inode->i_sb->s_root); > > + > > + /* If parent is NULL then use root inode */ > > + if (!parent) > > + return root_inode; > > + > > + /* Find the inode that is flagged as an instance or the root inode */ > > + do { > > + inode = d_inode(parent); > > + if (inode == root_inode) > > + return root_inode; > > + > > + ti = get_tracefs(inode); > > + > > + if (ti->flags & TRACEFS_INSTANCE_INODE) > > + return inode; > > + } while ((parent = parent->d_parent)); > > *blink* > > This is equivalent to > ... > parent = parent->d_parent; > } while (true); Yeah, that loop went through a few iterations as I first thought that root was a tracefs_inode and the get_tracefs() would work on it. No, it was not, and it caused a cash. But I didn't rewrite the loop well after fixing that. I was also not sure if parent could be NULL, and wanted to protect against it. > > ->d_parent is *never* NULL. And what the hell is that loop supposed to do, > anyway? Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE? > > If root is not marked that way, I would suggest > if (!parent) > parent = inode->i_sb->s_root; > while (!IS_ROOT(parent)) { > struct tracefs_inode *ti = get_tracefs(parent->d_inode); > if (ti->flags & TRACEFS_INSTANCE_INODE) > break; > parent = parent->d_parent; > } > return parent->d_inode; Sure, I could rewrite it that way too. Thanks, -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > +static struct inode *instance_inode(struct dentry *parent, struct inode > *inode) > +{ > + struct tracefs_inode *ti; > + struct inode *root_inode; > + > + root_inode = d_inode(inode->i_sb->s_root); > + > + /* If parent is NULL then use root inode */ > + if (!parent) > + return root_inode; > + > + /* Find the inode that is flagged as an instance or the root inode */ > + do { > + inode = d_inode(parent); > + if (inode == root_inode) > + return root_inode; > + > + ti = get_tracefs(inode); > + > + if (ti->flags & TRACEFS_INSTANCE_INODE) > + return inode; > + } while ((parent = parent->d_parent)); *blink* This is equivalent to ... parent = parent->d_parent; } while (true); ->d_parent is *never* NULL. And what the hell is that loop supposed to do, anyway? Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE? If root is not marked that way, I would suggest if (!parent) parent = inode->i_sb->s_root; while (!IS_ROOT(parent)) { struct tracefs_inode *ti = get_tracefs(parent->d_inode); if (ti->flags & TRACEFS_INSTANCE_INODE) break; parent = parent->d_parent; } return parent->d_inode;
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > + /* Get the tracefs root from the parent */ > + inode = d_inode(dentry->d_parent); > + inode = d_inode(inode->i_sb->s_root); That makes no sense. First of all, for any positive dentry we have dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all dentries on given superblock. So what's the point of that dance? If you want the root inode, just go for d_inode(dentry->d_sb->s_root) and be done with that...
[PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
From: "Steven Rostedt (Google)" Instead of walking the dentries on mount/remount to update the gid values of all the dentries if a gid option is specified on mount, just update the root inode. Add .getattr, .setattr, and .permissions on the tracefs inode operations to update the permissions of the files and directories. For all files and directories in the top level instance: /sys/kernel/tracing/* It will use the root inode as the default permissions. The inode that represents: /sys/kernel/tracing (or wherever it is mounted). When an instance is created: mkdir /sys/kernel/tracing/instance/foo The directory "foo" and all its files and directories underneath will use the default of what foo is when it was created. A remount of tracefs will not affect it. If a user were to modify the permissions of any file or directory in tracefs, it will also no longer be modified by a change in ownership of a remount. The events directory, if it is in the top level instance, will use the tracefs root inode as the default ownership for itself and all the files and directories below it. For the events directory in an instance ("foo"), it will keep the ownership of what it was when it was created, and that will be used as the default ownership for the files and directories beneath it. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 80 ++- fs/tracefs/inode.c | 205 ++- fs/tracefs/internal.h| 3 + 3 files changed, 198 insertions(+), 90 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 53d34a4b5a2b..641bffa0f139 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -45,6 +45,7 @@ enum { EVENTFS_SAVE_MODE = BIT(16), EVENTFS_SAVE_UID= BIT(17), EVENTFS_SAVE_GID= BIT(18), + EVENTFS_TOPLEVEL= BIT(19), }; #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) @@ -115,10 +116,17 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, * The events directory dentry is never freed, unless its * part of an instance that is deleted. It's attr is the * default for its child files and directories. -* Do not update it. It's not used for its own mode or ownership +* Do not update it. It's not used for its own mode or ownership. */ - if (!ei->is_events) + if (ei->is_events) { + /* But it still needs to know if it was modified */ + if (iattr->ia_valid & ATTR_UID) + ei->attr.mode |= EVENTFS_SAVE_UID; + if (iattr->ia_valid & ATTR_GID) + ei->attr.mode |= EVENTFS_SAVE_GID; + } else { update_attr(>attr, iattr); + } } else { name = dentry->d_name.name; @@ -136,9 +144,67 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, return ret; } +static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry) +{ + struct inode *inode; + + /* Only update if the "events" was on the top level */ + if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + return; + + /* Get the tracefs root from the parent */ + inode = d_inode(dentry->d_parent); + inode = d_inode(inode->i_sb->s_root); + ei->attr.uid = inode->i_uid; + ei->attr.gid = inode->i_gid; +} + +static void set_top_events_ownership(struct inode *inode) +{ + struct tracefs_inode *ti = get_tracefs(inode); + struct eventfs_inode *ei = ti->private; + struct dentry *dentry; + + /* The top events directory doesn't get automatically updated */ + if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + return; + + dentry = ei->dentry; + + update_top_events_attr(ei, dentry); + + if (!(ei->attr.mode & EVENTFS_SAVE_UID)) + inode->i_uid = ei->attr.uid; + + if (!(ei->attr.mode & EVENTFS_SAVE_GID)) + inode->i_gid = ei->attr.gid; +} + +static int eventfs_get_attr(struct mnt_idmap *idmap, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int flags) +{ + struct dentry *dentry = path->dentry; + struct inode *inode = d_backing_inode(dentry); + + set_top_events_ownership(inode); + + generic_fillattr(idmap, request_mask, inode, stat); + return 0; +} + +static int eventfs_permission(struct mnt_idmap *idmap, + struct inode *inode, int mask) +{ + set_top_events_ownership(inode); + return