Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]
On Mon, Jun 18, 2018 at 09:30:50PM +0100, David Howells wrote: > > The fscontext code *requires* you to parse the parameters *before* any attempt > to access the superblock is made. Note that this will actually be a problem > for, say, ext4 which passes a text string stored in the superblock through the > parser *before* parsing the mount syscall data. Fun. > > I'm intending to deal with that particular case by having ext4 create multiple > private contexts, one filled in from the user data, and then a second one > filled in from the superblock string. These can then be validated one against > the other before the super_block struct is published. Yeah, what we're trying to do is let the options in the superblock act as defaults which then can be overridden by what the user specifies on the command line. So when you parse the user-supplied data, will there be a way to determine what was specified explicitly, versus what was implied by the defaults? I'll need that in order to be able to merge the two contexts together. - Ted
Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]
On Mon, Jun 18, 2018 at 09:30:50PM +0100, David Howells wrote: > > The fscontext code *requires* you to parse the parameters *before* any attempt > to access the superblock is made. Note that this will actually be a problem > for, say, ext4 which passes a text string stored in the superblock through the > parser *before* parsing the mount syscall data. Fun. > > I'm intending to deal with that particular case by having ext4 create multiple > private contexts, one filled in from the user data, and then a second one > filled in from the superblock string. These can then be validated one against > the other before the super_block struct is published. Yeah, what we're trying to do is let the options in the superblock act as defaults which then can be overridden by what the user specifies on the command line. So when you parse the user-supplied data, will there be a way to determine what was specified explicitly, versus what was implied by the defaults? I'll need that in order to be able to merge the two contexts together. - Ted
Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]
David Howells writes: > Eric W. Biederman wrote: > >> I have read through these patches and I noticed a significant issue. >> >> Today in mount_bdev we do something that looks like: >> >> mount_bdev(...) >> { >> s = sget(..., bdev); >> if (s->s_root) { >> /* Noop */ >> } else { >> err = fill_super(s, ...); >> if (err) { >> deactivate_locked_super(s); >> return ERR_PTR(err); >> } >> s->s_flags |= SB_ATTIVE; >> bdev->bd_super = s; >> } >> return dget(s->s_root); >> } >> >> The key point is that we don't process the mount options at all if >> a super block already exists in the kernel. Similar to what >> your fscontext changes are doing (after parsing the options). > > Actually, no, that's not the case. > > The fscontext code *requires* you to parse the parameters *before* any attempt > to access the superblock is made. Note that this will actually be a problem > for, say, ext4 which passes a text string stored in the superblock through the > parser *before* parsing the mount syscall data. Fun. > > I'm intending to deal with that particular case by having ext4 create multiple > private contexts, one filled in from the user data, and then a second one > filled in from the superblock string. These can then be validated one against > the other before the super_block struct is published. > > And if the super_block struct already exists, the user's specified parameters > can be validated against that. I did not say parse. I said process. My meaning is that today on a second mount of a filesystem like ext2. But really most of them we ignore those options. >> Your fscontext changes do not improve upon this area of the mount api at >> all and that concerns me. This is an area where people can and already >> do shoot themselves in their feet. > > This *will* be dealt with, but I wanted to get the core changes upstream > before tackling all the filesystems. The legacy wrapper is just that and > should be got rid of when all the filesystems have been converted. This is an area where you are making things explicit, where before there was no way to talk about this. For filesystems that are in legacy mode I realize we might miss this corner case, but we should still think about it and handle it. The proc filesystem has the same behavior and it is one you are converting. >> ... >> >> Creating a new mount and finding an old mount are the same operation in >> the kernel today. This is fundamentally confusing. In the new api >> could we please separate these two operations? >> >> Perhaps someting like: >> x create >> x find >> >> With the "x create" case failing if the filesystem already exists, >> still allowing "x find"? And with the "x find" case failing if >> the superblock is not already created in the kernel. > > No. What you're suggesting introduces a userspace-userspace and a > userspace-kernel race - unless you're willing to let userspace lock against > superblock creation by other parties. > > Further, some filesystems *have* to be parameterised before you can do the > check for the superblock. Network filesystems, for example, where you have to > set the network parameters upfront and the key to the superblock might not be > known until you've queried the server. I am not talking about skipping the parameterization. I am talking about actually acting on those options. Parsing and validating them ahead of time is not my concern. When we make the super block honor those options is my concern. Further I am not suggesting something that has a meaningful race. I am suggesting some that is the equivalent of the O_EXCL logic. I am proposing that "x create" fail if the superblock already exists in the kernel. I am proposing that "x find" will fail if the superblock does not already exist. In the worst case you have to iterate a time or two when another user is racing with you to create the super block. But this gives you very valuable information. Knowledge of if the superblock is honoring all of your specified mount options or not. It removes an existing nasty race today where people think they mount a filesystem like "proc" with one set of options and those options are ignored because an internal kernel mount already exists. This is at the level of the fscontext API. I don't care what filesystems that have not been updated to fscontext do. I just want to avoid the nasty nasty confusion that is possible with the existing API. My motivation is I am in the middle of closing a regression in option parsing in proc that caused a security option to get ignored. I would be happy even with a result value of "x create" that told reported if the superbloc "created" or "found". Instead of having two different options. But I want to be able to say to userspace very clearly. If this superblock already exists. You need to
Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]
David Howells writes: > Eric W. Biederman wrote: > >> I have read through these patches and I noticed a significant issue. >> >> Today in mount_bdev we do something that looks like: >> >> mount_bdev(...) >> { >> s = sget(..., bdev); >> if (s->s_root) { >> /* Noop */ >> } else { >> err = fill_super(s, ...); >> if (err) { >> deactivate_locked_super(s); >> return ERR_PTR(err); >> } >> s->s_flags |= SB_ATTIVE; >> bdev->bd_super = s; >> } >> return dget(s->s_root); >> } >> >> The key point is that we don't process the mount options at all if >> a super block already exists in the kernel. Similar to what >> your fscontext changes are doing (after parsing the options). > > Actually, no, that's not the case. > > The fscontext code *requires* you to parse the parameters *before* any attempt > to access the superblock is made. Note that this will actually be a problem > for, say, ext4 which passes a text string stored in the superblock through the > parser *before* parsing the mount syscall data. Fun. > > I'm intending to deal with that particular case by having ext4 create multiple > private contexts, one filled in from the user data, and then a second one > filled in from the superblock string. These can then be validated one against > the other before the super_block struct is published. > > And if the super_block struct already exists, the user's specified parameters > can be validated against that. I did not say parse. I said process. My meaning is that today on a second mount of a filesystem like ext2. But really most of them we ignore those options. >> Your fscontext changes do not improve upon this area of the mount api at >> all and that concerns me. This is an area where people can and already >> do shoot themselves in their feet. > > This *will* be dealt with, but I wanted to get the core changes upstream > before tackling all the filesystems. The legacy wrapper is just that and > should be got rid of when all the filesystems have been converted. This is an area where you are making things explicit, where before there was no way to talk about this. For filesystems that are in legacy mode I realize we might miss this corner case, but we should still think about it and handle it. The proc filesystem has the same behavior and it is one you are converting. >> ... >> >> Creating a new mount and finding an old mount are the same operation in >> the kernel today. This is fundamentally confusing. In the new api >> could we please separate these two operations? >> >> Perhaps someting like: >> x create >> x find >> >> With the "x create" case failing if the filesystem already exists, >> still allowing "x find"? And with the "x find" case failing if >> the superblock is not already created in the kernel. > > No. What you're suggesting introduces a userspace-userspace and a > userspace-kernel race - unless you're willing to let userspace lock against > superblock creation by other parties. > > Further, some filesystems *have* to be parameterised before you can do the > check for the superblock. Network filesystems, for example, where you have to > set the network parameters upfront and the key to the superblock might not be > known until you've queried the server. I am not talking about skipping the parameterization. I am talking about actually acting on those options. Parsing and validating them ahead of time is not my concern. When we make the super block honor those options is my concern. Further I am not suggesting something that has a meaningful race. I am suggesting some that is the equivalent of the O_EXCL logic. I am proposing that "x create" fail if the superblock already exists in the kernel. I am proposing that "x find" will fail if the superblock does not already exist. In the worst case you have to iterate a time or two when another user is racing with you to create the super block. But this gives you very valuable information. Knowledge of if the superblock is honoring all of your specified mount options or not. It removes an existing nasty race today where people think they mount a filesystem like "proc" with one set of options and those options are ignored because an internal kernel mount already exists. This is at the level of the fscontext API. I don't care what filesystems that have not been updated to fscontext do. I just want to avoid the nasty nasty confusion that is possible with the existing API. My motivation is I am in the middle of closing a regression in option parsing in proc that caused a security option to get ignored. I would be happy even with a result value of "x create" that told reported if the superbloc "created" or "found". Instead of having two different options. But I want to be able to say to userspace very clearly. If this superblock already exists. You need to
Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]
Eric W. Biederman wrote: > I have read through these patches and I noticed a significant issue. > > Today in mount_bdev we do something that looks like: > > mount_bdev(...) > { > s = sget(..., bdev); > if (s->s_root) { > /* Noop */ > } else { > err = fill_super(s, ...); > if (err) { > deactivate_locked_super(s); > return ERR_PTR(err); > } > s->s_flags |= SB_ATTIVE; > bdev->bd_super = s; > } > return dget(s->s_root); > } > > The key point is that we don't process the mount options at all if > a super block already exists in the kernel. Similar to what > your fscontext changes are doing (after parsing the options). Actually, no, that's not the case. The fscontext code *requires* you to parse the parameters *before* any attempt to access the superblock is made. Note that this will actually be a problem for, say, ext4 which passes a text string stored in the superblock through the parser *before* parsing the mount syscall data. Fun. I'm intending to deal with that particular case by having ext4 create multiple private contexts, one filled in from the user data, and then a second one filled in from the superblock string. These can then be validated one against the other before the super_block struct is published. And if the super_block struct already exists, the user's specified parameters can be validated against that. > Your fscontext changes do not improve upon this area of the mount api at > all and that concerns me. This is an area where people can and already > do shoot themselves in their feet. This *will* be dealt with, but I wanted to get the core changes upstream before tackling all the filesystems. The legacy wrapper is just that and should be got rid of when all the filesystems have been converted. > ... > > Creating a new mount and finding an old mount are the same operation in > the kernel today. This is fundamentally confusing. In the new api > could we please separate these two operations? > > Perhaps someting like: > x create > x find > > With the "x create" case failing if the filesystem already exists, > still allowing "x find"? And with the "x find" case failing if > the superblock is not already created in the kernel. No. What you're suggesting introduces a userspace-userspace and a userspace-kernel race - unless you're willing to let userspace lock against superblock creation by other parties. Further, some filesystems *have* to be parameterised before you can do the check for the superblock. Network filesystems, for example, where you have to set the network parameters upfront and the key to the superblock might not be known until you've queried the server. > That should make it clear to a userspace program what is going on > and give it a chance to mount a filesystem anyway. That said, I'm willing to provide a "fail if already extant" option if we think that's actually likely to be of use. However, you'd still have to provide parameters before the check can be made. > In a similar vein could we please clarify the rules for changing mount > options for an existing superblock are in the new api? You mean remount/reconfigure? Note that we have to provide backward compatibility with every single filesystem. > Today mount assumes that it has to provide all of the existing options to > reconfigure a mount. What people want to do and what most filesystems > support is just specifying the options that need to be changed. Can we > please make this the rule of how this are expected to work for fscontext? > That only changing mount options need to be specified before: "x > reconfigure" Fine by me - but it must *also* support every option being specified if that is what mount currently does. I don't really want to supply extra parsers if I can avoid it. Miklós, for example wanted a different, incompatible interface, so you'd do: write(fd, "o +foo"); write(fd, "o -bar"); write(fd, "x reconfig"); sort of thing to enable or disable options... but this assumes that options are binary and requires a separate parser to the one that does the initial configuration - and you still have to support the old remount data parse. I'm okay with specifying that you should just specify the options you want to change and that the normal way to 'disable' something is to prefix it with "no". I guess I could pass a flag through to indicate that this came from sys_mount(MS_REMOUNT) rather than the new method. David
Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]
Eric W. Biederman wrote: > I have read through these patches and I noticed a significant issue. > > Today in mount_bdev we do something that looks like: > > mount_bdev(...) > { > s = sget(..., bdev); > if (s->s_root) { > /* Noop */ > } else { > err = fill_super(s, ...); > if (err) { > deactivate_locked_super(s); > return ERR_PTR(err); > } > s->s_flags |= SB_ATTIVE; > bdev->bd_super = s; > } > return dget(s->s_root); > } > > The key point is that we don't process the mount options at all if > a super block already exists in the kernel. Similar to what > your fscontext changes are doing (after parsing the options). Actually, no, that's not the case. The fscontext code *requires* you to parse the parameters *before* any attempt to access the superblock is made. Note that this will actually be a problem for, say, ext4 which passes a text string stored in the superblock through the parser *before* parsing the mount syscall data. Fun. I'm intending to deal with that particular case by having ext4 create multiple private contexts, one filled in from the user data, and then a second one filled in from the superblock string. These can then be validated one against the other before the super_block struct is published. And if the super_block struct already exists, the user's specified parameters can be validated against that. > Your fscontext changes do not improve upon this area of the mount api at > all and that concerns me. This is an area where people can and already > do shoot themselves in their feet. This *will* be dealt with, but I wanted to get the core changes upstream before tackling all the filesystems. The legacy wrapper is just that and should be got rid of when all the filesystems have been converted. > ... > > Creating a new mount and finding an old mount are the same operation in > the kernel today. This is fundamentally confusing. In the new api > could we please separate these two operations? > > Perhaps someting like: > x create > x find > > With the "x create" case failing if the filesystem already exists, > still allowing "x find"? And with the "x find" case failing if > the superblock is not already created in the kernel. No. What you're suggesting introduces a userspace-userspace and a userspace-kernel race - unless you're willing to let userspace lock against superblock creation by other parties. Further, some filesystems *have* to be parameterised before you can do the check for the superblock. Network filesystems, for example, where you have to set the network parameters upfront and the key to the superblock might not be known until you've queried the server. > That should make it clear to a userspace program what is going on > and give it a chance to mount a filesystem anyway. That said, I'm willing to provide a "fail if already extant" option if we think that's actually likely to be of use. However, you'd still have to provide parameters before the check can be made. > In a similar vein could we please clarify the rules for changing mount > options for an existing superblock are in the new api? You mean remount/reconfigure? Note that we have to provide backward compatibility with every single filesystem. > Today mount assumes that it has to provide all of the existing options to > reconfigure a mount. What people want to do and what most filesystems > support is just specifying the options that need to be changed. Can we > please make this the rule of how this are expected to work for fscontext? > That only changing mount options need to be specified before: "x > reconfigure" Fine by me - but it must *also* support every option being specified if that is what mount currently does. I don't really want to supply extra parsers if I can avoid it. Miklós, for example wanted a different, incompatible interface, so you'd do: write(fd, "o +foo"); write(fd, "o -bar"); write(fd, "x reconfig"); sort of thing to enable or disable options... but this assumes that options are binary and requires a separate parser to the one that does the initial configuration - and you still have to support the old remount data parse. I'm okay with specifying that you should just specify the options you want to change and that the normal way to 'disable' something is to prefix it with "no". I guess I could pass a flag through to indicate that this came from sys_mount(MS_REMOUNT) rather than the new method. David
Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]
David Howells writes: > Here are a set of patches to create a filesystem context prior to setting > up a new mount, populating it with the parsed options/binary data, creating > the superblock and then effecting the mount. This is also used for remount > since much of the parsing stuff is common in many filesystems. Dave, I have read through these patches and I noticed a significant issue. Today in mount_bdev we do something that looks like: mount_bdev(...) { s = sget(..., bdev); if (s->s_root) { /* Noop */ } else { err = fill_super(s, ...); if (err) { deactivate_locked_super(s); return ERR_PTR(err); } s->s_flags |= SB_ATTIVE; bdev->bd_super = s; } return dget(s->s_root); } The key point is that we don't process the mount options at all if a super block already exists in the kernel. Similar to what your fscontext changes are doing (after parsing the options). Your fscontext changes do not improve upon this area of the mount api at all and that concerns me. This is an area where people can and already do shoot themselves in their feet. The real world security issue we had in with this involved devpts. The devpts filesystem requires the mode and gid parameters for new ttys to be specified to be posix compliant. People were setting up chroot environments and mounting devpts with the wrong arguments. As these two devpts mounts shared a super block a change of arguments on one was a change of arguments on the other. Which mean the chroots were periodically breaking the primary devpts and causing new terminals to be opened with essentially unusable permissions. Fun when you are trying to ssh in to a box. Creating a new mount and finding an old mount are the same operation in the kernel today. This is fundamentally confusing. In the new api could we please separate these two operations? Perhaps someting like: x create x find With the "x create" case failing if the filesystem already exists, still allowing "x find"? And with the "x find" case failing if the superblock is not already created in the kernel. That should make it clear to a userspace program what is going on and give it a chance to mount a filesystem anyway. In a similar vein could we please clarify the rules for changing mount options for an existing superblock are in the new api? Today mount assumes that it has to provide all of the existing options to reconfigure a mount. What people want to do and what most filesystems support is just specifying the options that need to be changed. Can we please make this the rule of how this are expected to work for fscontext? That only changing mount options need to be specified before: "x reconfigure" Eric
Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]
David Howells writes: > Here are a set of patches to create a filesystem context prior to setting > up a new mount, populating it with the parsed options/binary data, creating > the superblock and then effecting the mount. This is also used for remount > since much of the parsing stuff is common in many filesystems. Dave, I have read through these patches and I noticed a significant issue. Today in mount_bdev we do something that looks like: mount_bdev(...) { s = sget(..., bdev); if (s->s_root) { /* Noop */ } else { err = fill_super(s, ...); if (err) { deactivate_locked_super(s); return ERR_PTR(err); } s->s_flags |= SB_ATTIVE; bdev->bd_super = s; } return dget(s->s_root); } The key point is that we don't process the mount options at all if a super block already exists in the kernel. Similar to what your fscontext changes are doing (after parsing the options). Your fscontext changes do not improve upon this area of the mount api at all and that concerns me. This is an area where people can and already do shoot themselves in their feet. The real world security issue we had in with this involved devpts. The devpts filesystem requires the mode and gid parameters for new ttys to be specified to be posix compliant. People were setting up chroot environments and mounting devpts with the wrong arguments. As these two devpts mounts shared a super block a change of arguments on one was a change of arguments on the other. Which mean the chroots were periodically breaking the primary devpts and causing new terminals to be opened with essentially unusable permissions. Fun when you are trying to ssh in to a box. Creating a new mount and finding an old mount are the same operation in the kernel today. This is fundamentally confusing. In the new api could we please separate these two operations? Perhaps someting like: x create x find With the "x create" case failing if the filesystem already exists, still allowing "x find"? And with the "x find" case failing if the superblock is not already created in the kernel. That should make it clear to a userspace program what is going on and give it a chance to mount a filesystem anyway. In a similar vein could we please clarify the rules for changing mount options for an existing superblock are in the new api? Today mount assumes that it has to provide all of the existing options to reconfigure a mount. What people want to do and what most filesystems support is just specifying the options that need to be changed. Can we please make this the rule of how this are expected to work for fscontext? That only changing mount options need to be specified before: "x reconfigure" Eric
[PATCH 00/32] VFS: Introduce filesystem context [ver #8]
Hi Al, Can you take a look at this please, in particular the last 6 patches? Here are a set of patches to create a filesystem context prior to setting up a new mount, populating it with the parsed options/binary data, creating the superblock and then effecting the mount. This is also used for remount since much of the parsing stuff is common in many filesystems. This allows namespaces and other information to be conveyed through the mount procedure. This also allows Miklós Szeredi's idea of doing: fd = fsopen("nfs"); write(fd, "option=val", ...); mfd = fsmount(fd, MS_NODEV); move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); that he presented at LSF-2017 to be implemented (see the relevant patches in the series). I didn't use netlink as that would make the core kernel depend on CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing issues. I've implemented filesystem context handling for procfs, nfs, mqueue, cpuset, kernfs, sysfs, cgroup and afs filesystems. Unconverted filesystems are handled by the legacy filesystem wrapper. This post is mostly about the internal filesystem context and the special kernel interface filesystems. I've included the fsopen() and fsmount() syscall implementations for reference, but I expect these to undergo some reconsideration during LSF. The last five patches relate to the AFS conversion and are included as an example. Significant changes: ver #8: (*) Changed the way fsmount() mounts into the namespace according to some of Al's ideas. (*) Put better typing on the fd cookie obtained from __fdget() & co.. (*) Stored the fd cookie in struct nameidata rather than the dfd number. (*) Changed sys_fsmount() to return an O_PATH-style fd rather than actually mounting into the mount namespace. (*) Separated internal FMODE_* handling from O_* handling to free up certain O_* flag numbers. (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd. (*) Added a new syscall, sys_move_mount(), to move a mount from an dfd+path source to a dfd+path destination. (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the vfsmount attached to file->f_path needs 'unmounting' if set. (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful. [!] This doesn't work quite right. (*) Added a new syscall, fsinfo(), to query information about a filesystem. The idea being that this will, in future, work with the fd from fsopen() too and permit querying of the parameters and metadata before fsmount() is called. ver #7: (*) Undo an incorrect MS_* -> SB_* conversion. (*) Pass the mount data buffer size to all the mount-related functions that take the data pointer. This fixes a problem where someone (say SELinux) tries to copy the mount data, assuming it to be a page in size, and overruns the buffer - thereby incurring an oops by hitting a guard page. (*) Made the AFS filesystem use them as an example. This is a much easier to deal with than with NFS or Ext4 as there are very few mount options. ver #6: (*) Dropped the supplementary error string facility for the moment. (*) Dropped the NFS patches for the moment. (*) Dropped the reserved file descriptor argument from fsopen() and replaced it with three reserved pointers that must be NULL. ver #5: (*) Renamed sb_config -> fs_context and adjusted variable names. (*) Differentiated the flags in sb->s_flags (now named SB_*) from those passed to mount(2) (named MS_*). (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the caller always provide a struct file_system_type pointer and the parameters required. (*) Got rid of vfs_submount_fc() in favour of passing FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context(). The purpose is now used more. (*) Call ->validate() on the remount path. (*) Got rid of the inode locking in sys_fsmount(). (*) Call security_sb_mountpoint() in the mount(2) path. ver #4: (*) Split the sb_config patch up somewhat. (*) Made the supplementary error string facility something attached to the task_struct rather than the sb_config so that error messages can be obtained from NFS doing a mount-root-and-pathwalk inside the nfs_get_tree() operation. Further, made this managed and read by prctl rather than through the mount fd so that it's more generally available. ver #3: (*) Rebased on 4.12-rc1. (*) Split the NFS patch up somewhat. ver #2: (*) Removed the ->fill_super() from sb_config_operations and passed it in directly to functions that want to call it. NFS now calls nfs_fill_super() directly rather than jumping through a pointer to it since there's only the one option at the moment. (*) Removed ->mnt_ns and ->sb from sb_config and moved
[PATCH 00/32] VFS: Introduce filesystem context [ver #8]
Hi Al, Can you take a look at this please, in particular the last 6 patches? Here are a set of patches to create a filesystem context prior to setting up a new mount, populating it with the parsed options/binary data, creating the superblock and then effecting the mount. This is also used for remount since much of the parsing stuff is common in many filesystems. This allows namespaces and other information to be conveyed through the mount procedure. This also allows Miklós Szeredi's idea of doing: fd = fsopen("nfs"); write(fd, "option=val", ...); mfd = fsmount(fd, MS_NODEV); move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); that he presented at LSF-2017 to be implemented (see the relevant patches in the series). I didn't use netlink as that would make the core kernel depend on CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing issues. I've implemented filesystem context handling for procfs, nfs, mqueue, cpuset, kernfs, sysfs, cgroup and afs filesystems. Unconverted filesystems are handled by the legacy filesystem wrapper. This post is mostly about the internal filesystem context and the special kernel interface filesystems. I've included the fsopen() and fsmount() syscall implementations for reference, but I expect these to undergo some reconsideration during LSF. The last five patches relate to the AFS conversion and are included as an example. Significant changes: ver #8: (*) Changed the way fsmount() mounts into the namespace according to some of Al's ideas. (*) Put better typing on the fd cookie obtained from __fdget() & co.. (*) Stored the fd cookie in struct nameidata rather than the dfd number. (*) Changed sys_fsmount() to return an O_PATH-style fd rather than actually mounting into the mount namespace. (*) Separated internal FMODE_* handling from O_* handling to free up certain O_* flag numbers. (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd. (*) Added a new syscall, sys_move_mount(), to move a mount from an dfd+path source to a dfd+path destination. (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the vfsmount attached to file->f_path needs 'unmounting' if set. (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful. [!] This doesn't work quite right. (*) Added a new syscall, fsinfo(), to query information about a filesystem. The idea being that this will, in future, work with the fd from fsopen() too and permit querying of the parameters and metadata before fsmount() is called. ver #7: (*) Undo an incorrect MS_* -> SB_* conversion. (*) Pass the mount data buffer size to all the mount-related functions that take the data pointer. This fixes a problem where someone (say SELinux) tries to copy the mount data, assuming it to be a page in size, and overruns the buffer - thereby incurring an oops by hitting a guard page. (*) Made the AFS filesystem use them as an example. This is a much easier to deal with than with NFS or Ext4 as there are very few mount options. ver #6: (*) Dropped the supplementary error string facility for the moment. (*) Dropped the NFS patches for the moment. (*) Dropped the reserved file descriptor argument from fsopen() and replaced it with three reserved pointers that must be NULL. ver #5: (*) Renamed sb_config -> fs_context and adjusted variable names. (*) Differentiated the flags in sb->s_flags (now named SB_*) from those passed to mount(2) (named MS_*). (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the caller always provide a struct file_system_type pointer and the parameters required. (*) Got rid of vfs_submount_fc() in favour of passing FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context(). The purpose is now used more. (*) Call ->validate() on the remount path. (*) Got rid of the inode locking in sys_fsmount(). (*) Call security_sb_mountpoint() in the mount(2) path. ver #4: (*) Split the sb_config patch up somewhat. (*) Made the supplementary error string facility something attached to the task_struct rather than the sb_config so that error messages can be obtained from NFS doing a mount-root-and-pathwalk inside the nfs_get_tree() operation. Further, made this managed and read by prctl rather than through the mount fd so that it's more generally available. ver #3: (*) Rebased on 4.12-rc1. (*) Split the NFS patch up somewhat. ver #2: (*) Removed the ->fill_super() from sb_config_operations and passed it in directly to functions that want to call it. NFS now calls nfs_fill_super() directly rather than jumping through a pointer to it since there's only the one option at the moment. (*) Removed ->mnt_ns and ->sb from sb_config and moved