Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-06-18 Thread Theodore Y. Ts'o
On Mon, Jun 18, 2018 at 09:30:50PM +0100, David Howells wrote:
> 
> The fscontext code *requires* you to parse the parameters *before* any attempt
> to access the superblock is made.  Note that this will actually be a problem
> for, say, ext4 which passes a text string stored in the superblock through the
> parser *before* parsing the mount syscall data.  Fun.
> 
> I'm intending to deal with that particular case by having ext4 create multiple
> private contexts, one filled in from the user data, and then a second one
> filled in from the superblock string.  These can then be validated one against
> the other before the super_block struct is published.

Yeah, what we're trying to do is let the options in the superblock act
as defaults which then can be overridden by what the user specifies on
the command line.

So when you parse the user-supplied data, will there be a way to
determine what was specified explicitly, versus what was implied by
the defaults?  I'll need that in order to be able to merge the two
contexts together.

- Ted


Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-06-18 Thread Theodore Y. Ts'o
On Mon, Jun 18, 2018 at 09:30:50PM +0100, David Howells wrote:
> 
> The fscontext code *requires* you to parse the parameters *before* any attempt
> to access the superblock is made.  Note that this will actually be a problem
> for, say, ext4 which passes a text string stored in the superblock through the
> parser *before* parsing the mount syscall data.  Fun.
> 
> I'm intending to deal with that particular case by having ext4 create multiple
> private contexts, one filled in from the user data, and then a second one
> filled in from the superblock string.  These can then be validated one against
> the other before the super_block struct is published.

Yeah, what we're trying to do is let the options in the superblock act
as defaults which then can be overridden by what the user specifies on
the command line.

So when you parse the user-supplied data, will there be a way to
determine what was specified explicitly, versus what was implied by
the defaults?  I'll need that in order to be able to merge the two
contexts together.

- Ted


Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-06-18 Thread Eric W. Biederman
David Howells  writes:

> Eric W. Biederman  wrote:
>
>> I have read through these patches and I noticed a significant issue.
>> 
>> Today in mount_bdev we do something that looks like:
>> 
>> mount_bdev(...)
>> {
>>  s = sget(..., bdev);
>>  if (s->s_root) {
>>  /* Noop */
>> } else {
>>  err = fill_super(s, ...);
>> if (err) {
>>  deactivate_locked_super(s);
>> return ERR_PTR(err);
>> }
>> s->s_flags |= SB_ATTIVE;
>> bdev->bd_super = s;
>> }
>> return dget(s->s_root);
>> }
>> 
>> The key point is that we don't process the mount options at all if
>> a super block already exists in the kernel.  Similar to what
>> your fscontext changes are doing (after parsing the options).
>
> Actually, no, that's not the case.
>
> The fscontext code *requires* you to parse the parameters *before* any attempt
> to access the superblock is made.  Note that this will actually be a problem
> for, say, ext4 which passes a text string stored in the superblock through the
> parser *before* parsing the mount syscall data.  Fun.
>
> I'm intending to deal with that particular case by having ext4 create multiple
> private contexts, one filled in from the user data, and then a second one
> filled in from the superblock string.  These can then be validated one against
> the other before the super_block struct is published.
>
> And if the super_block struct already exists, the user's specified parameters
> can be validated against that.

I did not say parse.  I said process.   My meaning is that today
on a second mount of a filesystem like ext2.  But really most of them we
ignore those options.  

>> Your fscontext changes do not improve upon this area of the mount api at
>> all and that concerns me.  This is an area where people can and already
>> do shoot themselves in their feet.
>
> This *will* be dealt with, but I wanted to get the core changes upstream
> before tackling all the filesystems.  The legacy wrapper is just that and
> should be got rid of when all the filesystems have been converted.

This is an area where you are making things explicit, where before
there was no way to talk about this.   For filesystems that are in
legacy mode I realize we might miss this corner case, but we should
still think about it and handle it.  The proc filesystem has the same
behavior and it is one you are converting.

>> ...
>>
>> Creating a new mount and finding an old mount are the same operation in
>> the kernel today.  This is fundamentally confusing.  In the new api
>> could we please separate these two operations?
>>
>> Perhaps someting like:
>> x create
>> x find
>> 
>> With the "x create" case failing if the filesystem already exists,
>> still allowing "x find"?  And with the "x find" case failing if
>> the superblock is not already created in the kernel.
>
> No.  What you're suggesting introduces a userspace-userspace and a
> userspace-kernel race - unless you're willing to let userspace lock against
> superblock creation by other parties.
>
> Further, some filesystems *have* to be parameterised before you can do the
> check for the superblock.  Network filesystems, for example, where you have to
> set the network parameters upfront and the key to the superblock might not be
> known until you've queried the server.

I am not talking about skipping the parameterization.  I am talking
about actually acting on those options.  Parsing and validating them
ahead of time is not my concern.  When we make the super block
honor those options is my concern.

Further I am not suggesting something that has a meaningful race.
I am suggesting some that is the equivalent of the O_EXCL logic.
I am proposing that "x create" fail if the superblock already exists
in the kernel.  I am proposing that "x find" will fail if the
superblock does not already exist.

In the worst case you have to iterate a time or two when another
user is racing with you to create the super block.  But this
gives you very valuable information.  Knowledge of if the superblock
is honoring all of your specified mount options or not.

It removes an existing nasty race today where people think they mount a
filesystem like "proc" with one set of options and those options are
ignored because an internal kernel mount already exists.

This is at the level of the fscontext API.

I don't care what filesystems that have not been updated to fscontext
do. I just want to avoid the nasty nasty confusion that is possible
with the existing API.

My motivation is I am in the middle of closing a regression in option
parsing in proc that caused a security option to get ignored.

I would be happy even with a result value of "x create" that told
reported if the superbloc "created" or "found".  Instead of having two
different options.

But I want to be able to say to userspace very clearly.  If this
superblock already exists.  You need to 

Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-06-18 Thread Eric W. Biederman
David Howells  writes:

> Eric W. Biederman  wrote:
>
>> I have read through these patches and I noticed a significant issue.
>> 
>> Today in mount_bdev we do something that looks like:
>> 
>> mount_bdev(...)
>> {
>>  s = sget(..., bdev);
>>  if (s->s_root) {
>>  /* Noop */
>> } else {
>>  err = fill_super(s, ...);
>> if (err) {
>>  deactivate_locked_super(s);
>> return ERR_PTR(err);
>> }
>> s->s_flags |= SB_ATTIVE;
>> bdev->bd_super = s;
>> }
>> return dget(s->s_root);
>> }
>> 
>> The key point is that we don't process the mount options at all if
>> a super block already exists in the kernel.  Similar to what
>> your fscontext changes are doing (after parsing the options).
>
> Actually, no, that's not the case.
>
> The fscontext code *requires* you to parse the parameters *before* any attempt
> to access the superblock is made.  Note that this will actually be a problem
> for, say, ext4 which passes a text string stored in the superblock through the
> parser *before* parsing the mount syscall data.  Fun.
>
> I'm intending to deal with that particular case by having ext4 create multiple
> private contexts, one filled in from the user data, and then a second one
> filled in from the superblock string.  These can then be validated one against
> the other before the super_block struct is published.
>
> And if the super_block struct already exists, the user's specified parameters
> can be validated against that.

I did not say parse.  I said process.   My meaning is that today
on a second mount of a filesystem like ext2.  But really most of them we
ignore those options.  

>> Your fscontext changes do not improve upon this area of the mount api at
>> all and that concerns me.  This is an area where people can and already
>> do shoot themselves in their feet.
>
> This *will* be dealt with, but I wanted to get the core changes upstream
> before tackling all the filesystems.  The legacy wrapper is just that and
> should be got rid of when all the filesystems have been converted.

This is an area where you are making things explicit, where before
there was no way to talk about this.   For filesystems that are in
legacy mode I realize we might miss this corner case, but we should
still think about it and handle it.  The proc filesystem has the same
behavior and it is one you are converting.

>> ...
>>
>> Creating a new mount and finding an old mount are the same operation in
>> the kernel today.  This is fundamentally confusing.  In the new api
>> could we please separate these two operations?
>>
>> Perhaps someting like:
>> x create
>> x find
>> 
>> With the "x create" case failing if the filesystem already exists,
>> still allowing "x find"?  And with the "x find" case failing if
>> the superblock is not already created in the kernel.
>
> No.  What you're suggesting introduces a userspace-userspace and a
> userspace-kernel race - unless you're willing to let userspace lock against
> superblock creation by other parties.
>
> Further, some filesystems *have* to be parameterised before you can do the
> check for the superblock.  Network filesystems, for example, where you have to
> set the network parameters upfront and the key to the superblock might not be
> known until you've queried the server.

I am not talking about skipping the parameterization.  I am talking
about actually acting on those options.  Parsing and validating them
ahead of time is not my concern.  When we make the super block
honor those options is my concern.

Further I am not suggesting something that has a meaningful race.
I am suggesting some that is the equivalent of the O_EXCL logic.
I am proposing that "x create" fail if the superblock already exists
in the kernel.  I am proposing that "x find" will fail if the
superblock does not already exist.

In the worst case you have to iterate a time or two when another
user is racing with you to create the super block.  But this
gives you very valuable information.  Knowledge of if the superblock
is honoring all of your specified mount options or not.

It removes an existing nasty race today where people think they mount a
filesystem like "proc" with one set of options and those options are
ignored because an internal kernel mount already exists.

This is at the level of the fscontext API.

I don't care what filesystems that have not been updated to fscontext
do. I just want to avoid the nasty nasty confusion that is possible
with the existing API.

My motivation is I am in the middle of closing a regression in option
parsing in proc that caused a security option to get ignored.

I would be happy even with a result value of "x create" that told
reported if the superbloc "created" or "found".  Instead of having two
different options.

But I want to be able to say to userspace very clearly.  If this
superblock already exists.  You need to 

Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-06-18 Thread David Howells
Eric W. Biederman  wrote:

> I have read through these patches and I noticed a significant issue.
> 
> Today in mount_bdev we do something that looks like:
> 
> mount_bdev(...)
> {
>   s = sget(..., bdev);
>   if (s->s_root) {
>   /* Noop */
> } else {
>   err = fill_super(s, ...);
> if (err) {
>   deactivate_locked_super(s);
> return ERR_PTR(err);
> }
> s->s_flags |= SB_ATTIVE;
> bdev->bd_super = s;
> }
> return dget(s->s_root);
> }
> 
> The key point is that we don't process the mount options at all if
> a super block already exists in the kernel.  Similar to what
> your fscontext changes are doing (after parsing the options).

Actually, no, that's not the case.

The fscontext code *requires* you to parse the parameters *before* any attempt
to access the superblock is made.  Note that this will actually be a problem
for, say, ext4 which passes a text string stored in the superblock through the
parser *before* parsing the mount syscall data.  Fun.

I'm intending to deal with that particular case by having ext4 create multiple
private contexts, one filled in from the user data, and then a second one
filled in from the superblock string.  These can then be validated one against
the other before the super_block struct is published.

And if the super_block struct already exists, the user's specified parameters
can be validated against that.

> Your fscontext changes do not improve upon this area of the mount api at
> all and that concerns me.  This is an area where people can and already
> do shoot themselves in their feet.

This *will* be dealt with, but I wanted to get the core changes upstream
before tackling all the filesystems.  The legacy wrapper is just that and
should be got rid of when all the filesystems have been converted.

> ...
>
> Creating a new mount and finding an old mount are the same operation in
> the kernel today.  This is fundamentally confusing.  In the new api
> could we please separate these two operations?
>
> Perhaps someting like:
> x create
> x find
> 
> With the "x create" case failing if the filesystem already exists,
> still allowing "x find"?  And with the "x find" case failing if
> the superblock is not already created in the kernel.

No.  What you're suggesting introduces a userspace-userspace and a
userspace-kernel race - unless you're willing to let userspace lock against
superblock creation by other parties.

Further, some filesystems *have* to be parameterised before you can do the
check for the superblock.  Network filesystems, for example, where you have to
set the network parameters upfront and the key to the superblock might not be
known until you've queried the server.

> That should make it clear to a userspace program what is going on
> and give it a chance to mount a filesystem anyway.

That said, I'm willing to provide a "fail if already extant" option if we
think that's actually likely to be of use.  However, you'd still have to
provide parameters before the check can be made.

> In a similar vein could we please clarify the rules for changing mount
> options for an existing superblock are in the new api?

You mean remount/reconfigure?  Note that we have to provide backward
compatibility with every single filesystem.

> Today mount assumes that it has to provide all of the existing options to
> reconfigure a mount.  What people want to do and what most filesystems
> support is just specifying the options that need to be changed.  Can we
> please make this the rule of how this are expected to work for fscontext?
> That only changing mount options need to be specified before: "x
> reconfigure"

Fine by me - but it must *also* support every option being specified if that
is what mount currently does.

I don't really want to supply extra parsers if I can avoid it.  Miklós, for
example wanted a different, incompatible interface, so you'd do:

write(fd, "o +foo");
write(fd, "o -bar");
write(fd, "x reconfig");

sort of thing to enable or disable options... but this assumes that options
are binary and requires a separate parser to the one that does the initial
configuration - and you still have to support the old remount data parse.

I'm okay with specifying that you should just specify the options you want to
change and that the normal way to 'disable' something is to prefix it with
"no".

I guess I could pass a flag through to indicate that this came from
sys_mount(MS_REMOUNT) rather than the new method.

David


Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-06-18 Thread David Howells
Eric W. Biederman  wrote:

> I have read through these patches and I noticed a significant issue.
> 
> Today in mount_bdev we do something that looks like:
> 
> mount_bdev(...)
> {
>   s = sget(..., bdev);
>   if (s->s_root) {
>   /* Noop */
> } else {
>   err = fill_super(s, ...);
> if (err) {
>   deactivate_locked_super(s);
> return ERR_PTR(err);
> }
> s->s_flags |= SB_ATTIVE;
> bdev->bd_super = s;
> }
> return dget(s->s_root);
> }
> 
> The key point is that we don't process the mount options at all if
> a super block already exists in the kernel.  Similar to what
> your fscontext changes are doing (after parsing the options).

Actually, no, that's not the case.

The fscontext code *requires* you to parse the parameters *before* any attempt
to access the superblock is made.  Note that this will actually be a problem
for, say, ext4 which passes a text string stored in the superblock through the
parser *before* parsing the mount syscall data.  Fun.

I'm intending to deal with that particular case by having ext4 create multiple
private contexts, one filled in from the user data, and then a second one
filled in from the superblock string.  These can then be validated one against
the other before the super_block struct is published.

And if the super_block struct already exists, the user's specified parameters
can be validated against that.

> Your fscontext changes do not improve upon this area of the mount api at
> all and that concerns me.  This is an area where people can and already
> do shoot themselves in their feet.

This *will* be dealt with, but I wanted to get the core changes upstream
before tackling all the filesystems.  The legacy wrapper is just that and
should be got rid of when all the filesystems have been converted.

> ...
>
> Creating a new mount and finding an old mount are the same operation in
> the kernel today.  This is fundamentally confusing.  In the new api
> could we please separate these two operations?
>
> Perhaps someting like:
> x create
> x find
> 
> With the "x create" case failing if the filesystem already exists,
> still allowing "x find"?  And with the "x find" case failing if
> the superblock is not already created in the kernel.

No.  What you're suggesting introduces a userspace-userspace and a
userspace-kernel race - unless you're willing to let userspace lock against
superblock creation by other parties.

Further, some filesystems *have* to be parameterised before you can do the
check for the superblock.  Network filesystems, for example, where you have to
set the network parameters upfront and the key to the superblock might not be
known until you've queried the server.

> That should make it clear to a userspace program what is going on
> and give it a chance to mount a filesystem anyway.

That said, I'm willing to provide a "fail if already extant" option if we
think that's actually likely to be of use.  However, you'd still have to
provide parameters before the check can be made.

> In a similar vein could we please clarify the rules for changing mount
> options for an existing superblock are in the new api?

You mean remount/reconfigure?  Note that we have to provide backward
compatibility with every single filesystem.

> Today mount assumes that it has to provide all of the existing options to
> reconfigure a mount.  What people want to do and what most filesystems
> support is just specifying the options that need to be changed.  Can we
> please make this the rule of how this are expected to work for fscontext?
> That only changing mount options need to be specified before: "x
> reconfigure"

Fine by me - but it must *also* support every option being specified if that
is what mount currently does.

I don't really want to supply extra parsers if I can avoid it.  Miklós, for
example wanted a different, incompatible interface, so you'd do:

write(fd, "o +foo");
write(fd, "o -bar");
write(fd, "x reconfig");

sort of thing to enable or disable options... but this assumes that options
are binary and requires a separate parser to the one that does the initial
configuration - and you still have to support the old remount data parse.

I'm okay with specifying that you should just specify the options you want to
change and that the normal way to 'disable' something is to prefix it with
"no".

I guess I could pass a flag through to indicate that this came from
sys_mount(MS_REMOUNT) rather than the new method.

David


Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-06-14 Thread Eric W. Biederman
David Howells  writes:

> Here are a set of patches to create a filesystem context prior to setting
> up a new mount, populating it with the parsed options/binary data, creating
> the superblock and then effecting the mount.  This is also used for remount
> since much of the parsing stuff is common in many filesystems.

Dave,
I have read through these patches and I noticed a significant issue.

Today in mount_bdev we do something that looks like:

mount_bdev(...)
{
s = sget(..., bdev);
if (s->s_root) {
/* Noop */
} else {
err = fill_super(s, ...);
if (err) {
deactivate_locked_super(s);
return ERR_PTR(err);
}
s->s_flags |= SB_ATTIVE;
bdev->bd_super = s;
}
return dget(s->s_root);
}

The key point is that we don't process the mount options at all if
a super block already exists in the kernel.  Similar to what
your fscontext changes are doing (after parsing the options).

Your fscontext changes do not improve upon this area of the mount api at
all and that concerns me.  This is an area where people can and already
do shoot themselves in their feet.

The real world security issue we had in with this involved devpts.  The
devpts filesystem requires the mode and gid parameters for new ttys to
be specified to be posix compliant.  People were setting up chroot
environments and mounting devpts with the wrong arguments.  As these two
devpts mounts shared a super block a change of arguments on one was a
change of arguments on the other.  Which mean the chroots were
periodically breaking the primary devpts and causing new terminals to be
opened with essentially unusable permissions.  Fun when you are trying
to ssh in to a box.

Creating a new mount and finding an old mount are the same operation in
the kernel today.  This is fundamentally confusing.  In the new api
could we please separate these two operations?

Perhaps someting like:
x create
x find

With the "x create" case failing if the filesystem already exists,
still allowing "x find"?  And with the "x find" case failing if
the superblock is not already created in the kernel.

That should make it clear to a userspace program what is going on
and give it a chance to mount a filesystem anyway.



In a similar vein could we please clarify the rules for changing mount
options for an existing superblock are in the new api?

Today mount assumes that it has to provide all of the existing options
to reconfigure a mount.  What people want to do and what most
filesystems support is just specifying the options that need to be
changed.  Can we please make this the rule of how this are expected
to work for fscontext?  That only changing mount options need to
be specified before: "x reconfigure"

Eric




Re: [PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-06-14 Thread Eric W. Biederman
David Howells  writes:

> Here are a set of patches to create a filesystem context prior to setting
> up a new mount, populating it with the parsed options/binary data, creating
> the superblock and then effecting the mount.  This is also used for remount
> since much of the parsing stuff is common in many filesystems.

Dave,
I have read through these patches and I noticed a significant issue.

Today in mount_bdev we do something that looks like:

mount_bdev(...)
{
s = sget(..., bdev);
if (s->s_root) {
/* Noop */
} else {
err = fill_super(s, ...);
if (err) {
deactivate_locked_super(s);
return ERR_PTR(err);
}
s->s_flags |= SB_ATTIVE;
bdev->bd_super = s;
}
return dget(s->s_root);
}

The key point is that we don't process the mount options at all if
a super block already exists in the kernel.  Similar to what
your fscontext changes are doing (after parsing the options).

Your fscontext changes do not improve upon this area of the mount api at
all and that concerns me.  This is an area where people can and already
do shoot themselves in their feet.

The real world security issue we had in with this involved devpts.  The
devpts filesystem requires the mode and gid parameters for new ttys to
be specified to be posix compliant.  People were setting up chroot
environments and mounting devpts with the wrong arguments.  As these two
devpts mounts shared a super block a change of arguments on one was a
change of arguments on the other.  Which mean the chroots were
periodically breaking the primary devpts and causing new terminals to be
opened with essentially unusable permissions.  Fun when you are trying
to ssh in to a box.

Creating a new mount and finding an old mount are the same operation in
the kernel today.  This is fundamentally confusing.  In the new api
could we please separate these two operations?

Perhaps someting like:
x create
x find

With the "x create" case failing if the filesystem already exists,
still allowing "x find"?  And with the "x find" case failing if
the superblock is not already created in the kernel.

That should make it clear to a userspace program what is going on
and give it a chance to mount a filesystem anyway.



In a similar vein could we please clarify the rules for changing mount
options for an existing superblock are in the new api?

Today mount assumes that it has to provide all of the existing options
to reconfigure a mount.  What people want to do and what most
filesystems support is just specifying the options that need to be
changed.  Can we please make this the rule of how this are expected
to work for fscontext?  That only changing mount options need to
be specified before: "x reconfigure"

Eric




[PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-05-24 Thread David Howells

Hi Al,

Can you take a look at this please, in particular the last 6 patches?

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount.  This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Miklós Szeredi's idea of doing:

fd = fsopen("nfs");
write(fd, "option=val", ...);
mfd = fsmount(fd, MS_NODEV);
move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.

Unconverted filesystems are handled by the legacy filesystem wrapper.

This post is mostly about the internal filesystem context and the special
kernel interface filesystems.  I've included the fsopen() and fsmount()
syscall implementations for reference, but I expect these to undergo some
reconsideration during LSF.  The last five patches relate to the AFS
conversion and are included as an example.

Significant changes:

 ver #8:

 (*) Changed the way fsmount() mounts into the namespace according to some
 of Al's ideas.

 (*) Put better typing on the fd cookie obtained from __fdget() & co..

 (*) Stored the fd cookie in struct nameidata rather than the dfd number.

 (*) Changed sys_fsmount() to return an O_PATH-style fd rather than
 actually mounting into the mount namespace.

 (*) Separated internal FMODE_* handling from O_* handling to free up
 certain O_* flag numbers.

 (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
 with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

 (*) Added a new syscall, sys_move_mount(), to move a mount from an
 dfd+path source to a dfd+path destination.

 (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
 vfsmount attached to file->f_path needs 'unmounting' if set.

 (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

[!] This doesn't work quite right.

 (*) Added a new syscall, fsinfo(), to query information about a
 filesystem.  The idea being that this will, in future, work with the
 fd from fsopen() too and permit querying of the parameters and
 metadata before fsmount() is called.

 ver #7:

 (*) Undo an incorrect MS_* -> SB_* conversion.

 (*) Pass the mount data buffer size to all the mount-related functions that
 take the data pointer.  This fixes a problem where someone (say SELinux)
 tries to copy the mount data, assuming it to be a page in size, and
 overruns the buffer - thereby incurring an oops by hitting a guard page.

 (*) Made the AFS filesystem use them as an example.  This is a much easier to
 deal with than with NFS or Ext4 as there are very few mount options.

 ver #6:

 (*) Dropped the supplementary error string facility for the moment.

 (*) Dropped the NFS patches for the moment.

 (*) Dropped the reserved file descriptor argument from fsopen() and
 replaced it with three reserved pointers that must be NULL.

 ver #5:

 (*) Renamed sb_config -> fs_context and adjusted variable names.

 (*) Differentiated the flags in sb->s_flags (now named SB_*) from those
 passed to mount(2) (named MS_*).

 (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
 caller always provide a struct file_system_type pointer and the
 parameters required.

 (*) Got rid of vfs_submount_fc() in favour of passing
 FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context().  The purpose is now
 used more.

 (*) Call ->validate() on the remount path.

 (*) Got rid of the inode locking in sys_fsmount().

 (*) Call security_sb_mountpoint() in the mount(2) path.

 ver #4:

 (*) Split the sb_config patch up somewhat.

 (*) Made the supplementary error string facility something attached to the
 task_struct rather than the sb_config so that error messages can be
 obtained from NFS doing a mount-root-and-pathwalk inside the
 nfs_get_tree() operation.

 Further, made this managed and read by prctl rather than through the
 mount fd so that it's more generally available.

 ver #3:

 (*) Rebased on 4.12-rc1.

 (*) Split the NFS patch up somewhat.

 ver #2:

 (*) Removed the ->fill_super() from sb_config_operations and passed it in
 directly to functions that want to call it.  NFS now calls
 nfs_fill_super() directly rather than jumping through a pointer to it
 since there's only the one option at the moment.

 (*) Removed ->mnt_ns and ->sb from sb_config and moved 

[PATCH 00/32] VFS: Introduce filesystem context [ver #8]

2018-05-24 Thread David Howells

Hi Al,

Can you take a look at this please, in particular the last 6 patches?

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount.  This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Miklós Szeredi's idea of doing:

fd = fsopen("nfs");
write(fd, "option=val", ...);
mfd = fsmount(fd, MS_NODEV);
move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.

Unconverted filesystems are handled by the legacy filesystem wrapper.

This post is mostly about the internal filesystem context and the special
kernel interface filesystems.  I've included the fsopen() and fsmount()
syscall implementations for reference, but I expect these to undergo some
reconsideration during LSF.  The last five patches relate to the AFS
conversion and are included as an example.

Significant changes:

 ver #8:

 (*) Changed the way fsmount() mounts into the namespace according to some
 of Al's ideas.

 (*) Put better typing on the fd cookie obtained from __fdget() & co..

 (*) Stored the fd cookie in struct nameidata rather than the dfd number.

 (*) Changed sys_fsmount() to return an O_PATH-style fd rather than
 actually mounting into the mount namespace.

 (*) Separated internal FMODE_* handling from O_* handling to free up
 certain O_* flag numbers.

 (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
 with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

 (*) Added a new syscall, sys_move_mount(), to move a mount from an
 dfd+path source to a dfd+path destination.

 (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
 vfsmount attached to file->f_path needs 'unmounting' if set.

 (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

[!] This doesn't work quite right.

 (*) Added a new syscall, fsinfo(), to query information about a
 filesystem.  The idea being that this will, in future, work with the
 fd from fsopen() too and permit querying of the parameters and
 metadata before fsmount() is called.

 ver #7:

 (*) Undo an incorrect MS_* -> SB_* conversion.

 (*) Pass the mount data buffer size to all the mount-related functions that
 take the data pointer.  This fixes a problem where someone (say SELinux)
 tries to copy the mount data, assuming it to be a page in size, and
 overruns the buffer - thereby incurring an oops by hitting a guard page.

 (*) Made the AFS filesystem use them as an example.  This is a much easier to
 deal with than with NFS or Ext4 as there are very few mount options.

 ver #6:

 (*) Dropped the supplementary error string facility for the moment.

 (*) Dropped the NFS patches for the moment.

 (*) Dropped the reserved file descriptor argument from fsopen() and
 replaced it with three reserved pointers that must be NULL.

 ver #5:

 (*) Renamed sb_config -> fs_context and adjusted variable names.

 (*) Differentiated the flags in sb->s_flags (now named SB_*) from those
 passed to mount(2) (named MS_*).

 (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
 caller always provide a struct file_system_type pointer and the
 parameters required.

 (*) Got rid of vfs_submount_fc() in favour of passing
 FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context().  The purpose is now
 used more.

 (*) Call ->validate() on the remount path.

 (*) Got rid of the inode locking in sys_fsmount().

 (*) Call security_sb_mountpoint() in the mount(2) path.

 ver #4:

 (*) Split the sb_config patch up somewhat.

 (*) Made the supplementary error string facility something attached to the
 task_struct rather than the sb_config so that error messages can be
 obtained from NFS doing a mount-root-and-pathwalk inside the
 nfs_get_tree() operation.

 Further, made this managed and read by prctl rather than through the
 mount fd so that it's more generally available.

 ver #3:

 (*) Rebased on 4.12-rc1.

 (*) Split the NFS patch up somewhat.

 ver #2:

 (*) Removed the ->fill_super() from sb_config_operations and passed it in
 directly to functions that want to call it.  NFS now calls
 nfs_fill_super() directly rather than jumping through a pointer to it
 since there's only the one option at the moment.

 (*) Removed ->mnt_ns and ->sb from sb_config and moved