Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-10 Thread Ingo Molnar


* Linus Torvalds  wrote:

> On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski  wrote:
> >
> > Linus, you rejected resolveat() because you wanted a *nice* API
> 
> No. I rejected resoveat() because it was a completely broken garbage
> API that couldn't do even basic stuff right (like O_CREAT).
> 
> We have a ton of flag space in the new openat2() model, we might as
> well leave the old flags alone that people are (a) used to and (b) we
> have code to support _anyway_.
> 
> Making up a new flag namespace is only going to cause us - and users -
> more work, and more confusion. For no actual advantage. It's not going
> to be "cleaner". It's just going to be worse.

I suspect there is a "add a clean new flags namespace" analogy to the 
classic "add a clean new standard" XKCD:

https://xkcd.com/927/

Thanks,

Ingo


Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-08 Thread Aleksa Sarai
On 2019-09-07, Jeff Layton  wrote:
> On Thu, 2019-09-05 at 06:19 +1000, Aleksa Sarai wrote:
> > + * @flags: O_* flags.
> > + * @mode: O_CREAT/O_TMPFILE file mode.
> > + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
> > + * @resolve: RESOLVE_* flags.
> > + */
> > +struct open_how {
> > +   __u32 flags;
> > +   union {
> > +   __u16 mode;
> > +   __u16 upgrade_mask;
> > +   };
> > +   __u16 resolve;
> > +};
> > +
> > +#define OPEN_HOW_SIZE_VER0 8 /* sizeof first published struct */
> > +
> 
> Hmm, there is no version field. When you want to expand this in the
> future, what is the plan? Add a new flag to indicate that it's some
> length?

The "version number" is the size of the struct. Any extensions we make
are appended to the struct (openat2 now takes a size_t argument), and
the new copy_struct_{to,from}_user() helpers handle all of the
permutations of {old,new} kernel and {old,new} user space.

This is how clone3(), sched_[gs]etattr() and perf_event_open() all
operate (all of the sigset syscalls operate similarly but don't
gracefully handle different kernel vintages -- you just get -EINVAL).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-07 Thread Andy Lutomirski


> On Sep 7, 2019, at 10:45 AM, Linus Torvalds  
> wrote:
> 
>> On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski  wrote:
>> 
>> Linus, you rejected resolveat() because you wanted a *nice* API
> 
> No. I rejected resoveat() because it was a completely broken garbage
> API that couldn't do even basic stuff right (like O_CREAT).
> 
> We have a ton of flag space in the new openat2() model, we might as
> well leave the old flags alone that people are (a) used to and (b) we
> have code to support _anyway_.
> 
> Making up a new flag namespace is only going to cause us - and users -
> more work, and more confusion. For no actual advantage. It's not going
> to be "cleaner". It's just going to be worse.
> 
> 

If we keep all the flag bits in the same mask with the same values, then we’re 
stuck with O_RDONLY=0 and everything that implies.  We’ll have UPGRADE_READ 
that works differently from the missing plain-old-READ bit, and we can’t 
express execute-only-no-read-or-write. This sucks.

Can we at least split the permission bits into their own mask and make bits 0 
and 1 illegal in the main set of flags in openat2?

There’s another thread going on right now about adding a bit along the lines of 
“MAYEXEC”, and one of the conclusions was that it should wait for openat2 so 
that it can have same semantics. If we’re stuck with O_RDONLY and friends, then 
MAYEXEC is doomed to being at least a bit nonsensical.

As an analogy, AMD64 introduced bigger PTEs but kept the same nonsense encoding 
of read and write permission. And then we got NX, and now we’re getting little 
holes in the encoding stolen by CET to mean new silly things.  I don’t know if 
you’ve been following the various rounds of patches, but it is truly horrible. 
The mapping from meaning to the actual bits is *shit*, and AMD64 should have 
made a clean break instead.

open()’s permission bits are basically the same situation. And the kernel 
*already* has a non-type-safe translation layer. Please, please let openat2() 
at least get rid of the turd in open()’s bits 0 and 1.



Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-07 Thread Linus Torvalds
On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski  wrote:
>
> Linus, you rejected resolveat() because you wanted a *nice* API

No. I rejected resoveat() because it was a completely broken garbage
API that couldn't do even basic stuff right (like O_CREAT).

We have a ton of flag space in the new openat2() model, we might as
well leave the old flags alone that people are (a) used to and (b) we
have code to support _anyway_.

Making up a new flag namespace is only going to cause us - and users -
more work, and more confusion. For no actual advantage. It's not going
to be "cleaner". It's just going to be worse.

 Linus


Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-07 Thread Andy Lutomirski



> On Sep 7, 2019, at 9:58 AM, Linus Torvalds  
> wrote:
> 
>> On Sat, Sep 7, 2019 at 5:40 AM Jeff Layton  wrote:
>> 
>> After thinking about this a bit, I wonder if we might be better served
>> with a new set of OA2_* flags instead of repurposing the O_* flags?
> 
> I'd hate to have yet _another_ set of translation functions, and
> another chance of people just getting it wrong either in user space or
> the kernel.
> 
> So no. Let's not make another set of flags that has no sane way to
> have type-safety to avoid more confusion.
> 
> The new flags that _only_ work with openat2() might be named with a
> prefix/suffix to mark that, but I'm not sure it's a huge deal.
> 
>

I agree with the philosophy, but I think it doesn’t apply in this case.  Here 
are the flags:

O_RDONLY, O_WRONLY, O_RDWR: not even a proper bitmask. The kernel already has 
the FMODE_ bits to make this make sense. How about we make the openat2 
permission bits consistent with the internal representation and let the O_ 
permission bits remain as an awful translation.  The kernel already translates 
like this, and it already sucks.

O_CREAT, O_TMPFILE, O_NOCTTY, O_TRUNC: not modes on the fd at all.  These 
affect the meaning of open().  Heck, for openat2, NOCTTY should be this default.

O_EXCL: hopelessly overloaded.

O_APPEND, O_DIRECT, O_SYNC, O_DSYNC, O_LARGEFILE, O_NOATIME, O_PATH, 
O_NONBLOCK: genuine mode bits

O_CLOEXEC: special because it affects the fd, not the struct file.

Linus, you rejected resolveat() because you wanted a *nice* API that people 
would use and that might even be adopted by other OSes. Let’s please not make 
openat2() be a giant pile of crap in the name of consistency with open().  
open(), frankly, is horrible.



Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-07 Thread Linus Torvalds
On Sat, Sep 7, 2019 at 5:40 AM Jeff Layton  wrote:
>
> After thinking about this a bit, I wonder if we might be better served
> with a new set of OA2_* flags instead of repurposing the O_* flags?

I'd hate to have yet _another_ set of translation functions, and
another chance of people just getting it wrong either in user space or
the kernel.

So no. Let's not make another set of flags that has no sane way to
have type-safety to avoid more confusion.

The new flags that _only_ work with openat2() might be named with a
prefix/suffix to mark that, but I'm not sure it's a huge deal.

 Linus


Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-07 Thread Jeff Layton
On Thu, 2019-09-05 at 06:19 +1000, Aleksa Sarai wrote:
> The most obvious syscall to add support for the new LOOKUP_* scoping
> flags would be openat(2). However, there are a few reasons why this is
> not the best course of action:
> 
>  * The new LOOKUP_* flags are intended to be security features, and
>openat(2) will silently ignore all unknown flags. This means that
>users would need to avoid foot-gunning themselves constantly when
>using this interface if it were part of openat(2). This can be fixed
>by having userspace libraries handle this for users[1], but should be
>avoided if possible.
> 
>  * Resolution scoping feels like a different operation to the existing
>O_* flags. And since openat(2) has limited flag space, it seems to be
>quite wasteful to clutter it with 5 flags that are all
>resolution-related. Arguably O_NOFOLLOW is also a resolution flag but
>its entire purpose is to error out if you encounter a trailing
>symlink -- not to scope resolution.
> 
>  * Other systems would be able to reimplement this syscall allowing for
>cross-OS standardisation rather than being hidden amongst O_* flags
>which may result in it not being used by all the parties that might
>want to use it (file servers, web servers, container runtimes, etc).
> 
>  * It gives us the opportunity to iterate on the O_PATH interface. In
>particular, the new @how->upgrade_mask field for fd re-opening is
>only possible because we have a clean slate without needing to re-use
>the ACC_MODE flag design nor the existing openat(2) @mode semantics.
> 
> To this end, we introduce the openat2(2) syscall. It provides all of the
> features of openat(2) through the @how->flags argument, but also
> also provides a new @how->resolve argument which exposes RESOLVE_* flags
> that map to our new LOOKUP_* flags. It also eliminates the long-standing
> ugliness of variadic-open(2) by embedding it in a struct.
> 
> In order to allow for userspace to lock down their usage of file
> descriptor re-opening, openat2(2) has the ability for users to disallow
> certain re-opening modes through @how->upgrade_mask. At the moment,
> there is no UPGRADE_NOEXEC.
> 
> [1]: https://github.com/openSUSE/libpathrs
> 
> Suggested-by: Christian Brauner 
> Signed-off-by: Aleksa Sarai 
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl  |  1 +
>  arch/arm/tools/syscall.tbl  |  1 +
>  arch/arm64/include/asm/unistd.h |  2 +-
>  arch/arm64/include/asm/unistd32.h   |  2 +
>  arch/ia64/kernel/syscalls/syscall.tbl   |  1 +
>  arch/m68k/kernel/syscalls/syscall.tbl   |  1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
>  arch/parisc/kernel/syscalls/syscall.tbl |  1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl|  1 +
>  arch/s390/kernel/syscalls/syscall.tbl   |  1 +
>  arch/sh/kernel/syscalls/syscall.tbl |  1 +
>  arch/sparc/kernel/syscalls/syscall.tbl  |  1 +
>  arch/x86/entry/syscalls/syscall_32.tbl  |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl  |  1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl |  1 +
>  fs/open.c   | 94 -
>  include/linux/fcntl.h   | 19 -
>  include/linux/fs.h  |  4 +-
>  include/linux/syscalls.h| 14 ++-
>  include/uapi/asm-generic/unistd.h   |  5 +-
>  include/uapi/linux/fcntl.h  | 42 +
>  24 files changed, 168 insertions(+), 30 deletions(-)
> 

[...]

> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 1d338357df8a..479baf2da10e 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -93,5 +93,47 @@
>  
>  #define AT_RECURSIVE 0x8000  /* Apply to the entire subtree */
>  
> +/**
> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates identically to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. In addition, @mode (or @upgrade_mask) must be
> + * zero unless one of {O_CREAT, O_TMPFILE, O_PATH} are set.
> + *

After thinking about this a bit, I wonder if we might be better served
with a new set of OA2_* flags instead of repurposing the O_* flags?

Yes, those flags are familiar, but this is an entirely new syscall. We
have a chance to make a fresh start. Does something like O_LARGEFILE
have any real place in openat2? I'd argue no.

Also, once you want to add a new flag, then we get into the mess of how
to document whether open/openat also support it. It'd be good to freeze
changes on those syscalls and aim to only introduce new functionality in
openat2.

That would also allow us to drop some 

Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-04 Thread Randy Dunlap
Hi,
just noisy nits here:

On 9/4/19 1:19 PM, Aleksa Sarai wrote:

> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 1d338357df8a..479baf2da10e 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -93,5 +93,47 @@
>  
>  #define AT_RECURSIVE 0x8000  /* Apply to the entire subtree */
>  
> +/**

/** means "the following is kernel-doc", but it's not, so please either make
it kernel-doc format or just use /* to begin the comment.

> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates identically to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. In addition, @mode (or @upgrade_mask) must be
> + * zero unless one of {O_CREAT, O_TMPFILE, O_PATH} are set.
> + *
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> + __u32 flags;
> + union {
> + __u16 mode;
> + __u16 upgrade_mask;
> + };
> + __u16 resolve;
> +};


-- 
~Randy