Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Aleksa Sarai
On 2020-11-24, Florian Weimer  wrote:
> This documents a way to safely use new security-related system calls
> while preserving compatibility with container runtimes that require
> insecure emulation (because they filter the system call by default).
> Admittedly, it is somewhat hackish, but it can be implemented by
> userspace today, for existing system calls such as faccessat2,
> without kernel or container runtime changes.
> 
> Signed-off-by: Florian Weimer 
> 
> ---
>  Documentation/process/adding-syscalls.rst | 37 
> +++
>  1 file changed, 37 insertions(+)
> 
> diff --git a/Documentation/process/adding-syscalls.rst 
> b/Documentation/process/adding-syscalls.rst
> index a3ecb236576c..7d1e578a1df1 100644
> --- a/Documentation/process/adding-syscalls.rst
> +++ b/Documentation/process/adding-syscalls.rst
> @@ -436,6 +436,40 @@ simulates registers etc).  Fixing this is as simple as 
> adding a #define to
>  
>  #define stub_xyzzy sys_xyzzy
>  
> +Container Compatibility and seccomp
> +---
> +
> +The Linux Foundation Open Container Initiative Runtime Specification
> +requires that by default, implementations install seccomp system call
> +filters which cause system calls to fail with ``EPERM``.  As a result,
> +all new system calls in such containers fail with ``EPERM`` instead of
> +``ENOSYS``.  This design is problematic because ``EPERM`` is a
> +legitimate system call result which should not trigger fallback to a
> +userspace emulation, particularly for security-related system calls.
> +(With ``ENOSYS``, it is clear that a fallback implementation has to be
> +used to maintain compatibility with older kernels or container
> +runtimes.)
> +
> +New system calls should therefore provide a way to reliably trigger an
> +error distinct from ``EPERM``, without any side effects.  Some ways to
> +achieve that are:
> +
> + - ``EBADFD`` for the invalid file descriptor -1
> + - ``EFAULT`` for a null pointer
> + - ``EINVAL`` for a contradictory set of flags that will remain invalid
> +   in the future
> +
> +If a system call has such error behavior, upon encountering an
> +``EPERM`` error, userspace applications can perform further
> +invocations of the same system call to check if the ``EPERM`` error
> +persists for those known error conditions.  If those also fail with
> +``EPERM``, that likely means that the original ``EPERM`` error was the
> +result of a seccomp filter, and should be treated like ``ENOSYS``
> +(e.g., trigger an alternative fallback implementation).  If those
> +probing system calls do not fail with ``EPERM``, the error likely came
> +from a real implementation, and should be reported to the caller
> +directly, without resorting to ``ENOSYS``-style fallback.
> +

As I mentioned in the runc thread[1], this is really down to Docker's
default policy configuration. The EPERM-everything behaviour in OCI was
inherited from Docker, and it boils down to not having an additional
seccomp rule which does ENOSYS for unknown syscall numbers (Docker can
just add the rule without modifying the OCI runtime-spec -- so it's
something Docker can fix entirely on their own). I'll prepare a patch
for Docker this week.

IMHO it's also slightly overkill to change the kernel API design
guidelines in response to this issue.

[1]: https://github.com/opencontainers/runc/issues/2151

>  Other Details
>  -
> @@ -575,3 +609,6 @@ References and Sources
>   - Recommendation from Linus Torvalds that x32 system calls should prefer
> compatibility with 64-bit versions rather than 32-bit versions:
> https://lkml.org/lkml/2011/8/31/244
> + - Linux Configuration section of the Open Container Initiative
> +   Runtime Specification:
> +   https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


[PATCH] openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT

2020-10-07 Thread Aleksa Sarai
This was an oversight in the original implementation, as it makes no
sense to specify both scoping flags to the same openat2(2) invocation
(before this patch, the result of such an invocation was equivalent to
RESOLVE_IN_ROOT being ignored).

This is a userspace-visible ABI change, but the only user of openat2(2)
at the moment is LXC which doesn't specify both flags and so no
userspace programs will break as a result.

Cc:  # v5.6+
Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall")
Acked-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 fs/open.c  | 4 
 tools/testing/selftests/openat2/openat2_test.c | 8 +++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index 9af548fb841b..4d7537ae59df 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1010,6 +1010,10 @@ inline int build_open_flags(const struct open_how *how, 
struct open_flags *op)
if (how->resolve & ~VALID_RESOLVE_FLAGS)
return -EINVAL;
 
+   /* Scoping flags are mutually exclusive. */
+   if ((how->resolve & RESOLVE_BENEATH) && (how->resolve & 
RESOLVE_IN_ROOT))
+   return -EINVAL;
+
/* Deal with the mode. */
if (WILL_CREATE(flags)) {
if (how->mode & ~S_IALLUGO)
diff --git a/tools/testing/selftests/openat2/openat2_test.c 
b/tools/testing/selftests/openat2/openat2_test.c
index b386367c606b..381d874cce99 100644
--- a/tools/testing/selftests/openat2/openat2_test.c
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -155,7 +155,7 @@ struct flag_test {
int err;
 };
 
-#define NUM_OPENAT2_FLAG_TESTS 23
+#define NUM_OPENAT2_FLAG_TESTS 24
 
 void test_openat2_flags(void)
 {
@@ -210,6 +210,12 @@ void test_openat2_flags(void)
  .how.flags = O_TMPFILE | O_RDWR,
  .how.mode = 0xA000ULL, .err = -EINVAL },
 
+   /* ->resolve flags must not conflict. */
+   { .name = "incompatible resolve flags (BENEATH | IN_ROOT)",
+ .how.flags = O_RDONLY,
+ .how.resolve = RESOLVE_BENEATH | RESOLVE_IN_ROOT,
+ .err = -EINVAL },
+
/* ->resolve must only contain RESOLVE_* flags. */
{ .name = "invalid how.resolve and O_RDONLY",
  .how.flags = O_RDONLY,
-- 
2.28.0



Re: [PATCH v2 2/2] vfs: add fchmodat2 syscall

2020-09-16 Thread Aleksa Sarai
lls/syscall.tbl 
> b/arch/sparc/kernel/syscalls/syscall.tbl
> index 4af114e84f20..e817416f81df 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -485,3 +485,4 @@
>  437  common  openat2 sys_openat2
>  438  common  pidfd_getfd sys_pidfd_getfd
>  439  common  faccessat2  sys_faccessat2
> +440  common  fchmodat2   sys_fchmodat2
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 9d1102873666..208b06650cef 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -444,3 +444,4 @@
>  437  i386openat2 sys_openat2
>  438  i386pidfd_getfd sys_pidfd_getfd
>  439  i386faccessat2  sys_faccessat2
> +440  i386fchmodat2   sys_fchmodat2
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index f30d6ae9a688..d9a591db72fb 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -361,6 +361,7 @@
>  437  common  openat2 sys_openat2
>  438  common  pidfd_getfd sys_pidfd_getfd
>  439  common  faccessat2  sys_faccessat2
> +440  common  fchmodat2   sys_fchmodat2
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl 
> b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 6276e3c2d3fc..ff756cb2f5d7 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -410,3 +410,4 @@
>  437  common  openat2 sys_openat2
>  438  common  pidfd_getfd sys_pidfd_getfd
>  439  common  faccessat2  sys_faccessat2
> +440  common  fchmodat2   sys_fchmodat2
> diff --git a/fs/open.c b/fs/open.c
> index cdb7964aaa6e..f492c782c0ed 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -616,11 +616,16 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode)
>   return err;
>  }
>  
> -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
> +static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, 
> int flags)
>  {
>   struct path path;
>   int error;
>   unsigned int lookup_flags = LOOKUP_FOLLOW;
> +
> + if (flags & ~AT_SYMLINK_NOFOLLOW)
> + return -EINVAL;
> + if (flags & AT_SYMLINK_NOFOLLOW)
> + lookup_flags &= ~LOOKUP_FOLLOW;
>  retry:
>   error = user_path_at(dfd, filename, lookup_flags, );
>   if (!error) {
> @@ -634,15 +639,21 @@ static int do_fchmodat(int dfd, const char __user 
> *filename, umode_t mode)
>   return error;
>  }
>  
> +SYSCALL_DEFINE4(fchmodat2, int, dfd, const char __user *, filename,
> + umode_t, mode, int, flags)
> +{
> + return do_fchmodat(dfd, filename, mode, flags);
> +}
> +
>  SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename,
>   umode_t, mode)
>  {
> - return do_fchmodat(dfd, filename, mode);
> + return do_fchmodat(dfd, filename, mode, 0);
>  }
>  
>  SYSCALL_DEFINE2(chmod, const char __user *, filename, umode_t, mode)
>  {
> - return do_fchmodat(AT_FDCWD, filename, mode);
> + return do_fchmodat(AT_FDCWD, filename, mode, 0);
>  }
>  
>  int chown_common(const struct path *path, uid_t user, gid_t group)
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 75ac7f8ae93c..ced00c56eba7 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -435,6 +435,8 @@ asmlinkage long sys_chroot(const char __user *filename);
>  asmlinkage long sys_fchmod(unsigned int fd, umode_t mode);
>  asmlinkage long sys_fchmodat(int dfd, const char __user * filename,
>umode_t mode);
> +asmlinkage long sys_fchmodat2(int dfd, const char __user * filename,
> +   umode_t mode, int flags);
>  asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t 
> user,
>gid_t group, int flag);
>  asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
> diff --git a/include/uapi/asm-generic/unistd.h 
> b/include/uapi/asm-generic/unistd.h
> index 995b36c2ea7d..ebf5cdb3f444 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -859,9 +859,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
>  __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
>  #define __NR_faccessat2 439
>  __SYSCALL(__NR_faccessat2, sys_faccessat2)
> +#define __NR_fchmodat2 440
> +__SYSCALL(__NR_fchmodat2, sys_fchmodat2)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 440
> +#define __NR_syscalls 441
>  
>  /*
>   * 32 bit systems traditionally used different
> -- 
> 2.21.0
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v8 1/2] Add a "nosymfollow" mount option.

2020-08-27 Thread Aleksa Sarai
On 2020-08-27, Al Viro  wrote:
> On Wed, Aug 26, 2020 at 02:48:19PM -0600, Ross Zwisler wrote:
> 
> > Al, now that the changes to fs/namei.c have landed and we're past the merge
> > window for v5.9, what are your thoughts on this patch and the associated 
> > test?
> 
> Humm...  should that be nd->path.mnt->mnt_flags or link->mnt->mnt_flags?
> Usually it's the same thing, but they might differ.  IOW, is that about the
> directory we'd found it in, or is it about the link itself?

Now that you mention it, I think link->mnt->mnt_flags makes more sense.
The restriction should apply in the context of whatever filesystem
contains the symlink, and that would matches FreeBSD's semantics (at
least as far as I can tell from a quick look at sys/kern/vfs_lookup.c).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH RESEND] fs: Move @f_count to different cacheline with @f_mode

2020-08-26 Thread Aleksa Sarai
On 2020-08-26, Shaokun Zhang  wrote:
> 在 2020/8/22 0:02, Will Deacon 写道:
> >   - This thing is tagged with __randomize_layout, so it doesn't help anybody
> > using that crazy plugin
> 
> This patch isolated the @f_count with @f_mode absolutely and we don't care the
> base address of the structure, or I may miss something what you said.

__randomize_layout randomises the order of fields in a structure on each
kernel rebuild (to make attacks against sensitive kernel structures
theoretically harder because the offset of a field is per-build). It is
separate to ASLR or other base-related randomisation. However it depends
on having CONFIG_GCC_PLUGIN_RANDSTRUCT=y and I believe (at least for
distribution kernels) this isn't a widely-used configuration.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] MAINTAINERS: add namespace entry

2020-08-26 Thread Aleksa Sarai
On 2020-08-25, Eric W. Biederman  wrote:
> C) You have overstated what I have agreed to here.
>I have have previously said that I agree that having a MAINTAINERS
>entry so people who are unfamiliar with the situation with namespaces
>can find us.  Given that most of the changes going forward are likely
>to be maintenance changes.
> 
>I also said we need to talk about how we plan to maintain the code
>here.
> 
>It feels like you are pushing this hard, and I am not certain why you
>are pushing and rushing this.  With my maintainer hat on my big
>concern is we catch the issues that will introduce security issue.
>Recently I have seen a report that there is an issue on Ubuntu
>kernels where anyone can read /etc/shadow.  The problem is that
>Ubuntu has not been cautions and has not taken the time to figure out
>how to enable things for unprivileged users safely, and have just
>enabled the code to be used by unprivileged users because it is
>useful.
> 
>In combination with you pushing hard and not taking the time to
>complete this conversation in private with me, this MAINTAINERS entry
>makes me uneasy as it feels like you may be looking for a way to push
>the code into the mainline kernel like has been pushed into the
>Ubuntu kernel.  I may be completely wrong I just don't know what to
>make of your not finishing our conversation in private, and forcing
>my hand by posting this patch publicly.

Eric, with all due respect, Christian is not a sleeper agent of some
shadow Ubuntu kernel team that is tirelessly trying to slip things by
you. I have no idea where you could have possibly gotten this
impression, given his track record of the past few years.

I also don't understand why you feel the need to talk about things which
he had nothing to do with -- what relationship does the /etc/shadow
thing have to do with his work and track record? Were Debian kernel
contributors considered untrustworthy because of the OpenSSL weak keys
issue? Would it be fair to question your competence because some RHEL
kernel backports were borked? Of course not -- that would be ridiculous!

> At the same time I am not convinced you are actually going to do the
> work to make new code maintainable and not a problem for other kernel
> developers.
> 
> A big part the job over the years has been to make the namespace ideas
> proposed sane, and to keep the burden from other maintainers of naive
> and terrible code.  Pushing this change before we finished our private
> conversation makes me very nervous on that front.

What gives you that impression? This whole thing seems incredibly
strange -- we've all met IRL several times, and have had many long
discussions about the best way to solve problems without placing undue
burden on kernel maintenance.

Furthermore, I don't think this is an acceptable way to talk about a
peer within the kernel community -- attributing malicious intent without
any justification other than "I feel this is the case" is little more
than a character assassination, and I don't see why you would feel that
such a statement is justified.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v7] Add a "nosymfollow" mount option.

2020-08-11 Thread Aleksa Sarai
On 2020-08-11, Ross Zwisler  wrote:
> From: Mattias Nissler 
> 
> For mounts that have the new "nosymfollow" option, don't follow symlinks
> when resolving paths. The new option is similar in spirit to the
> existing "nodev", "noexec", and "nosuid" options, as well as to the
> LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD
> variants have been supporting the "nosymfollow" mount option for a long
> time with equivalent implementations.
> 
> Note that symlinks may still be created on file systems mounted with
> the "nosymfollow" option present. readlink() remains functional, so
> user space code that is aware of symlinks can still choose to follow
> them explicitly.
> 
> Setting the "nosymfollow" mount option helps prevent privileged
> writers from modifying files unintentionally in case there is an
> unexpected link along the accessed path. The "nosymfollow" option is
> thus useful as a defensive measure for systems that need to deal with
> untrusted file systems in privileged contexts.
> 
> More information on the history and motivation for this patch can be
> found here:
> 
> https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal

Looks good. Did you plan to add an in-tree test for this (you could
shove it in tools/testing/selftests/mount)?

Reviewed-by: Aleksa Sarai 

> Signed-off-by: Mattias Nissler 
> Signed-off-by: Ross Zwisler 
> ---
> Changes since v6 [1]:
>  * Rebased onto v5.8.
>  * Another round of testing including readlink(1), readlink(2),
>realpath(1), realpath(3), statfs(2) and mount(2) to make sure
>everything still works.
> 
> After this lands I will upstream changes to util-linux[2] and man-pages
> [3].
> 
> [1]: https://lkml.org/lkml/2020/3/4/770
> [2]: 
> https://github.com/rzwisler/util-linux/commit/7f8771acd85edb70d97921c026c55e1e724d4e15
> [3]: 
> https://github.com/rzwisler/man-pages/commit/b8fe8079f64b5068940c0144586e580399a71668
> ---
>  fs/namei.c | 3 ++-
>  fs/namespace.c | 2 ++
>  fs/proc_namespace.c| 1 +
>  fs/statfs.c| 2 ++
>  include/linux/mount.h  | 3 ++-
>  include/linux/statfs.h | 1 +
>  include/uapi/linux/mount.h | 1 +
>  7 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 72d4219c93acb..ed68478fb1fb6 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1626,7 +1626,8 @@ static const char *pick_link(struct nameidata *nd, 
> struct path *link,
>   return ERR_PTR(error);
>   }
>  
> - if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
> + if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS) ||
> + unlikely(nd->path.mnt->mnt_flags & MNT_NOSYMFOLLOW))
>   return ERR_PTR(-ELOOP);
>  
>   if (!(nd->flags & LOOKUP_RCU)) {
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4a0f600a33285..1cbbf5a9b954f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3167,6 +3167,8 @@ long do_mount(const char *dev_name, const char __user 
> *dir_name,
>   mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
>   if (flags & MS_RDONLY)
>   mnt_flags |= MNT_READONLY;
> + if (flags & MS_NOSYMFOLLOW)
> + mnt_flags |= MNT_NOSYMFOLLOW;
>  
>   /* The default atime for remount is preservation */
>   if ((flags & MS_REMOUNT) &&
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 3059a9394c2d6..e59d4bb3a89e4 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -70,6 +70,7 @@ static void show_mnt_opts(struct seq_file *m, struct 
> vfsmount *mnt)
>   { MNT_NOATIME, ",noatime" },
>   { MNT_NODIRATIME, ",nodiratime" },
>   { MNT_RELATIME, ",relatime" },
> + { MNT_NOSYMFOLLOW, ",nosymfollow" },
>   { 0, NULL }
>   };
>   const struct proc_fs_opts *fs_infop;
> diff --git a/fs/statfs.c b/fs/statfs.c
> index 2616424012ea7..59f33752c1311 100644
> --- a/fs/statfs.c
> +++ b/fs/statfs.c
> @@ -29,6 +29,8 @@ static int flags_by_mnt(int mnt_flags)
>   flags |= ST_NODIRATIME;
>   if (mnt_flags & MNT_RELATIME)
>   flags |= ST_RELATIME;
> + if (mnt_flags & MNT_NOSYMFOLLOW)
> + flags |= ST_NOSYMFOLLOW;
>   return flags;
>  }
>  
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index de657bd211fa6..aaf343b38671c 100644
> --

Re: [PATCH] Userfaultfd: Avoid double free of userfault_ctx and remove O_CLOEXEC

2020-08-06 Thread Aleksa Sarai
On 2020-08-04, Eric Biggers  wrote:
> On Wed, Aug 05, 2020 at 01:47:58PM +1000, Aleksa Sarai wrote:
> > On 2020-08-04, Lokesh Gidra  wrote:
> > > when get_unused_fd_flags returns error, ctx will be freed by
> > > userfaultfd's release function, which is indirectly called by fput().
> > > Also, if anon_inode_getfile_secure() returns an error, then
> > > userfaultfd_ctx_put() is called, which calls mmdrop() and frees ctx.
> > > 
> > > Also, the O_CLOEXEC was inadvertently added to the call to
> > > get_unused_fd_flags() [1].
> > 
> > I disagree that it is "wrong" to do O_CLOEXEC-by-default (after all,
> > it's trivial to disable O_CLOEXEC, but it's non-trivial to enable it on
> > an existing file descriptor because it's possible for another thread to
> > exec() before you set the flag). Several new syscalls and fd-returning
> > facilities are O_CLOEXEC-by-default now (the most obvious being pidfds
> > and seccomp notifier fds).
> 
> Sure, O_CLOEXEC *should* be the default, but this is an existing syscall so it
> has to keep the existing behavior.

Ah, I missed that this was a UAPI breakage. :P

> > At the very least there should be a new flag added that sets O_CLOEXEC.
> 
> There already is one (but these patches broke it).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] Userfaultfd: Avoid double free of userfault_ctx and remove O_CLOEXEC

2020-08-04 Thread Aleksa Sarai
On 2020-08-04, Lokesh Gidra  wrote:
> when get_unused_fd_flags returns error, ctx will be freed by
> userfaultfd's release function, which is indirectly called by fput().
> Also, if anon_inode_getfile_secure() returns an error, then
> userfaultfd_ctx_put() is called, which calls mmdrop() and frees ctx.
> 
> Also, the O_CLOEXEC was inadvertently added to the call to
> get_unused_fd_flags() [1].

I disagree that it is "wrong" to do O_CLOEXEC-by-default (after all,
it's trivial to disable O_CLOEXEC, but it's non-trivial to enable it on
an existing file descriptor because it's possible for another thread to
exec() before you set the flag). Several new syscalls and fd-returning
facilities are O_CLOEXEC-by-default now (the most obvious being pidfds
and seccomp notifier fds).

At the very least there should be a new flag added that sets O_CLOEXEC.

> Adding Al Viro's suggested-by, based on [2].
> 
> [1] 
> https://lore.kernel.org/lkml/1f69c0ab-5791-974f-8bc0-3997ab1d6...@dancol.org/
> [2] https://lore.kernel.org/lkml/20200719165746.gj2786...@zeniv.linux.org.uk/
> 
> Fixes: d08ac70b1e0d (Wire UFFD up to SELinux)
> Suggested-by: Al Viro 
> Reported-by: syzbot+75867c44841cb6373...@syzkaller.appspotmail.com
> Signed-off-by: Lokesh Gidra 
> ---
>  fs/userfaultfd.c | 14 --
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index ae859161908f..e15eb8fdc083 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -2042,24 +2042,18 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
>   O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS),
>   NULL);
>   if (IS_ERR(file)) {
> - fd = PTR_ERR(file);
> - goto out;
> + userfaultfd_ctx_put(ctx);
> + return PTR_ERR(file);
>   }
>  
> - fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC);
> + fd = get_unused_fd_flags(O_RDONLY);
>   if (fd < 0) {
>   fput(file);
> - goto out;
> + return fd;
>   }
>  
>   ctx->owner = file_inode(file);
>   fd_install(fd, file);
> -
> -out:
> - if (fd < 0) {
> -     mmdrop(ctx->mm);
> - kmem_cache_free(userfaultfd_ctx_cachep, ctx);
> - }
>   return fd;
>  }
>  
> -- 
> 2.28.0.163.g6104cc2f0b6-goog
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec

2020-07-28 Thread Aleksa Sarai
On 2020-07-27, Eric W. Biederman  wrote:
> To the best of my knowledge processes with more than one thread
> calling exec are not common, and as all of the threads will be killed
> by exec there does not appear to be any useful work a thread can
> reliably do during exec.

Every Go program which calls exec (this includes runc, Docker, LXD,
Kubernetes, et al) fills the niche of "multi-threaded program that calls
exec" -- all Go programs are multi-threaded and there's no way of
disabling this. This will most likely cause pretty bad performance
regression for basically all container workloads.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: strace of io_uring events?

2020-07-16 Thread Aleksa Sarai
On 2020-07-15, Kees Cook  wrote:
> Earlier Andy Lutomirski wrote:
> > Let’s add some seccomp folks. We probably also want to be able to run
> > seccomp-like filters on io_uring requests. So maybe io_uring should call 
> > into
> > seccomp-and-tracing code for each action.
> 
> Okay, I'm finally able to spend time looking at this. And thank you to
> the many people that CCed me into this and earlier discussions (at least
> Jann, Christian, and Andy).
> 
> It *seems* like there is a really clean mapping of SQE OPs to syscalls.
> To that end, yes, it should be trivial to add ptrace and seccomp support
> (sort of). The trouble comes for doing _interception_, which is how both
> ptrace and seccomp are designed.
> 
> In the basic case of seccomp, various syscalls are just being checked
> for accept/reject. It seems like that would be easy to wire up. For the
> more ptrace-y things (SECCOMP_RET_TRAP, SECCOMP_RET_USER_NOTIF, etc),
> I think any such results would need to be "upgraded" to "reject". Things
> are a bit complex in that seccomp's form of "reject" can be "return
> errno" (easy) or it can be "kill thread (or thread_group)" which ...
> becomes less clear. (More on this later.)
> 
> In the basic case of "I want to run strace", this is really just a
> creative use of ptrace in that interception is being used only for
> reporting. Does ptrace need to grow a way to create/attach an io_uring
> eventfd? Or should there be an entirely different tool for
> administrative analysis of io_uring events (kind of how disk IO can be
> monitored)?

I would hope that we wouldn't introduce ptrace to io_uring, because
unless we plan to attach to io_uring events via GDB it's simply the
wrong tool for the job. strace does use ptrace, but that's mostly
because Linux's dynamic tracing was still in its infancy at the time
(and even today it requires more privileges than ptrace) -- but you can
emulate strace using bpftrace these days fairly easily.

So really what is being asked here is "can we make it possible to debug
io_uring programs as easily as traditional I/O programs". And this does
not require ptrace, nor should ptrace be part of this discussion IMHO. I
believe this issue (along with seccomp-style filtering) have been
mentioned informally in the past, but I am happy to finally see a thread
about this appear.

> For io_uring generally, I have a few comments/questions:
> 
> - Why did a new syscall get added that couldn't be extended? All new
>   syscalls should be using Extended Arguments. :(

io_uring was introduced in Linux 5.1, predating clone3() and openat2().
My larger concern is that io_uring operations aren't extensible-structs
-- but we can resolve that issue with some slight ugliness if we ever
run into the problem.

> - Why aren't the io_uring syscalls in the man-page git? (It seems like
>   they're in liburing, but that's should document the _library_ not the
>   syscalls, yes?)

I imagine because using the syscall requires specific memory barriers
which we probably don't want most C programs to be fiddling with
directly. Sort of similar to how iptables doesn't have a syscall-style
man page.

> Speaking to Stefano's proposal[1]:
> 
> - There appear to be three classes of desired restrictions:
>   - opcodes for io_uring_register() (which can be enforced entirely with
> seccomp right now).
>   - opcodes from SQEs (this _could_ be intercepted by seccomp, but is
> not currently written)
>   - opcodes of the types of restrictions to restrict... for making sure
> things can't be changed after being set? seccomp already enforces
> that kind of "can only be made stricter"

Unless I misunderstood the patch cover-letter, Stefano's proposal is to
have a mechanism for adding restrictions to individual io_urings -- so
we still need a separate mechanism (or an extended version of Stefano's
proposal) to allow for the "reduce attack surface" usecase of seccomp.
It seems to me like Stefano's proposal is more related to cases where
you might SCM_RIGHTS-send an io_uring to an unprivileged process.

> Solving the mapping of seccomp interception types into CQEs (or anything
> more severe) will likely inform what it would mean to map ptrace events
> to CQEs. So, I think they're related, and we should get seccomp hooked
> up right away, and that might help us see how (if) ptrace should be
> attached.

We could just emulate the seccomp-bpf API with the pseudo-syscalls done
as a result of CQEs, though I'm not sure how happy folks will be with
this kind of glue code in "seccomp-uring" (though in theory it would
allow us to attach existing filters to io_uring...).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH 0/5] RFC: connector: Add network namespace awareness

2020-07-13 Thread Aleksa Sarai
On 2020-07-13, Eric W. Biederman  wrote:
> Matt Bennett  writes:
> 
> > On Thu, 2020-07-02 at 21:10 +0200, Christian Brauner wrote:
> >> On Thu, Jul 02, 2020 at 08:17:38AM -0500, Eric W. Biederman wrote:
> >> > Matt Bennett  writes:
> >> > 
> >> > > Previously the connector functionality could only be used by processes 
> >> > > running in the
> >> > > default network namespace. This meant that any process that uses the 
> >> > > connector functionality
> >> > > could not operate correctly when run inside a container. This is a 
> >> > > draft patch series that
> >> > > attempts to now allow this functionality outside of the default 
> >> > > network namespace.
> >> > > 
> >> > > I see this has been discussed previously [1], but am not sure how my 
> >> > > changes relate to all
> >> > > of the topics discussed there and/or if there are any unintended side 
> >> > > effects from my draft
> >> > > changes.
> >> > 
> >> > Is there a piece of software that uses connector that you want to get
> >> > working in containers?
> >
> > We have an IPC system [1] where processes can register their socket
> > details (unix, tcp, tipc, ...) to a 'monitor' process. Processes can
> > then get notified when other processes they are interested in
> > start/stop their servers and use the registered details to connect to
> > them. Everything works unless a process crashes, in which case the
> > monitoring process never removes their details. Therefore the
> > monitoring process uses the connector functionality with
> > PROC_EVENT_EXIT to detect when a process crashes and removes the
> > details if it is a previously registered PID.
> >
> > This was working for us until we tried to run our system in a container.
> >
> >> > 
> >> > I am curious what the motivation is because up until now there has been
> >> > nothing very interesting using this functionality.  So it hasn't been
> >> > worth anyone's time to make the necessary changes to the code.
> >> 
> >> Imho, we should just state once and for all that the proc connector will
> >> not be namespaced. This is such a corner-case thing and has been
> >> non-namespaced for such a long time without consistent push for it to be
> >> namespaced combined with the fact that this needs quite some code to
> >> make it work correctly that I fear we end up buying more bugs than we're
> >> selling features. And realistically, you and I will end up maintaining
> >> this and I feel this is not worth the time(?). Maybe I'm being too
> >> pessimistic though.
> >> 
> >
> > Fair enough. I can certainly look for another way to detect process
> > crashes. Interestingly I found a patch set [2] on the mailing list
> > that attempts to solve the problem I wish to solve, but it doesn't
> > look like the patches were ever developed further. From reading the
> > discussion thread on that patch set it appears that I should be doing
> > some form of polling on the /proc files.
> 
> Recently Christian Brauner implemented pidfd complete with a poll
> operation that reports when a process terminates.
> 
> If you are willing to change your userspace code switching to pidfd
> should be all that you need.

While this does solve the problem of getting exit notifications in
general, you cannot get the exit code. But if they don't care about that
then we can solve that problem another time. :D

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH 0/5] RFC: connector: Add network namespace awareness

2020-07-02 Thread Aleksa Sarai
On 2020-07-02, Christian Brauner  wrote:
> On Thu, Jul 02, 2020 at 08:17:38AM -0500, Eric W. Biederman wrote:
> > Matt Bennett  writes:
> > 
> > > Previously the connector functionality could only be used by processes 
> > > running in the
> > > default network namespace. This meant that any process that uses the 
> > > connector functionality
> > > could not operate correctly when run inside a container. This is a draft 
> > > patch series that
> > > attempts to now allow this functionality outside of the default network 
> > > namespace.
> > >
> > > I see this has been discussed previously [1], but am not sure how my 
> > > changes relate to all
> > > of the topics discussed there and/or if there are any unintended side 
> > > effects from my draft
> > > changes.
> > 
> > Is there a piece of software that uses connector that you want to get
> > working in containers?
> > 
> > I am curious what the motivation is because up until now there has been
> > nothing very interesting using this functionality.  So it hasn't been
> > worth anyone's time to make the necessary changes to the code.
> 
> Imho, we should just state once and for all that the proc connector will
> not be namespaced. This is such a corner-case thing and has been
> non-namespaced for such a long time without consistent push for it to be
> namespaced combined with the fact that this needs quite some code to
> make it work correctly that I fear we end up buying more bugs than we're
> selling features. And realistically, you and I will end up maintaining
> this and I feel this is not worth the time(?). Maybe I'm being too
> pessimistic though.

It would be nice to have the proc connector be namespaced, because it
would allow you to have init systems that don't depend on cgroups to
operate -- and it would allow us to have a subset of FreeBSD's kqueue
functionality that doesn't exist today under Linux. However, arguably
pidfds might be a better path forward toward implementing such events
these days -- and is maybe something we should look into.

All of that being said, I agree that doing this is going to be
particularly hairy and likely not worth the effort. In particular, the
proc connector is:

 * Almost entirely unused (and largely unknown) by userspace.

 * Fairly fundamentally broken right now (the "security feature" of
   PROC_CN_MCAST_LISTEN doesn't work because once there is one listener,
   anyone who opens an cn_proc socket can get all events on the system
   -- and if the process which opened the socket dies with calling
   PROC_CN_MCAST_IGNORE then that information is now always streaming).
   So if we end up supporting this, we'll need to fix those bugs too.

 * Is so deeply intertwined with netlink and thus is so deeply embedded
   with network namespaces (rather than pid namespaces) meaning that
   getting it to correctly handle shared-network-namespace cases is
   going to be a nightmare. I agree with Eric that this patchset looks
   like it doesn't approach the problem from the right angle (and
   thinking about how you could fix it makes me a little nervous).

Not to mention that when I brought this up with the maintainer listed in
MAINTAINERS a few years ago (soon after I posted [1]), I was told that
they no longer maintain this code -- so whoever touches it next is the
new maintainer.

In 2017, I wrote that GNU Shepherd uses cn_proc, however I'm pretty sure
(looking at the code now) that it wasn't true then and isn't true now
(Shepherd seems to just do basic pidfile liveliness checks). So even the
niche example I used then doesn't actually use cn_proc.

[1]: https://lore.kernel.org/lkml/a2fa1602-2280-c5e8-cac9-b718eaea5...@suse.de/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


[PATCH] symlink.7: document magic-links more completely

2020-06-09 Thread Aleksa Sarai
Hi Michael,

Sorry for the delay and here is the patch I promised in this thread.

--8<-8<--

Traditionally, magic-links have not been a well-understood topic in
Linux. This helps clarify some of the terminology used in openat2.2.

Signed-off-by: Aleksa Sarai 
---
 man7/symlink.7 | 31 ++-
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/man7/symlink.7 b/man7/symlink.7
index 07b1db3a3764..ed99bc4236f1 100644
--- a/man7/symlink.7
+++ b/man7/symlink.7
@@ -84,6 +84,21 @@ as they are implemented on Linux and other systems,
 are outlined here.
 It is important that site-local applications also conform to these rules,
 so that the user interface can be as consistent as possible.
+.SS Magic-links
+There is a special class of symlink-like objects known as "magic-links" which
+can be found in certain pseudo-filesystems such as
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Unlike normal symlinks, magic-links are not resolved through
+pathname-expansion, but instead act as direct references to the kernel's own
+representation of a file handle. As such, these magic-links allow users to
+access files which cannot be referenced with normal paths (such as unlinked
+files still referenced by a running program.)
+.PP
+Because they can bypass ordinary
+.BR mount_namespaces (7)-based
+restrictions, magic-links have been used as attack vectors in various exploits.
 .SS Symbolic link ownership, permissions, and timestamps
 The owner and group of an existing symbolic link can be changed
 using
@@ -99,16 +114,14 @@ of a symbolic link can be changed using
 or
 .BR lutimes (3).
 .PP
-On Linux, the permissions of a symbolic link are not used
-in any operations; the permissions are always
-0777 (read, write, and execute for all user categories),
 .\" Linux does not currently implement an lchmod(2).
-and can't be changed.
-(Note that there are some "magic" symbolic links in the
-.I /proc
-directory tree\(emfor example, the
-.IR /proc/[pid]/fd/*
-files\(emthat have different permissions.)
+On Linux, the permissions of an ordinary symbolic link are not used in any
+operations; the permissions are always 0777 (read, write, and execute for all
+user categories), and can't be changed.
+.PP
+However, magic-links do not follow this rule. They can have a non-0777 mode,
+though this mode is not currently used in any permission checks.
+
 .\"
 .\" The
 .\" 4.4BSD
-- 
2.26.2



Re: seccomp feature development

2020-05-20 Thread Aleksa Sarai
On 2020-05-19, Alexei Starovoitov  wrote:
> On Wed, May 20, 2020 at 11:20:45AM +1000, Aleksa Sarai wrote:
> > No it won't become copy_from_user(), nor will there be a TOCTOU race.
> > 
> > The idea is that seccomp will proactively copy the struct (and
> > recursively any of the struct pointers inside) before the syscall runs
> > -- as this is done by seccomp it doesn't require any copy_from_user()
> > primitives in cBPF. We then run the cBPF filter on the copied struct,
> > just like how cBPF programs currently operate on seccomp_data (how this
> > would be exposed to the cBPF program as part of the seccomp ABI is the
> > topic of discussion here).
> > 
> > Then, when the actual syscall code runs, the struct will have already
> > been copied and the syscall won't copy it again.
> 
> Let's take bpf syscall as an example.
> Are you suggesting that all of syscall logic of conditionally parsing
> the arguments will be copy-pasted into seccomp-syscall infra, then
> it will do copy_from_user() all the data and replace all aligned_u64
> in "union bpf_attr" with kernel copied pointers instead of user pointers
> and make all of bpf syscall's copy_from_user() actions to be conditional ?
> If seccomp is on, use kernel pointers... if seccomp is off, do copy_from_user 
> ?
> And the same idea will be replicated for all syscalls?

This would be done optionally per-syscall. Only syscalls which want to
opt-in to such a mechanism (such as clone3 and openat2) would be
affected. Also, bpf is possibly the least-friendly syscall to pick as an
example of these types of filters -- openat2/clone3 is much simpler to
consider.

The point is that if we both agree that seccomp needs to have a way to
do "deep argument inspection" (filtering based on the struct argument to
a syscall), then some sort of caching mechanism is simply necessary to
solve the problem. Otherwise there's a trivial TOCTOU and seccomp
filtering for such syscalls would be rendered almost useless.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: seccomp feature development

2020-05-19 Thread Aleksa Sarai
On 2020-05-19, Alexei Starovoitov  wrote:
> On Mon, May 18, 2020 at 7:53 PM Aleksa Sarai  wrote:
> >
> > On 2020-05-19, Jann Horn  wrote:
> > > On Mon, May 18, 2020 at 11:05 PM Kees Cook  wrote:
> > > > ## deep argument inspection
> > > >
> > > > Background: seccomp users would like to write filters that traverse
> > > > the user pointers passed into many syscalls, but seccomp can't do this
> > > > dereference for a variety of reasons (mostly involving race conditions 
> > > > and
> > > > rearchitecting the entire kernel syscall and copy_from_user() code 
> > > > flows).
> > >
> > > Also, other than for syscall entry, it might be worth thinking about
> > > whether we want to have a special hook into seccomp for io_uring.
> > > io_uring is growing support for more and more syscalls, including
> > > things like openat2, connect, sendmsg, splice and so on, and that list
> > > is probably just going to grow in the future. If people start wanting
> > > to use io_uring in software with seccomp filters, it might be
> > > necessary to come up with some mechanism to prevent io_uring from
> > > permitting access to almost everything else...
> > >
> > > Probably not a big priority for now, but something to keep in mind for
> > > the future.
> >
> > Indeed. Quite a few people have raised concerns about io_uring and its
> > debug-ability, but I agree that another less-commonly-mentioned concern
> > should be how you restrict io_uring(2) from doing operations you've
> > disallowed through seccomp. Though obviously user_notif shouldn't be
> > allowed. :D
> >
> > > > The argument caching bit is, I think, rather mechanical in nature since
> > > > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > > > allocates seccomp_data (maybe going so far as to have it split across 
> > > > two
> > > > pages with the syscall argument struct always starting on the 2nd page
> > > > boundary), and copying the EA struct into that page, which will be both
> > > > used by the filter and by the syscall.
> > >
> > > We could also do the same kind of thing the eBPF verifier does in
> > > convert_ctx_accesses(), and rewrite the context accesses to actually
> > > go through two different pointers depending on the (constant) offset
> > > into seccomp_data.
> >
> > My main worry with this is that we'll need to figure out what kind of
> > offset mathematics are necessary to deal with pointers inside the
> > extensible struct. As a very ugly proposal, you could make it so that
> > you multiply the offset by PAGE_SIZE each time you want to dereference
> > the pointer at that offset (unless we want to add new opcodes to cBPF to
> > allow us to represent this).
> 
> Please don't. cbpf is frozen.

I have an alternative proposal in another mail[1].

> > This might even be needed for seccomp user_notif -- given one of the
> > recent proposals was basically to just add two (extensible) struct
> > pointers inside the main user_notif struct.
> >
> > > > I imagine state tracking ("is
> > > > there a cached EA?", "what is the address of seccomp_data?", "what is
> > > > the address of the EA?") can be associated with the thread struct.
> > >
> > > You probably mean the task struct?
> > >
> > > > The growing size of the EA struct will need some API design. For filters
> > > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > > need to know how large seccomp_data is (more on this later), and how
> > > > large the EA struct is. When the filter is written in userspace, it can
> > > > do the math, point into the expected offsets, and get what it needs. For
> > > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > > needs to know the size of the EA struct as well, so it can correctly
> > > > perform the offset checking (as it currently does for just the
> > > > seccomp_data struct size).
> > > >
> > > > Since there is not really any caller-based "seccomp state" associated
> > > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > > the kernel doesn't track who "I" is besides just being "current", which
> > &

Re: seccomp feature development

2020-05-19 Thread Aleksa Sarai
On 2020-05-19, Aleksa Sarai  wrote:
> On 2020-05-19, Christian Brauner  wrote:
> > On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote:
> > > On 2020-05-18, Kees Cook  wrote:
> > > > - the sizes of these EA structs are, by design, growable over time.
> > > >   seccomp and its users need to be handle this in a forward and backward
> > > >   compatible way, similar to the design of the EA syscall interface
> > > >   itself.
> > > > 
> > > > The growing size of the EA struct will need some API design. For filters
> > > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > > need to know how large seccomp_data is (more on this later), and how
> > > > large the EA struct is. When the filter is written in userspace, it can
> > > > do the math, point into the expected offsets, and get what it needs. For
> > > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > > needs to know the size of the EA struct as well, so it can correctly
> > > > perform the offset checking (as it currently does for just the
> > > > seccomp_data struct size).
> > > > 
> > > > Since there is not really any caller-based "seccomp state" associated
> > > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > > the kernel doesn't track who "I" is besides just being "current", which
> > > > doesn't take into account the thread lifetime -- if a process launcher
> > > > knows about one size and the child knows about another, things will get
> > > > confused. The sizes really are just associated with individual filters,
> > > > based on the syscalls they're examining. So, I have thoughts on possible
> > > > solutions:
> > > > 
> > > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > > >   EA style so we can pass in more than a filter and include also an
> > > >   array of syscall to size mappings. (I don't like this...)
> > > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which 
> > > > changes
> > > >   the meaning of the uarg from "filter" to a EA-style structure with
> > > >   sizes and pointers to the filter and an array of syscall to size
> > > >   mappings. (I like this slightly better, but I still don't like it.)
> > > > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > > >   the "max offset" value seen during filter verification, and zero-fill
> > > >   the EA struct with zeros to that size when constructing the
> > > >   seccomp_data + EA struct that the filter will examine. Then the 
> > > > seccomp
> > > >   filter doesn't care what any of the sizes are, and userspace doesn't
> > > >   care what any of the sizes are. (I like this as it makes the problems
> > > >   to solve contained entirely by the seccomp infrastructure and does not
> > > >   touch user API, but I worry I'm missing some gotcha I haven't
> > > >   considered.)
> > > 
> > > Okay, so here is my view on this. I think that the third option is
> > > closest to what I'd like to see. Based on Jann's email, I think we're on
> > > the same page but I'd just like to elaborate it a bit further:
> > > 
> > > First of all -- ideally, the backward and forward compatibility that EA
> > > syscalls give us should be reflected with seccomp filters being
> > > similarly compatible. Otherwise we're going to run into issues where all
> > > of the hard work with ensuring EA syscalls behave when extended will be
> > > less valuable if seccomp cannot handle it sufficiently. This means that
> > > I would hope that every combination of {old,new} filter/kernel/program
> > > would work on a best-effort (but fail-safe) basis.
> > > 
> > > In my view, the simplest way (from the kernel side) would be to simply
> > > do what you outlined in (3) -- make all accesses past usize (and even
> > > ksize) be zeroed.
> > > 
> > > However in order to make an old filter fail-safe on a new kernel with a
> > > new program, we'd need a new opcode which basically does
> > > bpf_check_uarg_tail_zero() after a given offset into the EA struct. This
> > > would punt the fail-safe problem to userspace (libseccomp would 

Re: seccomp feature development

2020-05-19 Thread Aleksa Sarai
On 2020-05-19, Christian Brauner  wrote:
> On Tue, May 19, 2020 at 05:09:29PM +1000, Aleksa Sarai wrote:
> > On 2020-05-18, Kees Cook  wrote:
> > > - the sizes of these EA structs are, by design, growable over time.
> > >   seccomp and its users need to be handle this in a forward and backward
> > >   compatible way, similar to the design of the EA syscall interface
> > >   itself.
> > > 
> > > The growing size of the EA struct will need some API design. For filters
> > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > need to know how large seccomp_data is (more on this later), and how
> > > large the EA struct is. When the filter is written in userspace, it can
> > > do the math, point into the expected offsets, and get what it needs. For
> > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > needs to know the size of the EA struct as well, so it can correctly
> > > perform the offset checking (as it currently does for just the
> > > seccomp_data struct size).
> > > 
> > > Since there is not really any caller-based "seccomp state" associated
> > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > the kernel doesn't track who "I" is besides just being "current", which
> > > doesn't take into account the thread lifetime -- if a process launcher
> > > knows about one size and the child knows about another, things will get
> > > confused. The sizes really are just associated with individual filters,
> > > based on the syscalls they're examining. So, I have thoughts on possible
> > > solutions:
> > > 
> > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > >   EA style so we can pass in more than a filter and include also an
> > >   array of syscall to size mappings. (I don't like this...)
> > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> > >   the meaning of the uarg from "filter" to a EA-style structure with
> > >   sizes and pointers to the filter and an array of syscall to size
> > >   mappings. (I like this slightly better, but I still don't like it.)
> > > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > >   the "max offset" value seen during filter verification, and zero-fill
> > >   the EA struct with zeros to that size when constructing the
> > >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> > >   filter doesn't care what any of the sizes are, and userspace doesn't
> > >   care what any of the sizes are. (I like this as it makes the problems
> > >   to solve contained entirely by the seccomp infrastructure and does not
> > >   touch user API, but I worry I'm missing some gotcha I haven't
> > >   considered.)
> > 
> > Okay, so here is my view on this. I think that the third option is
> > closest to what I'd like to see. Based on Jann's email, I think we're on
> > the same page but I'd just like to elaborate it a bit further:
> > 
> > First of all -- ideally, the backward and forward compatibility that EA
> > syscalls give us should be reflected with seccomp filters being
> > similarly compatible. Otherwise we're going to run into issues where all
> > of the hard work with ensuring EA syscalls behave when extended will be
> > less valuable if seccomp cannot handle it sufficiently. This means that
> > I would hope that every combination of {old,new} filter/kernel/program
> > would work on a best-effort (but fail-safe) basis.
> > 
> > In my view, the simplest way (from the kernel side) would be to simply
> > do what you outlined in (3) -- make all accesses past usize (and even
> > ksize) be zeroed.
> > 
> > However in order to make an old filter fail-safe on a new kernel with a
> > new program, we'd need a new opcode which basically does
> > bpf_check_uarg_tail_zero() after a given offset into the EA struct. This
> > would punt the fail-safe problem to userspace (libseccomp would need to
> > generate a check that any unknown-to-the-filter fields are zero). I
> > don't think this is a decision we can make in-kernel because it might be
> > that the filter doesn't care about the last field in a struct (and thus
> > doesn't access it) but we don't know the difference between a field the
> > filter doesn't care about and a field it doesn't know abo

Re: seccomp feature development

2020-05-19 Thread Aleksa Sarai
PI then we can just make the old API have an
implied set of requested-fields that match whatever fields were present
before we added the requested-fields approach.

> If we did a requested-fields approach, what would the user_notif event
> block of bytes look like? Would it be entirely dynamic based on the
> initial ioctl()? Another design consideration here is that we don't want
> the kernel doing tons of work (especially copying) and tossing tons
> of stuff into a huge structure that the user doesn't care about. In
> addition to explicit fields, maybe the EA struct could be included,
> perhaps with specified offset/size, so only the portion the user_notif
> user wanted to inspect was copied?

I would be cautious about adding the EA struct -- there is still the
problem of nested pointers (and it might be even crazier to apply
whatever hacks we have for deep argument inspection into the user_notif
API).

> ## syscall bitmasks

I like this idea. :D

> So how would the API for this work? I have two thoughts, and I don't
> think they're exclusive:
> 
> - new API for "add this syscall to the reject bitmask". We can't really
>   do an "accept" bitmask addition without processing the attached
>   filters...

You could use the accept mask -- take the logical and of all the
filters' masks and that set is the ones you can skip and auto-accept.

> - process attached filters! Each time a filter is added, have the
>   BPF verifier do an analysis to determine if there are any static
>   results, and set bits in the various bitmasks to represent it.
>   i.e. when seccomp is first enabled on a thread, the "accept"
>   bitmask is populated with all syscalls, and for each filter, do
>   [math,simulation,magic] and knock each syscall out of "accept" if
>   it isn't always accepted, and further see if there are any syscalls
>   that are always rejected, and mark those in the "reject" bitmask.

I reckon that both approaches (in parallel) are worthwhile, as the
latter proposal will allow for existing filters to become faster while
the former allows new filters to become much smaller.

The only possible issue I see with a deny bitmask is that we'd have to
make sure that filters generated by libseccomp et al still handle
unknown syscalls correctly (so we don't end up with this feature causing
new filters to become black-lists).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: seccomp feature development

2020-05-18 Thread Aleksa Sarai
s consider seccomp_data. If we grow it, the EA struct offset
> > will move, based on the deep arg inspection design above. Alternatively,
> > we could instead put seccomp_data offset 0, and EA struct at offset
> > PAGE_SIZE, and treat seccomp_data itself as an EA struct where we let
> > the filter access whatever it thinks is there, with it being zero-filled
> > by the kernel. For any values where 0 is valid, there will just need to
> > be a "is that field valid?" bit before it:
> >
> > unsigned long feature_bits;
> > unsigned long interesting_thing_1;
> > unsigned long interesting_thing_2;
> > unsigned long interesting_thing_3;
> > ...
> >
> > and the filter would check feature_bits...
> 
> (Apart from the user_notif stuff, those feature bits would not
> actually have to exist in memory; they could be inlined while loading
> the program. Actually, not even the registers would have to exist in a
> seccomp_data struct in memory, we could just replace the loads with
> reads from the pt_regs, too.)
> 
> > (However, this needs to be carefully considered given that seccomp_data
> > is embedded in user_notif... should the EA struct from userspace also be
> > copied into user_notif? More thoughts on this below...)
> >
> > For user_notif, I think we need something in and around these options:
> >
> > - make a new API that explicitly follows EA struct design
> >   (and while read()/write() might be easier[4], I tend to agree with
> >   Jann and we need to stick to ioctl(): as Tycho noted, "read/write is
> >   for data". Though I wonder if read() could be used for the notifications,
> >   which ARE data, and use ioctl() for the responses?)
> 
> Just as a note: If we use read() there, we'll never be able to
> transfer things like FDs through that API.

And we run into the age-old "read() for management can be a bit hairy"
problem.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: How about just O_EXEC? (was Re: [PATCH v5 3/6] fs: Enable to enforce noexec mounts or file exec through O_MAYEXEC)

2020-05-18 Thread Aleksa Sarai
On 2020-05-15, Kees Cook  wrote:
> On Fri, May 15, 2020 at 04:43:37PM +0200, Florian Weimer wrote:
> > * Kees Cook:
> > 
> > > On Fri, May 15, 2020 at 10:43:34AM +0200, Florian Weimer wrote:
> > >> * Kees Cook:
> > >> 
> > >> > Maybe I've missed some earlier discussion that ruled this out, but I
> > >> > couldn't find it: let's just add O_EXEC and be done with it. It 
> > >> > actually
> > >> > makes the execve() path more like openat2() and is much cleaner after
> > >> > a little refactoring. Here are the results, though I haven't emailed it
> > >> > yet since I still want to do some more testing:
> > >> > https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/o_exec/v1
> > >> 
> > >> I think POSIX specifies O_EXEC in such a way that it does not confer
> > >> read permissions.  This seems incompatible with what we are trying to
> > >> achieve here.
> > >
> > > I was trying to retain this behavior, since we already make this
> > > distinction between execve() and uselib() with the MAY_* flags:
> > >
> > > execve():
> > > struct open_flags open_exec_flags = {
> > > .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
> > > .acc_mode = MAY_EXEC,
> > >
> > > uselib():
> > > static const struct open_flags uselib_flags = {
> > > .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
> > > .acc_mode = MAY_READ | MAY_EXEC,
> > >
> > > I tried to retain this in my proposal, in the O_EXEC does not imply
> > > MAY_READ:
> > 
> > That doesn't quite parse for me, sorry.
> > 
> > The point is that the script interpreter actually needs to *read* those
> > files in order to execute them.
> 
> I think I misunderstood what you meant (Mickaël got me sorted out
> now). If O_EXEC is already meant to be "EXEC and _not_ READ nor WRITE",
> then yes, this new flag can't be O_EXEC. I was reading the glibc
> documentation (which treats it as a permission bit flag, not POSIX,
> which treats it as a complete mode description).

On the other hand, if we had O_EXEC (or O_EXONLY a-la O_RDONLY) then the
interpreter could re-open the file descriptor as O_RDONLY after O_EXEC
succeeds. Not ideal, but I don't think it's a deal-breaker.

Regarding O_MAYEXEC, I do feel a little conflicted.

I do understand that its goal is not to be what O_EXEC was supposed to
be (which is loosely what O_PATH has effectively become), so I think
that this is not really a huge problem -- especially since you could
just do O_MAYEXEC|O_PATH if you wanted to disallow reading explicitly.
It would be nice to have an O_EXONLY concept, but it's several decades
too late to make it mandatory (and making it optional has questionable
utility IMHO).

However, the thing I still feel mildly conflicted about is the sysctl. I
do understand the argument for it (ultimately, whether O_MAYEXEC is
usable on a system depends on the distribution) but it means that any
program which uses O_MAYEXEC cannot rely on it to provide the security
guarantees they expect. Even if the program goes and reads the sysctl
value, it could change underneath them. If this is just meant to be a
best-effort protection then this doesn't matter too much, but I just
feel uneasy about these kinds of best-effort protections.

I do wonder if we could require that fexecve(3) can only be done with
file descriptors that have been opened with O_MAYEXEC (obviously this
would also need to be a sysctl -- *sigh*). This would tie in to some of
the magic-link changes I wanted to push (namely, upgrade_mask).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] seccomp: Add group_leader pid to seccomp_notif

2020-05-17 Thread Aleksa Sarai
On 2020-05-17, Christian Brauner  wrote:
> Or... And that's more invasive but ultimately cleaner we v2 the whole
> thing so e.g. SECCOMP_IOCTL_NOTIF_RECV2, SECCOMP_IOCTL_NOTIF_SEND2, and
> embedd the size argument in the structs. Userspace sets the size
> argument, we use get_user() to get the size first and then
> copy_struct_from_user() to handle it cleanly based on that. A similar
> model as with sched (has other unrelated quirks because they messed up
> something too):
> 
> static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr 
> *attr)
> {
>   u32 size;
>   int ret;
> 
>   /* Zero the full structure, so that a short copy will be nice: */
>   memset(attr, 0, sizeof(*attr));
> 
>   ret = get_user(size, >size);
>   if (ret)
>   return ret;
> 
>   /* ABI compatibility quirk: */
>   if (!size)
>   size = SCHED_ATTR_SIZE_VER0;
>   if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
>   goto err_size;
> 
>   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
>   if (ret) {
>   if (ret == -E2BIG)
>   goto err_size;
>   return ret;
>   }
> 
> We're probably the biggest user of this right now and I'd be ok with
> that change. If it's a v2 than whatever. :)

I'm :+1: on a new version and switch to copy_struct_from_user(). I was a
little surprised when I found out that user_notif doesn't do it this
way a while ago (and although in theory it is userspace's fault, ideally
we could have an API that doesn't have built-in footguns).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v5 0/6] Add support for O_MAYEXEC

2020-05-06 Thread Aleksa Sarai
On 2020-05-06, Lev R. Oshvang .  wrote:
> On Tue, May 5, 2020 at 6:36 PM Mickaël Salaün  wrote:
> >
> >
> > On 05/05/2020 17:31, Mickaël Salaün wrote:
> > > Hi,
> > >
> > > This fifth patch series add new kernel configurations (OMAYEXEC_STATIC,
> > > OMAYEXEC_ENFORCE_MOUNT, and OMAYEXEC_ENFORCE_FILE) to enable to
> > > configure the security policy at kernel build time.  As requested by
> > > Mimi Zohar, I completed the series with one of her patches for IMA.
> > >
> > > The goal of this patch series is to enable to control script execution
> > > with interpreters help.  A new O_MAYEXEC flag, usable through
> > > openat2(2), is added to enable userspace script interpreter to delegate
> > > to the kernel (and thus the system security policy) the permission to
> > > interpret/execute scripts or other files containing what can be seen as
> > > commands.
> > >
> > > A simple system-wide security policy can be enforced by the system
> > > administrator through a sysctl configuration consistent with the mount
> > > points or the file access rights.  The documentation patch explains the
> > > prerequisites.
> > >
> > > Furthermore, the security policy can also be delegated to an LSM, either
> > > a MAC system or an integrity system.  For instance, the new kernel
> > > MAY_OPENEXEC flag closes a major IMA measurement/appraisal interpreter
> > > integrity gap by bringing the ability to check the use of scripts [1].
> > > Other uses are expected, such as for openat2(2) [2], SGX integration
> > > [3], bpffs [4] or IPE [5].
> > >
> > > Userspace needs to adapt to take advantage of this new feature.  For
> > > example, the PEP 578 [6] (Runtime Audit Hooks) enables Python 3.8 to be
> > > extended with policy enforcement points related to code interpretation,
> > > which can be used to align with the PowerShell audit features.
> > > Additional Python security improvements (e.g. a limited interpreter
> > > withou -c, stdin piping of code) are on their way.
> > >
> > > The initial idea come from CLIP OS 4 and the original implementation has
> > > been used for more than 12 years:
> > > https://github.com/clipos-archive/clipos4_doc
> > >
> > > An introduction to O_MAYEXEC was given at the Linux Security Summit
> > > Europe 2018 - Linux Kernel Security Contributions by ANSSI:
> > > https://www.youtube.com/watch?v=chNjCRtPKQY=17m15s
> > > The "write xor execute" principle was explained at Kernel Recipes 2018 -
> > > CLIP OS: a defense-in-depth OS:
> > > https://www.youtube.com/watch?v=PjRE0uBtkHU=11m14s
> > >
> > > This patch series can be applied on top of v5.7-rc4.  This can be tested
> > > with CONFIG_SYSCTL.  I would really appreciate constructive comments on
> > > this patch series.
> > >
> > > Previous version:
> > > https://lore.kernel.org/lkml/20200428175129.634352-1-...@digikod.net/
> >
> > The previous version (v4) is
> > https://lore.kernel.org/lkml/20200430132320.699508-1-...@digikod.net/
> 
> 
> Hi Michael
> 
> I have couple of question
> 1. Why you did not add O_MAYEXEC to open()?
> Some time ago (around v4.14) open() did not return EINVAL when
> VALID_OPEN_FLAGS check failed.
> Now it does, so I do not see a problem that interpreter will use
> simple open(),  ( Although that path might be manipulated, but file
> contents will be verified by IMA)

You don't get -EINVAL from open() in the case of unknown flags, that's
something only openat2() does in the open*() family. Hence why it's only
introduced for openat2().

> 2. When you apply a new flag to mount, it means that IMA will check
> all files under this mount and it does not matter whether the file in
> question is a script or not.
> IMHO it is too hard overhead for performance reasons.
> 
> Regards,
> LEv


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v3 4/7] linux/signal.h: Ignore SIGINFO by default in new tasks

2020-04-30 Thread Aleksa Sarai
On 2020-04-30, Christian Brauner  wrote:
> On Thu, Apr 30, 2020 at 08:53:56AM +0200, Jiri Slaby wrote:
> > On 30. 04. 20, 8:42, Arseny Maslennikov wrote:
> > > This matches the behaviour of other Unix-like systems that have SIGINFO
> > > and causes less harm to processes that do not install handlers for this
> > > signal, making the keyboard status character non-fatal for them.
> > > 
> > > This is implemented with the assumption that SIGINFO is defined
> > > to be equivalent to SIGPWR; still, there is no reason for PWR to
> > > result in termination of the signal recipient anyway — it does not
> > > indicate there is a fatal problem with the recipient's execution
> > > context (like e.g. FPE/ILL do), and we have TERM/KILL for explicit
> > > termination requests.
> > > 
> > > To put it another way:
> > > The only scenario where system behaviour actually changes is when the
> > > signal recipient has default disposition for SIGPWR. If a process
> > > chose to interpret a SIGPWR as an incentive to cleanly terminate, it
> > > would supply its own handler — and this commit does not affect processes
> > > with non-default handlers.
> > > 
> > > Signed-off-by: Arseny Maslennikov 
> > > ---
> > >  include/linux/signal.h | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > > index 05bacd2ab..dc31da8fc 100644
> > > --- a/include/linux/signal.h
> > > +++ b/include/linux/signal.h
> > > @@ -369,7 +369,7 @@ extern bool unhandled_signal(struct task_struct *tsk, 
> > > int sig);
> > >   *   |  SIGSYS/SIGUNUSED  |  coredump|
> > >   *   |  SIGSTKFLT |  terminate   |
> > >   *   |  SIGWINCH  |  ignore  |
> > > - *   |  SIGPWR|  terminate   |
> > > + *   |  SIGPWR|  ignore  |
> > 
> > You need to update signal.7 too:
> > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man7/signal.7#n285
> 
> (I fail this whole thread via b4 and it appears that a bunch of messages
> are missing on lore. Might just be delay though.)
> 
> How this is this not going to break userspace? Just for a start,
> SIGPWR (for better or worse) was used for a long time by some
> sandboxing/container runtimes to shutdown a process and still is.

To play Devil's advocate -- pid1 has also always had a default-ignore
signal mask (which included SIGPWR), so any pid1 that obeyed SIGPWR
already had a non-default signal mask (and thus wouldn't be affected by
this patch).

But I do agree that this seems like a strange change to make (SIGPWR
seems like a signal you don't want to ignore by default). Unfortunately
the fact that it appears to always be equal to SIGINFO means that while
SIGINFO (to me at least) seems like it should be a no-op, the necessary
SIGPWR change makes it harder to justify IMHO.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v3 0/5] Add support for RESOLVE_MAYEXEC

2020-04-29 Thread Aleksa Sarai
On 2020-04-28, Mickaël Salaün  wrote:
> The goal of this patch series is to enable to control script execution
> with interpreters help.  A new RESOLVE_MAYEXEC flag, usable through
> openat2(2), is added to enable userspace script interpreter to delegate
> to the kernel (and thus the system security policy) the permission to
> interpret/execute scripts or other files containing what can be seen as
> commands.
> 
> This third patch series mainly differ from the previous one by relying
> on the new openat2(2) system call to get rid of the undefined behavior
> of the open(2) flags.  Thus, the previous O_MAYEXEC flag is now replaced
> with the new RESOLVE_MAYEXEC flag and benefits from the openat2(2)
> strict check of this kind of flags.

My only strong upfront objection is with this being a RESOLVE_ flag.

RESOLVE_ flags have a specific meaning (they generally apply to all
components, and affect the rules of path resolution). RESOLVE_MAYEXEC
does neither of these things and so seems out of place among the other
RESOLVE_ flags.

I would argue this should be an O_ flag, but not supported for the
old-style open(2). This is what the O_SPECIFIC_FD patchset does[1] and I
think it's a reasonable way of solving such problems.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [LTP] [fs] ce436509a8: ltp.openat203.fail

2020-04-28 Thread Aleksa Sarai
On 2020-04-28, Cyril Hrubis  wrote:
> Hi!
> > > > commit: ce436509a8e109330c56bb4d8ec87d258788f5f4 ("[PATCH v4 2/3] fs: 
> > > > openat2: Extend open_how to allow userspace-selected fds")
> > > > url: 
> > > > https://github.com/0day-ci/linux/commits/Josh-Triplett/Support-userspace-selected-fds/20200414-102939
> > > > base: 
> > > > https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git 
> > > > next
> > > 
> > > This commit adds fd parameter to the how structure where LTP test was
> > > previously passing garbage, which obviously causes the difference in
> > > errno.
> > > 
> > > This could be safely ignored for now, if the patch gets merged the test
> > > needs to be updated.
> > 
> > It wouldn't be a bad idea to switch the test to figure out the ksize of
> > the struct, so that you only add bad padding after that. But then again,
> > this would be a bit ugly -- having CHECK_FIELDS would make this simpler.
> 
> Any pointers how can be the size figured out without relying on the
> E2BIG we are testing for? Does the kernel export it somewhere?

No, you would have to effectively binary search on -E2BIG at the moment.
CHECK_FIELDS is a proposal I have which would allow you to get get the
size of the in-kernel struct, but it's still a proposal.

In theory you could get the size through BTF, but it's probably more
effort than it's worth to implement that.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


[PATCH v2] cgroup: pids: use atomic64_t for pids->limit

2019-10-16 Thread Aleksa Sarai
Because pids->limit can be changed concurrently (but we don't want to
take a lock because it would be needlessly expensive), use atomic64_ts
instead.

Fixes: commit 49b786ea146f ("cgroup: implement the PIDs subsystem")
Cc: sta...@vger.kernel.org # v4.3+
Signed-off-by: Aleksa Sarai 
---
v2:
  * Switch to atomic64_t instead of using {READ,WRITE}_ONCE().
v1: <https://lore.kernel.org/lkml/20191012010539.6131-1-cyp...@cyphar.com/>

 kernel/cgroup/pids.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c
index 8e513a573fe9..138059eb730d 100644
--- a/kernel/cgroup/pids.c
+++ b/kernel/cgroup/pids.c
@@ -45,7 +45,7 @@ struct pids_cgroup {
 * %PIDS_MAX = (%PID_MAX_LIMIT + 1).
 */
atomic64_t  counter;
-   int64_t limit;
+   atomic64_t  limit;
 
/* Handle for "pids.events" */
struct cgroup_file  events_file;
@@ -73,8 +73,8 @@ pids_css_alloc(struct cgroup_subsys_state *parent)
if (!pids)
return ERR_PTR(-ENOMEM);
 
-   pids->limit = PIDS_MAX;
atomic64_set(>counter, 0);
+   atomic64_set(>limit, PIDS_MAX);
atomic64_set(>events_limit, 0);
return >css;
 }
@@ -146,13 +146,14 @@ static int pids_try_charge(struct pids_cgroup *pids, int 
num)
 
for (p = pids; parent_pids(p); p = parent_pids(p)) {
int64_t new = atomic64_add_return(num, >counter);
+   int64_t limit = atomic64_read(>limit);
 
/*
 * Since new is capped to the maximum number of pid_t, if
 * p->limit is %PIDS_MAX then we know that this test will never
 * fail.
 */
-   if (new > p->limit)
+   if (new > limit)
goto revert;
}
 
@@ -277,7 +278,7 @@ static ssize_t pids_max_write(struct kernfs_open_file *of, 
char *buf,
 * Limit updates don't need to be mutex'd, since it isn't
 * critical that any racing fork()s follow the new limit.
 */
-   pids->limit = limit;
+   atomic64_set(>limit, limit);
return nbytes;
 }
 
@@ -285,7 +286,7 @@ static int pids_max_show(struct seq_file *sf, void *v)
 {
struct cgroup_subsys_state *css = seq_css(sf);
struct pids_cgroup *pids = css_pids(css);
-   int64_t limit = pids->limit;
+   int64_t limit = atomic64_read(>limit);
 
if (limit >= PIDS_MAX)
seq_printf(sf, "%s\n", PIDS_MAX_STR);
-- 
2.23.0



Re: [PATCH] cgroup: pids: use {READ,WRITE}_ONCE for pids->limit operations

2019-10-16 Thread Aleksa Sarai
On 2019-10-17, Aleksa Sarai  wrote:
> On 2019-10-16, Tejun Heo  wrote:
> > Hello, Aleksa.
> > 
> > On Wed, Oct 16, 2019 at 07:32:19PM +1100, Aleksa Sarai wrote:
> > > Maybe I'm misunderstanding Documentation/atomic_t.txt, but it looks to
> > > me like it's explicitly saying that I shouldn't use atomic64_t if I'm
> > > just using it for fetching and assignment.
> > 
> > Hah, where is it saying that?
> 
> Isn't that what this says:
> 
> > Therefore, if you find yourself only using the Non-RMW operations of
> > atomic_t, you do not in fact need atomic_t at all and are doing it
> > wrong.
> 
> Doesn't using just atomic64_read() and atomic64_set() fall under "only
> using the non-RMW operations of atomic_t"? But yes, I agree that any
> locking is overkill.
> 
> > > As for 64-bit on 32-bit machines -- that is a separate issue, but from
> > > [1] it seems to me like there are more problems that *_ONCE() fixes than
> > > just split reads and writes.
> > 
> > Your explanations are too wishy washy.  If you wanna fix it, please do
> > it correctly.  R/W ONCE isn't the right solution here.
> 
> Sure, I will switch it to use atomic64_read() and atomic64_set() instead
> if that's what you'd prefer. Though I will mention that on quite a few
> architectures atomic64_read() is defined as:
> 
>   #define atomic64_read(v)READ_ONCE((v)->counter)

Though I guess that's because on those architectures it turns out that
READ_ONCE is properly atomic?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] cgroup: pids: use {READ,WRITE}_ONCE for pids->limit operations

2019-10-16 Thread Aleksa Sarai
On 2019-10-16, Tejun Heo  wrote:
> Hello, Aleksa.
> 
> On Wed, Oct 16, 2019 at 07:32:19PM +1100, Aleksa Sarai wrote:
> > Maybe I'm misunderstanding Documentation/atomic_t.txt, but it looks to
> > me like it's explicitly saying that I shouldn't use atomic64_t if I'm
> > just using it for fetching and assignment.
> 
> Hah, where is it saying that?

Isn't that what this says:

> Therefore, if you find yourself only using the Non-RMW operations of
> atomic_t, you do not in fact need atomic_t at all and are doing it
> wrong.

Doesn't using just atomic64_read() and atomic64_set() fall under "only
using the non-RMW operations of atomic_t"? But yes, I agree that any
locking is overkill.

> > As for 64-bit on 32-bit machines -- that is a separate issue, but from
> > [1] it seems to me like there are more problems that *_ONCE() fixes than
> > just split reads and writes.
> 
> Your explanations are too wishy washy.  If you wanna fix it, please do
> it correctly.  R/W ONCE isn't the right solution here.

Sure, I will switch it to use atomic64_read() and atomic64_set() instead
if that's what you'd prefer. Though I will mention that on quite a few
architectures atomic64_read() is defined as:

  #define atomic64_read(v)READ_ONCE((v)->counter)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v2] usercopy: Avoid soft lockups in test_check_nonzero_user()

2019-10-16 Thread Aleksa Sarai
On 2019-10-16, Michael Ellerman  wrote:
> On a machine with a 64K PAGE_SIZE, the nested for loops in
> test_check_nonzero_user() can lead to soft lockups, eg:
> 
>   watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [modprobe:611]
>   Modules linked in: test_user_copy(+) vmx_crypto gf128mul crc32c_vpmsum 
> virtio_balloon ip_tables x_tables autofs4
>   CPU: 4 PID: 611 Comm: modprobe Tainted: G L
> 5.4.0-rc1-gcc-8.2.0-1-gf5a1a536fa14-dirty #1151
>   ...
>   NIP __might_sleep+0x20/0xc0
>   LR  __might_fault+0x40/0x60
>   Call Trace:
> check_zeroed_user+0x12c/0x200
> test_user_copy_init+0x67c/0x1210 [test_user_copy]
> do_one_initcall+0x60/0x340
> do_init_module+0x7c/0x2f0
> load_module+0x2d94/0x30e0
> __do_sys_finit_module+0xc8/0x150
> system_call+0x5c/0x68
> 
> Even with a 4K PAGE_SIZE the test takes multiple seconds. Instead
> tweak it to only scan a 1024 byte region, but make it cross the
> page boundary.
> 
> Fixes: f5a1a536fa14 ("lib: introduce copy_struct_from_user() helper")
> Suggested-by: Aleksa Sarai 
> Signed-off-by: Michael Ellerman 

Thanks Michael.

Reviewed-by: Aleksa Sarai 

> ---
>  lib/test_user_copy.c | 22 +++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> v2: Rework calculation to just use PAGE_SIZE directly.
> Rebase onto Christian's tree.
> 
> diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
> index ad2372727b1b..5ff04d8fe971 100644
> --- a/lib/test_user_copy.c
> +++ b/lib/test_user_copy.c
> @@ -47,9 +47,25 @@ static bool is_zeroed(void *from, size_t size)
>  static int test_check_nonzero_user(char *kmem, char __user *umem, size_t 
> size)
>  {
>   int ret = 0;
> - size_t start, end, i;
> - size_t zero_start = size / 4;
> - size_t zero_end = size - zero_start;
> + size_t start, end, i, zero_start, zero_end;
> +
> + if (test(size < 2 * PAGE_SIZE, "buffer too small"))
> + return -EINVAL;
> +
> + /*
> +  * We want to cross a page boundary to exercise the code more
> +  * effectively. We also don't want to make the size we scan too large,
> +  * otherwise the test can take a long time and cause soft lockups. So
> +  * scan a 1024 byte region across the page boundary.
> +  */
> + size = 1024;
> + start = PAGE_SIZE - (size / 2);
> +
> + kmem += start;
> + umem += start;
> +
> + zero_start = size / 4;
> + zero_end = size - zero_start;
>  
>   /*
>* We conduct a series of check_nonzero_user() tests on a block of
> -- 
> 2.21.0
> 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] cgroup: pids: use {READ,WRITE}_ONCE for pids->limit operations

2019-10-16 Thread Aleksa Sarai
On 2019-10-14, Tejun Heo  wrote:
> Hello, Aleksa.
> 
> On Tue, Oct 15, 2019 at 02:59:31AM +1100, Aleksa Sarai wrote:
> > On 2019-10-14, Tejun Heo  wrote:
> > > On Sat, Oct 12, 2019 at 12:05:39PM +1100, Aleksa Sarai wrote:
> > > > Because pids->limit can be changed concurrently (but we don't want to
> > > > take a lock because it would be needlessly expensive), use the
> > > > appropriate memory barriers.
> > > 
> > > I can't quite tell what problem it's fixing.  Can you elaborate a
> > > scenario where the current code would break that your patch fixes?
> > 
> > As far as I can tell, not using *_ONCE() here means that if you had a
> > process changing pids->limit from A to B, a process might be able to
> > temporarily exceed pids->limit -- because pids->limit accesses are not
> > protected by mutexes and the C compiler can produce confusing
> > intermediate values for pids->limit[1].
> >
> > But this is more of a correctness fix than one fixing an actually
> > exploitable bug -- given the kernel memory model work, it seems like a
> > good idea to just use READ_ONCE() and WRITE_ONCE() for shared memory
> > access.
> 
> READ/WRITE_ONCE provides protection against compiler generating
> multiple accesses for a single operation.  It won't prevent split
> writes / reads of 64bit variables on 32bit machines.  For that, you'd
> have to switch them to atomic64_t's.

Maybe I'm misunderstanding Documentation/atomic_t.txt, but it looks to
me like it's explicitly saying that I shouldn't use atomic64_t if I'm
just using it for fetching and assignment.

> The non-RMW ops are (typically) regular LOADs and STOREs and are
> canonically implemented using READ_ONCE(), WRITE_ONCE(),
> smp_load_acquire() and smp_store_release() respectively. Therefore, if
> you find yourself only using the Non-RMW operations of atomic_t, you
> do not in fact need atomic_t at all and are doing it wrong.

As for 64-bit on 32-bit machines -- that is a separate issue, but from
[1] it seems to me like there are more problems that *_ONCE() fixes than
just split reads and writes.

[1]: https://lwn.net/Articles/793253/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] cgroup: pids: use {READ,WRITE}_ONCE for pids->limit operations

2019-10-14 Thread Aleksa Sarai
On 2019-10-14, Tejun Heo  wrote:
> On Sat, Oct 12, 2019 at 12:05:39PM +1100, Aleksa Sarai wrote:
> > Because pids->limit can be changed concurrently (but we don't want to
> > take a lock because it would be needlessly expensive), use the
> > appropriate memory barriers.
> 
> I can't quite tell what problem it's fixing.  Can you elaborate a
> scenario where the current code would break that your patch fixes?

As far as I can tell, not using *_ONCE() here means that if you had a
process changing pids->limit from A to B, a process might be able to
temporarily exceed pids->limit -- because pids->limit accesses are not
protected by mutexes and the C compiler can produce confusing
intermediate values for pids->limit[1].

But this is more of a correctness fix than one fixing an actually
exploitable bug -- given the kernel memory model work, it seems like a
good idea to just use READ_ONCE() and WRITE_ONCE() for shared memory
access.

[1]: https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] usercopy: Avoid soft lockups in test_check_nonzero_user()

2019-10-12 Thread Aleksa Sarai
On 2019-10-12, Michael Ellerman  wrote:
> Aleksa Sarai  writes:
> > On 2019-10-11, Michael Ellerman  wrote:
> >> On a machine with a 64K PAGE_SIZE, the nested for loops in
> >> test_check_nonzero_user() can lead to soft lockups, eg:
> ...
> >> diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
> >> index 950ee88cd6ac..9fb6bc609d4c 100644
> >> --- a/lib/test_user_copy.c
> >> +++ b/lib/test_user_copy.c
> >> @@ -47,9 +47,26 @@ static bool is_zeroed(void *from, size_t size)
> >>  static int test_check_nonzero_user(char *kmem, char __user *umem, size_t 
> >> size)
> >>  {
> >>int ret = 0;
> >> -  size_t start, end, i;
> >> -  size_t zero_start = size / 4;
> >> -  size_t zero_end = size - zero_start;
> >> +  size_t start, end, i, zero_start, zero_end;
> >> +
> >> +  if (test(size < 1024, "buffer too small"))
> >> +  return -EINVAL;
> >> +
> >> +  /*
> >> +   * We want to cross a page boundary to exercise the code more
> >> +   * effectively. We assume the buffer we're passed has a page boundary at
> >> +   * size / 2. We also don't want to make the size we scan too large,
> >> +   * otherwise the test can take a long time and cause soft lockups. So
> >> +   * scan a 1024 byte region across the page boundary.
> >> +   */
> >> +  start = size / 2 - 512;
> >> +  size = 1024;
> >
> > I don't think it's necessary to do "size / 2" here -- you can just use
> > PAGE_SIZE directly and check above that "size == 2*PAGE_SIZE" (not that
> > this check is exceptionally necessary -- since there's only one caller
> > of this function and it's in the same file).
> 
> OK, like this?

Yup -- that looks good. I'll give it a Reviewed-by once you resend it.

> diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
> index 950ee88cd6ac..48bc669b2549 100644
> --- a/lib/test_user_copy.c
> +++ b/lib/test_user_copy.c
> @@ -47,9 +47,25 @@ static bool is_zeroed(void *from, size_t size)
>  static int test_check_nonzero_user(char *kmem, char __user *umem, size_t 
> size)
>  {
>   int ret = 0;
> - size_t start, end, i;
> - size_t zero_start = size / 4;
> - size_t zero_end = size - zero_start;
> + size_t start, end, i, zero_start, zero_end;
> +
> + if (test(size < 2 * PAGE_SIZE, "buffer too small"))
> + return -EINVAL;
> +
> + /*
> +  * We want to cross a page boundary to exercise the code more
> +  * effectively. We also don't want to make the size we scan too large,
> +  * otherwise the test can take a long time and cause soft lockups. So
> +  * scan a 1024 byte region across the page boundary.
> +  */
> + size = 1024;
> + start = PAGE_SIZE - (size / 2);
> +
> + kmem += start;
> + umem += start;
> +
> + zero_start = size / 4;
> + zero_end = size - zero_start;
>  
>   /*
>* We conduct a series of check_nonzero_user() tests on a block of 
> memory


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


[PATCH] cgroup: pids: use {READ,WRITE}_ONCE for pids->limit operations

2019-10-11 Thread Aleksa Sarai
Because pids->limit can be changed concurrently (but we don't want to
take a lock because it would be needlessly expensive), use the
appropriate memory barriers.

Fixes: commit 49b786ea146f ("cgroup: implement the PIDs subsystem")
Cc: sta...@vger.kernel.org # v4.3+
Signed-off-by: Aleksa Sarai 
---
 kernel/cgroup/pids.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c
index 8e513a573fe9..a726e4a20177 100644
--- a/kernel/cgroup/pids.c
+++ b/kernel/cgroup/pids.c
@@ -152,7 +152,7 @@ static int pids_try_charge(struct pids_cgroup *pids, int 
num)
 * p->limit is %PIDS_MAX then we know that this test will never
 * fail.
 */
-   if (new > p->limit)
+   if (new > READ_ONCE(p->limit))
goto revert;
}
 
@@ -277,7 +277,7 @@ static ssize_t pids_max_write(struct kernfs_open_file *of, 
char *buf,
 * Limit updates don't need to be mutex'd, since it isn't
 * critical that any racing fork()s follow the new limit.
 */
-   pids->limit = limit;
+   WRITE_ONCE(pids->limit, limit);
return nbytes;
 }
 
@@ -285,7 +285,7 @@ static int pids_max_show(struct seq_file *sf, void *v)
 {
struct cgroup_subsys_state *css = seq_css(sf);
struct pids_cgroup *pids = css_pids(css);
-   int64_t limit = pids->limit;
+   int64_t limit = READ_ONCE(pids->limit);
 
if (limit >= PIDS_MAX)
seq_printf(sf, "%s\n", PIDS_MAX_STR);
-- 
2.23.0



Re: [PATCH 1/2] clone3: add CLONE3_CLEAR_SIGHAND

2019-10-11 Thread Aleksa Sarai
On 2019-10-11, Michael Kerrisk  wrote:
> Why CLONE3_CLEAR_SIGHAND rather than just CLONE_CLEAR_SIGHAND?

There are no more flag bits left for the classic clone()/clone2() (the
last one was used up by CLONE_PIDFD) -- thus this flag is clone3()-only.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] usercopy: Avoid soft lockups in test_check_nonzero_user()

2019-10-10 Thread Aleksa Sarai
On 2019-10-11, Michael Ellerman  wrote:
> On a machine with a 64K PAGE_SIZE, the nested for loops in
> test_check_nonzero_user() can lead to soft lockups, eg:
> 
>   watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [modprobe:611]
>   Modules linked in: test_user_copy(+) vmx_crypto gf128mul crc32c_vpmsum 
> virtio_balloon ip_tables x_tables autofs4
>   CPU: 4 PID: 611 Comm: modprobe Tainted: G L
> 5.4.0-rc1-gcc-8.2.0-1-gf5a1a536fa14-dirty #1151
>   ...
>   NIP __might_sleep+0x20/0xc0
>   LR  __might_fault+0x40/0x60
>   Call Trace:
> check_zeroed_user+0x12c/0x200
> test_user_copy_init+0x67c/0x1210 [test_user_copy]
> do_one_initcall+0x60/0x340
> do_init_module+0x7c/0x2f0
> load_module+0x2d94/0x30e0
> __do_sys_finit_module+0xc8/0x150
> system_call+0x5c/0x68
> 
> Even with a 4K PAGE_SIZE the test takes multiple seconds. Instead
> tweak it to only scan a 1024 byte region, but make it cross the
> page boundary.
> 
> Fixes: f5a1a536fa14 ("lib: introduce copy_struct_from_user() helper")
> Suggested-by: Aleksa Sarai 
> Signed-off-by: Michael Ellerman 
> ---
>  lib/test_user_copy.c | 23 ---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> How does this look? It runs in < 1s on my machine here.
> 
> cheers
> 
> diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
> index 950ee88cd6ac..9fb6bc609d4c 100644
> --- a/lib/test_user_copy.c
> +++ b/lib/test_user_copy.c
> @@ -47,9 +47,26 @@ static bool is_zeroed(void *from, size_t size)
>  static int test_check_nonzero_user(char *kmem, char __user *umem, size_t 
> size)
>  {
>   int ret = 0;
> - size_t start, end, i;
> - size_t zero_start = size / 4;
> - size_t zero_end = size - zero_start;
> + size_t start, end, i, zero_start, zero_end;
> +
> + if (test(size < 1024, "buffer too small"))
> + return -EINVAL;
> +
> + /*
> +  * We want to cross a page boundary to exercise the code more
> +  * effectively. We assume the buffer we're passed has a page boundary at
> +  * size / 2. We also don't want to make the size we scan too large,
> +  * otherwise the test can take a long time and cause soft lockups. So
> +  * scan a 1024 byte region across the page boundary.
> +  */
> + start = size / 2 - 512;
> + size = 1024;

I don't think it's necessary to do "size / 2" here -- you can just use
PAGE_SIZE directly and check above that "size == 2*PAGE_SIZE" (not that
this check is exceptionally necessary -- since there's only one caller
of this function and it's in the same file).

> +
> + kmem += start;
> + umem += start;
> +
> + zero_start = size / 4;
> + zero_end = size - zero_start;
>  
>   /*
>* We conduct a series of check_nonzero_user() tests on a block of 
> memory

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v4 1/4] lib: introduce copy_struct_from_user() helper

2019-10-10 Thread Aleksa Sarai
On 2019-10-10, Michael Ellerman  wrote:
> Aleksa Sarai  writes:
> > A common pattern for syscall extensions is increasing the size of a
> > struct passed from userspace, such that the zero-value of the new fields
> > result in the old kernel behaviour (allowing for a mix of userspace and
> > kernel vintages to operate on one another in most cases).
> >
> > While this interface exists for communication in both directions, only
> > one interface is straightforward to have reasonable semantics for
> > (userspace passing a struct to the kernel). For kernel returns to
> > userspace, what the correct semantics are (whether there should be an
> > error if userspace is unaware of a new extension) is very
> > syscall-dependent and thus probably cannot be unified between syscalls
> > (a good example of this problem is [1]).
> >
> > Previously there was no common lib/ function that implemented
> > the necessary extension-checking semantics (and different syscalls
> > implemented them slightly differently or incompletely[2]). Future
> > patches replace common uses of this pattern to make use of
> > copy_struct_from_user().
> >
> > Some in-kernel selftests that insure that the handling of alignment and
> > various byte patterns are all handled identically to memchr_inv() usage.
> ...
> > diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
> > index 67bcd5dfd847..950ee88cd6ac 100644
> > --- a/lib/test_user_copy.c
> > +++ b/lib/test_user_copy.c
> > @@ -31,14 +31,133 @@
> ...
> > +static int test_check_nonzero_user(char *kmem, char __user *umem, size_t 
> > size)
> > +{
> > +   int ret = 0;
> > +   size_t start, end, i;
> > +   size_t zero_start = size / 4;
> > +   size_t zero_end = size - zero_start;
> > +
> > +   /*
> > +* We conduct a series of check_nonzero_user() tests on a block of 
> > memory
> > +* with the following byte-pattern (trying every possible [start,end]
> > +* pair):
> > +*
> > +*   [ 00 ff 00 ff ... 00 00 00 00 ... ff 00 ff 00 ]
> > +*
> > +* And we verify that check_nonzero_user() acts identically to 
> > memchr_inv().
> > +*/
> > +
> > +   memset(kmem, 0x0, size);
> > +   for (i = 1; i < zero_start; i += 2)
> > +   kmem[i] = 0xff;
> > +   for (i = zero_end; i < size; i += 2)
> > +   kmem[i] = 0xff;
> > +
> > +   ret |= test(copy_to_user(umem, kmem, size),
> > +   "legitimate copy_to_user failed");
> > +
> > +   for (start = 0; start <= size; start++) {
> > +   for (end = start; end <= size; end++) {
> > +   size_t len = end - start;
> > +   int retval = check_zeroed_user(umem + start, len);
> > +   int expected = is_zeroed(kmem + start, len);
> > +
> > +   ret |= test(retval != expected,
> > +   "check_nonzero_user(=%d) != memchr_inv(=%d) 
> > mismatch (start=%zu, end=%zu)",
> > +   retval, expected, start, end);
> > +   }
> > +   }
> 
> This is causing soft lockups for me on powerpc, eg:
> 
>   [  188.208315] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! 
> [modprobe:611]
>   [  188.208782] Modules linked in: test_user_copy(+) vmx_crypto gf128mul 
> crc32c_vpmsum virtio_balloon ip_tables x_tables autofs4
>   [  188.209594] CPU: 4 PID: 611 Comm: modprobe Tainted: G L
> 5.4.0-rc1-gcc-8.2.0-1-gf5a1a536fa14-dirty #1151
>   [  188.210392] NIP:  c0173650 LR: c0379cb0 CTR: 
> c07b20d0
>   [  188.210612] REGS: c000ec213560 TRAP: 0901   Tainted: G L 
> (5.4.0-rc1-gcc-8.2.0-1-gf5a1a536fa14-dirty)
>   [  188.210876] MSR:  80009033   CR: 28222422  
> XER: 2000
>   [  188.211060] CFAR: c0379cac IRQMASK: 0 
>   [  188.211060] GPR00: c0379cb0 c000ec2137f0 c13bbb00 
> c0f527f0 
>   [  188.211060] GPR04: 004b  85f5 
> cfffb780 
>   [  188.211060] GPR08:   c000fb9a3080 
> c00800411478 
>   [  188.211060] GPR12: c07b20d0 cfffb780 
>   [  188.211802] NIP [c0173650] __might_sleep+0x20/0xc0
>   [  188.211924] LR [c0379cb0] __might_fault+0x40/0x60
>   [  188.212037] Call Trace:
>   [  188.212101] [c000ec2137f0] [c01b99b4] 
> vprintk_func+0xc4/0x230 (unreliable)
>   [  188.212274] [c000ec213810] [c07b21fc] 
> check_zeroed_user+0x12c/0x200
&g

Re: [PATCH 3/3] bpf: use copy_struct_from_user() in bpf() syscall

2019-10-10 Thread Aleksa Sarai
On 2019-10-09, Christian Brauner  wrote:
> In v5.4-rc2 we added a new helper (cf. [1]) copy_struct_from_user().
> This helper is intended for all codepaths that copy structs from
> userspace that are versioned by size. The bpf() syscall does exactly
> what copy_struct_from_user() is doing.
> Note that copy_struct_from_user() is calling min() already. So
> technically, the min_t() call could go. But the size is used further
> below so leave it.
> 
> [1]: f5a1a536fa14 ("lib: introduce copy_struct_from_user() helper")
> Signed-off-by: Christian Brauner 

Acked-by: Aleksa Sarai 

> ---
>  kernel/bpf/syscall.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 6f4f9097b1fe..6fdcbdb27501 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2819,14 +2819,11 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user 
> *, uattr, unsigned int, siz
>   if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>   return -EPERM;
>  
> - err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
> - if (err)
> - return err;
>   size = min_t(u32, size, sizeof(attr));
> -
>   /* copy attributes from user space, may be less than sizeof(bpf_attr) */
> - if (copy_from_user(, uattr, size) != 0)
> - return -EFAULT;
> + err = copy_struct_from_user(, sizeof(attr), uattr, size);
> + if (err)
> +     return err;
>  
>   err = security_bpf(cmd, , size);
>   if (err < 0)
> -- 
> 2.23.0

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH 2/3] bpf: use copy_struct_from_user() in bpf_prog_get_info_by_fd()

2019-10-10 Thread Aleksa Sarai
On 2019-10-09, Christian Brauner  wrote:
> In v5.4-rc2 we added a new helper (cf. [1]) copy_struct_from_user().
> This helper is intended for all codepaths that copy structs from
> userspace that are versioned by size. bpf_prog_get_info_by_fd() does
> exactly what copy_struct_from_user() is doing.
> Note that copy_struct_from_user() is calling min() already. So
> technically, the min_t() call could go. But the info_len is used further
> below so leave it.
> 
> [1]: f5a1a536fa14 ("lib: introduce copy_struct_from_user() helper")
> Signed-off-by: Christian Brauner 

Acked-by: Aleksa Sarai 

> ---
>  kernel/bpf/syscall.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 78790778f101..6f4f9097b1fe 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2312,13 +2312,10 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
> *prog,
>   u32 ulen;
>   int err;
>  
> - err = bpf_check_uarg_tail_zero(uinfo, sizeof(info), info_len);
> + info_len = min_t(u32, sizeof(info), info_len);
> + err = copy_struct_from_user(, sizeof(info), uinfo, info_len);
>   if (err)
>   return err;
> - info_len = min_t(u32, sizeof(info), info_len);
> -
> - if (copy_from_user(, uinfo, info_len))
> - return -EFAULT;
>  
>   info.type = prog->type;
>   info.id = prog->aux->id;
> -- 
> 2.23.0

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH 1/3] bpf: use check_zeroed_user() in bpf_check_uarg_tail_zero()

2019-10-10 Thread Aleksa Sarai
On 2019-10-09, Christian Brauner  wrote:
> In v5.4-rc2 we added a new helper (cf. [1]) check_zeroed_user() which
> does what bpf_check_uarg_tail_zero() is doing generically. We're slowly
> switching such codepaths over to use check_zeroed_user() instead of
> using their own hand-rolled version.
> 
> [1]: f5a1a536fa14 ("lib: introduce copy_struct_from_user() helper")
> Signed-off-by: Christian Brauner 

Acked-by: Aleksa Sarai 

> ---
>  kernel/bpf/syscall.c | 22 +++---
>  1 file changed, 7 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 82eabd4e38ad..78790778f101 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -63,30 +63,22 @@ int bpf_check_uarg_tail_zero(void __user *uaddr,
>size_t expected_size,
>size_t actual_size)
>  {
> - unsigned char __user *addr;
> - unsigned char __user *end;
> - unsigned char val;
> + size_t size = min(expected_size, actual_size);
> + size_t rest = max(expected_size, actual_size) - size;
>   int err;
>  
>   if (unlikely(actual_size > PAGE_SIZE))  /* silly large */
>   return -E2BIG;
>  
> - if (unlikely(!access_ok(uaddr, actual_size)))
> - return -EFAULT;
> -
>   if (actual_size <= expected_size)
>   return 0;
>  
> - addr = uaddr + expected_size;
> - end  = uaddr + actual_size;
> + err = check_zeroed_user(uaddr + expected_size, rest);
> + if (err < 0)
> + return err;
>  
> - for (; addr < end; addr++) {
> - err = get_user(val, addr);
> - if (err)
> - return err;
> -     if (val)
> - return -E2BIG;
> - }
> + if (err)
> + return -E2BIG;
>  
>   return 0;
>  }

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall

2019-10-09 Thread Aleksa Sarai
On 2019-10-09, Michael Kerrisk (man-pages)  wrote:
> Hello Aleksa,
> 
> Thanks for this. It's a great piece of documentation work!
> 
> I would prefer the path_resolution(7) piece as a separate patch.

Thanks, and will do.

> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Rather than trying to merge the new syscall documentation into open.2
> > (which would probably result in the man-page being incomprehensible),
> > instead the new syscall gets its own dedicated page with links between
> > open(2) and openat2(2) to avoid duplicating information such as the list
> > of O_* flags or common errors.
> 
> Yes, looking at the size of the proposed openat2(2) page,
> this seems best.
> > 
> > Signed-off-by: Aleksa Sarai 
> > ---
> >  man2/open.2|   5 +
> >  man2/openat2.2 | 381 +
> >  man7/path_resolution.7 |  57 --
> >  3 files changed, 426 insertions(+), 17 deletions(-)
> >  create mode 100644 man2/openat2.2
> > 
> > diff --git a/man2/open.2 b/man2/open.2
> > index 7217fe056e5e..a0b43394bbee 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
> >  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
> >  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
> >  ", mode_t " mode );
> > +.PP
> > +/* Docuented separately, in \fBopenat2\fP(2). */
> 
> Documented
> 
> > +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> > +const struct open_how *" how ", size_t " size ");
> >  .fi
> >  .PP
> >  .in -4n
> > @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
> >  .B O_DIRECTORY
> >  is ignored).
> >  .SH SEE ALSO
> > +.BR openat2 (2),
> 
> Entries here should into alphabetical order (within
> sections).
> 
> >  .BR chmod (2),
> >  .BR chown (2),
> >  .BR close (2),
> > diff --git a/man2/openat2.2 b/man2/openat2.2
> > new file mode 100644
> > index ..c43c76046243
> > --- /dev/null
> > +++ b/man2/openat2.2
> > @@ -0,0 +1,381 @@
> > +.\" Copyright (C) 2019 Aleksa Sarai 
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein.  The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +openat2 \- open and possibly create a file (extended)
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include 
> > +.B #include 
> > +.B #include 
> > +.PP
> > +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> > +const struct open_how *" how ", size_t " size ");
> > +.fi
> > +.PP
> > +.IR Note :
> > +There is no glibc wrapper for this system call; see NOTES.
> > +.SH DESCRIPTION
> > +The
> > +.BR openat2 ()
> > +system call is an extension of
> > +.BR openat (2)
> > +and provides a superset of its functionality. Rather than taking a single
> 
> Please start new sentences on new source lines. I recently added this
> text in man-pages(7):
> 
>Use semantic newlines
>In the source of a manual page, new

Re: [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation

2019-10-09 Thread Aleksa Sarai
On 2019-10-09, Michael Kerrisk (man-pages)  wrote:
> Hello Aleksa,
> 
> You write "5.FOO" in these patches. When do you expect these changes to 
> land in the kernel?

Probably 5.6 (I'd hope for 5.5, but I don't know how the v14 review will
go). I'm not too sure though, and the magic-link changes (plus
O_EMPTYPATH) will probably land after openat2(2) since there is some
remaining work to do.

> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Some of the wording around empty paths in path_resolution(7) also needed
> > to be reworked since it's now legal (if you pass O_EMPTYPATH).
> > 
> > Signed-off-by: Aleksa Sarai 
> > ---
> >  man2/open.2| 42 +-
> >  man7/path_resolution.7 | 17 -
> >  2 files changed, 57 insertions(+), 2 deletions(-)
> > 
> > diff --git a/man2/open.2 b/man2/open.2
> > index b0f485b41589..7217fe056e5e 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -48,7 +48,7 @@
> >  .\" FIXME . Apr 08: The next POSIX revision has O_EXEC, O_SEARCH, and
> >  .\" O_TTYINIT.  Eventually these may need to be documented.  --mtk
> >  .\"
> > -.TH OPEN 2 2018-04-30 "Linux" "Linux Programmer's Manual"
> > +.TH OPEN 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> 
> No need to update the timestamp. I have scripts that handle this
> automatically.
> 
> >  .SH NAME
> >  open, openat, creat \- open and possibly create a file
> >  .SH SYNOPSIS
> > @@ -421,6 +421,21 @@ was followed by a call to
> >  .BR fdatasync (2)).
> >  .IR "See NOTES below" .
> >  .TP
> > +.BR O_EMPTYPATH " (since Linux 5.FOO)"
> > +If \fIpathname\fP is an empty string, re-open the the file descriptor 
> > given as
> 
> In general, I prefer the general form
> 
> .I pathname
> 
> over \fIpathname\fP. 
> 
> If you would be willing to cahnge that, it would  save me a little work.
> (And likewise throughout the rest of the patch.)
> 
> > +the \fIdirfd\fP argument to
> > +.BR openat (2).
> > +This can be used with both ordinary (file and directory) and \fBO_PATH\fP 
> > file
> > +descriptors, but cannot be used with
> > +.BR AT_FDCWD
> > +(or as an argument to plain
> > +.BR open (2).) When re-opening an \fBO_PATH\fP file descriptor, the same 
> > "link
> 
> There's a formatting problem here which can be fixed by inserting a 
> newline before "When".
> 
> > +mode" restrictions apply as with re-opening through
> > +.BR proc (5)
> > +(see
> > +.BR path_resolution "(7) and " symlink (7)
> > +for more details.)
> > +.TP
> >  .B O_EXCL
> >  Ensure that this call creates the file:
> >  if this flag is specified in conjunction with
> > @@ -668,6 +683,13 @@ with
> >  (or via procfs using
> >  .BR AT_SYMLINK_FOLLOW )
> >  even if the file is not a directory.
> > +You can even "re-open" (or upgrade) an
> > +.BR O_PATH
> > +file descriptor by using
> > +.BR O_EMPTYPATH
> > +(see the section for
> > +.BR O_EMPTYPATH
> > +for more details.)
> >  .IP *
> >  Passing the file descriptor to another process via a UNIX domain socket
> >  (see
> > @@ -958,6 +980,15 @@ is not allowed.
> >  (See also
> >  .BR path_resolution (7).)
> >  .TP
> > +.B EBADF
> > +.I pathname
> > +was an empty string (and
> > +.B O_EMPTYPATH
> > +was passed) with
> > +.BR open (2)
> > +(instead of
> > +.BR openat (2).)
> > +.TP
> >  .B EDQUOT
> >  Where
> >  .B O_CREAT
> > @@ -1203,6 +1234,15 @@ The following additional errors can occur for
> >  .I dirfd
> >  is not a valid file descriptor.
> >  .TP
> > +.B EBADF
> > +.I pathname
> > +was an empty string (and
> > +.B O_EMPTYPATH
> > +was passed), but the provided
> > +.I dirfd
> > +was an invalid file descriptor (or was
> > +.BR AT_FDCWD .)
> > +.TP
> >  .B ENOTDIR
> >  .I pathname
> >  is a relative pathname and
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 46f25ec4cdfa..85dd354e9a93 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -22,7 +22,7 @@
> >  .\" the source, must acknowledge the copyright and authors of this work.
> >  .\" %%%LICENSE_END
> >  .\"
> > -.TH PATH_RESOLUTION 7 2017-11-26 "Linux" "Linux Programmer's Manual"
> > +.TH PATH_RESOLUTION 7 2019-10-03 "Linux

Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely

2019-10-09 Thread Aleksa Sarai
On 2019-10-09, Michael Kerrisk (man-pages)  wrote:
> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Traditionally, magic-links have not been a well-understood topic in
> > Linux. Given the new changes in their semantics (related to the link
> > mode of trailing magic-links), it seems like a good opportunity to shine
> > more light on magic-links and their semantics.
> > 
> > Signed-off-by: Aleksa Sarai 
> 
> Thanks for doing this. Some comments below.

No problem -- just a heads-up that I'm going to split off the magic-link
changes from the openat2(2) series (there are quite a few things that
need to be done). So I will drop this man page for now.

> > ---
> >  man7/path_resolution.7 | 15 +++
> >  man7/symlink.7 | 39 ++-
> >  2 files changed, 45 insertions(+), 9 deletions(-)
> > 
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 07664ed8faec..46f25ec4cdfa 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -136,6 +136,21 @@ we are just creating it.
> >  The details on the treatment
> >  of the final entry are described in the manual pages of the specific
> >  system calls.
> > +.PP
> > +Since Linux 5.FOO, if the final entry is a "magic-link" (see
> 
> "magic link". As Jann points out, this is more normal English usage.
> 
> > +.BR symlink (7)),
> > +and the user is attempting to
> > +.BR open (2)
> > +it, then there is an additional permission-related restriction applied to 
> > the
> > +operation: the requested access mode must not exceed the "link mode" of the
> > +magic-link (unlike ordinary symlinks, magic-links have their own file 
> > mode.)
> 
> Remove the hyphens (magic link). And also, as someone else pointed out,
> manual pages fairly consistently uses the term "symbolic link"
> (written in full).

Will do.

> You use the term "file mode" here. Do you mean the file permissions bits?

Yes.

> If yes, it is a bit misleading to suggest that symbolic links don't
> have these mode bits. They do, but--as noted in the existing symlink(7)
> manual page text--these bits are ignored. I suggest just removing the
> parenthesized text.

I was trying to say that their file mode can be non-0777 -- but I can
just drop the entire thing.

> > +For example, if
> > +.I /proc/[pid]/fd/[num]
> > +has a link mode of
> > +.BR 0500 ,
> > +unprivileged users are not permitted to
> > +.BR open ()
> > +the magic-link for writing.
> >  .SS . and ..
> >  By convention, every directory has the entries "." and "..",
> >  which refer to the directory itself and to its parent directory,
> > diff --git a/man7/symlink.7 b/man7/symlink.7
> > index 9f5bddd5dc21..33f0ec703acd 100644
> > --- a/man7/symlink.7
> > +++ b/man7/symlink.7
> > @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
> >  are outlined here.
> >  It is important that site-local applications also conform to these rules,
> >  so that the user interface can be as consistent as possible.
> > +.SS Magic-links
> > +There is a special class of symlink-like objects known as "magic-links" 
> > which
> 
> "magic links" (and through the rest of the page).
> 
> > +can be found in certain pseudo-filesystems such as
> 
> pseudofilesystems
> 
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Unlike normal symlinks, magic-links are not resolved through
> 
> symbolic links
> 
> > +pathname-expansion, but instead act as direct references to the kernel's 
> > own
> 
> pathname expansion

Will do all of the above.

> > +representation of a file handle. As such, these magic-links allow users to
> > +access files which cannot be referenced with normal paths (such as unlinked
> > +files still referenced by a running program.)
> > +.PP
> > +Because they can bypass ordinary
> > +.BR mount_namespaces (7)-based
> > +restrictions, magic-links have been used as attack vectors in various 
> > exploits.
> > +As such (since Linux 5.FOO), there are additional restrictions placed on 
> > the
> > +re-opening of magic-links (see
> > +.BR path_resolution (7)
> > +for more details.)
> >  .SS Symbolic link ownership, permissions, and timestamps
> >  The owner and group of an existing symbolic link can be changed
> >  using
> > @@ -99,16 +118,18 @@ of a symbolic link can be changed using
> >  or
> >  .BR lutim

Re: [PATCH] proc:fix confusing macro arg name

2019-10-08 Thread Aleksa Sarai
On 2019-10-08, linmiaohe  wrote:
> Add suitable additional cc's as Andrew Morton suggested.
> Get cc list from get_maintainer script:
> [root@localhost mm]# ./scripts/get_maintainer.pl 
> 0001-proc-fix-confusing-macro-arg-name.patch 
> Alexey Dobriyan  (reviewer:PROC FILESYSTEM)
> linux-kernel@vger.kernel.org (open list:PROC FILESYSTEM)
> linux-fsde...@vger.kernel.org (open list:PROC FILESYSTEM)
> 
> --
> From: Miaohe Lin 
> Subject: fix confusing macro arg name
> 
> state_size and ops are in the wrong position, fix it.
> 
> Signed-off-by: Miaohe Lin 
> Reviewed-by: Andrew Morton 
> Cc: Alexey Dobriyan 
> Signed-off-by: Andrew Morton 

Looks reasonable.

Acked-by: Aleksa Sarai 

> ---
> 
>  include/linux/proc_fs.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index 
> a705aa2d03f9..0640be56dcbd 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -58,8 +58,8 @@ extern int remove_proc_subtree(const char *, struct 
> proc_dir_entry *);  struct proc_dir_entry *proc_create_net_data(const char 
> *name, umode_t mode,
>   struct proc_dir_entry *parent, const struct seq_operations *ops,
>   unsigned int state_size, void *data);
> -#define proc_create_net(name, mode, parent, state_size, ops) \
> - proc_create_net_data(name, mode, parent, state_size, ops, NULL)
> +#define proc_create_net(name, mode, parent, ops, state_size) \
> + proc_create_net_data(name, mode, parent, ops, state_size, NULL)
>  struct proc_dir_entry *proc_create_net_single(const char *name, umode_t mode,
>   struct proc_dir_entry *parent,
>   int (*show)(struct seq_file *, void *), void *data);
> --
> 2.21.GIT
> 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH RFC 1/3] symlink.7: document magic-links more completely

2019-10-07 Thread Aleksa Sarai
On 2019-10-07, Jann Horn  wrote:
> On Thu, Oct 3, 2019 at 4:56 PM Aleksa Sarai  wrote:
> > Traditionally, magic-links have not been a well-understood topic in
> > Linux. Given the new changes in their semantics (related to the link
> > mode of trailing magic-links), it seems like a good opportunity to shine
> > more light on magic-links and their semantics.
> [...]
> > +++ b/man7/symlink.7
> > @@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
> >  are outlined here.
> >  It is important that site-local applications also conform to these rules,
> >  so that the user interface can be as consistent as possible.
> > +.SS Magic-links
> > +There is a special class of symlink-like objects known as "magic-links" 
> > which
> 
> I think names like that normally aren't hypenated in english, and
> instead of "magic-links", it'd be "magic links"? Just like how you
> wouldn't write "symbolic-link", but "symbolic link". But this is
> bikeshedding, and if you disagree, feel free to ignore this comment.

Looking at it now, I think you're right -- I hyphenated it here because
that's how I wrote it when documenting the feature in comments. But I
think that's because "symlink" and "magic-link" (the "abbreviated"
versions) seem to match better than "symlink" and "magic link".

I'll use "magic link" in documentation, but "magic-link" for all cases
where I would normally write "symlink".

> > +can be found in certain pseudo-filesystems such as
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Unlike normal symlinks, magic-links are not resolved through
> 
> nit: AFAICS symlinks are always referred to as "symbolic links"
> throughout the manpages.

:+1:

> > +pathname-expansion, but instead act as direct references to the kernel's 
> > own
> > +representation of a file handle. As such, these magic-links allow users to
> > +access files which cannot be referenced with normal paths (such as unlinked
> > +files still referenced by a running program.)
> 
> Could maybe add "and files in different mount namespaces" as another
> example here; at least for me, that's the main usecases for
> /proc/*/root.

Will do.

> [...]
> > +However, magic-links do not follow this rule. They can have a non-0777 
> > mode,
> > +which is used for permission checks when the final
> > +component of an
> > +.BR open (2)'s
> 
> Maybe leave out the "open" part, since the same restriction has to
> also apply to other syscalls operating on files, like truncate() and
> so on?

Yes (though I've just realised I hadn't implemented that -- oops.) Given
how expansive this patchset will get -- I might end up splitting it into
the magic-link stuff (and O_EMPTYPATH) and a separate series for
openat2(2) and the path resolution restrictions.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] lib: test_user_copy: style cleanup

2019-10-06 Thread Aleksa Sarai
On 2019-10-06, Christian Brauner  wrote:
> On Sun, Oct 06, 2019 at 10:30:28AM +1100, Aleksa Sarai wrote:
> > While writing the tests for copy_struct_from_user(), I used a construct
> > that Linus doesn't appear to be too fond of:
> > 
> > On 2019-10-04, Linus Torvalds  wrote:
> > > Hmm. That code is ugly, both before and after the fix.
> > >
> > > This just doesn't make sense for so many reasons:
> > >
> > > if ((ret |= test(umem_src == NULL, "kmalloc failed")))
> > >
> > > where the insanity comes from
> > >
> > >  - why "|=" when you know that "ret" was zero before (and it had to
> > >be, for the test to make sense)
> > >
> > >  - why do this as a single line anyway?
> > >
> > >  - don't do the stupid "double parenthesis" to hide a warning. Make it
> > >use an actual comparison if you add a layer of parentheses.
> > 
> > So instead, use a bog-standard check that isn't nearly as ugly.
> > 
> > Fixes: 341115822f88 ("usercopy: Add parentheses around assignment in 
> > test_copy_struct_from_user")
> > Fixes: f5a1a536fa14 ("lib: introduce copy_struct_from_user() helper")
> > Signed-off-by: Aleksa Sarai 
> 
> Fwiw, I think the commit message doesn't necessarily need to mention
> stylistic preferences nor a specific mail. It's sufficient enough to say
> that the new way makes things way more obvious. But ok. :)
> 
> I'll pick this up now.

Thanks, and feel free to rewrite the commit message to whatever you'd
prefer.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


[PATCH] lib: test_user_copy: style cleanup

2019-10-05 Thread Aleksa Sarai
While writing the tests for copy_struct_from_user(), I used a construct
that Linus doesn't appear to be too fond of:

On 2019-10-04, Linus Torvalds  wrote:
> Hmm. That code is ugly, both before and after the fix.
>
> This just doesn't make sense for so many reasons:
>
> if ((ret |= test(umem_src == NULL, "kmalloc failed")))
>
> where the insanity comes from
>
>  - why "|=" when you know that "ret" was zero before (and it had to
>be, for the test to make sense)
>
>  - why do this as a single line anyway?
>
>  - don't do the stupid "double parenthesis" to hide a warning. Make it
>use an actual comparison if you add a layer of parentheses.

So instead, use a bog-standard check that isn't nearly as ugly.

Fixes: 341115822f88 ("usercopy: Add parentheses around assignment in 
test_copy_struct_from_user")
Fixes: f5a1a536fa14 ("lib: introduce copy_struct_from_user() helper")
Signed-off-by: Aleksa Sarai 
---
 lib/test_user_copy.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
index e365ace06538..ad2372727b1b 100644
--- a/lib/test_user_copy.c
+++ b/lib/test_user_copy.c
@@ -52,13 +52,14 @@ static int test_check_nonzero_user(char *kmem, char __user 
*umem, size_t size)
size_t zero_end = size - zero_start;
 
/*
-* We conduct a series of check_nonzero_user() tests on a block of 
memory
-* with the following byte-pattern (trying every possible [start,end]
-* pair):
+* We conduct a series of check_nonzero_user() tests on a block of
+* memory with the following byte-pattern (trying every possible
+* [start,end] pair):
 *
 *   [ 00 ff 00 ff ... 00 00 00 00 ... ff 00 ff 00 ]
 *
-* And we verify that check_nonzero_user() acts identically to 
memchr_inv().
+* And we verify that check_nonzero_user() acts identically to
+* memchr_inv().
 */
 
memset(kmem, 0x0, size);
@@ -93,11 +94,13 @@ static int test_copy_struct_from_user(char *kmem, char 
__user *umem,
size_t ksize, usize;
 
umem_src = kmalloc(size, GFP_KERNEL);
-   if ((ret |= test(umem_src == NULL, "kmalloc failed")))
+   ret = test(umem_src == NULL, "kmalloc failed");
+   if (ret)
goto out_free;
 
expected = kmalloc(size, GFP_KERNEL);
-   if ((ret |= test(expected == NULL, "kmalloc failed")))
+   ret = test(expected == NULL, "kmalloc failed");
+   if (ret)
goto out_free;
 
/* Fill umem with a fixed byte pattern. */
-- 
2.23.0



Re: [GIT PULL] usercopy structs for v5.4-rc2

2019-10-04 Thread Aleksa Sarai
On 2019-10-04, Linus Torvalds  wrote:
> On Fri, Oct 4, 2019 at 3:42 AM Christian Brauner
>  wrote:
> >
> >The only separate fix we we had to apply
> > was for a warning by clang when building the tests for using the result of
> > an assignment as a condition without parantheses.
> 
> Hmm. That code is ugly, both before and after the fix.
> 
> This just doesn't make sense for so many reasons:
> 
> if ((ret |= test(umem_src == NULL, "kmalloc failed")))
> 
> where the insanity comes from
> 
>  - why "|=" when you know that "ret" was zero before (and it had to
> be, for the test to make sense)
> 
>  - why do this as a single line anyway?
> 
>  - don't do the stupid "double parenthesis" to hide a warning. Make it
> use an actual comparison if you add a layer of parentheses.

You're quite right -- I was mindlessly copying the "ret |=" logic the
rest of test_user_copy.c does without thinking about it. I'll include a
cleanup for it in the openat2(2) series.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] usercopy: Add parentheses around assignment in test_copy_struct_from_user

2019-10-03 Thread Aleksa Sarai
On 2019-10-03, Christian Brauner  wrote:
> On Thu, Oct 3, 2019, 19:11 Nathan Chancellor 
> wrote:
> 
> > Clang warns:
> >
> > lib/test_user_copy.c:96:10: warning: using the result of an assignment
> > as a condition without parentheses [-Wparentheses]
> > if (ret |= test(umem_src == NULL, "kmalloc failed"))
> > ^~~
> > lib/test_user_copy.c:96:10: note: place parentheses around the
> > assignment to silence this warning
> > if (ret |= test(umem_src == NULL, "kmalloc failed"))
> > ^
> > (  )
> > lib/test_user_copy.c:96:10: note: use '!=' to turn this compound
> > assignment into an inequality comparison
> > if (ret |= test(umem_src == NULL, "kmalloc failed"))
> > ^~
> > !=
> >
> > Add the parentheses as it suggests because this is intentional.
> >
> > Fixes: f5a1a536fa14 ("lib: introduce copy_struct_from_user() helper")
> > Link: https://github.com/ClangBuiltLinux/linux/issues/731
> > Signed-off-by: Nathan Chancellor 
> >
> 
> I'll take this. Aleksa, can I get your ack too, please?
> 
> Acked-by: Christian Brauner 

Acked-by: Aleksa Sarai 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH] Documentation: update about adding syscalls

2019-10-03 Thread Aleksa Sarai
On 2019-10-03, Jonathan Corbet  wrote:
> [Expanding CC a bit; this is the sort of change I'm reluctant to take
> without being sure it reflects what the community thinks.]
> 
> On Wed,  2 Oct 2019 17:14:37 +0200
> Christian Brauner  wrote:
> 
> > Add additional information on how to ensure that syscalls with structure
> > arguments are extensible and add a section about naming conventions to
> > follow when adding revised versions of already existing syscalls.
> > 
> > Co-Developed-by: Aleksa Sarai 
> > Signed-off-by: Aleksa Sarai 
> > Signed-off-by: Christian Brauner 
> > ---
> >  Documentation/process/adding-syscalls.rst | 82 +++
> >  1 file changed, 70 insertions(+), 12 deletions(-)
> > 
> > diff --git a/Documentation/process/adding-syscalls.rst 
> > b/Documentation/process/adding-syscalls.rst
> > index 1c3a840d06b9..93e0221fbb9a 100644
> > --- a/Documentation/process/adding-syscalls.rst
> > +++ b/Documentation/process/adding-syscalls.rst
> > @@ -79,7 +79,7 @@ flags, and reject the system call (with ``EINVAL``) if it 
> > does::
> >  For more sophisticated system calls that involve a larger number of 
> > arguments,
> >  it's preferred to encapsulate the majority of the arguments into a 
> > structure
> >  that is passed in by pointer.  Such a structure can cope with future 
> > extension
> > -by including a size argument in the structure::
> > +by either including a size argument in the structure::
> >  
> >  struct xyzzy_params {
> >  u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) 
> > */
> > @@ -87,20 +87,56 @@ by including a size argument in the structure::
> >  u64 param_2;
> >  u64 param_3;
> >  };
> > +int sys_xyzzy(struct xyzzy_params __user *uarg);
> > +/* in case of -E2BIG, p->size is set to the in-kernel size and thus all
> > +   extensions after that offset are unsupported. */
> 
> That comment kind of threw me for a loop - this is the first mention of
> E2BIG and readers may not just know what's being talked about.  Especially
> since the comment suggests *not* actually returning an error.

I probably could've worded this better -- this comment describes what
userspace sees when they use the API (sched_setattr(2) is an example of
this style of API).

In the case where the kernel doesn't support a requested extension
(usize > ksize, and there are non-zero bytes past ksize) then the kernel
returns -E2BIG *but also* sets p->size to ksize so that userspace knows
what extensions the kernel supports.

Maybe I should've replicated more of the details from the kernel-doc for
copy_struct_from_user().

> > -As long as any subsequently added field, say ``param_4``, is designed so 
> > that a
> > -zero value gives the previous behaviour, then this allows both directions 
> > of
> > -version mismatch:
> > +or by including a separate argument that specifies the size::
> >  
> > - - To cope with a later userspace program calling an older kernel, the 
> > kernel
> > -   code should check that any memory beyond the size of the structure that 
> > it
> > -   expects is zero (effectively checking that ``param_4 == 0``).
> > - - To cope with an older userspace program calling a newer kernel, the 
> > kernel
> > -   code can zero-extend a smaller instance of the structure (effectively
> > -   setting ``param_4 = 0``).
> > +struct xyzzy_params {
> > +u32 param_1;
> > +u64 param_2;
> > +u64 param_3;
> > +};
> > +/* userspace sets @usize = sizeof(struct xyzzy_params) */
> > +int sys_xyzzy(struct xyzzy_params __user *uarg, size_t usize);
> > +/* in case of -E2BIG, userspace has to attempt smaller @usize values
> > +   to figure out which extensions are unsupported. */
> 
> Here too.  But what I'm really wondering about now is: you're describing
> different behavior for what are essentially two cases of the same thing.
> Why should the kernel simply accept the smaller size if the size is
> embedded in the structure itself, but return an error and force user space
> to retry if it's a separate argument?
> 
> I guess maybe because in the latter case the kernel can't easily return
> the size it's actually using?  I think that should be explicit if so.

As above, the -E2BIG only happens if userspace is trying to use an
extension that the kernel doesn't support (usize > ksize, non-zero bytes
after ksize). The main difference between the two API styles is whether
or not userspace gets told what ksize is explicitly in the case of an
-E2BIG.

Maybe it would be less confusing to only mention one of ways of doing
it, but then we have to pick one (and while the newer syscalls [clone3
and openat2] use a separate argument, there are more syscalls which
embed it in the struct).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH RFC 3/3] openat2.2: document new syscall

2019-10-03 Thread Aleksa Sarai
Ignore this one (it's an older version of the openat2.2 patch) -- I sent
it by accident.

On 2019-10-04, Aleksa Sarai  wrote:
> Signed-off-by: Aleksa Sarai 
> ---
>  man2/open.2|   5 +
>  man2/openat2.2 | 381 +
>  man7/path_resolution.7 |  57 --
>  3 files changed, 426 insertions(+), 17 deletions(-)
>  create mode 100644 man2/openat2.2
> 
> diff --git a/man2/open.2 b/man2/open.2
> index 7217fe056e5e..a0b43394bbee 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
>  .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
>  ", mode_t " mode );
> +.PP
> +/* Docuented separately, in \fBopenat2\fP(2). */
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
>  .fi
>  .PP
>  .in -4n
> @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
>  .B O_DIRECTORY
>  is ignored).
>  .SH SEE ALSO
> +.BR openat2 (2),
>  .BR chmod (2),
>  .BR chown (2),
>  .BR close (2),
> diff --git a/man2/openat2.2 b/man2/openat2.2
> new file mode 100644
> index ..c43c76046243
> --- /dev/null
> +++ b/man2/openat2.2
> @@ -0,0 +1,381 @@
> +.\" Copyright (C) 2019 Aleksa Sarai 
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +openat2 \- open and possibly create a file (extended)
> +.SH SYNOPSIS
> +.nf
> +.B #include 
> +.B #include 
> +.B #include 
> +.PP
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
> +.SH DESCRIPTION
> +The
> +.BR openat2 ()
> +system call is an extension of
> +.BR openat (2)
> +and provides a superset of its functionality. Rather than taking a single
> +.I flag
> +argument, an extensible structure (\fIhow\fP) is passed instead to allow for
> +seamless future extensions.
> +.PP
> +.I size
> +must be set to
> +.IR "sizeof(struct open_how)" ,
> +to facilitate future extensions (see the "Extensibility" section of the
> +\fBNOTES\fP for more detail on how extensions are handled.)
> +
> +.SS The open_how structure
> +The following structure indicates how
> +.I pathname
> +should be opened, and acts as a superset of the
> +.IR flag " and " mode
> +arguments to
> +.BR openat (2).
> +.PP
> +.in +4n
> +.EX
> +struct open_how {
> +uint32_t flags;  /* open(2)-style O_* flags. */
> +union {
> +uint16_t mode;   /* File mode bits for new file creation. */
> +uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
> +};
> +uint32_t resolve;/* RESOLVE_* path-resolution flags. */
> +};
> +.EE
> +.in
> +.PP
> +Any future extensions to
> +.BR openat2 ()
> +will be implemented as new fields appended to the above structure, with the
> +zero value of the new fields acting as though the extension were not present.
> +.PP
> +The meaning of each field is as follows:
> +.RS
>

[PATCH RFC 0/3] document openat2(2) patch series

2019-10-03 Thread Aleksa Sarai
This is a first draft of the man-page changes for the openat2(2) patch
series I'm working on[1]. It includes information about the magic-link
changes as well as the primary new features (O_EMPTYPATH and openat2).

Let me know what you think. I might go into too much detail about how
extension of openat2(2) will work -- let me know if that section should
be dropped (while it is useful for userspace to understand, it isn't
really that necessary to explain exactly what the semantics are -- it
will usually just transparently work).

[1]: https://lore.kernel.org/lkml/20190930183316.10190-1-cyp...@cyphar.com/

Aleksa Sarai (3):
  symlink.7: document magic-links more completely
  open.2: add O_EMPTYPATH documentation
  openat2.2: document new openat2(2) syscall

 man2/open.2|  47 -
 man2/openat2.2 | 381 +
 man7/path_resolution.7 |  89 --
 man7/symlink.7 |  39 -
 4 files changed, 528 insertions(+), 28 deletions(-)
 create mode 100644 man2/openat2.2

-- 
2.23.0



[PATCH RFC 3/3] openat2.2: document new syscall

2019-10-03 Thread Aleksa Sarai
Signed-off-by: Aleksa Sarai 
---
 man2/open.2|   5 +
 man2/openat2.2 | 381 +
 man7/path_resolution.7 |  57 --
 3 files changed, 426 insertions(+), 17 deletions(-)
 create mode 100644 man2/openat2.2

diff --git a/man2/open.2 b/man2/open.2
index 7217fe056e5e..a0b43394bbee 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
 ", mode_t " mode );
+.PP
+/* Docuented separately, in \fBopenat2\fP(2). */
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
 .fi
 .PP
 .in -4n
@@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
 .B O_DIRECTORY
 is ignored).
 .SH SEE ALSO
+.BR openat2 (2),
 .BR chmod (2),
 .BR chown (2),
 .BR close (2),
diff --git a/man2/openat2.2 b/man2/openat2.2
new file mode 100644
index 0000..c43c76046243
--- /dev/null
+++ b/man2/openat2.2
@@ -0,0 +1,381 @@
+.\" Copyright (C) 2019 Aleksa Sarai 
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
+.SH NAME
+openat2 \- open and possibly create a file (extended)
+.SH SYNOPSIS
+.nf
+.B #include 
+.B #include 
+.B #include 
+.PP
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR openat2 ()
+system call is an extension of
+.BR openat (2)
+and provides a superset of its functionality. Rather than taking a single
+.I flag
+argument, an extensible structure (\fIhow\fP) is passed instead to allow for
+seamless future extensions.
+.PP
+.I size
+must be set to
+.IR "sizeof(struct open_how)" ,
+to facilitate future extensions (see the "Extensibility" section of the
+\fBNOTES\fP for more detail on how extensions are handled.)
+
+.SS The open_how structure
+The following structure indicates how
+.I pathname
+should be opened, and acts as a superset of the
+.IR flag " and " mode
+arguments to
+.BR openat (2).
+.PP
+.in +4n
+.EX
+struct open_how {
+uint32_t flags;  /* open(2)-style O_* flags. */
+union {
+uint16_t mode;   /* File mode bits for new file creation. */
+uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
+};
+uint32_t resolve;/* RESOLVE_* path-resolution flags. */
+};
+.EE
+.in
+.PP
+Any future extensions to
+.BR openat2 ()
+will be implemented as new fields appended to the above structure, with the
+zero value of the new fields acting as though the extension were not present.
+.PP
+The meaning of each field is as follows:
+.RS
+
+.I flags
+.RS
+The file creation and status flags to use for this operation. All of the
+.B O_*
+flags defined for
+.BR openat (2)
+are valid
+.BR openat2 ()
+flag values.
+.RE
+
+.I upgrade_mask
+.RS
+Restrict with which
+.I access modes
+the returned
+.B O_PATH
+descriptor may be re-opened (either through
+.B O_EMPTYPATH
+or
+.IR /proc/self/fd/ .)
+This field may only be set to a non-zero value if
+.I flags
+contains
+.BR O_PATH .
+By default, an
+.B O_PATH
+file descriptor of an ordinary file may be re-opened with with any access mode 
(but an
+.B O_PATH
+file descriptor of a magic-link may only be re-opened with access modes that
+the original magic-link possessed). The full list of
+.I upgrade_mask
+flags is

[PATCH RFC 3/3] openat2.2: document new openat2(2) syscall

2019-10-03 Thread Aleksa Sarai
Rather than trying to merge the new syscall documentation into open.2
(which would probably result in the man-page being incomprehensible),
instead the new syscall gets its own dedicated page with links between
open(2) and openat2(2) to avoid duplicating information such as the list
of O_* flags or common errors.

Signed-off-by: Aleksa Sarai 
---
 man2/open.2|   5 +
 man2/openat2.2 | 381 +
 man7/path_resolution.7 |  57 --
 3 files changed, 426 insertions(+), 17 deletions(-)
 create mode 100644 man2/openat2.2

diff --git a/man2/open.2 b/man2/open.2
index 7217fe056e5e..a0b43394bbee 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
 ", mode_t " mode );
+.PP
+/* Docuented separately, in \fBopenat2\fP(2). */
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
 .fi
 .PP
 .in -4n
@@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
 .B O_DIRECTORY
 is ignored).
 .SH SEE ALSO
+.BR openat2 (2),
 .BR chmod (2),
 .BR chown (2),
 .BR close (2),
diff --git a/man2/openat2.2 b/man2/openat2.2
new file mode 100644
index 0000..c43c76046243
--- /dev/null
+++ b/man2/openat2.2
@@ -0,0 +1,381 @@
+.\" Copyright (C) 2019 Aleksa Sarai 
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
+.SH NAME
+openat2 \- open and possibly create a file (extended)
+.SH SYNOPSIS
+.nf
+.B #include 
+.B #include 
+.B #include 
+.PP
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR openat2 ()
+system call is an extension of
+.BR openat (2)
+and provides a superset of its functionality. Rather than taking a single
+.I flag
+argument, an extensible structure (\fIhow\fP) is passed instead to allow for
+seamless future extensions.
+.PP
+.I size
+must be set to
+.IR "sizeof(struct open_how)" ,
+to facilitate future extensions (see the "Extensibility" section of the
+\fBNOTES\fP for more detail on how extensions are handled.)
+
+.SS The open_how structure
+The following structure indicates how
+.I pathname
+should be opened, and acts as a superset of the
+.IR flag " and " mode
+arguments to
+.BR openat (2).
+.PP
+.in +4n
+.EX
+struct open_how {
+uint32_t flags;  /* open(2)-style O_* flags. */
+union {
+uint16_t mode;   /* File mode bits for new file creation. */
+uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
+};
+uint32_t resolve;/* RESOLVE_* path-resolution flags. */
+};
+.EE
+.in
+.PP
+Any future extensions to
+.BR openat2 ()
+will be implemented as new fields appended to the above structure, with the
+zero value of the new fields acting as though the extension were not present.
+.PP
+The meaning of each field is as follows:
+.RS
+
+.I flags
+.RS
+The file creation and status flags to use for this operation. All of the
+.B O_*
+flags defined for
+.BR openat (2)
+are valid
+.BR openat2 ()
+flag values.
+.RE
+
+.I upgrade_mask
+.RS
+Restrict with which
+.I access modes
+the returned
+.B O_PATH
+descriptor may be re-opened (either through
+.B O_EMPTYPATH
+or
+.IR /proc/self/fd/ .)
+This field may only be set to a non-zero value if

[PATCH RFC 1/3] symlink.7: document magic-links more completely

2019-10-03 Thread Aleksa Sarai
Traditionally, magic-links have not been a well-understood topic in
Linux. Given the new changes in their semantics (related to the link
mode of trailing magic-links), it seems like a good opportunity to shine
more light on magic-links and their semantics.

Signed-off-by: Aleksa Sarai 
---
 man7/path_resolution.7 | 15 +++
 man7/symlink.7 | 39 ++-
 2 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 07664ed8faec..46f25ec4cdfa 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -136,6 +136,21 @@ we are just creating it.
 The details on the treatment
 of the final entry are described in the manual pages of the specific
 system calls.
+.PP
+Since Linux 5.FOO, if the final entry is a "magic-link" (see
+.BR symlink (7)),
+and the user is attempting to
+.BR open (2)
+it, then there is an additional permission-related restriction applied to the
+operation: the requested access mode must not exceed the "link mode" of the
+magic-link (unlike ordinary symlinks, magic-links have their own file mode.)
+For example, if
+.I /proc/[pid]/fd/[num]
+has a link mode of
+.BR 0500 ,
+unprivileged users are not permitted to
+.BR open ()
+the magic-link for writing.
 .SS . and ..
 By convention, every directory has the entries "." and "..",
 which refer to the directory itself and to its parent directory,
diff --git a/man7/symlink.7 b/man7/symlink.7
index 9f5bddd5dc21..33f0ec703acd 100644
--- a/man7/symlink.7
+++ b/man7/symlink.7
@@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
 are outlined here.
 It is important that site-local applications also conform to these rules,
 so that the user interface can be as consistent as possible.
+.SS Magic-links
+There is a special class of symlink-like objects known as "magic-links" which
+can be found in certain pseudo-filesystems such as
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Unlike normal symlinks, magic-links are not resolved through
+pathname-expansion, but instead act as direct references to the kernel's own
+representation of a file handle. As such, these magic-links allow users to
+access files which cannot be referenced with normal paths (such as unlinked
+files still referenced by a running program.)
+.PP
+Because they can bypass ordinary
+.BR mount_namespaces (7)-based
+restrictions, magic-links have been used as attack vectors in various exploits.
+As such (since Linux 5.FOO), there are additional restrictions placed on the
+re-opening of magic-links (see
+.BR path_resolution (7)
+for more details.)
 .SS Symbolic link ownership, permissions, and timestamps
 The owner and group of an existing symbolic link can be changed
 using
@@ -99,16 +118,18 @@ of a symbolic link can be changed using
 or
 .BR lutimes (3).
 .PP
-On Linux, the permissions of a symbolic link are not used
-in any operations; the permissions are always
-0777 (read, write, and execute for all user categories),
 .\" Linux does not currently implement an lchmod(2).
-and can't be changed.
-(Note that there are some "magic" symbolic links in the
-.I /proc
-directory tree\(emfor example, the
-.IR /proc/[pid]/fd/*
-files\(emthat have different permissions.)
+On Linux, the permissions of an ordinary symbolic link are not used in any
+operations; the permissions are always 0777 (read, write, and execute for all
+user categories), and can't be changed.
+.PP
+However, magic-links do not follow this rule. They can have a non-0777 mode,
+which is used for permission checks when the final
+component of an
+.BR open (2)'s
+path is a magic-link (see
+.BR path_resolution (7).)
+
 .\"
 .\" The
 .\" 4.4BSD
-- 
2.23.0



[PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation

2019-10-03 Thread Aleksa Sarai
Some of the wording around empty paths in path_resolution(7) also needed
to be reworked since it's now legal (if you pass O_EMPTYPATH).

Signed-off-by: Aleksa Sarai 
---
 man2/open.2| 42 +-
 man7/path_resolution.7 | 17 -
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/man2/open.2 b/man2/open.2
index b0f485b41589..7217fe056e5e 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -48,7 +48,7 @@
 .\" FIXME . Apr 08: The next POSIX revision has O_EXEC, O_SEARCH, and
 .\" O_TTYINIT.  Eventually these may need to be documented.  --mtk
 .\"
-.TH OPEN 2 2018-04-30 "Linux" "Linux Programmer's Manual"
+.TH OPEN 2 2019-10-03 "Linux" "Linux Programmer's Manual"
 .SH NAME
 open, openat, creat \- open and possibly create a file
 .SH SYNOPSIS
@@ -421,6 +421,21 @@ was followed by a call to
 .BR fdatasync (2)).
 .IR "See NOTES below" .
 .TP
+.BR O_EMPTYPATH " (since Linux 5.FOO)"
+If \fIpathname\fP is an empty string, re-open the the file descriptor given as
+the \fIdirfd\fP argument to
+.BR openat (2).
+This can be used with both ordinary (file and directory) and \fBO_PATH\fP file
+descriptors, but cannot be used with
+.BR AT_FDCWD
+(or as an argument to plain
+.BR open (2).) When re-opening an \fBO_PATH\fP file descriptor, the same "link
+mode" restrictions apply as with re-opening through
+.BR proc (5)
+(see
+.BR path_resolution "(7) and " symlink (7)
+for more details.)
+.TP
 .B O_EXCL
 Ensure that this call creates the file:
 if this flag is specified in conjunction with
@@ -668,6 +683,13 @@ with
 (or via procfs using
 .BR AT_SYMLINK_FOLLOW )
 even if the file is not a directory.
+You can even "re-open" (or upgrade) an
+.BR O_PATH
+file descriptor by using
+.BR O_EMPTYPATH
+(see the section for
+.BR O_EMPTYPATH
+for more details.)
 .IP *
 Passing the file descriptor to another process via a UNIX domain socket
 (see
@@ -958,6 +980,15 @@ is not allowed.
 (See also
 .BR path_resolution (7).)
 .TP
+.B EBADF
+.I pathname
+was an empty string (and
+.B O_EMPTYPATH
+was passed) with
+.BR open (2)
+(instead of
+.BR openat (2).)
+.TP
 .B EDQUOT
 Where
 .B O_CREAT
@@ -1203,6 +1234,15 @@ The following additional errors can occur for
 .I dirfd
 is not a valid file descriptor.
 .TP
+.B EBADF
+.I pathname
+was an empty string (and
+.B O_EMPTYPATH
+was passed), but the provided
+.I dirfd
+was an invalid file descriptor (or was
+.BR AT_FDCWD .)
+.TP
 .B ENOTDIR
 .I pathname
 is a relative pathname and
diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 46f25ec4cdfa..85dd354e9a93 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -22,7 +22,7 @@
 .\" the source, must acknowledge the copyright and authors of this work.
 .\" %%%LICENSE_END
 .\"
-.TH PATH_RESOLUTION 7 2017-11-26 "Linux" "Linux Programmer's Manual"
+.TH PATH_RESOLUTION 7 2019-10-03 "Linux" "Linux Programmer's Manual"
 .SH NAME
 path_resolution \- how a pathname is resolved to a file
 .SH DESCRIPTION
@@ -198,6 +198,21 @@ successfully.
 Linux returns
 .B ENOENT
 in this case.
+.PP
+As of Linux 5.FOO, an empty path argument can be used to indicate the "re-open"
+an existing file descriptor if
+.B O_EMPTYPATH
+is passed as a flag argument to
+.BR openat (2),
+with the
+.I dfd
+argument indicating which file descriptor to "re-open". This is approximately
+equivalent to opening
+.I /proc/self/fd/$fd
+where
+.I $fd
+is the open file descriptor to be "re-opened".
+
 .SS Permissions
 The permission bits of a file consist of three groups of three bits; see
 .BR chmod (1)
-- 
2.23.0



[PATCH v4 3/4] sched_setattr: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Ideally we could also
unify sched_getattr(2)-style syscalls as well, but unfortunately the
correct semantics for such syscalls are much less clear (see [1] for
more detail). In future we could come up with a more sane idea for how
the syscall interface should look.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

Reviewed-by: Kees Cook 
Signed-off-by: Aleksa Sarai 
---
 kernel/sched/core.c | 43 +++
 1 file changed, 7 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7880f4f64d0e..dd05a378631a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5106,9 +5106,6 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
u32 size;
int ret;
 
-   if (!access_ok(uattr, SCHED_ATTR_SIZE_VER0))
-   return -EFAULT;
-
/* Zero the full structure, so that a short copy will be nice: */
memset(attr, 0, sizeof(*attr));
 
@@ -5116,45 +5113,19 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
if (ret)
return ret;
 
-   /* Bail out on silly large: */
-   if (size > PAGE_SIZE)
-   goto err_size;
-
/* ABI compatibility quirk: */
if (!size)
size = SCHED_ATTR_SIZE_VER0;
-
-   if (size < SCHED_ATTR_SIZE_VER0)
+   if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
size < SCHED_ATTR_SIZE_VER1)
return -EINVAL;
@@ -5354,7 +5325,7 @@ sched_attr_copy_to_user(struct sched_attr __user *uattr,
  * sys_sched_getattr - similar to sched_getparam, but with sched_attr
  * @pid: the pid in question.
  * @uattr: structure containing the extended parameters.
- * @usize: sizeof(attr) that user-space knows about, for forwards and 
backwards compatibility.
+ * @usize: sizeof(attr) for fwd/bwd comp.
  * @flags: for future extension.
  */
 SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
-- 
2.23.0



[PATCH v4 4/4] perf_event_open: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls.

Reviewed-by: Kees Cook 
Signed-off-by: Aleksa Sarai 
---
 kernel/events/core.c | 47 +---
 1 file changed, 9 insertions(+), 38 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4655adbbae10..3f0cb82e4fbc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10586,55 +10586,26 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
u32 size;
int ret;
 
-   if (!access_ok(uattr, PERF_ATTR_SIZE_VER0))
-   return -EFAULT;
-
-   /*
-* zero the full structure, so that a short copy will be nice.
-*/
+   /* Zero the full structure, so that a short copy will be nice. */
memset(attr, 0, sizeof(*attr));
 
ret = get_user(size, >size);
if (ret)
return ret;
 
-   if (size > PAGE_SIZE)   /* silly large */
-   goto err_size;
-
-   if (!size)  /* abi compat */
+   /* ABI compatibility quirk: */
+   if (!size)
size = PERF_ATTR_SIZE_VER0;
-
-   if (size < PERF_ATTR_SIZE_VER0)
+   if (size < PERF_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
attr->size = size;
 
if (attr->__reserved_1)
-- 
2.23.0



[PATCH v4 2/4] clone3: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Additionally, explicitly
define CLONE_ARGS_SIZE_VER0 to match the other users of the
struct-extension pattern.

Reviewed-by: Kees Cook 
Signed-off-by: Aleksa Sarai 
---
 include/uapi/linux/sched.h |  2 ++
 kernel/fork.c  | 34 +++---
 2 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..0945805982b4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,6 +47,8 @@ struct clone_args {
__aligned_u64 tls;
 };
 
+#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f416126..2ef529869c64 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2525,39 +2525,19 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 #ifdef __ARCH_WANT_SYS_CLONE3
 noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
  struct clone_args __user *uargs,
- size_t size)
+ size_t usize)
 {
+   int err;
struct clone_args args;
 
-   if (unlikely(size > PAGE_SIZE))
+   if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
-
-   if (unlikely(size < sizeof(struct clone_args)))
+   if (unlikely(usize < CLONE_ARGS_SIZE_VER0))
return -EINVAL;
 
-   if (unlikely(!access_ok(uargs, size)))
-   return -EFAULT;
-
-   if (size > sizeof(struct clone_args)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uargs + sizeof(struct clone_args);
-   end = (void __user *)uargs + size;
-
-   for (; addr < end; addr++) {
-   if (get_user(val, addr))
-   return -EFAULT;
-   if (val)
-   return -E2BIG;
-   }
-
-   size = sizeof(struct clone_args);
-   }
-
-   if (copy_from_user(, uargs, size))
-   return -EFAULT;
+   err = copy_struct_from_user(, sizeof(args), uargs, usize);
+   if (err)
+   return err;
 
/*
 * Verify that higher 32bits of exit_signal are unset and that
-- 
2.23.0



[PATCH v4 1/4] lib: introduce copy_struct_from_user() helper

2019-09-30 Thread Aleksa Sarai
A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

While this interface exists for communication in both directions, only
one interface is straightforward to have reasonable semantics for
(userspace passing a struct to the kernel). For kernel returns to
userspace, what the correct semantics are (whether there should be an
error if userspace is unaware of a new extension) is very
syscall-dependent and thus probably cannot be unified between syscalls
(a good example of this problem is [1]).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[2]). Future
patches replace common uses of this pattern to make use of
copy_struct_from_user().

Some in-kernel selftests that insure that the handling of alignment and
various byte patterns are all handled identically to memchr_inv() usage.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Suggested-by: Rasmus Villemoes 
Signed-off-by: Aleksa Sarai 
---
 include/linux/bitops.h  |   7 +++
 include/linux/uaccess.h |  70 +
 lib/strnlen_user.c  |   8 +--
 lib/test_user_copy.c| 136 ++--
 lib/usercopy.c  |  55 
 5 files changed, 263 insertions(+), 13 deletions(-)

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index cf074bce3eb3..c94a9ff9f082 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+/* Set bits in the first 'n' bytes when loaded from memory */
+#ifdef __LITTLE_ENDIAN
+#  define aligned_byte_mask(n) ((1UL << 8*(n))-1)
+#else
+#  define aligned_byte_mask(n) (~0xffUL << (BITS_PER_LONG - 8 - 8*(n)))
+#endif
+
 #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
 #define BITS_TO_LONGS(nr)  DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 70bbdc38dc37..8abbc713f7fb 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -231,6 +231,76 @@ __copy_from_user_inatomic_nocache(void *to, const void 
__user *from,
 
 #endif /* ARCH_HAS_NOCACHE_UACCESS */
 
+extern int check_zeroed_user(const void __user *from, size_t size);
+
+/**
+ * copy_struct_from_user: copy a struct from userspace
+ * @dst:   Destination address, in kernel space. This buffer must be @ksize
+ * bytes long.
+ * @ksize: Size of @dst struct.
+ * @src:   Source address, in userspace.
+ * @usize: (Alleged) size of @src struct.
+ *
+ * Copies a struct from userspace to kernel space, in a way that guarantees
+ * backwards-compatibility for struct syscall arguments (as long as future
+ * struct extensions are made such that all new fields are *appended* to the
+ * old struct, and zeroed-out new fields have the same meaning as the old
+ * struct).
+ *
+ * @ksize is just sizeof(*dst), and @usize should've been passed by userspace.
+ * The recommended usage is something like the following:
+ *
+ *   SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
+ *   {
+ *  int err;
+ *  struct foo karg = {};
+ *
+ *  if (usize > PAGE_SIZE)
+ *return -E2BIG;
+ *  if (usize < FOO_SIZE_VER0)
+ *return -EINVAL;
+ *
+ *  err = copy_struct_from_user(, sizeof(karg), uarg, usize);
+ *  if (err)
+ *return err;
+ *
+ *  // ...
+ *   }
+ *
+ * There are three cases to consider:
+ *  * If @usize == @ksize, then it's copied verbatim.
+ *  * If @usize < @ksize, then the userspace has passed an old struct to a
+ *newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
+ *are to be zero-filled.
+ *  * If @usize > @ksize, then the userspace has passed a new struct to an
+ *older kernel. The trailing bytes unknown to the kernel (@usize - @ksize)
+ *are checked to ensure they are zeroed, otherwise -E2BIG is returned.
+ *
+ * Returns (in all cases, some data may have been copied):
+ *  * -E2BIG:  (@usize > @ksize) and there are non-zero trailing bytes in @src.
+ *  * -EFAULT: access to userspace failed.
+ */
+static __always_inline
+int copy_struct_from_user(void *dst, size_t ksize,
+ const void __user *src, size_t usize)
+{
+   size_t size = min(ksize, usize);
+   size_t rest = max(ksize, usize) - size;
+
+   /* Deal with trailing bytes. */
+   if (usize < ksize) {
+   memset(dst + 

[PATCH v4 0/4] lib: introduce copy_struct_from_user() helper

2019-09-30 Thread Aleksa Sarai
Patch changelog:
 v4:
  * __always_inline copy_struct_from_user(). [Kees Cook]
  * Rework test_user_copy.ko changes. [Kees Cook]
 v3: <https://lore.kernel.org/lkml/20190930182810.6090-1-cyp...@cyphar.com/>
 <https://lore.kernel.org/lkml/20190930191526.19544-1-asa...@suse.de/>
 v2: <https://lore.kernel.org/lkml/20190925230332.18690-1-cyp...@cyphar.com/>
 v1: <https://lore.kernel.org/lkml/20190925165915.8135-1-cyp...@cyphar.com/>

This series was split off from the openat2(2) syscall discussion[1].
However, the copy_struct_to_user() helper has been dropped, because
after some discussion it appears that there is no really obvious
semantics for how copy_struct_to_user() should work on mixed-vintages
(for instance, whether [2] is the correct semantics for all syscalls).

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[3]). This series
implements the helper and ports several syscalls to use it.

Some in-kernel selftests are included in this patch. More complete
self-tests for copy_struct_from_user() are included in the openat2()
patchset.

[1]: https://lore.kernel.org/lkml/20190904201933.10736-1-cyp...@cyphar.com/

[2]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[3]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Aleksa Sarai (4):
  lib: introduce copy_struct_from_user() helper
  clone3: switch to copy_struct_from_user()
  sched_setattr: switch to copy_struct_from_user()
  perf_event_open: switch to copy_struct_from_user()

 include/linux/bitops.h |   7 ++
 include/linux/uaccess.h|  70 +++
 include/uapi/linux/sched.h |   2 +
 kernel/events/core.c   |  47 +++--
 kernel/fork.c  |  34 ++
 kernel/sched/core.c|  43 ++--
 lib/strnlen_user.c |   8 +--
 lib/test_user_copy.c   | 136 +++--
 lib/usercopy.c |  55 +++
 9 files changed, 288 insertions(+), 114 deletions(-)

-- 
2.23.0



Re: [PATCH RESEND v3 2/4] clone3: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
On 2019-09-30, Kees Cook  wrote:
> On Tue, Oct 01, 2019 at 05:15:24AM +1000, Aleksa Sarai wrote:
> > From: Aleksa Sarai 
> > 
> > The change is very straightforward, and helps unify the syscall
> > interface for struct-from-userspace syscalls. Additionally, explicitly
> > define CLONE_ARGS_SIZE_VER0 to match the other users of the
> > struct-extension pattern.
> > 
> > Signed-off-by: Aleksa Sarai 
> > ---
> >  include/uapi/linux/sched.h |  2 ++
> >  kernel/fork.c  | 34 +++---
> >  2 files changed, 9 insertions(+), 27 deletions(-)
> > 
> > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > index b3105ac1381a..0945805982b4 100644
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> > @@ -47,6 +47,8 @@ struct clone_args {
> > __aligned_u64 tls;
> >  };
> >  
> > +#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> > +
> >  /*
> >   * Scheduling policies
> >   */
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index f9572f416126..2ef529869c64 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -2525,39 +2525,19 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
> > unsigned long, newsp,
> >  #ifdef __ARCH_WANT_SYS_CLONE3
> >  noinline static int copy_clone_args_from_user(struct kernel_clone_args 
> > *kargs,
> >   struct clone_args __user *uargs,
> > - size_t size)
> > + size_t usize)
> >  {
> > +   int err;
> > struct clone_args args;
> >  
> > -   if (unlikely(size > PAGE_SIZE))
> > +   if (unlikely(usize > PAGE_SIZE))
> > return -E2BIG;
> 
> I quickly looked through the earlier threads and couldn't find it, but
> I have a memory of some discussion about moving this test into the
> copy_struct_from_user() function itself? That would seems like a
> reasonable idea? ("4k should be enough for any structure!")

Yes (and this also seemed the most reasonable way to do it to me), but
the main counter-arguments which swayed me were:

 1. Putting it in the hands of the caller allows them to decide if they
want to have a limit, because if you institute a limit in one kernel
vintage then expanding it later will be less-than-ideally-smooth.

 2. There is no amplification, so doing copy_struct_from_user() for a
really big usize boils down to the userspace program blocking for
the kernel to check if some of your memory is zeroed. Thus there
doesn't seem to be much DoS potential.

Not to mention that users of copy_struct_from_user() will end up doing
some kind of usize comparison anyway (to check if it's smaller than
the version-0 size).

> Either way:
> 
> Reviewed-by: Kees Cook 
> 
> 
> > -
> > -   if (unlikely(size < sizeof(struct clone_args)))
> > +   if (unlikely(usize < CLONE_ARGS_SIZE_VER0))
> > return -EINVAL;
> >  
> > -   if (unlikely(!access_ok(uargs, size)))
> > -   return -EFAULT;
> > -
> > -   if (size > sizeof(struct clone_args)) {
> > -   unsigned char __user *addr;
> > -   unsigned char __user *end;
> > -   unsigned char val;
> > -
> > -   addr = (void __user *)uargs + sizeof(struct clone_args);
> > -   end = (void __user *)uargs + size;
> > -
> > -   for (; addr < end; addr++) {
> > -   if (get_user(val, addr))
> > -   return -EFAULT;
> > -   if (val)
> > -   return -E2BIG;
> > -   }
> > -
> > -   size = sizeof(struct clone_args);
> > -   }
> > -
> > -   if (copy_from_user(, uargs, size))
> > -   return -EFAULT;
> > +   err = copy_struct_from_user(, sizeof(args), uargs, usize);
> > +   if (err)
> > +   return err;
> >  
> > /*
> >  * Verify that higher 32bits of exit_signal are unset and that
> > -- 
> > 2.23.0
> > 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH RESEND v3 1/4] lib: introduce copy_struct_from_user() helper

2019-09-30 Thread Aleksa Sarai
On 2019-09-30, Kees Cook  wrote:
> On Tue, Oct 01, 2019 at 05:15:23AM +1000, Aleksa Sarai wrote:
> > From: Aleksa Sarai 
> > 
> > A common pattern for syscall extensions is increasing the size of a
> > struct passed from userspace, such that the zero-value of the new fields
> > result in the old kernel behaviour (allowing for a mix of userspace and
> > kernel vintages to operate on one another in most cases).
> > 
> > While this interface exists for communication in both directions, only
> > one interface is straightforward to have reasonable semantics for
> > (userspace passing a struct to the kernel). For kernel returns to
> > userspace, what the correct semantics are (whether there should be an
> > error if userspace is unaware of a new extension) is very
> > syscall-dependent and thus probably cannot be unified between syscalls
> > (a good example of this problem is [1]).
> > 
> > Previously there was no common lib/ function that implemented
> > the necessary extension-checking semantics (and different syscalls
> > implemented them slightly differently or incompletely[2]). Future
> > patches replace common uses of this pattern to make use of
> > copy_struct_from_user().
> > 
> > Some in-kernel selftests that insure that the handling of alignment and
> > various byte patterns are all handled identically to memchr_inv() usage.
> > 
> > [1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
> >  robustify sched_read_attr() ABI logic and code")
> > 
> > [2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> >  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> >  always rejects differently-sized struct arguments.
> > 
> > Suggested-by: Rasmus Villemoes 
> > Signed-off-by: Aleksa Sarai 
> > ---
> >  include/linux/bitops.h  |   7 +++
> >  include/linux/uaccess.h |   4 ++
> >  lib/strnlen_user.c  |   8 +--
> >  lib/test_user_copy.c| 133 ++--
> >  lib/usercopy.c  | 123 +
> >  5 files changed, 262 insertions(+), 13 deletions(-)
> > 
> > diff --git a/include/linux/bitops.h b/include/linux/bitops.h
> > index cf074bce3eb3..c94a9ff9f082 100644
> > --- a/include/linux/bitops.h
> > +++ b/include/linux/bitops.h
> > @@ -4,6 +4,13 @@
> >  #include 
> >  #include 
> >  
> > +/* Set bits in the first 'n' bytes when loaded from memory */
> > +#ifdef __LITTLE_ENDIAN
> > +#  define aligned_byte_mask(n) ((1UL << 8*(n))-1)
> > +#else
> > +#  define aligned_byte_mask(n) (~0xffUL << (BITS_PER_LONG - 8 - 8*(n)))
> > +#endif
> > +
> >  #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
> >  #define BITS_TO_LONGS(nr)  DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
> >  
> > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> > index 70bbdc38dc37..94f20e6ec6ab 100644
> > --- a/include/linux/uaccess.h
> > +++ b/include/linux/uaccess.h
> > @@ -231,6 +231,10 @@ __copy_from_user_inatomic_nocache(void *to, const void 
> > __user *from,
> >  
> >  #endif /* ARCH_HAS_NOCACHE_UACCESS */
> >  
> > +extern int check_zeroed_user(const void __user *from, size_t size);
> > +extern int copy_struct_from_user(void *dst, size_t ksize,
> > +const void __user *src, size_t usize);
> > +
> >  /*
> >   * probe_kernel_read(): safely attempt to read from a location
> >   * @dst: pointer to the buffer that shall take the data
> > diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
> > index 28ff554a1be8..6c0005d5dd5c 100644
> > --- a/lib/strnlen_user.c
> > +++ b/lib/strnlen_user.c
> > @@ -3,16 +3,10 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  
> > -/* Set bits in the first 'n' bytes when loaded from memory */
> > -#ifdef __LITTLE_ENDIAN
> > -#  define aligned_byte_mask(n) ((1ul << 8*(n))-1)
> > -#else
> > -#  define aligned_byte_mask(n) (~0xfful << (BITS_PER_LONG - 8 - 8*(n)))
> > -#endif
> > -
> >  /*
> >   * Do a strnlen, return length of string *with* final '\0'.
> >   * 'count' is the user-supplied count, while 'max' is the
> > diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
> > index 67bcd5dfd847..3a17f71029bb 100644
> > --- a/lib/test_user_copy.c
> > +++ b/lib/test_user_copy.c
> > @@ -16,6 +16,7 @@
>

Re: [PATCH v13 3/9] open: O_EMPTYPATH: procfs-less file descriptor re-opening

2019-09-30 Thread Aleksa Sarai
On 2019-10-01, kbuild test robot  wrote:
> Hi Aleksa,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [cannot apply to v5.4-rc1 next-20190930]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see 
> https://stackoverflow.com/a/37406982]
> 
> url:
> https://github.com/0day-ci/linux/commits/Aleksa-Sarai/namei-openat2-2-path-resolution-restrictions/20191001-025628
> config: sparc-allyesconfig (attached as .config)
> compiler: sparc64-linux-gcc (GCC) 7.4.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.4.0 make.cross ARCH=sparc 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
> 
> All error/warnings (new ones prefixed by >>):
> 
>In file included from include/linux/kernel.h:11:0,
> from include/linux/list.h:9,
> from include/linux/wait.h:7,
> from include/linux/wait_bit.h:8,
> from include/linux/fs.h:6,
> from include/uapi/linux/aio_abi.h:31,
> from include/linux/syscalls.h:74,
> from fs/fcntl.c:8:
>fs/fcntl.c: In function 'fcntl_init':
> >> include/linux/compiler.h:350:38: error: call to 
> >> '__compiletime_assert_1037' declared with attribute error: BUILD_BUG_ON 
> >> failed: 22 - 1 != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) 
> >> | __FMODE_EXEC | __FMODE_NONOTIFY)
>  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>  ^
>include/linux/compiler.h:331:4: note: in definition of macro 
> '__compiletime_assert'
>prefix ## suffix();\
>^~
>include/linux/compiler.h:350:2: note: in expansion of macro 
> '_compiletime_assert'
>  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>  ^~~
>include/linux/build_bug.h:39:37: note: in expansion of macro 
> 'compiletime_assert'
> #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
> ^~
>include/linux/build_bug.h:50:2: note: in expansion of macro 
> 'BUILD_BUG_ON_MSG'
>  BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>  ^~~~
> >> fs/fcntl.c:1034:2: note: in expansion of macro 'BUILD_BUG_ON'
>  BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
>  ^~~~

This is because 0x400 is used by FMODE_NONOTIFY. The fix is simple,
and I'll include it in the next version.

> vim +/__compiletime_assert_1037 +350 include/linux/compiler.h
> 
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  336  
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  337  #define 
> _compiletime_assert(condition, msg, prefix, suffix) \
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  338  
> __compiletime_assert(condition, msg, prefix, suffix)
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  339  
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  340  /**
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  341   * compiletime_assert - break 
> build and emit msg if condition is false
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  342   * @condition: a compile-time 
> constant condition to check
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  343   * @msg:   a message to 
> emit if condition is false
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  344   *
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  345   * In tradition of POSIX 
> assert, this macro will break the build if the
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  346   * supplied condition is 
> *false*, emitting the supplied error message if the
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  347   * compiler has support to do 
> so.
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  348   */
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  349  #define 
> compiletime_assert(condition, msg) \
> 9a8ab1c39970a4 Daniel Santos 2013-02-21 @350  
> _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
> 9a8ab1c39970a4 Daniel Santos 2013-02-21  351  
> 
> :: The code at line 350 was first introduced by commit
> :: 9a8ab1c39970a4938a72d94e6fd13be88a797590 bug.h, compiler.h: introduce 
> compiletime_assert & BUILD_BUG_ON_MSG
> 
> :: TO: Daniel Santos 
> :: CC: Linus Torvalds 
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


[PATCH RESEND v3 0/4] lib: introduce copy_struct_from_user() helper

2019-09-30 Thread Aleksa Sarai
From: Aleksa Sarai 

Patch changelog:
 v3: [<https://lore.kernel.org/lkml/20190930182810.6090-1-cyp...@cyphar.com/>]
  * Rename is_zeroed_user() to check_zeroed_user(). [Christian Brauner]
  * Various minor cleanups. [Christian Brauner]
  * Add copy_struct_from_user() tests.
 v2: <https://lore.kernel.org/lkml/20190925230332.18690-1-cyp...@cyphar.com/>
 v1: <https://lore.kernel.org/lkml/20190925165915.8135-1-cyp...@cyphar.com/>

This series was split off from the openat2(2) syscall discussion[1].
However, the copy_struct_to_user() helper has been dropped, because
after some discussion it appears that there is no really obvious
semantics for how copy_struct_to_user() should work on mixed-vintages
(for instance, whether [2] is the correct semantics for all syscalls).

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[3]). This series
implements the helper and ports several syscalls to use it.

Some in-kernel selftests are included in this patch. More complete
self-tests for copy_struct_from_user() are included in the openat2()
patchset.

[1]: https://lore.kernel.org/lkml/20190904201933.10736-1-cyp...@cyphar.com/

[2]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[3]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Aleksa Sarai (4):
  lib: introduce copy_struct_from_user() helper
  clone3: switch to copy_struct_from_user()
  sched_setattr: switch to copy_struct_from_user()
  perf_event_open: switch to copy_struct_from_user()

 include/linux/bitops.h |   7 ++
 include/linux/uaccess.h|   4 ++
 include/uapi/linux/sched.h |   2 +
 kernel/events/core.c   |  47 +++--
 kernel/fork.c  |  34 ++
 kernel/sched/core.c|  43 ++--
 lib/strnlen_user.c |   8 +--
 lib/test_user_copy.c   | 133 +++--
 lib/usercopy.c | 123 ++
 9 files changed, 287 insertions(+), 114 deletions(-)

-- 
2.23.0



[PATCH RESEND v3 4/4] perf_event_open: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
From: Aleksa Sarai 

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls.

Signed-off-by: Aleksa Sarai 
---
 kernel/events/core.c | 47 +---
 1 file changed, 9 insertions(+), 38 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4655adbbae10..3f0cb82e4fbc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10586,55 +10586,26 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
u32 size;
int ret;
 
-   if (!access_ok(uattr, PERF_ATTR_SIZE_VER0))
-   return -EFAULT;
-
-   /*
-* zero the full structure, so that a short copy will be nice.
-*/
+   /* Zero the full structure, so that a short copy will be nice. */
memset(attr, 0, sizeof(*attr));
 
ret = get_user(size, >size);
if (ret)
return ret;
 
-   if (size > PAGE_SIZE)   /* silly large */
-   goto err_size;
-
-   if (!size)  /* abi compat */
+   /* ABI compatibility quirk: */
+   if (!size)
size = PERF_ATTR_SIZE_VER0;
-
-   if (size < PERF_ATTR_SIZE_VER0)
+   if (size < PERF_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
attr->size = size;
 
if (attr->__reserved_1)
-- 
2.23.0



[PATCH RESEND v3 1/4] lib: introduce copy_struct_from_user() helper

2019-09-30 Thread Aleksa Sarai
From: Aleksa Sarai 

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

While this interface exists for communication in both directions, only
one interface is straightforward to have reasonable semantics for
(userspace passing a struct to the kernel). For kernel returns to
userspace, what the correct semantics are (whether there should be an
error if userspace is unaware of a new extension) is very
syscall-dependent and thus probably cannot be unified between syscalls
(a good example of this problem is [1]).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[2]). Future
patches replace common uses of this pattern to make use of
copy_struct_from_user().

Some in-kernel selftests that insure that the handling of alignment and
various byte patterns are all handled identically to memchr_inv() usage.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Suggested-by: Rasmus Villemoes 
Signed-off-by: Aleksa Sarai 
---
 include/linux/bitops.h  |   7 +++
 include/linux/uaccess.h |   4 ++
 lib/strnlen_user.c  |   8 +--
 lib/test_user_copy.c| 133 ++--
 lib/usercopy.c  | 123 +
 5 files changed, 262 insertions(+), 13 deletions(-)

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index cf074bce3eb3..c94a9ff9f082 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+/* Set bits in the first 'n' bytes when loaded from memory */
+#ifdef __LITTLE_ENDIAN
+#  define aligned_byte_mask(n) ((1UL << 8*(n))-1)
+#else
+#  define aligned_byte_mask(n) (~0xffUL << (BITS_PER_LONG - 8 - 8*(n)))
+#endif
+
 #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
 #define BITS_TO_LONGS(nr)  DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 70bbdc38dc37..94f20e6ec6ab 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -231,6 +231,10 @@ __copy_from_user_inatomic_nocache(void *to, const void 
__user *from,
 
 #endif /* ARCH_HAS_NOCACHE_UACCESS */
 
+extern int check_zeroed_user(const void __user *from, size_t size);
+extern int copy_struct_from_user(void *dst, size_t ksize,
+const void __user *src, size_t usize);
+
 /*
  * probe_kernel_read(): safely attempt to read from a location
  * @dst: pointer to the buffer that shall take the data
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index 28ff554a1be8..6c0005d5dd5c 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -3,16 +3,10 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-/* Set bits in the first 'n' bytes when loaded from memory */
-#ifdef __LITTLE_ENDIAN
-#  define aligned_byte_mask(n) ((1ul << 8*(n))-1)
-#else
-#  define aligned_byte_mask(n) (~0xfful << (BITS_PER_LONG - 8 - 8*(n)))
-#endif
-
 /*
  * Do a strnlen, return length of string *with* final '\0'.
  * 'count' is the user-supplied count, while 'max' is the
diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
index 67bcd5dfd847..3a17f71029bb 100644
--- a/lib/test_user_copy.c
+++ b/lib/test_user_copy.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Several 32-bit architectures support 64-bit {get,put}_user() calls.
@@ -31,14 +32,129 @@
 # define TEST_U64
 #endif
 
-#define test(condition, msg)   \
-({ \
-   int cond = (condition); \
-   if (cond)   \
-   pr_warn("%s\n", msg);   \
-   cond;   \
+#define test(condition, msg, ...)  \
+({ \
+   int cond = (condition); \
+   if (cond)   \
+   pr_warn("[%d] " msg "\n", __LINE__, ##__VA_ARGS__); \
+   cond;   \
 })
 
+static bool is_zeroed(void *from, size_t size)
+{
+   return memchr_inv(from, 0x0, size) == NULL;
+}
+
+static int test_check_nonzero_user(char *kmem, char __user *umem, size_t size)
+{
+   int ret = 0;
+   size_t start, end, i;
+   size_t ze

[PATCH RESEND v3 2/4] clone3: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
From: Aleksa Sarai 

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Additionally, explicitly
define CLONE_ARGS_SIZE_VER0 to match the other users of the
struct-extension pattern.

Signed-off-by: Aleksa Sarai 
---
 include/uapi/linux/sched.h |  2 ++
 kernel/fork.c  | 34 +++---
 2 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..0945805982b4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,6 +47,8 @@ struct clone_args {
__aligned_u64 tls;
 };
 
+#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f416126..2ef529869c64 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2525,39 +2525,19 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 #ifdef __ARCH_WANT_SYS_CLONE3
 noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
  struct clone_args __user *uargs,
- size_t size)
+ size_t usize)
 {
+   int err;
struct clone_args args;
 
-   if (unlikely(size > PAGE_SIZE))
+   if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
-
-   if (unlikely(size < sizeof(struct clone_args)))
+   if (unlikely(usize < CLONE_ARGS_SIZE_VER0))
return -EINVAL;
 
-   if (unlikely(!access_ok(uargs, size)))
-   return -EFAULT;
-
-   if (size > sizeof(struct clone_args)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uargs + sizeof(struct clone_args);
-   end = (void __user *)uargs + size;
-
-   for (; addr < end; addr++) {
-   if (get_user(val, addr))
-   return -EFAULT;
-   if (val)
-   return -E2BIG;
-   }
-
-   size = sizeof(struct clone_args);
-   }
-
-   if (copy_from_user(, uargs, size))
-   return -EFAULT;
+   err = copy_struct_from_user(, sizeof(args), uargs, usize);
+   if (err)
+   return err;
 
/*
 * Verify that higher 32bits of exit_signal are unset and that
-- 
2.23.0



[PATCH RESEND v3 3/4] sched_setattr: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
From: Aleksa Sarai 

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Ideally we could also
unify sched_getattr(2)-style syscalls as well, but unfortunately the
correct semantics for such syscalls are much less clear (see [1] for
more detail). In future we could come up with a more sane idea for how
the syscall interface should look.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

Signed-off-by: Aleksa Sarai 
---
 kernel/sched/core.c | 43 +++
 1 file changed, 7 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7880f4f64d0e..dd05a378631a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5106,9 +5106,6 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
u32 size;
int ret;
 
-   if (!access_ok(uattr, SCHED_ATTR_SIZE_VER0))
-   return -EFAULT;
-
/* Zero the full structure, so that a short copy will be nice: */
memset(attr, 0, sizeof(*attr));
 
@@ -5116,45 +5113,19 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
if (ret)
return ret;
 
-   /* Bail out on silly large: */
-   if (size > PAGE_SIZE)
-   goto err_size;
-
/* ABI compatibility quirk: */
if (!size)
size = SCHED_ATTR_SIZE_VER0;
-
-   if (size < SCHED_ATTR_SIZE_VER0)
+   if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
size < SCHED_ATTR_SIZE_VER1)
return -EINVAL;
@@ -5354,7 +5325,7 @@ sched_attr_copy_to_user(struct sched_attr __user *uattr,
  * sys_sched_getattr - similar to sched_getparam, but with sched_attr
  * @pid: the pid in question.
  * @uattr: structure containing the extended parameters.
- * @usize: sizeof(attr) that user-space knows about, for forwards and 
backwards compatibility.
+ * @usize: sizeof(attr) for fwd/bwd comp.
  * @flags: for future extension.
  */
 SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
-- 
2.23.0



[PATCH v3 1/4] lib: introduce copy_struct_from_user() helper

2019-09-30 Thread Aleksa Sarai
A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

While this interface exists for communication in both directions, only
one interface is straightforward to have reasonable semantics for
(userspace passing a struct to the kernel). For kernel returns to
userspace, what the correct semantics are (whether there should be an
error if userspace is unaware of a new extension) is very
syscall-dependent and thus probably cannot be unified between syscalls
(a good example of this problem is [1]).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[2]). Future
patches replace common uses of this pattern to make use of
copy_struct_from_user().

Some in-kernel selftests that insure that the handling of alignment and
various byte patterns are all handled identically to memchr_inv() usage.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Suggested-by: Rasmus Villemoes 
Signed-off-by: Aleksa Sarai 
---
 include/linux/bitops.h  |   7 +++
 include/linux/uaccess.h |   4 ++
 lib/strnlen_user.c  |   8 +--
 lib/test_user_copy.c| 133 ++--
 lib/usercopy.c  | 123 +
 5 files changed, 262 insertions(+), 13 deletions(-)

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index cf074bce3eb3..c94a9ff9f082 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+/* Set bits in the first 'n' bytes when loaded from memory */
+#ifdef __LITTLE_ENDIAN
+#  define aligned_byte_mask(n) ((1UL << 8*(n))-1)
+#else
+#  define aligned_byte_mask(n) (~0xffUL << (BITS_PER_LONG - 8 - 8*(n)))
+#endif
+
 #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
 #define BITS_TO_LONGS(nr)  DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 70bbdc38dc37..94f20e6ec6ab 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -231,6 +231,10 @@ __copy_from_user_inatomic_nocache(void *to, const void 
__user *from,
 
 #endif /* ARCH_HAS_NOCACHE_UACCESS */
 
+extern int check_zeroed_user(const void __user *from, size_t size);
+extern int copy_struct_from_user(void *dst, size_t ksize,
+const void __user *src, size_t usize);
+
 /*
  * probe_kernel_read(): safely attempt to read from a location
  * @dst: pointer to the buffer that shall take the data
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index 28ff554a1be8..6c0005d5dd5c 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -3,16 +3,10 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-/* Set bits in the first 'n' bytes when loaded from memory */
-#ifdef __LITTLE_ENDIAN
-#  define aligned_byte_mask(n) ((1ul << 8*(n))-1)
-#else
-#  define aligned_byte_mask(n) (~0xfful << (BITS_PER_LONG - 8 - 8*(n)))
-#endif
-
 /*
  * Do a strnlen, return length of string *with* final '\0'.
  * 'count' is the user-supplied count, while 'max' is the
diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
index 67bcd5dfd847..3a17f71029bb 100644
--- a/lib/test_user_copy.c
+++ b/lib/test_user_copy.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Several 32-bit architectures support 64-bit {get,put}_user() calls.
@@ -31,14 +32,129 @@
 # define TEST_U64
 #endif
 
-#define test(condition, msg)   \
-({ \
-   int cond = (condition); \
-   if (cond)   \
-   pr_warn("%s\n", msg);   \
-   cond;   \
+#define test(condition, msg, ...)  \
+({ \
+   int cond = (condition); \
+   if (cond)   \
+   pr_warn("[%d] " msg "\n", __LINE__, ##__VA_ARGS__); \
+   cond;   \
 })
 
+static bool is_zeroed(void *from, size_t size)
+{
+   return memchr_inv(from, 0x0, size) == NULL;
+}
+
+static int test_check_nonzero_user(char *kmem, char __user *umem, size_t size)
+{
+   int ret = 0;
+   size_t start, end, i;
+   size_t zero_start = size / 4;
+   size_t 

[PATCH v13 3/9] open: O_EMPTYPATH: procfs-less file descriptor re-opening

2019-09-30 Thread Aleksa Sarai
Userspace has made use of /proc/self/fd very liberally to allow for
descriptors to be re-opened. There are a wide variety of uses for this
feature, but it has always required constructing a pathname and could
not be done without procfs mounted. The obvious solution for this is to
extend openat(2) to have an AT_EMPTY_PATH-equivalent -- O_EMPTYPATH.

Now that descriptor re-opening has been made safe through the new
magic-link resolution restrictions, we can replicate these restrictions
for O_EMPTYPATH. In particular, we only allow "upgrading" the file
descriptor if the corresponding FMODE_PATH_* bit is set (or the
FMODE_{READ,WRITE} cases for non-O_PATH file descriptors).

When doing openat(O_EMPTYPATH|O_PATH), O_PATH takes precedence and
O_EMPTYPATH is ignored. Very few users ever have a need to O_PATH
re-open an existing file descriptor, and so accommodating them at the
expense of further complicating O_PATH makes little sense. Ultimately,
if users ask for this we can always add RESOLVE_EMPTY_PATH to
resolveat(2) in the future.

Signed-off-by: Aleksa Sarai 
---
 arch/alpha/include/uapi/asm/fcntl.h  |  1 +
 arch/parisc/include/uapi/asm/fcntl.h | 39 ++--
 arch/sparc/include/uapi/asm/fcntl.h  |  1 +
 fs/fcntl.c   |  2 +-
 fs/namei.c   | 20 ++
 fs/open.c|  7 -
 include/linux/fcntl.h|  2 +-
 include/uapi/asm-generic/fcntl.h |  4 +++
 8 files changed, 54 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h 
b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..1f879bade68b 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -34,6 +34,7 @@
 
 #define O_PATH 04000
 #define __O_TMPFILE01
+#define O_EMPTYPATH02
 
 #define F_GETLK7
 #define F_SETLK8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h 
b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..5d709058a76f 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
 #ifndef _PARISC_FCNTL_H
 #define _PARISC_FCNTL_H
 
-#define O_APPEND   00010
-#define O_BLKSEEK  00100 /* HPUX only */
-#define O_CREAT00400 /* not fcntl */
-#define O_EXCL 02000 /* not fcntl */
-#define O_LARGEFILE04000
-#define __O_SYNC   00010
+#define O_APPEND   10
+#define O_BLKSEEK  000100 /* HPUX only */
+#define O_CREAT000400 /* not fcntl */
+#define O_EXCL 002000 /* not fcntl */
+#define O_LARGEFILE004000
+#define __O_SYNC   10
 #define O_SYNC (__O_SYNC|O_DSYNC)
-#define O_NONBLOCK 00024 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY   00040 /* not fcntl */
-#define O_DSYNC00100 /* HPUX only */
-#define O_RSYNC00200 /* HPUX only */
-#define O_NOATIME  00400
-#define O_CLOEXEC  01000 /* set close_on_exec */
-
-#define O_DIRECTORY1 /* must be a directory */
-#define O_NOFOLLOW 00200 /* don't follow links */
-#define O_INVISIBLE00400 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH 02000
-#define __O_TMPFILE04000
+#define O_NONBLOCK 24 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY   40 /* not fcntl */
+#define O_DSYNC000100 /* HPUX only */
+#define O_RSYNC000200 /* HPUX only */
+#define O_NOATIME  000400
+#define O_CLOEXEC  001000 /* set close_on_exec */
+
+#define O_DIRECTORY01 /* must be a directory */
+#define O_NOFOLLOW 000200 /* don't follow links */
+#define O_INVISIBLE000400 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH 002000
+#define __O_TMPFILE004000
+#define O_EMPTYPATH01
 
 #define F_GETLK64  8
 #define F_SETLK64  9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h 
b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..dc86c9eaf950 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH 0x100
 #define __O_TMPFILE0x200
+#define O_EMPTYPATH0x400
 
 #define F_GETOWN   5   /*  for sockets. */
 #define F_SETOWN   6   /*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..4cf05a2fd162 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 * is defined as O_NONBLOCK on some platforms and not on others.
 */
-   BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+   BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
   

[PATCH v13 7/9] open: openat2(2) syscall

2019-09-30 Thread Aleksa Sarai
The most obvious syscall to add support for the new LOOKUP_* scoping
flags would be openat(2). However, there are a few reasons why this is
not the best course of action:

 * The new LOOKUP_* flags are intended to be security features, and
   openat(2) will silently ignore all unknown flags. This means that
   users would need to avoid foot-gunning themselves constantly when
   using this interface if it were part of openat(2). This can be fixed
   by having userspace libraries handle this for users[1], but should be
   avoided if possible.

 * Resolution scoping feels like a different operation to the existing
   O_* flags. And since openat(2) has limited flag space, it seems to be
   quite wasteful to clutter it with 5 flags that are all
   resolution-related. Arguably O_NOFOLLOW is also a resolution flag but
   its entire purpose is to error out if you encounter a trailing
   symlink -- not to scope resolution.

 * Other systems would be able to reimplement this syscall allowing for
   cross-OS standardisation rather than being hidden amongst O_* flags
   which may result in it not being used by all the parties that might
   want to use it (file servers, web servers, container runtimes, etc).

 * It gives us the opportunity to iterate on the O_PATH interface. In
   particular, the new @how->upgrade_mask field for fd re-opening is
   only possible because we have a clean slate without needing to re-use
   the ACC_MODE flag design nor the existing openat(2) @mode semantics.

To this end, we introduce the openat2(2) syscall. It provides all of the
features of openat(2) through the @how->flags argument, but also
also provides a new @how->resolve argument which exposes RESOLVE_* flags
that map to our new LOOKUP_* flags. It also eliminates the long-standing
ugliness of variadic-open(2) by embedding it in a struct.

In order to allow for userspace to lock down their usage of file
descriptor re-opening, openat2(2) has the ability for users to disallow
certain re-opening modes through @how->upgrade_mask. At the moment,
there is no UPGRADE_NOEXEC.

[1]: https://github.com/openSUSE/libpathrs

Suggested-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 arch/alpha/kernel/syscalls/syscall.tbl  |  1 +
 arch/arm/tools/syscall.tbl  |  1 +
 arch/arm64/include/asm/unistd.h |  2 +-
 arch/arm64/include/asm/unistd32.h   |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl   |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl   |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl|  1 +
 arch/s390/kernel/syscalls/syscall.tbl   |  1 +
 arch/sh/kernel/syscalls/syscall.tbl |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl  |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl  |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl  |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl |  1 +
 fs/open.c   | 94 -
 include/linux/fcntl.h   | 19 -
 include/linux/fs.h  |  4 +-
 include/linux/syscalls.h| 14 ++-
 include/uapi/asm-generic/unistd.h   |  5 +-
 include/uapi/linux/fcntl.h  | 42 +
 24 files changed, 168 insertions(+), 30 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..9f374f7d9514 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 543common  fspick  sys_fspick
 544common  pidfd_open  sys_pidfd_open
 # 545 reserved for clone3
+547common  openat2 sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..4ba54bc7e19a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 435common  clone3  sys_clone3
+437common  openat2 sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..8aa00ccb0b96 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   436
+#define __NR_compat_syscalls   438
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unis

[PATCH v13 5/9] namei: LOOKUP_IN_ROOT: chroot-like path resolution

2019-09-30 Thread Aleksa Sarai
The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing
LOOKUP_NO_XDEV and LOOKUP_NO_MAGICLINKS help defend against other
potential attacks in a malicious rootfs scenario.

Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.

[*] At the moment, ".." and magic-link jumping are disallowed for the
same reason it is disabled for LOOKUP_BENEATH -- currently it is not
safe to allow it. Future patches may enable it unconditionally once
we have resolved the possible races (for "..") and semantics (for
magic-link jumping).

The most significant *at(2) semantic change with LOOKUP_IN_ROOT is that
absolute pathnames no longer cause the dirfd to be ignored completely.

The rationale is that LOOKUP_IN_ROOT must necessarily chroot-scope
symlinks with absolute paths to dirfd, and so doing it for the base path
seems to be the most consistent behaviour (and also avoids foot-gunning
users who want to scope paths that are absolute).

[1]: https://github.com/cyphar/filepath-securejoin

Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 5 +
 include/linux/namei.h | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index b80efc0ae0f3..efed62c6136e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2274,6 +2274,11 @@ static const char *path_init(struct nameidata *nd, 
unsigned flags)
 
nd->m_seq = read_seqbegin(_lock);
 
+   /* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
+   if (flags & LOOKUP_IN_ROOT)
+   while (*s == '/')
+   s++;
+
/* Figure out the starting path and root (if needed). */
if (*s == '/') {
error = nd_jump_root(nd);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 88b610ca4d83..1ace31052237 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -48,8 +48,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_MAGICLINKS   0x08 /* No /proc/$pid/fd/ "symlink" 
crossing. */
 #define LOOKUP_NO_SYMLINKS 0x10 /* No symlink crossing *at all*.
Implies LOOKUP_NO_MAGICLINKS. */
+#define LOOKUP_IN_ROOT 0x20 /* Treat dirfd as %current->fs->root. 
*/
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
-#define LOOKUP_DIRFD_SCOPE_FLAGS LOOKUP_BENEATH
+#define LOOKUP_DIRFD_SCOPE_FLAGS (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
 
 extern int path_pts(struct path *path);
 
-- 
2.23.0



[PATCH v13 2/9] procfs: switch magic-link modes to be more sane

2019-09-30 Thread Aleksa Sarai
Now that magic-link modes are obeyed for file re-opening purposes, some
of the pre-existing magic-link modes need to be adjusted to be more
semantically correct.

The most blatant example of this is /proc/self/exe, which had a mode of
a+rwx even though tautologically the file could never be opened for
writing (because it is the current->mm of a live process).

With the new O_PATH restrictions, changing the default mode of these
magic-links allows us to avoid delayed-access attacks such as we saw in
CVE-2019-5736.

Signed-off-by: Aleksa Sarai 
---
 fs/proc/base.c   | 20 ++--
 fs/proc/namespaces.c |  2 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 96c9ec66846f..908edd0e875e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -133,9 +133,9 @@ struct pid_entry {
 
 #define DIR(NAME, MODE, iops, fops)\
NOD(NAME, (S_IFDIR|(MODE)), , , {} )
-#define LNK(NAME, get_link)\
-   NOD(NAME, (S_IFLNK|S_IRWXUGO),  \
-   _pid_link_inode_operations, NULL,  \
+#define LNK(NAME, MODE, get_link)  \
+   NOD(NAME, (S_IFLNK|(MODE)), \
+   _pid_link_inode_operations, NULL,  \
{ .proc_get_link = get_link } )
 #define REG(NAME, MODE, fops)  \
NOD(NAME, (S_IFREG|(MODE)), NULL, , {})
@@ -3047,9 +3047,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
 #endif
REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations),
-   LNK("cwd",proc_cwd_link),
-   LNK("root",   proc_root_link),
-   LNK("exe",proc_exe_link),
+   LNK("cwd",S_IRWXUGO, proc_cwd_link),
+   LNK("root",   S_IRWXUGO, proc_root_link),
+   LNK("exe",S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts", S_IRUGO, proc_mounts_operations),
REG("mountinfo",  S_IRUGO, proc_mountinfo_operations),
REG("mountstats", S_IRUSR, proc_mountstats_operations),
@@ -3448,11 +3448,11 @@ static const struct pid_entry tid_base_stuff[] = {
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
 #endif
REG("mem",   S_IRUSR|S_IWUSR, proc_mem_operations),
-   LNK("cwd",   proc_cwd_link),
-   LNK("root",  proc_root_link),
-   LNK("exe",   proc_exe_link),
+   LNK("cwd",   S_IRWXUGO, proc_cwd_link),
+   LNK("root",  S_IRWXUGO, proc_root_link),
+   LNK("exe",   S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts",S_IRUGO, proc_mounts_operations),
-   REG("mountinfo",  S_IRUGO, proc_mountinfo_operations),
+   REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
 #ifdef CONFIG_PROC_PAGE_MONITOR
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 0142992eceea..cadf0ae796a2 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -94,7 +94,7 @@ static struct dentry *proc_ns_instantiate(struct dentry 
*dentry,
struct inode *inode;
struct proc_inode *ei;
 
-   inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
+   inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRUGO);
if (!inode)
return ERR_PTR(-ENOENT);
 
-- 
2.23.0



[PATCH v13 6/9] namei: permit ".." resolution with LOOKUP_{IN_ROOT,BENEATH}

2019-09-30 Thread Aleksa Sarai
This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
".." resolution (in the case of LOOKUP_BENEATH the resolution will still
fail if ".." resolution would resolve a path outside of the root --
while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
are still disallowed entirely[*].

The need for this patch (and the original no-".." restriction) is
explained by observing there is a fairly easy-to-exploit race condition
with chroot(2) (and thus by extension LOOKUP_IN_ROOT and LOOKUP_BENEATH
if ".." is allowed) where a rename(2) of a path can be used to "skip
over" nd->root and thus escape to the filesystem above nd->root.

  thread1 [attacker]:
for (;;)
  renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
for (;;)
  openat2(dirb, "b/c/../../etc/shadow",
  { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.

This is a first-pass implementation, where -EAGAIN will be returned if
any rename or mount occurs anywhere on the host (in any namespace). This
will result in spurious errors, but there isn't a satisfactory
alternative (other than denying ".." altogether).

One other possible alternative (which previous versions of this patch
used) would be to check with path_is_under() if there was a racing
rename or mount (after re-taking the relevant seqlocks). While this does
work, it results in possible O(n*m) behaviour if there are many renames
or mounts occuring *anywhere on the system*.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check (as
with the alternative solution for "..") for magic-links after they
are resolved. However this seems unlikely to be a feature that
people *really* need -- it can be added later if it turns out a lot
of people want it.

Signed-off-by: Aleksa Sarai 
---
 fs/namei.c | 43 +--
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index efed62c6136e..9c35768fcf4f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode*inode; /* path.dentry.d_inode */
unsigned intflags;
-   unsignedseq, m_seq;
+   unsignedseq, m_seq, r_seq;
int last_type;
unsigneddepth;
int total_link_count;
@@ -1766,22 +1766,35 @@ static inline int handle_dots(struct nameidata *nd, int 
type)
if (type == LAST_DOTDOT) {
int error = 0;
 
-   /*
-* Scoped-lookup flags resolving ".." is not currently safe --
-* races can cause our parent to have moved outside of the root
-* and us to skip over it.
-*/
-   if (unlikely(nd->flags & LOOKUP_DIRFD_SCOPE_FLAGS))
-   return -EXDEV;
if (!nd->root.mnt) {
error = set_root(nd);
if (error)
return error;
}
-   if (nd->flags & LOOKUP_RCU) {
-   return follow_dotdot_rcu(nd);
-   } else
-   return follow_dotdot(nd);
+   if (nd->flags & LOOKUP_RCU)
+   error = follow_dotdot_rcu(nd);
+   else
+   error = follow_dotdot(nd);
+   if (error)
+   return error;
+
+   if (unlikely(nd->flags & LOOKUP_DIRFD_SCOPE_FLAGS)) {
+   bool m_retry = read_seqretry(_lock, nd->m_seq);
+   bool r_retry = read_seqretry(_lock, nd->r_seq);
+
+   

[PATCH v13 9/9] Documentation: update path-lookup to mention trailing magic-links

2019-09-30 Thread Aleksa Sarai
We've introduced new (somewhat subtle) behaviour regarding trailing
magic-links, so it's best to make sure everyone can follow along with
the reasoning behind trailing_magiclink().

Signed-off-by: Aleksa Sarai 
---
 Documentation/filesystems/path-lookup.rst | 80 ++-
 1 file changed, 63 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/path-lookup.rst 
b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..c30145b3d9ba 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -405,6 +405,10 @@ is requested.  Keeping a reference in the ``nameidata`` 
ensures that
 only one root is in effect for the entire path walk, even if it races
 with a ``chroot()`` system call.
 
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
+``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
+passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
+
 The root is needed when either of two conditions holds: (1) either the
 pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
 component is being handled, since "``..``" from the root must always stay
@@ -1149,22 +1153,61 @@ so ``NULL`` is returned to indicate that the symlink 
can be released and
 the stack frame discarded.
 
 The other case involves things in ``/proc`` that look like symlinks but
-aren't really::
+aren't really (and are therefore commonly referred to as "magic-links")::
 
  $ ls -l /proc/self/fd/1
  lrwx-- 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
 
 Every open file descriptor in any process is represented in ``/proc`` by
-something that looks like a symlink.  It is really a reference to the
-target file, not just the name of it.  When you ``readlink`` these
-objects you get a name that might refer to the same file - unless it
-has been unlinked or mounted over.  When ``walk_component()`` follows
-one of these, the ``->follow_link()`` method in "procfs" doesn't return
-a string name, but instead calls ``nd_jump_link()`` which updates the
-``nameidata`` in place to point to that target.  ``->follow_link()`` then
-returns ``NULL``.  Again there is no final component and ``get_link()``
-reports this by leaving the ``last_type`` field of ``nameidata`` as
-``LAST_BIND``.
+a magic-link.  It is really a reference to the target file, not just the
+name of it (hence making them "magical" compared to ordinary symlinks).
+When you ``readlink`` these objects you get a name that might refer to
+the same file - unless it has been unlinked or mounted over.  When
+``walk_component()`` follows one of these, the ``->follow_link()`` method
+in "procfs" doesn't return a string name, but instead calls
+``nd_jump_link()`` which updates the ``nameidata`` in place to point to
+that target.  ``->follow_link()`` then returns ``NULL``. Again there is
+no final component and ``get_link()`` reports this by leaving the
+``last_type`` field of ``nameidata`` as ``LAST_BIND``.
+
+In order to avoid potential re-opening attacks (especially in the context
+of containers), it is necessary to restrict the ability for a trailing
+magic-link to be opened. The restrictions are as follows (and are
+implemented in ``trailing_magiclink()``):
+
+* If the ``open()`` is an "ordinary open" (without ``O_PATH``), the
+  access-mode of the ``open()`` call must be permitted by one of the
+  octets in the magic-link's file mode (elsewhere in Linux, ordinary
+  symlinks have a file mode of ``0777`` but this doesn't apply to
+  magic-links). Each "ordinary" file in ``/proc/self/fd/$n`` has the user
+  octet of its file mode set to correspond to the access-mode it was
+  opened with.
+
+  This restriction means that you cannot re-open an ``O_RDONLY`` file
+  descriptor through ``/proc/self/fd/$n`` with ``O_RDWR``.
+
+With a "half-open" (with ``O_PATH``), there is no ``-EACCES``-enforced
+restrictions on ``open()``, but there are rules about the mode shown in
+``/proc/self/fd/$n``:
+
+* If the target of the ``open()`` is not a magic-link, then the group
+  octet of the file mode is set to permit all access modes.
+
+* Otherwise, the mode of the new ``O_PATH`` descriptor is set to
+  effectively the same mode as the magic-link (though the permissions are
+  set in the group octet of the mode). This means that an ``O_PATH`` of a
+  magic-link gives you no more re-open permissions than the magic-link
+  itself.
+
+With these ``O_PATH`` restrictions, it is still possible to re-open an
+``O_PATH`` file descriptor but you cannot use ``O_PATH`` to work around
+the above restrictions on "ordinary opens" of magic-links.
+
+In order to avoid certain race conditions (where a file descriptor
+associated with a magic-link is swapped, causing the ``link_inode`` of
+``nameidata`` to become stale during magic-link traver

[PATCH v13 8/9] selftests: add openat2(2) selftests

2019-09-30 Thread Aleksa Sarai
Test all of the various openat2(2) flags, as well as how file
descriptor re-opening works. A small stress-test of a symlink-rename
attack is included to show that the protections against ".."-based
attacks are sufficient. In addition, the memfd selftest is fixed to no
longer depend on the now-disallowed functionality of upgrading an
O_RDONLY descriptor to O_RDWR.

The main things these self-tests are enforcing are:

  * The struct+usize ABI for openat2(2) and copy_struct_from_user() to
ensure that upgrades will be handled gracefully (in addition,
ensuring that misaligned structures are also handled correctly).

  * All of the RESOLVE_* semantics (including errno values) are
correctly handled with various combinations of paths and flags.

  * RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
attack that has been responsible for several CVEs (and likely will
be responsible for several more).

  * The magic-link trailing mode semantics correctly block re-opens in
all of the relevant cases, as well as checking that the "flip-flop"
attack is correctly protected against.

  * O_PATH has the correct semantics (the mode is g+rwx for ordinary
files, but for trailing magic-links the mode gets inherited).

Signed-off-by: Aleksa Sarai 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/memfd/memfd_test.c|   7 +-
 tools/testing/selftests/openat2/.gitignore|   1 +
 tools/testing/selftests/openat2/Makefile  |   8 +
 tools/testing/selftests/openat2/helpers.c |  98 +++
 tools/testing/selftests/openat2/helpers.h | 114 
 .../testing/selftests/openat2/linkmode_test.c | 590 ++
 .../testing/selftests/openat2/openat2_test.c  | 152 +
 .../selftests/openat2/rename_attack_test.c| 149 +
 .../testing/selftests/openat2/resolve_test.c  | 522 
 10 files changed, 1640 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
 create mode 100644 tools/testing/selftests/openat2/openat2_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index c3feccb99ff5..7e91d7f03afb 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
 TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += openat2
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index c67d32eeb668..e71df3d3e55d 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -925,7 +925,7 @@ static void test_share_mmap(char *banner, char *b_suffix)
  */
 static void test_share_open(char *banner, char *b_suffix)
 {
-   int fd, fd2;
+   int procfd, fd, fd2;
 
printf("%s %s %s\n", memfd_str, banner, b_suffix);
 
@@ -950,13 +950,16 @@ static void test_share_open(char *banner, char *b_suffix)
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
 
+   /* We cannot do a MAY_WRITE re-open of an O_RDONLY fd. */
+   procfd = mfd_assert_open(fd2, O_PATH, 0);
close(fd2);
-   fd2 = mfd_assert_open(fd, O_RDWR, 0);
+   fd2 = mfd_assert_open(procfd, O_WRONLY, 0);
 
mfd_assert_add_seals(fd2, F_SEAL_SEAL);
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
 
+   close(procfd);
close(fd2);
close(fd);
 }
diff --git a/tools/testing/selftests/openat2/.gitignore 
b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index ..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile 
b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index ..bd6ce6cfaa59
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
+TEST_GEN_PROGS := linkmode_test openat2_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c 
b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index ..5a9d6e36357f
--- /dev/null
+++ b/tools/testing/selftests/o

[PATCH v13 0/9] namei: openat2(2) path resolution restrictions

2019-09-30 Thread Aleksa Sarai
round the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.

In addition, two new flags are added that expand on the above ideas:

  * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.

  * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that chroot(2)
is not.

If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to cross
magic-links with LOOKUP_IN_ROOT.

The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[6] when opening
paths in a potentially malicious container. There is a long list of
CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
(such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
CVE-2019-5736, just to name a few).

And further, several semantics of file descriptor "re-opening" are now
changed to prevent attacks like CVE-2019-5736 by restricting how
magic-links can be resolved (based on their mode). This required some
other changes to the semantics of the modes of O_PATH file descriptor's
associated /proc/self/fd magic-links. openat2(2) has the ability to
further restrict re-opening of its own O_PATH fds, so that users can
make even better use of this feature.

Finally, O_EMPTYPATH was added so that users can do /proc/self/fd-style
re-opening without depending on procfs. The new restricted semantics for
magic-links are applied here too.

In order to make all of the above more usable, I'm working on
libpathrs[7] which is a C-friendly library for safe path resolution. It
features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.

[1]: https://lwn.net/Articles/721443/
[2]: https://lore.kernel.org/patchwork/patch/784221/
[3]: https://lwn.net/Articles/619151/
[4]: https://lwn.net/Articles/603929/
[5]: https://lwn.net/Articles/723057/
[6]: https://github.com/cyphar/filepath-securejoin
[7]: https://github.com/openSUSE/libpathrs

Aleksa Sarai (9):
  namei: obey trailing magic-link DAC permissions
  procfs: switch magic-link modes to be more sane
  open: O_EMPTYPATH: procfs-less file descriptor re-opening
  namei: O_BENEATH-style path resolution flags
  namei: LOOKUP_IN_ROOT: chroot-like path resolution
  namei: permit ".." resolution with LOOKUP_{IN_ROOT,BENEATH}
  open: openat2(2) syscall
  selftests: add openat2(2) selftests
  Documentation: update path-lookup to mention trailing magic-links

 Documentation/filesystems/path-lookup.rst |  80 ++-
 arch/alpha/include/uapi/asm/fcntl.h   |   1 +
 arch/alpha/kernel/syscalls/syscall.tbl|   1 +
 arch/arm/tools/syscall.tbl|   1 +
 arch/arm64/include/asm/unistd.h   |   2 +-
 arch/arm64/include/asm/unistd32.h |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl |   1 +
 arch/parisc/include/uapi/asm/fcntl.h  |  39 +-
 arch/parisc/kernel/syscalls/syscall.tbl   |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl  |   1 +
 arch/s390/kernel/syscalls/syscall.tbl |   1 +
 arch/sh/kernel/syscalls/syscall.tbl   |   1 +
 arch/sparc/include/uapi/asm/fcntl.h   |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl|   1 +
 arch/x86/entry/syscalls/syscall_32.tbl|   1 +
 arch/x86/entry/syscalls/syscall_64.tbl|   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl   |   1 +
 fs/fcntl.c|   2 +-
 fs/internal.h |   1 +
 fs/namei.c| 286 +++--
 fs/open.c | 100 ++-
 fs/proc/base.c|  69 +-
 fs/proc/fd.c  |  45 +-
 fs/proc/internal.h|   2 +-
 fs/proc/namespaces.c  |   4 +-
 include/linux/fcntl.h |  21 +-
 include/linux/fs.h|   8 +-
 include/li

[PATCH v3 2/4] clone3: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Additionally, explicitly
define CLONE_ARGS_SIZE_VER0 to match the other users of the
struct-extension pattern.

Signed-off-by: Aleksa Sarai 
---
 include/uapi/linux/sched.h |  2 ++
 kernel/fork.c  | 34 +++---
 2 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..0945805982b4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,6 +47,8 @@ struct clone_args {
__aligned_u64 tls;
 };
 
+#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f416126..2ef529869c64 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2525,39 +2525,19 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 #ifdef __ARCH_WANT_SYS_CLONE3
 noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
  struct clone_args __user *uargs,
- size_t size)
+ size_t usize)
 {
+   int err;
struct clone_args args;
 
-   if (unlikely(size > PAGE_SIZE))
+   if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
-
-   if (unlikely(size < sizeof(struct clone_args)))
+   if (unlikely(usize < CLONE_ARGS_SIZE_VER0))
return -EINVAL;
 
-   if (unlikely(!access_ok(uargs, size)))
-   return -EFAULT;
-
-   if (size > sizeof(struct clone_args)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uargs + sizeof(struct clone_args);
-   end = (void __user *)uargs + size;
-
-   for (; addr < end; addr++) {
-   if (get_user(val, addr))
-   return -EFAULT;
-   if (val)
-   return -E2BIG;
-   }
-
-   size = sizeof(struct clone_args);
-   }
-
-   if (copy_from_user(, uargs, size))
-   return -EFAULT;
+   err = copy_struct_from_user(, sizeof(args), uargs, usize);
+   if (err)
+   return err;
 
/*
 * Verify that higher 32bits of exit_signal are unset and that
-- 
2.23.0



[PATCH v3 3/4] sched_setattr: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Ideally we could also
unify sched_getattr(2)-style syscalls as well, but unfortunately the
correct semantics for such syscalls are much less clear (see [1] for
more detail). In future we could come up with a more sane idea for how
the syscall interface should look.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

Signed-off-by: Aleksa Sarai 
---
 kernel/sched/core.c | 43 +++
 1 file changed, 7 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7880f4f64d0e..dd05a378631a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5106,9 +5106,6 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
u32 size;
int ret;
 
-   if (!access_ok(uattr, SCHED_ATTR_SIZE_VER0))
-   return -EFAULT;
-
/* Zero the full structure, so that a short copy will be nice: */
memset(attr, 0, sizeof(*attr));
 
@@ -5116,45 +5113,19 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
if (ret)
return ret;
 
-   /* Bail out on silly large: */
-   if (size > PAGE_SIZE)
-   goto err_size;
-
/* ABI compatibility quirk: */
if (!size)
size = SCHED_ATTR_SIZE_VER0;
-
-   if (size < SCHED_ATTR_SIZE_VER0)
+   if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
size < SCHED_ATTR_SIZE_VER1)
return -EINVAL;
@@ -5354,7 +5325,7 @@ sched_attr_copy_to_user(struct sched_attr __user *uattr,
  * sys_sched_getattr - similar to sched_getparam, but with sched_attr
  * @pid: the pid in question.
  * @uattr: structure containing the extended parameters.
- * @usize: sizeof(attr) that user-space knows about, for forwards and 
backwards compatibility.
+ * @usize: sizeof(attr) for fwd/bwd comp.
  * @flags: for future extension.
  */
 SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
-- 
2.23.0



[PATCH v3 0/4] lib: introduce copy_struct_from_user() helper

2019-09-30 Thread Aleksa Sarai
Patch changelog:
 v3:
  * Rename is_zeroed_user() to check_zeroed_user(). [Christian Brauner]
  * Various minor cleanups. [Christian Brauner]
  * Add tests for check_zeroed_user() and copy_struct_from_user() to
lib/test_user_copy.ko (and thus EXPORT_SYMBOL them both).
 v2: <https://lore.kernel.org/lkml/20190925230332.18690-1-cyp...@cyphar.com/>
 v1: <https://lore.kernel.org/lkml/20190925165915.8135-1-cyp...@cyphar.com/>

This series was split off from the openat2(2) syscall discussion[1].
However, the copy_struct_to_user() helper has been dropped, because
after some discussion it appears that there is no really obvious
semantics for how copy_struct_to_user() should work on mixed-vintages
(for instance, whether [2] is the correct semantics for all syscalls).

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[3]). This series
implements the helper and ports several syscalls to use it.

Some in-kernel selftests are included in this patch. More complete
self-tests for copy_struct_from_user() are included in the openat2()
patchset.

[1]: https://lore.kernel.org/lkml/20190904201933.10736-1-cyp...@cyphar.com/

[2]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[3]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Aleksa Sarai (4):
  lib: introduce copy_struct_from_user() helper
  clone3: switch to copy_struct_from_user()
  sched_setattr: switch to copy_struct_from_user()
  perf_event_open: switch to copy_struct_from_user()

 include/linux/bitops.h |   7 ++
 include/linux/uaccess.h|   4 ++
 include/uapi/linux/sched.h |   2 +
 kernel/events/core.c   |  47 +++--
 kernel/fork.c  |  34 ++
 kernel/sched/core.c|  43 ++--
 lib/strnlen_user.c |   8 +--
 lib/test_user_copy.c   | 133 +++--
 lib/usercopy.c | 123 ++
 9 files changed, 287 insertions(+), 114 deletions(-)

-- 
2.23.0



[PATCH v3 4/4] perf_event_open: switch to copy_struct_from_user()

2019-09-30 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls.

Signed-off-by: Aleksa Sarai 
---
 kernel/events/core.c | 47 +---
 1 file changed, 9 insertions(+), 38 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4655adbbae10..3f0cb82e4fbc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10586,55 +10586,26 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
u32 size;
int ret;
 
-   if (!access_ok(uattr, PERF_ATTR_SIZE_VER0))
-   return -EFAULT;
-
-   /*
-* zero the full structure, so that a short copy will be nice.
-*/
+   /* Zero the full structure, so that a short copy will be nice. */
memset(attr, 0, sizeof(*attr));
 
ret = get_user(size, >size);
if (ret)
return ret;
 
-   if (size > PAGE_SIZE)   /* silly large */
-   goto err_size;
-
-   if (!size)  /* abi compat */
+   /* ABI compatibility quirk: */
+   if (!size)
size = PERF_ATTR_SIZE_VER0;
-
-   if (size < PERF_ATTR_SIZE_VER0)
+   if (size < PERF_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
attr->size = size;
 
if (attr->__reserved_1)
-- 
2.23.0



Re: [PATCH v2 1/4] lib: introduce copy_struct_from_user() helper

2019-09-26 Thread Aleksa Sarai
On 2019-09-26, Christian Brauner  wrote:
> On Thu, Sep 26, 2019 at 01:03:29AM +0200, Aleksa Sarai wrote:
> > +int is_zeroed_user(const void __user *from, size_t size)
> > +{
> > +   unsigned long val;
> > +   uintptr_t align = (uintptr_t) from % sizeof(unsigned long);
> > +
> > +   if (unlikely(!size))
> > +   return true;
> 
> You're returning "true" and another implicit boolean with (val == 0)
> down below but -EFAULT in other places. But that function is int
> is_zeroed_user() Would probably be good if you either switch to bool
> is_zeroed_user() as the name suggests or rename the function and have
> it return an int everywhere.

I just checked, and in C11 (and presumably in older specs) it is
guaranteed that "true" and "false" from  have the values 1
and 0 (respectively) [§7.18]. So this is perfectly well-defined.

Personally, I think it's more readable to have:

  if (unlikely(size == 0))
return true;
  /* ... */
  return (val == 0);

compared to:

  if (unlikely(size == 0))
return 1;
  /* ... */
  return val ? 0 : 1;

But I will change the function name (to check_zeroed_user) to make it
clearer that it isn't returning a boolean and that you need to check for
negative returns.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v2 1/4] lib: introduce copy_struct_from_user() helper

2019-09-25 Thread Aleksa Sarai
(Damn, I forgot to add Kees to Cc.)

On 2019-09-26, Aleksa Sarai  wrote:
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases).
> 
> While this interface exists for communication in both directions, only
> one interface is straightforward to have reasonable semantics for
> (userspace passing a struct to the kernel). For kernel returns to
> userspace, what the correct semantics are (whether there should be an
> error if userspace is unaware of a new extension) is very
> syscall-dependent and thus probably cannot be unified between syscalls
> (a good example of this problem is [1]).
> 
> Previously there was no common lib/ function that implemented
> the necessary extension-checking semantics (and different syscalls
> implemented them slightly differently or incompletely[2]). Future
> patches replace common uses of this pattern to make use of
> copy_struct_from_user().
> 
> Some in-kernel selftests that insure that the handling of alignment and
> various byte patterns are all handled identically to memchr_inv() usage.
> 
> [1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
>  robustify sched_read_attr() ABI logic and code")
> 
> [2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
>  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
>  always rejects differently-sized struct arguments.
> 
> Suggested-by: Rasmus Villemoes 
> Signed-off-by: Aleksa Sarai 
> ---
>  include/linux/bitops.h  |   7 +++
>  include/linux/uaccess.h |   4 ++
>  lib/strnlen_user.c  |   8 +--
>  lib/test_user_copy.c|  59 ++---
>  lib/usercopy.c  | 115 
>  5 files changed, 180 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/bitops.h b/include/linux/bitops.h
> index cf074bce3eb3..a23f4c054768 100644
> --- a/include/linux/bitops.h
> +++ b/include/linux/bitops.h
> @@ -4,6 +4,13 @@
>  #include 
>  #include 
>  
> +/* Set bits in the first 'n' bytes when loaded from memory */
> +#ifdef __LITTLE_ENDIAN
> +#  define aligned_byte_mask(n) ((1ul << 8*(n))-1)
> +#else
> +#  define aligned_byte_mask(n) (~0xfful << (BITS_PER_LONG - 8 - 8*(n)))
> +#endif
> +
>  #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
>  #define BITS_TO_LONGS(nr)DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
>  
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 34a038563d97..824569e309e4 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -230,6 +230,10 @@ static inline unsigned long 
> __copy_from_user_inatomic_nocache(void *to,
>  
>  #endif   /* ARCH_HAS_NOCACHE_UACCESS */
>  
> +extern int is_zeroed_user(const void __user *from, size_t count);
> +extern int copy_struct_from_user(void *dst, size_t ksize,
> +  const void __user *src, size_t usize);
> +
>  /*
>   * probe_kernel_read(): safely attempt to read from a location
>   * @dst: pointer to the buffer that shall take the data
> diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
> index 7f2db3fe311f..39d588aaa8cd 100644
> --- a/lib/strnlen_user.c
> +++ b/lib/strnlen_user.c
> @@ -2,16 +2,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> -/* Set bits in the first 'n' bytes when loaded from memory */
> -#ifdef __LITTLE_ENDIAN
> -#  define aligned_byte_mask(n) ((1ul << 8*(n))-1)
> -#else
> -#  define aligned_byte_mask(n) (~0xfful << (BITS_PER_LONG - 8 - 8*(n)))
> -#endif
> -
>  /*
>   * Do a strnlen, return length of string *with* final '\0'.
>   * 'count' is the user-supplied count, while 'max' is the
> diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
> index 67bcd5dfd847..f7cde3845ccc 100644
> --- a/lib/test_user_copy.c
> +++ b/lib/test_user_copy.c
> @@ -31,14 +31,58 @@
>  # define TEST_U64
>  #endif
>  
> -#define test(condition, msg) \
> -({   \
> - int cond = (condition); \
> - if (cond)   \
> - pr_warn("%s\n", msg);   \
> - cond;   \
> +#define test(condition, msg, ...)\
> +({   \
> + int cond = (condition); \
> + if (cond)   

[PATCH v2 4/4] perf_event_open: switch to copy_struct_from_user()

2019-09-25 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls.

Signed-off-by: Aleksa Sarai 
---
 kernel/events/core.c | 47 +---
 1 file changed, 9 insertions(+), 38 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..038ed126bc1b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10498,55 +10498,26 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
u32 size;
int ret;
 
-   if (!access_ok(uattr, PERF_ATTR_SIZE_VER0))
-   return -EFAULT;
-
-   /*
-* zero the full structure, so that a short copy will be nice.
-*/
+   /* Zero the full structure, so that a short copy will be nice. */
memset(attr, 0, sizeof(*attr));
 
ret = get_user(size, >size);
if (ret)
return ret;
 
-   if (size > PAGE_SIZE)   /* silly large */
-   goto err_size;
-
-   if (!size)  /* abi compat */
+   /* ABI compatibility quirk: */
+   if (!size)
size = PERF_ATTR_SIZE_VER0;
-
-   if (size < PERF_ATTR_SIZE_VER0)
+   if (size < PERF_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
attr->size = size;
 
if (attr->__reserved_1)
-- 
2.23.0



[PATCH v2 0/4] lib: introduce copy_struct_from_user() helper

2019-09-25 Thread Aleksa Sarai
Patch changelog:
 v2:
  * Switch to less buggy handling of alignment. [Linus Torvalds, Al Viro]
  * Move is_zeroed_user() to lib/usercopy.c. [kbuild test robot]
  * Move copy_struct_to_user() to lib/usercopy.c. [Christian Brauner]
  * Add self-tests for is_zeroed_user() to lib/test_user_copy.c.
[Christian Brauner]
 v1: <https://lore.kernel.org/lkml/20190925165915.8135-1-cyp...@cyphar.com/>

This series was split off from the openat2(2) syscall discussion[1].
However, the copy_struct_to_user() helper has been dropped, because
after some discussion it appears that there is no really obvious
semantics for how copy_struct_to_user() should work on mixed-vintages
(for instance, whether [2] is the correct semantics for all syscalls).

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[3]). This series
implements the helper and ports several syscalls to use it.

[1]: https://lore.kernel.org/lkml/20190904201933.10736-1-cyp...@cyphar.com/

[2]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[3]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Aleksa Sarai (4):
  lib: introduce copy_struct_from_user() helper
  clone3: switch to copy_struct_from_user()
  sched_setattr: switch to copy_struct_from_user()
  perf_event_open: switch to copy_struct_from_user()

 include/linux/bitops.h |   7 +++
 include/linux/uaccess.h|   4 ++
 include/uapi/linux/sched.h |   2 +
 kernel/events/core.c   |  47 +++
 kernel/fork.c  |  34 +++
 kernel/sched/core.c|  43 +++---
 lib/strnlen_user.c |   8 +--
 lib/test_user_copy.c   |  59 +--
 lib/usercopy.c | 115 +
 9 files changed, 205 insertions(+), 114 deletions(-)

-- 
2.23.0



[PATCH v2 2/4] clone3: switch to copy_struct_from_user()

2019-09-25 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Additionally, explicitly
define CLONE_ARGS_SIZE_VER0 to match the other users of the
struct-extension pattern.

Signed-off-by: Aleksa Sarai 
---
 include/uapi/linux/sched.h |  2 ++
 kernel/fork.c  | 34 +++---
 2 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..0945805982b4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,6 +47,8 @@ struct clone_args {
__aligned_u64 tls;
 };
 
+#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index 541fd805fb88..a86e3841ee4e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2530,39 +2530,19 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 #ifdef __ARCH_WANT_SYS_CLONE3
 noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
  struct clone_args __user *uargs,
- size_t size)
+ size_t usize)
 {
+   int err;
struct clone_args args;
 
-   if (unlikely(size > PAGE_SIZE))
+   if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
-
-   if (unlikely(size < sizeof(struct clone_args)))
+   if (unlikely(usize < CLONE_ARGS_SIZE_VER0))
return -EINVAL;
 
-   if (unlikely(!access_ok(uargs, size)))
-   return -EFAULT;
-
-   if (size > sizeof(struct clone_args)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uargs + sizeof(struct clone_args);
-   end = (void __user *)uargs + size;
-
-   for (; addr < end; addr++) {
-   if (get_user(val, addr))
-   return -EFAULT;
-   if (val)
-   return -E2BIG;
-   }
-
-   size = sizeof(struct clone_args);
-   }
-
-   if (copy_from_user(, uargs, size))
-   return -EFAULT;
+   err = copy_struct_from_user(, sizeof(args), uargs, usize);
+   if (err)
+   return err;
 
/*
 * Verify that higher 32bits of exit_signal are unset and that
-- 
2.23.0



[PATCH v2 1/4] lib: introduce copy_struct_from_user() helper

2019-09-25 Thread Aleksa Sarai
A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

While this interface exists for communication in both directions, only
one interface is straightforward to have reasonable semantics for
(userspace passing a struct to the kernel). For kernel returns to
userspace, what the correct semantics are (whether there should be an
error if userspace is unaware of a new extension) is very
syscall-dependent and thus probably cannot be unified between syscalls
(a good example of this problem is [1]).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[2]). Future
patches replace common uses of this pattern to make use of
copy_struct_from_user().

Some in-kernel selftests that insure that the handling of alignment and
various byte patterns are all handled identically to memchr_inv() usage.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Suggested-by: Rasmus Villemoes 
Signed-off-by: Aleksa Sarai 
---
 include/linux/bitops.h  |   7 +++
 include/linux/uaccess.h |   4 ++
 lib/strnlen_user.c  |   8 +--
 lib/test_user_copy.c|  59 ++---
 lib/usercopy.c  | 115 
 5 files changed, 180 insertions(+), 13 deletions(-)

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index cf074bce3eb3..a23f4c054768 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+/* Set bits in the first 'n' bytes when loaded from memory */
+#ifdef __LITTLE_ENDIAN
+#  define aligned_byte_mask(n) ((1ul << 8*(n))-1)
+#else
+#  define aligned_byte_mask(n) (~0xfful << (BITS_PER_LONG - 8 - 8*(n)))
+#endif
+
 #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
 #define BITS_TO_LONGS(nr)  DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
 
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 34a038563d97..824569e309e4 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -230,6 +230,10 @@ static inline unsigned long 
__copy_from_user_inatomic_nocache(void *to,
 
 #endif /* ARCH_HAS_NOCACHE_UACCESS */
 
+extern int is_zeroed_user(const void __user *from, size_t count);
+extern int copy_struct_from_user(void *dst, size_t ksize,
+const void __user *src, size_t usize);
+
 /*
  * probe_kernel_read(): safely attempt to read from a location
  * @dst: pointer to the buffer that shall take the data
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index 7f2db3fe311f..39d588aaa8cd 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -2,16 +2,10 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-/* Set bits in the first 'n' bytes when loaded from memory */
-#ifdef __LITTLE_ENDIAN
-#  define aligned_byte_mask(n) ((1ul << 8*(n))-1)
-#else
-#  define aligned_byte_mask(n) (~0xfful << (BITS_PER_LONG - 8 - 8*(n)))
-#endif
-
 /*
  * Do a strnlen, return length of string *with* final '\0'.
  * 'count' is the user-supplied count, while 'max' is the
diff --git a/lib/test_user_copy.c b/lib/test_user_copy.c
index 67bcd5dfd847..f7cde3845ccc 100644
--- a/lib/test_user_copy.c
+++ b/lib/test_user_copy.c
@@ -31,14 +31,58 @@
 # define TEST_U64
 #endif
 
-#define test(condition, msg)   \
-({ \
-   int cond = (condition); \
-   if (cond)   \
-   pr_warn("%s\n", msg);   \
-   cond;   \
+#define test(condition, msg, ...)  \
+({ \
+   int cond = (condition); \
+   if (cond)   \
+   pr_warn("[%d] " msg "\n", __LINE__, ##__VA_ARGS__); \
+   cond;   \
 })
 
+static int test_is_zeroed_user(char *kmem, char __user *umem, size_t size)
+{
+   int ret = 0;
+   size_t start, end, i;
+   size_t zero_start = size / 4;
+   size_t zero_end = size - zero_start;
+
+   /*
+* We conduct a series of is_zeroed_user() tests on a block of memory
+* with the following byte-pattern (trying every possible [start,end]
+* pair):
+*
+*   [ 00 ff 00 ff ... 0

[PATCH v2 3/4] sched_setattr: switch to copy_struct_from_user()

2019-09-25 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Ideally we could also
unify sched_getattr(2)-style syscalls as well, but unfortunately the
correct semantics for such syscalls are much less clear (see [1] for
more detail). In future we could come up with a more sane idea for how
the syscall interface should look.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

Signed-off-by: Aleksa Sarai 
---
 kernel/sched/core.c | 43 +++
 1 file changed, 7 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index df9f1fe5689b..cdb2f5e29b88 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4900,9 +4900,6 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
u32 size;
int ret;
 
-   if (!access_ok(uattr, SCHED_ATTR_SIZE_VER0))
-   return -EFAULT;
-
/* Zero the full structure, so that a short copy will be nice: */
memset(attr, 0, sizeof(*attr));
 
@@ -4910,45 +4907,19 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
if (ret)
return ret;
 
-   /* Bail out on silly large: */
-   if (size > PAGE_SIZE)
-   goto err_size;
-
/* ABI compatibility quirk: */
if (!size)
size = SCHED_ATTR_SIZE_VER0;
-
-   if (size < SCHED_ATTR_SIZE_VER0)
+   if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
size < SCHED_ATTR_SIZE_VER1)
return -EINVAL;
@@ -5148,7 +5119,7 @@ sched_attr_copy_to_user(struct sched_attr __user *uattr,
  * sys_sched_getattr - similar to sched_getparam, but with sched_attr
  * @pid: the pid in question.
  * @uattr: structure containing the extended parameters.
- * @usize: sizeof(attr) that user-space knows about, for forwards and 
backwards compatibility.
+ * @usize: sizeof(attr) for fwd/bwd comp.
  * @flags: for future extension.
  */
 SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
-- 
2.23.0



Re: [PATCH v1 1/4] lib: introduce copy_struct_from_user() helper

2019-09-25 Thread Aleksa Sarai
On 2019-09-25, Linus Torvalds  wrote:
> On Wed, Sep 25, 2019 at 10:00 AM Aleksa Sarai  wrote:
> >
> > +int is_zeroed_user(const void __user *from, size_t size)
> 
> I like how you've done this, but it's buggy and only works on 64-bit.
> 
> All the "u64" and "8" cases need to be "unsigned long" and
> "sizeof(unsigned long)".
> 
> Part of that requirement is:
> 
> > +   unsafe_get_user(val, (u64 __user *) from, err_fault);
> 
> This part works fine - although 64-bit accesses migth be much more
> expensive and the win of unrolling might not be sufficient - but:
> 
> > +   if (align) {
> > +   /* @from is unaligned. */
> > +   val &= ~aligned_byte_mask(align);
> > +   align = 0;
> > +   }
> 
> This part fundamentally only works on 'unsigned long'.

Just to make sure I understand, the following diff would this solve the
problem? If so, I'll apply it, and re-send in a few hours.

--8<--

 int is_zeroed_user(const void __user *from, size_t size)
 {
-   u64 val;
-   uintptr_t align = (uintptr_t) from % 8;
+   unsigned long val;
+   uintptr_t align = (uintptr_t) from % sizeof(unsigned long);
 
if (unlikely(!size))
return true;
@@ -150,8 +150,8 @@ int is_zeroed_user(const void __user *from, size_t size)
if (!user_access_begin(from, size))
return -EFAULT;
 
-   while (size >= 8) {
-   unsafe_get_user(val, (u64 __user *) from, err_fault);
+   while (size >= sizeof(unsigned long)) {
+   unsafe_get_user(val, (unsigned long __user *) from, err_fault);
if (align) {
/* @from is unaligned. */
val &= ~aligned_byte_mask(align);
@@ -159,12 +159,12 @@ int is_zeroed_user(const void __user *from, size_t size)
}
if (val)
goto done;
-   from += 8;
-   size -= 8;
+   from += sizeof(unsigned long);
+   size -= sizeof(unsigned long);
}
if (size) {
/* (@from + @size) is unaligned. */
-   unsafe_get_user(val, (u64 __user *) from, err_fault);
+   unsafe_get_user(val, (unsigned long __user *) from, err_fault);
val &= aligned_byte_mask(size);
}

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


[PATCH v1 1/4] lib: introduce copy_struct_from_user() helper

2019-09-25 Thread Aleksa Sarai
A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

While this interface exists for communication in both directions, only
one interface is straightforward to have reasonable semantics for
(userspace passing a struct to the kernel). For kernel returns to
userspace, what the correct semantics are (whether there should be an
error if userspace is unaware of a new extension) is very
syscall-dependent and thus probably cannot be unified between syscalls
(a good example of this problem is [1]).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[2]). Future
patches replace common uses of this pattern to make use of
copy_struct_from_user().

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Suggested-by: Rasmus Villemoes 
Signed-off-by: Aleksa Sarai 
---
 include/linux/uaccess.h |  4 +++
 lib/Makefile|  2 +-
 lib/strnlen_user.c  | 52 +
 lib/struct_user.c   | 73 +
 4 files changed, 130 insertions(+), 1 deletion(-)
 create mode 100644 lib/struct_user.c

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 34a038563d97..824569e309e4 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -230,6 +230,10 @@ static inline unsigned long 
__copy_from_user_inatomic_nocache(void *to,
 
 #endif /* ARCH_HAS_NOCACHE_UACCESS */
 
+extern int is_zeroed_user(const void __user *from, size_t count);
+extern int copy_struct_from_user(void *dst, size_t ksize,
+const void __user *src, size_t usize);
+
 /*
  * probe_kernel_read(): safely attempt to read from a location
  * @dst: pointer to the buffer that shall take the data
diff --git a/lib/Makefile b/lib/Makefile
index 29c02a924973..d86c71feaf0a 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,7 +28,7 @@ endif
 CFLAGS_string.o := $(call cc-option, -fno-stack-protector)
 endif
 
-lib-y := ctype.o string.o vsprintf.o cmdline.o \
+lib-y := ctype.o string.o struct_user.o vsprintf.o cmdline.o \
 rbtree.o radix-tree.o timerqueue.o xarray.o \
 idr.o extable.o \
 sha1.o chacha.o irq_regs.o argv_split.o \
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index 7f2db3fe311f..7eb665732954 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -123,3 +123,55 @@ long strnlen_user(const char __user *str, long count)
return 0;
 }
 EXPORT_SYMBOL(strnlen_user);
+
+/**
+ * is_zeroed_user: check if a userspace buffer is full of zeros
+ * @from:  Source address, in userspace.
+ * @size: Size of buffer.
+ *
+ * This is effectively shorthand for "memchr_inv(from, 0, size) == NULL" for
+ * userspace addresses. If there are non-zero bytes present then false is
+ * returned, otherwise true is returned.
+ *
+ * Returns:
+ *  * -EFAULT: access to userspace failed.
+ */
+int is_zeroed_user(const void __user *from, size_t size)
+{
+   u64 val;
+   uintptr_t align = (uintptr_t) from % 8;
+
+   if (unlikely(!size))
+   return true;
+
+   from -= align;
+   size += align;
+
+   if (!user_access_begin(from, size))
+   return -EFAULT;
+
+   while (size >= 8) {
+   unsafe_get_user(val, (u64 __user *) from, err_fault);
+   if (align) {
+   /* @from is unaligned. */
+   val &= ~aligned_byte_mask(align);
+   align = 0;
+   }
+   if (val)
+   goto done;
+   from += 8;
+   size -= 8;
+   }
+   if (size) {
+   /* (@from + @size) is unaligned. */
+   unsafe_get_user(val, (u64 __user *) from, err_fault);
+   val &= aligned_byte_mask(size);
+   }
+
+done:
+   user_access_end();
+   return (val == 0);
+err_fault:
+   user_access_end();
+   return -EFAULT;
+}
diff --git a/lib/struct_user.c b/lib/struct_user.c
new file mode 100644
index ..57d79eb53bfa
--- /dev/null
+++ b/lib/struct_user.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2019 SUSE LLC
+ * Copyright (C) 2019 Aleksa Sarai 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * copy_struct_from_user: copy a struct from userspace
+ * @dst:   Destination address, in kernel space. This buffer m

[PATCH v1 3/4] sched_setattr: switch to copy_struct_from_user()

2019-09-25 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Ideally we could also
unify sched_getattr(2)-style syscalls as well, but unfortunately the
correct semantics for such syscalls are much less clear (see [1] for
more detail). In future we could come up with a more sane idea for how
the syscall interface should look.

[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

Signed-off-by: Aleksa Sarai 
---
 kernel/sched/core.c | 43 +++
 1 file changed, 7 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index df9f1fe5689b..cdb2f5e29b88 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4900,9 +4900,6 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
u32 size;
int ret;
 
-   if (!access_ok(uattr, SCHED_ATTR_SIZE_VER0))
-   return -EFAULT;
-
/* Zero the full structure, so that a short copy will be nice: */
memset(attr, 0, sizeof(*attr));
 
@@ -4910,45 +4907,19 @@ static int sched_copy_attr(struct sched_attr __user 
*uattr, struct sched_attr *a
if (ret)
return ret;
 
-   /* Bail out on silly large: */
-   if (size > PAGE_SIZE)
-   goto err_size;
-
/* ABI compatibility quirk: */
if (!size)
size = SCHED_ATTR_SIZE_VER0;
-
-   if (size < SCHED_ATTR_SIZE_VER0)
+   if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
size < SCHED_ATTR_SIZE_VER1)
return -EINVAL;
@@ -5148,7 +5119,7 @@ sched_attr_copy_to_user(struct sched_attr __user *uattr,
  * sys_sched_getattr - similar to sched_getparam, but with sched_attr
  * @pid: the pid in question.
  * @uattr: structure containing the extended parameters.
- * @usize: sizeof(attr) that user-space knows about, for forwards and 
backwards compatibility.
+ * @usize: sizeof(attr) for fwd/bwd comp.
  * @flags: for future extension.
  */
 SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
-- 
2.23.0



[PATCH v1 2/4] clone3: switch to copy_struct_from_user()

2019-09-25 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Additionally, explicitly
define CLONE_ARGS_SIZE_VER0 to match the other users of the
struct-extension pattern.

Signed-off-by: Aleksa Sarai 
---
 include/uapi/linux/sched.h |  2 ++
 kernel/fork.c  | 34 +++---
 2 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..0945805982b4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,6 +47,8 @@ struct clone_args {
__aligned_u64 tls;
 };
 
+#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index 541fd805fb88..a86e3841ee4e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2530,39 +2530,19 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, 
unsigned long, newsp,
 #ifdef __ARCH_WANT_SYS_CLONE3
 noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
  struct clone_args __user *uargs,
- size_t size)
+ size_t usize)
 {
+   int err;
struct clone_args args;
 
-   if (unlikely(size > PAGE_SIZE))
+   if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
-
-   if (unlikely(size < sizeof(struct clone_args)))
+   if (unlikely(usize < CLONE_ARGS_SIZE_VER0))
return -EINVAL;
 
-   if (unlikely(!access_ok(uargs, size)))
-   return -EFAULT;
-
-   if (size > sizeof(struct clone_args)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uargs + sizeof(struct clone_args);
-   end = (void __user *)uargs + size;
-
-   for (; addr < end; addr++) {
-   if (get_user(val, addr))
-   return -EFAULT;
-   if (val)
-   return -E2BIG;
-   }
-
-   size = sizeof(struct clone_args);
-   }
-
-   if (copy_from_user(, uargs, size))
-   return -EFAULT;
+   err = copy_struct_from_user(, sizeof(args), uargs, usize);
+   if (err)
+   return err;
 
/*
 * Verify that higher 32bits of exit_signal are unset and that
-- 
2.23.0



[PATCH v1 0/4] lib: introduce copy_struct_from_user() helper

2019-09-25 Thread Aleksa Sarai
This series was split off from the openat2(2) syscall discussion[1].
However, the copy_struct_to_user() helper has been dropped, because
after some discussion it appears that there is no really obvious
semantics for how copy_struct_to_user() should work on mixed-vintages
(for instance, whether [2] is the correct semantics for all syscalls).

A common pattern for syscall extensions is increasing the size of a
struct passed from userspace, such that the zero-value of the new fields
result in the old kernel behaviour (allowing for a mix of userspace and
kernel vintages to operate on one another in most cases).

Previously there was no common lib/ function that implemented
the necessary extension-checking semantics (and different syscalls
implemented them slightly differently or incompletely[3]). This series
implements the helper and ports several syscalls to use it.

[1]: https://lore.kernel.org/lkml/20190904201933.10736-1-cyp...@cyphar.com/

[2]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
 robustify sched_read_attr() ABI logic and code")

[3]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
 similar checks to copy_struct_from_user() while rt_sigprocmask(2)
 always rejects differently-sized struct arguments.

Aleksa Sarai (4):
  lib: introduce copy_struct_from_user() helper
  clone3: switch to copy_struct_from_user()
  sched_setattr: switch to copy_struct_from_user()
  perf_event_open: switch to copy_struct_from_user()

 include/linux/uaccess.h|  4 +++
 include/uapi/linux/sched.h |  2 ++
 kernel/events/core.c   | 47 +---
 kernel/fork.c  | 34 --
 kernel/sched/core.c| 43 --
 lib/Makefile   |  2 +-
 lib/strnlen_user.c | 52 +++
 lib/struct_user.c  | 73 ++
 8 files changed, 155 insertions(+), 102 deletions(-)
 create mode 100644 lib/struct_user.c

-- 
2.23.0



[PATCH v1 4/4] perf_event_open: switch to copy_struct_from_user()

2019-09-25 Thread Aleksa Sarai
The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls.

Signed-off-by: Aleksa Sarai 
---
 kernel/events/core.c | 47 +---
 1 file changed, 9 insertions(+), 38 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c1151bae..038ed126bc1b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10498,55 +10498,26 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
u32 size;
int ret;
 
-   if (!access_ok(uattr, PERF_ATTR_SIZE_VER0))
-   return -EFAULT;
-
-   /*
-* zero the full structure, so that a short copy will be nice.
-*/
+   /* Zero the full structure, so that a short copy will be nice. */
memset(attr, 0, sizeof(*attr));
 
ret = get_user(size, >size);
if (ret)
return ret;
 
-   if (size > PAGE_SIZE)   /* silly large */
-   goto err_size;
-
-   if (!size)  /* abi compat */
+   /* ABI compatibility quirk: */
+   if (!size)
size = PERF_ATTR_SIZE_VER0;
-
-   if (size < PERF_ATTR_SIZE_VER0)
+   if (size < PERF_ATTR_SIZE_VER0 || size > PAGE_SIZE)
goto err_size;
 
-   /*
-* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
-   if (size > sizeof(*attr)) {
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-
-   addr = (void __user *)uattr + sizeof(*attr);
-   end  = (void __user *)uattr + size;
-
-   for (; addr < end; addr++) {
-   ret = get_user(val, addr);
-   if (ret)
-   return ret;
-   if (val)
-   goto err_size;
-   }
-   size = sizeof(*attr);
+   ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size);
+   if (ret) {
+   if (ret == -E2BIG)
+   goto err_size;
+   return ret;
}
 
-   ret = copy_from_user(attr, uattr, size);
-   if (ret)
-   return -EFAULT;
-
attr->size = size;
 
if (attr->__reserved_1)
-- 
2.23.0



Re: [PATCH v12 05/12] namei: obey trailing magic-link DAC permissions

2019-09-18 Thread Aleksa Sarai
On 2019-09-17, Jann Horn  wrote:
> On Wed, Sep 4, 2019 at 10:21 PM Aleksa Sarai  wrote:
> > The ability for userspace to "re-open" file descriptors through
> > /proc/self/fd has been a very useful tool for all sorts of usecases
> > (container runtimes are one common example). However, the current
> > interface for doing this has resulted in some pretty subtle security
> > holes. Userspace can re-open a file descriptor with more permissions
> > than the original, which can result in cases such as /proc/$pid/exe
> > being re-opened O_RDWR at a later date even though (by definition)
> > /proc/$pid/exe cannot be opened for writing. When combined with O_PATH
> > the results can get even more confusing.
> [...]
> > Instead we have to restrict it in such a way that it doesn't break
> > (good) users but does block potential attackers. The solution applied in
> > this patch is to restrict *re-opening* (not resolution through)
> > magic-links by requiring that mode of the link be obeyed. Normal
> > symlinks have modes of a+rwx but magic-links have other modes. These
> > magic-link modes were historically ignored during path resolution, but
> > they've now been re-purposed for more useful ends.
> 
> Thanks for dealing with this issue!
> 
> [...]
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 209c51a5226c..54d57dad0f91 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
> >
> > nd->path = *path;
> > nd->inode = nd->path.dentry->d_inode;
> > -   nd->flags |= LOOKUP_JUMPED;
> > +   nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
> >  }
> [...]
> > +static int trailing_magiclink(struct nameidata *nd, int acc_mode,
> > + fmode_t *opath_mask)
> > +{
> > +   struct inode *inode = nd->link_inode;
> > +   fmode_t upgrade_mask = 0;
> > +
> > +   /* Was the trailing_symlink() a magic-link? */
> > +   if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
> > +   return 0;
> > +
> > +   /*
> > +* Figure out the upgrade-mask of the link_inode. Since these aren't
> > +* strictly POSIX semantics we don't do an acl_permission_check() 
> > here,
> > +* so we only care that at least one bit is set for each 
> > upgrade-mode.
> > +*/
> > +   if (inode->i_mode & S_IRUGO)
> > +   upgrade_mask |= FMODE_PATH_READ;
> > +   if (inode->i_mode & S_IWUGO)
> > +   upgrade_mask |= FMODE_PATH_WRITE;
> > +   /* Restrict the O_PATH upgrade-mask of the caller. */
> > +   if (opath_mask)
> > +   *opath_mask &= upgrade_mask;
> > +   return may_open_magiclink(upgrade_mask, acc_mode);
> >  }
> 
> This looks racy because entries in the file descriptor table can be
> switched out as long as task->files->file_lock isn't held. Unless I'm
> missing something, something like the following (untested) would
> bypass this restriction:

You're absolutely right -- good catch!

> Perhaps you could change nd_jump_link() to "void nd_jump_link(struct
> path *path, umode_t link_mode)", and let proc_pid_get_link() pass the
> link_mode through from an out-argument of .proc_get_link()? Then
> proc_fd_link() could grab the proper mode in a race-free manner. And
> nd_jump_link() could stash the mode in the nameidata.

This indeed does appear to be the simplest solution -- I'm currently
testing a variation of the patch you proposed (with a few extra bits to
deal with nd_jump_link and proc_get_link being used elsewhere).

I'll include this change (assuming it fixes the flaw you found) in the
v13 series I'll send around next week. Thanks, Jann!

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()

2019-09-09 Thread Aleksa Sarai
On 2019-09-09, Mickaël Salaün  wrote:
> On 09/09/2019 12:12, James Morris wrote:
> > On Mon, 9 Sep 2019, Mickaël Salaün wrote:
> >> As I said, O_MAYEXEC should be ignored if it is not supported by the
> >> kernel, which perfectly fit with the current open(2) flags behavior, and
> >> should also behave the same with openat2(2).
> >
> > The problem here is programs which are already using the value of
> > O_MAYEXEC, which will break.  Hence, openat2(2).
> 
> Well, it still depends on the sysctl, which doesn't enforce anything by
> default, hence doesn't break existing behavior, and this unused flags
> could be fixed/removed or reported by sysadmins or distro developers.

Okay, but then this means that new programs which really want to enforce
O_MAYEXEC (and know that they really do want this feature) won't be able
to unless an admin has set the relevant sysctl. Not to mention that the
old-kernel fallback will not cover the "it's disabled by the sysctl"
case -- so the fallback handling would need to be:

int fd = open("foo", O_MAYEXEC|O_RDONLY);
if (!(fcntl(fd, F_GETFL) & O_MAYEXEC))
fallback();
if (!sysctl_feature_is_enabled)
fallback();

However, there is still a race here -- if an administrator enables
O_MAYEXEC after the program gets the fd, then you still won't hit the
fallback (and you can't tell that O_MAYEXEC checks weren't done).

You could fix the issue with the sysctl by clearing O_MAYEXEC from
f_flags if the sysctl is disabled. You could also avoid some of the
problems with it being a global setting by making it a prctl(2) which
processes can opt-in to (though this has its own major problems).

Sorry, but I'm just really not a fan of this.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()

2019-09-09 Thread Aleksa Sarai
On 2019-09-09, Mickaël Salaün  wrote:
> On 06/09/2019 21:03, James Morris wrote:
> > On Fri, 6 Sep 2019, Jeff Layton wrote:
> >
> >> The fact that open and openat didn't vet unknown flags is really a bug.
> >>
> >> Too late to fix it now, of course, and as Aleksa points out, we've
> >> worked around that in the past. Now though, we have a new openat2
> >> syscall on the horizon. There's little need to continue these sorts of
> >> hacks.
> >>
> >> New open flags really have no place in the old syscalls, IMO.
> >
> > Agree here. It's unfortunate but a reality and Linus will reject any such
> > changes which break existing userspace.
> 
> Do you mean that adding new flags to open(2) is not possible?

It is possible, as long as there is no case where a program that works
today (and passes garbage to the unused bits in flags) works with the
change.

O_TMPFILE was okay because it's actually two flags (one is O_DIRECTORY)
and no working program does file IO to a directory (there are also some
other tricky things done there, I'll admit I don't fully understand it).

O_EMPTYPATH works because it's a no-op with non-empty path strings, and
empty path strings have always given an error (so no working program
does it today).

However, O_MAYEXEC will result in programs that pass garbage bits to
potentially get -EACCES that worked previously.

> As I said, O_MAYEXEC should be ignored if it is not supported by the
> kernel, which perfectly fit with the current open(2) flags behavior, and
> should also behave the same with openat2(2).

NACK on having that behaviour with openat2(2). -EINVAL on unknown flags
is how all other syscalls work (any new syscall proposed today that
didn't do that would be rightly rejected), and is a quirk of open(2)
which unfortunately cannot be fixed. The fact that *every new O_ flag
needs to work around this problem* should be an indication that this
interface mis-design should not be allowed to infect any more syscalls.

Note that this point is regardless of the fact that O_MAYEXEC is a
*security* flag -- if userspace wants to have a secure fallback on
old kernels (which is "the right thing" to do) they would have to do
more work than necessary. And programs that don't care don't have to do
anything special.

However with -EINVAL, the programs doing "the right thing" get an easy
-EINVAL check. And programs that don't care can just un-set O_MAYEXEC
and retry. You should be forced to deal with the case where a flag is
not supported -- and this is doubly true of security flags!

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v2 0/5] Add support for O_MAYEXEC

2019-09-06 Thread Aleksa Sarai
On 2019-09-06, Andy Lutomirski  wrote:
> > On Sep 6, 2019, at 12:07 PM, Steve Grubb  wrote:
> > 
> >> On Friday, September 6, 2019 2:57:00 PM EDT Florian Weimer wrote:
> >> * Steve Grubb:
> >>> Now with LD_AUDIT
> >>> $ LD_AUDIT=/home/sgrubb/test/openflags/strip-flags.so.0 strace ./test
> >>> 2>&1 | grep passwd openat(3, "passwd", O_RDONLY)   = 4
> >>> 
> >>> No O_CLOEXEC flag.
> >> 
> >> I think you need to explain in detail why you consider this a problem.
> > 
> > Because you can strip the O_MAYEXEC flag from being passed into the kernel. 
> > Once you do that, you defeat the security mechanism because it never gets 
> > invoked. The issue is that the only thing that knows _why_ something is 
> > being 
> > opened is user space. With this mechanism, you can attempt to pass this 
> > reason to the kernel so that it may see if policy permits this. But you can 
> > just remove the flag.
> 
> I’m with Florian here. Once you are executing code in a process, you
> could just emulate some other unapproved code. This series is not
> intended to provide the kind of absolute protection you’re imagining.

I also agree, though I think that there is a separate argument to be
made that there are two possible problems with O_MAYEXEC (which might
not be really big concerns):

  * It's very footgun-prone if you didn't call O_MAYEXEC yourself and
you pass the descriptor elsewhere. You need to check f_flags to see
if it contains O_MAYEXEC. Maybe there is an argument to be made that
passing O_MAYEXECs around isn't a valid use-case, but in that case
there should be some warnings about that.

  * There's effectively a TOCTOU flaw (even if you are sure O_MAYEXEC is
in f_flags) -- if the filesystem becomes re-mounted noexec (or the
file has a-x permissions) after you've done the check you won't get
hit with an error when you go to use the file descriptor later.

To fix both you'd need to do what you mention later:

> What the kernel *could* do is prevent mmapping a non-FMODE_EXEC file
> with PROT_EXEC, which would indeed have a real effect (in an iOS-like
> world, for example) but would break many, many things.

And I think this would be useful (with the two possible ways of
executing .text split into FMODE_EXEC and FMODE_MAP_EXEC, as mentioned
in a sister subthread), but would have to be opt-in for the obvious
reason you outlined. However, we could make it the default for
openat2(2) -- assuming we can agree on what the semantics of a
theoretical FMODE_EXEC should be.

And of course we'd need to do FMODE_UPGRADE_EXEC (which would need to
also permit fexecve(2) though probably not PROT_EXEC -- I don't think
you can mmap() an O_PATH descriptor).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()

2019-09-06 Thread Aleksa Sarai
On 2019-09-07, Aleksa Sarai  wrote:
> On 2019-09-06, Jeff Layton  wrote:
> > On Sat, 2019-09-07 at 03:13 +1000, Aleksa Sarai wrote:
> > > On 2019-09-06, Jeff Layton  wrote:
> > > > On Fri, 2019-09-06 at 18:06 +0200, Mickaël Salaün wrote:
> > > > > On 06/09/2019 17:56, Florian Weimer wrote:
> > > > > > Let's assume I want to add support for this to the glibc dynamic 
> > > > > > loader,
> > > > > > while still being able to run on older kernels.
> > > > > > 
> > > > > > Is it safe to try the open call first, with O_MAYEXEC, and if that 
> > > > > > fails
> > > > > > with EINVAL, try again without O_MAYEXEC?
> > > > > 
> > > > > The kernel ignore unknown open(2) flags, so yes, it is safe even for
> > > > > older kernel to use O_MAYEXEC.
> > > > > 
> > > > 
> > > > Well...maybe. What about existing programs that are sending down bogus
> > > > open flags? Once you turn this on, they may break...or provide a way to
> > > > circumvent the protections this gives.
> > > 
> > > It should be noted that this has been a valid concern for every new O_*
> > > flag introduced (and yet we still introduced new flags, despite the
> > > concern) -- though to be fair, O_TMPFILE actually does have a
> > > work-around with the O_DIRECTORY mask setup.
> > > 
> > > The openat2() set adds O_EMPTYPATH -- though in fairness it's also
> > > backwards compatible because empty path strings have always given ENOENT
> > > (or EINVAL?) while O_EMPTYPATH is a no-op non-empty strings.
> > > 
> > > > Maybe this should be a new flag that is only usable in the new openat2()
> > > > syscall that's still under discussion? That syscall will enforce that
> > > > all flags are recognized. You presumably wouldn't need the sysctl if you
> > > > went that route too.
> > > 
> > > I'm also interested in whether we could add an UPGRADE_NOEXEC flag to
> > > how->upgrade_mask for the openat2(2) patchset (I reserved a flag bit for
> > > it, since I'd heard about this work through the grape-vine).
> > > 
> > 
> > I rather like the idea of having openat2 fds be non-executable by
> > default, and having userland request it specifically via O_MAYEXEC (or
> > some similar openat2 flag) if it's needed. Then you could add an
> > UPGRADE_EXEC flag instead?
> > 
> > That seems like something reasonable to do with a brand new API, and
> > might be very helpful for preventing certain classes of attacks.
> 
> In that case, maybe openat2(2) should default to not allowing any
> upgrades by default? The reason I pitched UPGRADE_NOEXEC is because
> UPGRADE_NO{READ,WRITE} are the existing @how->upgrade_mask flags.

Sorry, another issue is that there isn't a current way to really
restrict fexecve() permissions (from my [limited] understanding,
__FMODE_EXEC isn't the right thing to use) -- so we can't blanket block
exec through openat2() O_PATH descriptors and add UPGRADE_EXEC later.

We would have to implement FMODE_EXEC (and FMODE_MAP_EXEC as you
suggested) in order to implement FMODE_UPGRADE_EXEC before we could even
get a first version of openat2(2) in. Though, I do (a little
begrudgingly) agree that we should have a safe default if possible
(magical O_PATH reopening trickery is something that most people don't
know about and probably wouldn't want to happen if they did).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


signature.asc
Description: PGP signature


  1   2   3   4   5   6   7   8   9   >