Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic

2019-10-26 Thread Mimi Zohar
On Fri, 2019-10-25 at 10:32 -0700, Lakshmi Ramasubramanian wrote:
> 
> On 10/25/2019 10:24 AM, Nayna Jain wrote:
> > 
> > On 10/24/19 10:20 AM, Lakshmi Ramasubramanian wrote:
> >> On 10/23/19 8:47 PM, Nayna Jain wrote:
> >>
> >> Hi Nayna,
> >>
> >>> +void process_buffer_measurement(const void *buf, int size,
> >>> +    const char *eventname, enum ima_hooks func,
> >>> +    int pcr)
> >>>   {
> >>>   int ret = 0;
> >>>   struct ima_template_entry *entry = NULL;
> >>
> >>> +    if (func) {

Let's comment this line.  Perhaps something like /*Unnecessary for
auxiliary buffer measurements */
> >>> +    security_task_getsecid(current, );
> >>> +    action = ima_get_action(NULL, current_cred(), secid, 0, func,
> >>> +    , );
> >>> +    if (!(action & IMA_MEASURE))
> >>> +    return;
> >>> +    }
> >>
> >> In your change set process_buffer_measurement is called with NONE for 
> >> the parameter func. So ima_get_action (the above if block) will not be 
> >> executed.
> >>
> >> Wouldn't it better to update ima_get_action (and related functions) to 
> >> handle the ima policy (func param)?
> > 
> > 
> > The idea is to use ima-buf template for the auxiliary measurement 
> > record. The auxiliary measurement record is an additional record to the 
> > one already created based on the existing policy. When func is passed as 
> > NONE, it represents it is an additional record. I am not sure what you 
> > mean by updating ima_get_action, it is already handling the ima policy.
> >
> 
> I was referring to using "func" in process_buffer_measurement to 
> determine ima action. In my opinion, process_buffer_measurement should 
> be generic.
> 
> ima_get_action() should instead determine the required ima action, 
> template, pcr, etc. based on "func" passed to it.

Nayna's original patch moved ima_get_action() into the caller, but
that resulted in code duplication in each of the callers.  This
solution differentiates between the initial, which requires calling
ima_get_action(), and auxiliary buffer measurement records.

Mimi 



Re: [PATCH v9 2/8] powerpc/ima: add support to initialize ima policy rules

2019-10-26 Thread Mimi Zohar
On Fri, 2019-10-25 at 12:02 -0500, Nayna Jain wrote:
> On 10/24/19 12:35 PM, Lakshmi Ramasubramanian wrote:
> > On 10/23/2019 8:47 PM, Nayna Jain wrote:
> >
> >> +/*
> >> + * The "secure_rules" are enabled only on "secureboot" enabled systems.
> >> + * These rules verify the file signatures against known good values.
> >> + * The "appraise_type=imasig|modsig" option allows the known good 
> >> signature
> >> + * to be stored as an xattr or as an appended signature.
> >> + *
> >> + * To avoid duplicate signature verification as much as possible, 
> >> the IMA
> >> + * policy rule for module appraisal is added only if 
> >> CONFIG_MODULE_SIG_FORCE
> >> + * is not enabled.
> >> + */
> >> +static const char *const secure_rules[] = {
> >> +    "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
> >> +#ifndef CONFIG_MODULE_SIG_FORCE
> >> +    "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
> >> +#endif
> >> +    NULL
> >> +};
> >
> > Is there any way to not use conditional compilation in the above array 
> > definition? Maybe define different functions to get "secure_rules" for 
> > when CONFIG_MODULE_SIG_FORCE is defined and when it is not defined.
> 
> How will you decide which function to be called ?

You could call "is_module_sig_enforced()".

Mimi



Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-10-26 Thread Segher Boessenkool
On Sat, Oct 26, 2019 at 08:48:27PM +0200, Thomas Gleixner wrote:
> On Sat, 26 Oct 2019, Christophe Leroy wrote:
> Let's look at the code:
> 
> __cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
> {
> const struct vdso_data *vd = __arch_get_vdso_data();
> 
> if (likely(tv != NULL)) {
>   struct __kernel_timespec ts;
> 
> if (do_hres([CS_HRES_COARSE], CLOCK_REALTIME, ))
> return gettimeofday_fallback(tv, tz);
> 
> tv->tv_sec = ts.tv_sec;
> tv->tv_usec = (u32)ts.tv_nsec / NSEC_PER_USEC;
> 
> IIRC PPC did some magic math tricks to avoid that. Could you just for the
> fun of it replace this division with
> 
>(u32)ts.tv_nsec >> 10;

On this particular CPU (the 885, right?) a division by 1000 is just 9
cycles.  On other CPUs it can be more, say 19 cycles like on the 750; not
cheap at all, but not hugely expensive either, comparatively.

(A 64/32->32 division is expensive on all 32-bit PowerPC: there is no
hardware help for it at all, so it's all done in software.)

Of course the compiler won't do a division by a constant with a division
instruction at all, so it's somewhat cheaper even, 5 or 6 cycles or so.

> One thing which might be worth to try as well is to mark all functions in
> that file as inline. The speedup by the do_hres() inlining was impressive
> on PPC.

The hand-optimised asm code will pretty likely win handsomely, whatever
you do.  Especially on cores like the 885 (no branch prediction, single
issue, small caches, etc.: every instruction counts).

Is there any reason to replace this hand-optimised code?  It was written
for exacty this reason?  These functions are critical and should be as
fast as possible.


Segher


[Bug 205327] kmemleak reports various leaks in "swapper/0"

2019-10-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205327

--- Comment #2 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 285661
  --> https://bugzilla.kernel.org/attachment.cgi?id=285661=edit
dmesg (kernel 5.4.0-rc4, PowerMac G5 7,3)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 205327] kmemleak reports various leaks in "swapper/0"

2019-10-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205327

--- Comment #1 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 285659
  --> https://bugzilla.kernel.org/attachment.cgi?id=285659=edit
5.4.0-rc4 kernel .config (PowerMac G5 7,3)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 205327] New: kmemleak reports various leaks in "swapper/0"

2019-10-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205327

Bug ID: 205327
   Summary: kmemleak reports various leaks in "swapper/0"
   Product: Platform Specific/Hardware
   Version: 2.5
Kernel Version: 5.4-rc4
  Hardware: PPC-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: PPC-64
  Assignee: platform_ppc...@kernel-bugs.osdl.org
  Reporter: erhar...@mailbox.org
Regression: No

Created attachment 285655
  --> https://bugzilla.kernel.org/attachment.cgi?id=285655=edit
kmemleak output

kmemleak reported various leaks in "swapper/0" while I was building some stuff
via distcc:

unreferenced object 0xc0027eea64e0 (size 32):
  comm "swapper/0", pid 1, jiffies 4294877673 (age 1568.254s)
  hex dump (first 32 bytes):
2f 5f 5f 6c 6f 63 61 6c 5f 66 69 78 75 70 73 5f  /__local_fixups_
5f 00 00 00 00 ca ad c8 00 00 00 00 00 00 00 00  _...
  backtrace:
[] .kvasprintf+0x64/0xe0
[] .kasprintf+0x2c/0x50
[<00808425>] .attach_node_and_children+0x2c/0x270
[] .of_unittest+0x1f4/0x379c
[] .do_one_initcall+0x7c/0x430
[] .kernel_init_freeable+0x2d0/0x3cc
[<5adf1660>] .kernel_init+0x1c/0x138
[<6adcb060>] .ret_from_kernel_thread+0x58/0x64
[...]

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[PATCH RESEND v14 6/6] Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED

2019-10-26 Thread Aleksa Sarai
Now that we have a special flag to signify magic-link jumps, mention it
within the path-lookup docs. And now that "magic link" is the correct
term for nd_jump_link()-style symlinks, clean up references to this type
of "symlink".

Signed-off-by: Aleksa Sarai 
---
 Documentation/filesystems/path-lookup.rst | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/path-lookup.rst 
b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..2c32795389bd 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -405,6 +405,10 @@ is requested.  Keeping a reference in the ``nameidata`` 
ensures that
 only one root is in effect for the entire path walk, even if it races
 with a ``chroot()`` system call.
 
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
+``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
+passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
+
 The root is needed when either of two conditions holds: (1) either the
 pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
 component is being handled, since "``..``" from the root must always stay
@@ -1149,7 +1153,7 @@ so ``NULL`` is returned to indicate that the symlink can 
be released and
 the stack frame discarded.
 
 The other case involves things in ``/proc`` that look like symlinks but
-aren't really::
+aren't really (and are therefore commonly referred to as "magic-links")::
 
  $ ls -l /proc/self/fd/1
  lrwx-- 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
@@ -1310,12 +1314,14 @@ longer needed.
 ``LOOKUP_JUMPED`` means that the current dentry was chosen not because
 it had the right name but for some other reason.  This happens when
 following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink.  In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``).  In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``).  In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
 ``LOOKUP_JUMPED`` is set when the look completes - which may be at the
-final component or, when creating, unlinking, or renaming, at the penultimate 
component.
+final component or, when creating, unlinking, or renaming, at the
+penultimate component. ``LOOKUP_MAGICLINK_JUMPED`` is set alongside
+``LOOKUP_JUMPED`` if a magic-link was traversed.
 
 Final-component flags
 ~
-- 
2.23.0



[PATCH RESEND v14 5/6] selftests: add openat2(2) selftests

2019-10-26 Thread Aleksa Sarai
Test all of the various openat2(2) flags. A small stress-test of a
symlink-rename attack is included to show that the protections against
".."-based attacks are sufficient.

The main things these self-tests are enforcing are:

  * The struct+usize ABI for openat2(2) and copy_struct_from_user() to
ensure that upgrades will be handled gracefully (in addition,
ensuring that misaligned structures are also handled correctly).

  * The -EINVAL checks for openat2(2) are all correctly handled to avoid
userspace passing unknown or conflicting flag sets (most
importantly, ensuring that invalid flag combinations are checked).

  * All of the RESOLVE_* semantics (including errno values) are
correctly handled with various combinations of paths and flags.

  * RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
attack that has been responsible for several CVEs (and likely will
be responsible for several more).

Signed-off-by: Aleksa Sarai 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/openat2/.gitignore|   1 +
 tools/testing/selftests/openat2/Makefile  |   8 +
 tools/testing/selftests/openat2/helpers.c | 109 
 tools/testing/selftests/openat2/helpers.h | 107 
 .../testing/selftests/openat2/openat2_test.c  | 297 ++
 .../selftests/openat2/rename_attack_test.c| 160 ++
 .../testing/selftests/openat2/resolve_test.c  | 523 ++
 8 files changed, 1206 insertions(+)
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/openat2_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4cdbae6f4e61..28996856ed5e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
 TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += openat2
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/openat2/.gitignore 
b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index ..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile 
b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index ..4b93b1417b86
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
+TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c 
b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index ..e9a6557ab16f
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai 
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "helpers.h"
+
+bool needs_openat2(const struct open_how *how)
+{
+   return how->resolve != 0;
+}
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size)
+{
+   int ret = syscall(__NR_openat2, dfd, path, how, size);
+   return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat2(int dfd, const char *path, struct open_how *how)
+{
+   return raw_openat2(dfd, path, how, sizeof(*how));
+}
+
+int sys_openat(int dfd, const char *path, struct open_how *how)
+{
+   int ret = openat(dfd, path, how->flags, how->mode);
+   return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags)
+{
+   int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+ newdirfd, newpath, flags);
+   return ret >= 0 ? ret : -errno;
+}
+
+int touchat(int dfd, const char *path)
+{
+   int fd = openat(dfd, path, O_CREAT);
+   if (fd >= 0)
+   close(fd);
+   return fd;
+}
+
+char *fdreadlink(int fd)
+{
+   char *target, *tmp;
+
+   E_asprintf(, "/proc/self/fd/%d", fd);
+
+   target = malloc(PATH_MAX);
+   if (!target)
+   ksft_exit_fail_msg("fdreadlink: malloc failed\n");
+   memset(target, 0, PATH_MAX);
+
+   E_readlink(tmp, target, PATH_MAX);
+   free(tmp);
+   return target;
+}
+
+bool fdequal(int fd, int dfd, const char *path)
+{
+   char *fdpath, *dfdpath, *other;
+   bool cmp;
+
+  

[PATCH RESEND v14 4/6] open: introduce openat2(2) syscall

2019-10-26 Thread Aleksa Sarai
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
Thus it's necessary to (at the very least) add an additional flag.

Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).

/* Syscall Prototype. */
  /*
   * open_how is an extensible structure (similar in interface to
   * clone3(2) or sched_setattr(2)). The size parameter must be set to
   * sizeof(struct open_how), to allow for future extensions. All future
   * extensions will be appended to open_how, with their zero value
   * acting as a no-op default.
   */
  struct open_how { /* ... */ };

  int openat2(int dfd, const char *pathname,
  struct open_how *how, size_t size);

/* Description. */
The initial version of 'struct open_how' contains the following fields:

  flags
Used to specify openat(2)-style flags. However, any unknown flag
bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
will result in -EINVAL. In addition, this field is 64-bits wide to
allow for more O_ flags than currently permitted with openat(2).

  mode
The file mode for O_CREAT or O_TMPFILE.

Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

  __padding
Must be set to all zeroes.

  resolve
Restrict path resolution (in contrast to O_* flags they affect all
path components). The current set of flags are as follows (at the
moment, all of the RESOLVE_ flags are implemented as just passing
the corresponding LOOKUP_ flag).

RESOLVE_NO_XDEV   => LOOKUP_NO_XDEV
RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
RESOLVE_BENEATH   => LOOKUP_BENEATH
RESOLVE_IN_ROOT   => LOOKUP_IN_ROOT

open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it).

Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).

/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.

In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).

/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking automount
during resolution).

Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[4]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.

[1]: https://lwn.net/Articles/588444/
[2]: 
https://lore.kernel.org/lkml/ca+55afyyxjl1lyxzebsf2ypriraj5ut1xkndsunrbqgvjzu...@mail.gmail.com
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[4]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyp...@cyphar.com/

Suggested-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 CREDITS |   4 +-
 arch/alpha/kernel/syscalls/syscall.tbl  |   1 +
 arch/arm/tools/syscall.tbl  |   1 +
 arch/arm64/include/asm/unistd.h |   2 +-
 arch/arm64/include/asm/unistd32.h   |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl   |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl   |   1 +
 

[PATCH RESEND v14 3/6] namei: permit ".." resolution with LOOKUP_{IN_ROOT, BENEATH}

2019-10-26 Thread Aleksa Sarai
This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
".." resolution (in the case of LOOKUP_BENEATH the resolution will still
fail if ".." resolution would resolve a path outside of the root --
while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
are still disallowed entirely[*].

The need for this patch (and the original no-".." restriction) is
explained by observing there is a fairly easy-to-exploit race condition
with chroot(2) (and thus by extension LOOKUP_IN_ROOT and LOOKUP_BENEATH
if ".." is allowed) where a rename(2) of a path can be used to "skip
over" nd->root and thus escape to the filesystem above nd->root.

  thread1 [attacker]:
for (;;)
  renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
for (;;)
  openat2(dirb, "b/c/../../etc/shadow",
  { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.

This is a first-pass implementation, where -EAGAIN will be returned if
any rename or mount occurs anywhere on the host (in any namespace). This
will result in spurious errors, but there isn't a satisfactory
alternative (other than denying ".." altogether).

One other possible alternative (which previous versions of this patch
used) would be to check with path_is_under() if there was a racing
rename or mount (after re-taking the relevant seqlocks). While this does
work, it results in possible O(n*m) behaviour if there are many renames
or mounts occuring *anywhere on the system*.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check (as
with the alternative solution for "..") for magic-links after they
are resolved. However this seems unlikely to be a feature that
people *really* need -- it can be added later if it turns out a lot
of people want it.

Signed-off-by: Aleksa Sarai 
---
 fs/namei.c | 43 +--
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 9d00b138f54c..0d6857ac4e5b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode*inode; /* path.dentry.d_inode */
unsigned intflags;
-   unsignedseq, m_seq;
+   unsignedseq, m_seq, r_seq;
int last_type;
unsigneddepth;
int total_link_count;
@@ -1769,22 +1769,35 @@ static inline int handle_dots(struct nameidata *nd, int 
type)
if (type == LAST_DOTDOT) {
int error = 0;
 
-   /*
-* Scoped-lookup flags resolving ".." is not currently safe --
-* races can cause our parent to have moved outside of the root
-* and us to skip over it.
-*/
-   if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
-   return -EXDEV;
if (!nd->root.mnt) {
error = set_root(nd);
if (error)
return error;
}
-   if (nd->flags & LOOKUP_RCU) {
-   return follow_dotdot_rcu(nd);
-   } else
-   return follow_dotdot(nd);
+   if (nd->flags & LOOKUP_RCU)
+   error = follow_dotdot_rcu(nd);
+   else
+   error = follow_dotdot(nd);
+   if (error)
+   return error;
+
+   if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
+   bool m_retry = read_seqretry(_lock, nd->m_seq);
+   bool r_retry = read_seqretry(_lock, nd->r_seq);
+
+   /*
+* If there was a racing rename or mount along our
+* path, then we can't be sure that ".." hasn't jumped
+* above nd->root (and so userspace should 

[PATCH RESEND v14 2/6] namei: LOOKUP_IN_ROOT: chroot-like path resolution

2019-10-26 Thread Aleksa Sarai
/* Background. */
Container runtimes or other administrative management processes will
often interact with root filesystems while in the host mount namespace,
because the cost of doing a chroot(2) on every operation is too
prohibitive (especially in Go, which cannot safely use vfork). However,
a malicious program can trick the management process into doing
operations on files outside of the root filesystem through careful
crafting of symlinks.

Most programs that need this feature have attempted to make this process
safe, by doing all of the path resolution in userspace (with symlinks
being scoped to the root of the malicious root filesystem).
Unfortunately, this method is prone to foot-guns and usually such
implementations have subtle security bugs.

Thus, what userspace needs is a way to resolve a path as though it were
in a chroot(2) -- with all absolute symlinks being resolved relative to
the dirfd root (and ".." components being stuck under the dirfd root[1])
It is much simpler and more straight-forward to provide this
functionality in-kernel (because it can be done far more cheaply and
correctly).

More classical applications that also have this problem (which have
their own potentially buggy userspace path sanitisation code) include
web servers, archive extraction tools, network file servers, and so on.

[1]: At the moment, ".." and magic-link jumping are disallowed for the
 same reason it is disabled for LOOKUP_BENEATH -- currently it is
 not safe to allow it. Future patches may enable it unconditionally
 once we have resolved the possible races (for "..") and semantics
 (for magic-link jumping).

/* Userspace API. */
LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).

There is a slight change in behaviour regarding pathnames -- if the
pathname is absolute then the dirfd is still used as the root of
resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
foot-guns, at the cost of a minor API inconsistency).

Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 5 +
 include/linux/namei.h | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 54fdbdfbeb94..9d00b138f54c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2277,6 +2277,11 @@ static const char *path_init(struct nameidata *nd, 
unsigned flags)
 
nd->m_seq = read_seqbegin(_lock);
 
+   /* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
+   if (flags & LOOKUP_IN_ROOT)
+   while (*s == '/')
+   s++;
+
/* Figure out the starting path and root (if needed). */
if (*s == '/') {
error = nd_jump_root(nd);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 35a1bf074ff1..c7a010570d05 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -47,8 +47,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_MAGICLINKS   0x08 /* No /proc/$pid/fd/ "symlink" 
crossing. */
 #define LOOKUP_NO_SYMLINKS 0x10 /* No symlink crossing *at all*.
Implies LOOKUP_NO_MAGICLINKS. */
+#define LOOKUP_IN_ROOT 0x20 /* Treat dirfd as %current->fs->root. 
*/
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
-#define LOOKUP_IS_SCOPED LOOKUP_BENEATH
+#define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
 
 extern int path_pts(struct path *path);
 
-- 
2.23.0



[PATCH RESEND v14 1/6] namei: O_BENEATH-style resolution restriction flags

2019-10-26 Thread Aleksa Sarai
/* Background. */
The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications throughout the
history of Unix. While some improvements have been made (such as
O_NOFOLLOW or AT_NO_AUTOMOUNT), most of the new APIs have involved
restricting the final component in a path's lookup -- completely
ignoring the rest of the path components.

Userspace programs have thus been forced to implement their own (and
usually subtly broken) methods of ensuring path components they don't
wish to resolve are detected. Aside from making it more complicated to
write such programs safely, there are some things which are effectively
impossible to safely handle correctly (for instance, magic-links cannot
reliably be differentiated from symlinks on filesystems that may contain
magic-links). It would be a massive improvement to provide these types
of resolution restriction features to userspace.

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

/* Userspace API. */
These flags will be exposed to userspace through openat2(2).

/* Semantics. */
The following new LOOKUP flags are defined, and (in contrast to most
other LOOKUP flags, they apply to all components of path resolution as
opposed to only the final component).

  LOOKUP_NO_XDEV
Disallow mount-point crossing (both *down* into one, or *up* from
one). Both bind-mounts and cross-filesystem mounts are blocked by
this flag. The naming is based on "find -xdev" as well as -EXDEV
(though find(1) doesn't walk upwards, the semantics seem obvious).

  LOOKUP_NO_MAGICLINKS
Disallows "magic-link" resolution ("symlinks" that are resolved
through nd_jump_link()). This is important to provide explicitly,
because magic-links can be used to trick privileged programs into
bypassing normal path resolution restriction mechanisms (such as
mount namespaces). Such programs likely want to permit ordinary
symlink resolution, but don't wish to permit magic-links.

It should be noted that prior to this, there was no way for
userspace to unambiguously verify whether a symlink was a
magic-link.

  LOOKUP_NO_SYMLINKS
Disallows resolution through symlinks (includes magic-links).

  LOOKUP_BENEATH
Disallow "escapes" from the starting point of the filesystem tree
during resolution (you must stay "beneath" the starting point at all
times). Currently this is done by disallowing ".." and absolute
paths (either in the given path or found during symlink resolution)
entirely, as well as all magic-link jumping.

The wholesale banning of ".." is because it is currently not safe to
allow ".." resolution (races can cause the path to be moved outside
of the root -- this is conceptually similar to historical chroot(2)
escape attacks). Future patches in this series will address this,
and will re-enable ".." resolution once it is safe. With those
patches, ".." resolution will only be allowed if it remains in the
root throughout resolution (such as "a/../b" not
"a/../../outside/b").

The banning of magic-link jumping is done because it is not clear
whether semantically they should be allowed -- while some
magic-links are safe there are many that can cause escapes (and once
a resolution is outside of the root, O_BENEATH will no longer detect
it). Future patches may re-enable magic-link jumping when such jumps
would remain inside the root.

The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would
violates their requirement, while the others all return -EXDEV.

[1]: https://lore.kernel.org/lkml/20170429220414.gt29...@zeniv.linux.org.uk/
[2]: 
https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysd...@google.com/
[3]: 
https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysd...@google.com/
[4]: https://lwn.net/Articles/723057/

Cc: Christian Brauner 
Suggested-by: David Drysdale 
Suggested-by: Al Viro 
Suggested-by: Andy Lutomirski 
Suggested-by: Linus Torvalds 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 135 +++---
 include/linux/namei.h |  11 
 2 files changed, 125 insertions(+), 21 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..54fdbdfbeb94 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -504,6 +504,9 @@ struct nameidata {
struct filename *name;
struct nameidata *saved;
struct inode*link_inode;
+   struct {
+   bool same_mnt;
+   } last_magiclink;
unsignedroot_seq;
int dfd;
 } __randomize_layout;
@@ -641,6 +644,14 @@ static bool legitimize_links(struct 

[PATCH RESEND v14 0/6] open: introduce openat2(2) syscall

2019-10-26 Thread Aleksa Sarai
This patchset is being developed here:
  

Patch changelog:
 v14: []
  * The magic-link changes (and O_EMPTYPATH) have been dropped from this series
-- they will be developed and sent separately. The main reason is that we
need to restrict things other than open(2) (examples include truncate(2) as
well as mount(MS_BIND)). This will require a fair amount of extra work, and
there's no point stalling openat2(2) for that work to be completed.
  * Minor rework of 'struct open_how':
* To avoid future headaches, make it a non-const argument.
* Expand ->flags and ->resolve to 64-bit fields to allow for more flag
  extensions without needing to add separate fields too early. This
  requires adding a bit of explicit padding (32 bits) to avoid userspace
  putting garbage in the alignment padding -- this can be repurposed for
  future extensions.
* upgrade_mask is dropped (and will be a separate field when we add it
  again in the future) to avoid userspace foot-guns.
* Expand -EINVAL checks in build_open_flags(). Rather than silently
  ignoring silly flag combinations (such as O_TMPFILE|O_PATH or
  O_PATH|), give an -EINVAL. All of the silent ignore semantics
  were added to open(2) because we couldn't return -EINVAL -- but we can
  now!
  * open(2) and openat(2) clean up their flags before passing them to
build_open_flags(), so all mixed flags will continue to work. There is
one exception which is (O_PATH|O_TMPFILE) -- this is no longer
permitted (as far as I can tell this appears to be a bug, and there are
no userspace users that I've hit after running this code for a few
days). If it turns out that userspace does depend on (O_PATH|O_TMPFILE)
working, we can only disallow it for openat2(2).
  * Don't zero out nd->root in complete_walk() for RCU-walk if we're doing a
scoped-lookup (this prevents a needless REF-walk retry).
  * Attempt all tests on kernels that don't have openat2(2), rather than just
skipping everything.
 v13: 
 v12: 
 v11: 
  
 v10: 
 v09: 
 v08: 
 v07: 
 v06: 
 v05: 
 v04: 
 v03: 
 v02: 
 v01: 

For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:

  * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not

Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-10-26 Thread Thomas Gleixner
On Sat, 26 Oct 2019, Christophe Leroy wrote:
> Le 26/10/2019 à 17:53, Thomas Gleixner a écrit :
> > > > > > gettimeofday:    vdso: 750 nsec/call
> > > > > > 
> > > > > > gettimeofday:    vdso: 1533 nsec/call
> > > > 
> > > > Small improvement (3%) with the proposed change:
> > > > 
> > > > gettimeofday:    vdso: 1485 nsec/call
> > > 
> > > By inlining do_hres() I get the following:
> > > 
> > > gettimeofday:vdso: 1072 nsec/call
> > 
> > What's the effect for clock_gettime()?
> > 
> > gettimeofday() is suboptimal vs. the PPC ASM variant due to an extra
> > division, but clock_gettime() should be 1:1 comparable.
> > 
> 
> Original PPC asm:
> clock-gettime-realtime:vdso: 928 nsec/call
> 
> My original RFC:
> clock-gettime-realtime:vdso: 1570 nsec/call
> 
> With your suggested vdso_calc_delta():
> clock-gettime-realtime:vdso: 1512 nsec/call
> 
> With your vdso_calc_delta() and inlined do_hres():
> clock-gettime-realtime:vdso: 1302 nsec/call

That does not make any sense at all.

gettimeofday() is basically the same as clock_gettime(REALTIME) and does
an extra division. So I would expect it to be slower. 

Let's look at the code:

__cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
{
const struct vdso_data *vd = __arch_get_vdso_data();

if (likely(tv != NULL)) {
struct __kernel_timespec ts;

if (do_hres([CS_HRES_COARSE], CLOCK_REALTIME, ))
return gettimeofday_fallback(tv, tz);

tv->tv_sec = ts.tv_sec;
tv->tv_usec = (u32)ts.tv_nsec / NSEC_PER_USEC;

IIRC PPC did some magic math tricks to avoid that. Could you just for the
fun of it replace this division with

   (u32)ts.tv_nsec >> 10;

That's obviously incorrect, but it would tell us how heavy the division
is. If that brings us close we might do something special for
gettimeofday().

OTOH, last time I checked clock_gettime() was by far more used than
gettimeofday() but that obviously depends on the use cases.

}
...
}

and

__cvdso_clock_gettime_common(clockid_t clock, struct __kernel_timespec *ts)
{
const struct vdso_data *vd = __arch_get_vdso_data();
u32 msk;

/* Check for negative values or invalid clocks */
if (unlikely((u32) clock >= MAX_CLOCKS))
return -1;

/*
 * Convert the clockid to a bitmask and use it to check which
 * clocks are handled in the VDSO directly.
 */
msk = 1U << clock;
if (likely(msk & VDSO_HRES)) {
return do_hres([CS_HRES_COARSE], clock, ts);

So this is the extra code which is executed for clock_gettime(REAL) which
is pure logic and certainly not as heavyweight as the division in the
gettimeofday() code path.

}

static __maybe_unused int
__cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
{
int ret = __cvdso_clock_gettime_common(clock, ts);

if (unlikely(ret))
return clock_gettime_fallback(clock, ts);
return 0;
}

One thing which might be worth to try as well is to mark all functions in
that file as inline. The speedup by the do_hres() inlining was impressive
on PPC.

Thanks,

tglx


Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-10-26 Thread Christophe Leroy




Le 26/10/2019 à 17:53, Thomas Gleixner a écrit :

On Tue, 22 Oct 2019, Christophe Leroy wrote:

Le 22/10/2019 à 11:01, Christophe Leroy a écrit :

Le 21/10/2019 à 23:29, Thomas Gleixner a écrit :

On Mon, 21 Oct 2019, Christophe Leroy wrote:


This is a tentative to switch powerpc/32 vdso to generic C
implementation.
It will likely not work on 64 bits or even build properly at the moment.

powerpc is a bit special for VDSO as well as system calls in the
way that it requires setting CR SO bit which cannot be done in C.
Therefore, entry/exit and fallback needs to be performed in ASM.

To allow that, C fallbacks just return -1 and the ASM entry point
performs the system call when the C function returns -1.

The performance is rather disappoiting. That's most likely all
calculation in the C implementation are based on 64 bits math and
converted to 32 bits at the very end. I guess C implementation should
use 32 bits math like the assembly VDSO does as of today.



gettimeofday:    vdso: 750 nsec/call

gettimeofday:    vdso: 1533 nsec/call


Small improvement (3%) with the proposed change:

gettimeofday:    vdso: 1485 nsec/call


By inlining do_hres() I get the following:

gettimeofday:vdso: 1072 nsec/call


What's the effect for clock_gettime()?

gettimeofday() is suboptimal vs. the PPC ASM variant due to an extra
division, but clock_gettime() should be 1:1 comparable.



Original PPC asm:
clock-gettime-realtime:vdso: 928 nsec/call

My original RFC:
clock-gettime-realtime:vdso: 1570 nsec/call

With your suggested vdso_calc_delta():
clock-gettime-realtime:vdso: 1512 nsec/call

With your vdso_calc_delta() and inlined do_hres():
clock-gettime-realtime:vdso: 1302 nsec/call

Christophe


Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-10-26 Thread Christophe Leroy




Le 26/10/2019 à 15:55, Andy Lutomirski a écrit :

On Tue, Oct 22, 2019 at 6:56 AM Christophe Leroy
 wrote:




The performance is rather disappoiting. That's most likely all
calculation in the C implementation are based on 64 bits math and
converted to 32 bits at the very end. I guess C implementation should
use 32 bits math like the assembly VDSO does as of today.



gettimeofday:vdso: 750 nsec/call

gettimeofday:vdso: 1533 nsec/call


Small improvement (3%) with the proposed change:

gettimeofday:vdso: 1485 nsec/call


By inlining do_hres() I get the following:

gettimeofday:vdso: 1072 nsec/call



A perf report might be informative.



Not sure there is much to learn from perf report:

With the original RFC:

51.83%  test_vdso  [vdso] [.] do_hres
24.86%  test_vdso  [vdso] [.] __c_kernel_gettimeofday
 7.33%  test_vdso  [vdso] [.] __kernel_gettimeofday
 5.77%  test_vdso  test_vdso  [.] main
 1.55%  test_vdso  [kernel.kallsyms]  [k] copy_page
 0.67%  test_vdso  libc-2.23.so   [.] _dl_addr
 0.55%  test_vdso  ld-2.23.so [.] do_lookup_x

With vdso_calc_delta() optimised as suggested by Thomas + inlined do_hres():

68.00%  test_vdso  [vdso] [.] __c_kernel_gettimeofday
12.59%  test_vdso  [vdso] [.] __kernel_gettimeofday
 6.22%  test_vdso  test_vdso  [.] main
 2.07%  test_vdso  [kernel.kallsyms]  [k] copy_page
 1.04%  test_vdso  ld-2.23.so [.] _dl_relocate_object
 0.89%  test_vdso  ld-2.23.so [.] do_lookup_x

I've tried 'perf annotate', but I have not found how to tell perf to use 
vdso32.so.dbg file for annotate [vdso].


Test app:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int (*gettimeofday_vdso)(struct timeval *tv, struct timezone *tz);

int main(int argc, char **argv)
{
void *handle = dlopen("linux-vdso32.so.1", RTLD_NOW | RTLD_GLOBAL);
struct timeval tv;
struct timezone tz;
int i;

(void)dlerror();

gettimeofday_vdso = dlsym(handle, "__kernel_gettimeofday");
if (dlerror())
gettimeofday_vdso = NULL;

for (i = 0; i < 10; i++)
gettimeofday_vdso(, );
}


Christophe


Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-10-26 Thread Thomas Gleixner
On Tue, 22 Oct 2019, Christophe Leroy wrote:
> Le 22/10/2019 à 11:01, Christophe Leroy a écrit :
> > Le 21/10/2019 à 23:29, Thomas Gleixner a écrit :
> > > On Mon, 21 Oct 2019, Christophe Leroy wrote:
> > > 
> > > > This is a tentative to switch powerpc/32 vdso to generic C
> > > > implementation.
> > > > It will likely not work on 64 bits or even build properly at the moment.
> > > > 
> > > > powerpc is a bit special for VDSO as well as system calls in the
> > > > way that it requires setting CR SO bit which cannot be done in C.
> > > > Therefore, entry/exit and fallback needs to be performed in ASM.
> > > > 
> > > > To allow that, C fallbacks just return -1 and the ASM entry point
> > > > performs the system call when the C function returns -1.
> > > > 
> > > > The performance is rather disappoiting. That's most likely all
> > > > calculation in the C implementation are based on 64 bits math and
> > > > converted to 32 bits at the very end. I guess C implementation should
> > > > use 32 bits math like the assembly VDSO does as of today.
> > > 
> > > > gettimeofday:    vdso: 750 nsec/call
> > > > 
> > > > gettimeofday:    vdso: 1533 nsec/call
> > 
> > Small improvement (3%) with the proposed change:
> > 
> > gettimeofday:    vdso: 1485 nsec/call
> 
> By inlining do_hres() I get the following:
> 
> gettimeofday:vdso: 1072 nsec/call

What's the effect for clock_gettime()?

gettimeofday() is suboptimal vs. the PPC ASM variant due to an extra
division, but clock_gettime() should be 1:1 comparable.

Thanks,

tglx

Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-10-26 Thread Andy Lutomirski
On Tue, Oct 22, 2019 at 6:56 AM Christophe Leroy
 wrote:
>
>
> >>> The performance is rather disappoiting. That's most likely all
> >>> calculation in the C implementation are based on 64 bits math and
> >>> converted to 32 bits at the very end. I guess C implementation should
> >>> use 32 bits math like the assembly VDSO does as of today.
> >>
> >>> gettimeofday:vdso: 750 nsec/call
> >>>
> >>> gettimeofday:vdso: 1533 nsec/call
> >
> > Small improvement (3%) with the proposed change:
> >
> > gettimeofday:vdso: 1485 nsec/call
>
> By inlining do_hres() I get the following:
>
> gettimeofday:vdso: 1072 nsec/call
>

A perf report might be informative.


[PATCH AUTOSEL 4.19 40/59] net/ibmvnic: Fix EOI when running in XIVE mode.

2019-10-26 Thread Sasha Levin
From: Cédric Le Goater 

[ Upstream commit 11d49ce9f7946dfed4dcf5dbde865c78058b50ab ]

pSeries machines on POWER9 processors can run with the XICS (legacy)
interrupt mode or with the XIVE exploitation interrupt mode. These
interrupt contollers have different interfaces for interrupt
management : XICS uses hcalls and XIVE loads and stores on a page.
H_EOI being a XICS interface the enable_scrq_irq() routine can fail
when the machine runs in XIVE mode.

Fix that by calling the EOI handler of the interrupt chip.

Fixes: f23e0643cd0b ("ibmvnic: Clear pending interrupt after device reset")
Signed-off-by: Cédric Le Goater 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index aa067a7a72d40..8fa14736449bc 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2731,12 +2731,10 @@ static int enable_scrq_irq(struct ibmvnic_adapter 
*adapter,
 
if (adapter->resetting &&
adapter->reset_reason == VNIC_RESET_MOBILITY) {
-   u64 val = (0xff00) | scrq->hw_irq;
+   struct irq_desc *desc = irq_to_desc(scrq->irq);
+   struct irq_chip *chip = irq_desc_get_chip(desc);
 
-   rc = plpar_hcall_norets(H_EOI, val);
-   if (rc)
-   dev_err(dev, "H_EOI FAILED irq 0x%llx. rc=%ld\n",
-   val, rc);
+   chip->irq_eoi(>irq_data);
}
 
rc = plpar_hcall_norets(H_VIOCTL, adapter->vdev->unit_address,
-- 
2.20.1



[PATCH AUTOSEL 5.3 58/99] net/ibmvnic: Fix EOI when running in XIVE mode.

2019-10-26 Thread Sasha Levin
From: Cédric Le Goater 

[ Upstream commit 11d49ce9f7946dfed4dcf5dbde865c78058b50ab ]

pSeries machines on POWER9 processors can run with the XICS (legacy)
interrupt mode or with the XIVE exploitation interrupt mode. These
interrupt contollers have different interfaces for interrupt
management : XICS uses hcalls and XIVE loads and stores on a page.
H_EOI being a XICS interface the enable_scrq_irq() routine can fail
when the machine runs in XIVE mode.

Fix that by calling the EOI handler of the interrupt chip.

Fixes: f23e0643cd0b ("ibmvnic: Clear pending interrupt after device reset")
Signed-off-by: Cédric Le Goater 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 5cb55ea671e35..964e7d62f4b13 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2772,12 +2772,10 @@ static int enable_scrq_irq(struct ibmvnic_adapter 
*adapter,
 
if (adapter->resetting &&
adapter->reset_reason == VNIC_RESET_MOBILITY) {
-   u64 val = (0xff00) | scrq->hw_irq;
+   struct irq_desc *desc = irq_to_desc(scrq->irq);
+   struct irq_chip *chip = irq_desc_get_chip(desc);
 
-   rc = plpar_hcall_norets(H_EOI, val);
-   if (rc)
-   dev_err(dev, "H_EOI FAILED irq 0x%llx. rc=%ld\n",
-   val, rc);
+   chip->irq_eoi(>irq_data);
}
 
rc = plpar_hcall_norets(H_VIOCTL, adapter->vdev->unit_address,
-- 
2.20.1



Re: loop nesting in alignment exception and machine check

2019-10-26 Thread Christophe Leroy

Hi,

Le 26/10/2019 à 09:23, Wangshaobo (bobo) a écrit :

Hi,

I encountered a problem about a loop nesting occurred in manufacturing 
the alignment exception in machine check, trigger background is :


problem:

machine checkout or critical interrupt ->…->kbox_write[for recording 
last words] -> memcpy(irremap_addr, src,size):_GLOBAL(memcpy)…


when we enter memcpy,a command ‘dcbz r11,r6’ will cause a alignment 
exception, in this situation,r11 loads the ioremap address,which leads 
to the alignment exception,


You can't use memcpy() on something else than memory.

For an ioremapped area, you have to use memcpy_toio()

Christophe



then the command can not be process successfully,as we still in machine 
check.at the end ,it triggers a new irq machine check in irq handler 
function,a loop nesting begins.


analysis:

We have analysed a lot,but it still can not come to a reasonable 
description,in common,the alignment triggered in machine check context 
can still be collected into the Kbox


after alignment exception be handled by handler function, but how does 
the machine checkout can be triggered in the handler fucntion for any 
causes? We print relevant registers


as follow when first enter machine check and alignment exception handler 
function:


  MSR:0x2  MSR:0x0

  SRR1:0x2  SRR1:0x21002

  But the manual says SRR1 should be set to MSR(0x2),why that 
happened ?


  Then a branch in handler function copy the SRR1 to MSR,this 
enble MSR[ME] and MSR[CE],system collapses.


Conclusion:

  1)  why the alignment exception can not be handled in machine 
check ?


  2)  besides memcpy,any other function can cause the alignment 
exception ?


We still recurrent it, the line as follows:

  Cpu dead lock->watch log->trigger 
fiq->kbox_write->memcpy->alignment exception->print last words.


  but for those problems as below,what the kbox printed is empty.

--kbox restart:[   10.147594]

kbox verify fs magic fail

kbox mem mabye destroyed, format it

kbox: load OK

lock-task: major[249] minor[0]

-start show_destroyed_kbox_mem_head

:      

0010:      

0020:      

0030:      

0040:      

0050:      

0060:      

0070:      

0080:      

0090:      



loop nesting in alignment exception and machine check

2019-10-26 Thread Wangshaobo (bobo)
Hi,
I encountered a problem about a loop nesting occurred in manufacturing the 
alignment exception in machine check, trigger background is :

problem:
machine checkout or critical interrupt ->...->kbox_write[for recording last 
words] -> memcpy(irremap_addr, src,size):_GLOBAL(memcpy)...
when we enter memcpy,a command 'dcbz r11,r6' will cause a alignment exception, 
in this situation,r11 loads the ioremap address,which leads to the alignment 
exception,
then the command can not be process successfully,as we still in machine 
check.at the end ,it triggers a new irq machine check in irq handler function,a 
loop nesting begins.

analysis:
We have analysed a lot,but it still can not come to a reasonable description,in 
common,the alignment triggered in machine check context can still be collected 
into the Kbox
after alignment exception be handled by handler function, but how does the 
machine checkout can be triggered in the handler fucntion for any causes? We 
print relevant registers
as follow when first enter machine check and alignment exception handler 
function:
 MSR:0x2  MSR:0x0
 SRR1:0x2  SRR1:0x21002
 But the manual says SRR1 should be set to MSR(0x2),why that happened ?
 [cid:image001.jpg@01D58C0D.E496CFD0]
 Then a branch in handler function copy the SRR1 to MSR,this enble 
MSR[ME] and MSR[CE],system collapses.

Conclusion:
 1)  why the alignment exception can not be handled in machine check ?
 2)  besides memcpy,any other function can cause the alignment 
exception ?

We still recurrent it, the line as follows:
 Cpu dead lock->watch log->trigger fiq->kbox_write->memcpy->alignment 
exception->print last words.
 but for those problems as below,what the kbox printed is empty.
--kbox restart:[   10.147594]
kbox verify fs magic fail
kbox mem mabye destroyed, format it
kbox: load OK
lock-task: major[249] minor[0]
-start show_destroyed_kbox_mem_head
:      
0010:      
0020:      
0030:      
0040:      
0050:      
0060:      
0070:      
0080:      
0090:      



[PATCH] ide: remove set but not used variable 'hwif'

2019-10-26 Thread Wang Hai
Fix the following gcc warning:

drivers/ide/pmac.c: In function pmac_ide_setup_device:
drivers/ide/pmac.c:1027:14: warning: variable hwif set but not used
[-Wunused-but-set-variable]

Fixes: d58b0c39e32f ("powerpc/macio: Rework hotplug media bay support")
Reported-by: Hulk Robot 
Signed-off-by: Wang Hai 
---
 drivers/ide/pmac.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/ide/pmac.c b/drivers/ide/pmac.c
index b5647e3..ea0b064 100644
--- a/drivers/ide/pmac.c
+++ b/drivers/ide/pmac.c
@@ -1019,7 +1019,6 @@ static int pmac_ide_setup_device(pmac_ide_hwif_t *pmif, 
struct ide_hw *hw)
struct device_node *np = pmif->node;
const int *bidp;
struct ide_host *host;
-   ide_hwif_t *hwif;
struct ide_hw *hws[] = { hw };
struct ide_port_info d = pmac_port_info;
int rc;
@@ -1075,7 +1074,7 @@ static int pmac_ide_setup_device(pmac_ide_hwif_t *pmif, 
struct ide_hw *hw)
rc = -ENOMEM;
goto bail;
}
-   hwif = pmif->hwif = host->ports[0];
+   pmif->hwif = host->ports[0];
 
if (on_media_bay(pmif)) {
/* Fixup bus ID for media bay */
-- 
1.8.3.1



Re: [PATCH 10/10] ocxl: Conditionally bind SCM devices to the generic OCXL driver

2019-10-26 Thread Christoph Hellwig
On Fri, Oct 25, 2019 at 03:47:05PM +1100, Alastair D'Silva wrote:
> From: Alastair D'Silva 
> 
> This patch allows the user to bind OpenCAPI SCM devices to the generic OCXL
> driver.

This completely misses any explanation of why you'd want that.  The
what is rather obvious from the patch.

> +config OCXL_SCM_GENERIC
> + bool "Treat OpenCAPI Storage Class Memory as a generic OpenCAPI device"
> + default n

n is the default default.


Re: Onboard SD card doesn't work anymore after the 'mmc-v5.4-2' updates

2019-10-26 Thread Christoph Hellwig
On Fri, Oct 25, 2019 at 05:28:45PM -0500, Rob Herring wrote:
> This doesn't work?:
> 
> if (IS_ENABLED(CONFIG_PPC) || of_dma_is_coherent(dev->of_node))
> value |= ESDHC_DMA_SNOOP;
> else
> value &= ~ESDHC_DMA_SNOOP;
> 
> While I said use the compatibles, using the kconfig symbol is easier
> than sorting out which compatibles are PPC SoCs. Though if that's
> already done elsewhere in the driver, you could set a flag and use
> that here. I'd be surprised if this was the only difference between
> ARM and PPC SoCs for this block.

I think the right thing is a Kconfig variable that the architectures
selects which says if OF is by default coherent or incoherent, and then
use that in of_dma_is_coherent.