[PATCH v17 13/13] Documentation: path-lookup: include new LOOKUP flags

2019-11-16 Thread Aleksa Sarai
Now that we have new LOOKUP flags, we should document them in the
relevant path-walking documentation. And now that we've settled on a
common name for nd_jump_link() style symlinks ("magic links"), use that
term where magic-link semantics are described.

Signed-off-by: Aleksa Sarai 
---
 Documentation/filesystems/path-lookup.rst | 68 +--
 1 file changed, 62 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/path-lookup.rst 
b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..a3216979298b 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -13,6 +13,7 @@ It has subsequently been updated to reflect changes in the 
kernel
 including:
 
 - per-directory parallel name lookup.
+- ``openat2()`` resolution restriction flags.
 
 Introduction to pathname lookup
 ===
@@ -235,6 +236,13 @@ renamed.  If ``d_lookup`` finds that a rename happened 
while it
 unsuccessfully scanned a chain in the hash table, it simply tries
 again.
 
+``rename_lock`` is also used to detect and defend against potential attacks
+against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
+the parent directory is moved outside the root, bypassing the ``path_equal()``
+check). If ``rename_lock`` is updated during the lookup and the path encounters
+a "..", a potential attack occurred and ``handle_dots()`` will bail out with
+``-EAGAIN``.
+
 inode->i_rwsem
 ~~
 
@@ -348,6 +356,13 @@ any changes to any mount points while stepping up.  This 
locking is
 needed to stabilize the link to the mounted-on dentry, which the
 refcount on the mount itself doesn't ensure.
 
+``mount_lock`` is also used to detect and defend against potential attacks
+against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
+the parent directory is moved outside the root, bypassing the ``path_equal()``
+check). If ``mount_lock`` is updated during the lookup and the path encounters
+a "..", a potential attack occurred and ``handle_dots()`` will bail out with
+``-EAGAIN``.
+
 RCU
 ~~~
 
@@ -405,6 +420,10 @@ is requested.  Keeping a reference in the ``nameidata`` 
ensures that
 only one root is in effect for the entire path walk, even if it races
 with a ``chroot()`` system call.
 
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
+``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
+passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
+
 The root is needed when either of two conditions holds: (1) either the
 pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
 component is being handled, since "``..``" from the root must always stay
@@ -1149,7 +1168,7 @@ so ``NULL`` is returned to indicate that the symlink can 
be released and
 the stack frame discarded.
 
 The other case involves things in ``/proc`` that look like symlinks but
-aren't really::
+aren't really (and are therefore commonly referred to as "magic-links")::
 
  $ ls -l /proc/self/fd/1
  lrwx-- 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
@@ -1286,7 +1305,9 @@ A few flags
 A suitable way to wrap up this tour of pathname walking is to list
 the various flags that can be stored in the ``nameidata`` to guide the
 lookup process.  Many of these are only meaningful on the final
-component, others reflect the current state of the pathname lookup.
+component, others reflect the current state of the pathname lookup, and some
+apply restrictions to all path components encountered in the path lookup.
+
 And then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with
 the others.  If this is not set, an empty pathname causes an error
 very early on.  If it is set, empty pathnames are not considered to be
@@ -1310,13 +1331,48 @@ longer needed.
 ``LOOKUP_JUMPED`` means that the current dentry was chosen not because
 it had the right name but for some other reason.  This happens when
 following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink.  In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``).  In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``).  In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
 ``LOOKUP_JUMPED`` is set when the look completes - which may be at the
 final component or, when creating, unlinking, or renaming, at the penultimate 
component.
 
+Resolution-restriction flags
+
+
+In order to allow userspace to protect itself against certain race conditions
+and attack scenarios involving changing path components, a series of 

[PATCH v17 12/13] selftests: add openat2(2) selftests

2019-11-16 Thread Aleksa Sarai
Test all of the various openat2(2) flags. A small stress-test of a
symlink-rename attack is included to show that the protections against
".."-based attacks are sufficient.

The main things these self-tests are enforcing are:

  * The struct+usize ABI for openat2(2) and copy_struct_from_user() to
ensure that upgrades will be handled gracefully (in addition,
ensuring that misaligned structures are also handled correctly).

  * The -EINVAL checks for openat2(2) are all correctly handled to avoid
userspace passing unknown or conflicting flag sets (most
importantly, ensuring that invalid flag combinations are checked).

  * All of the RESOLVE_* semantics (including errno values) are
correctly handled with various combinations of paths and flags.

  * RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
attack that has been responsible for several CVEs (and likely will
be responsible for several more).

Cc: Shuah Khan 
Signed-off-by: Aleksa Sarai 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/openat2/.gitignore|   1 +
 tools/testing/selftests/openat2/Makefile  |   8 +
 tools/testing/selftests/openat2/helpers.c | 109 
 tools/testing/selftests/openat2/helpers.h | 107 
 .../testing/selftests/openat2/openat2_test.c  | 316 +++
 .../selftests/openat2/rename_attack_test.c| 160 ++
 .../testing/selftests/openat2/resolve_test.c  | 523 ++
 8 files changed, 1225 insertions(+)
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/openat2_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 4cdbae6f4e61..28996856ed5e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
 TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += openat2
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/openat2/.gitignore 
b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index ..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile 
b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index ..4b93b1417b86
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
+TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c 
b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index ..e9a6557ab16f
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai 
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "helpers.h"
+
+bool needs_openat2(const struct open_how *how)
+{
+   return how->resolve != 0;
+}
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size)
+{
+   int ret = syscall(__NR_openat2, dfd, path, how, size);
+   return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat2(int dfd, const char *path, struct open_how *how)
+{
+   return raw_openat2(dfd, path, how, sizeof(*how));
+}
+
+int sys_openat(int dfd, const char *path, struct open_how *how)
+{
+   int ret = openat(dfd, path, how->flags, how->mode);
+   return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags)
+{
+   int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+ newdirfd, newpath, flags);
+   return ret >= 0 ? ret : -errno;
+}
+
+int touchat(int dfd, const char *path)
+{
+   int fd = openat(dfd, path, O_CREAT);
+   if (fd >= 0)
+   close(fd);
+   return fd;
+}
+
+char *fdreadlink(int fd)
+{
+   char *target, *tmp;
+
+   E_asprintf(, "/proc/self/fd/%d", fd);
+
+   target = malloc(PATH_MAX);
+   if (!target)
+   ksft_exit_fail_msg("fdreadlink: malloc failed\n");
+   memset(target, 0, PATH_MAX);
+
+   E_readlink(tmp, target, PATH_MAX);
+   free(tmp);
+   return target;
+}
+
+bool fdequal(int fd, int dfd, const char *path)
+{
+   char *fdpath, *dfdpath, *other;
+ 

[PATCH v17 11/13] open: introduce openat2(2) syscall

2019-11-16 Thread Aleksa Sarai
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.

In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.

Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).

/* Syscall Prototype. */
  /*
   * open_how is an extensible structure (similar in interface to
   * clone3(2) or sched_setattr(2)). The size parameter must be set to
   * sizeof(struct open_how), to allow for future extensions. All future
   * extensions will be appended to open_how, with their zero value
   * acting as a no-op default.
   */
  struct open_how { /* ... */ };

  int openat2(int dfd, const char *pathname,
  struct open_how *how, size_t size);

/* Description. */
The initial version of 'struct open_how' contains the following fields:

  flags
Used to specify openat(2)-style flags. However, any unknown flag
bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
will result in -EINVAL. In addition, this field is 64-bits wide to
allow for more O_ flags than currently permitted with openat(2).

  mode
The file mode for O_CREAT or O_TMPFILE.

Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

  __padding
Must be set to all zeroes.

  resolve
Restrict path resolution (in contrast to O_* flags they affect all
path components). The current set of flags are as follows (at the
moment, all of the RESOLVE_ flags are implemented as just passing
the corresponding LOOKUP_ flag).

RESOLVE_NO_XDEV   => LOOKUP_NO_XDEV
RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
RESOLVE_BENEATH   => LOOKUP_BENEATH
RESOLVE_IN_ROOT   => LOOKUP_IN_ROOT

open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it).

Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).

/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.

In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).

/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).

Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.

[1]: https://lwn.net/Articles/588444/
[2]: 
https://lore.kernel.org/lkml/ca+55afyyxjl1lyxzebsf2ypriraj5ut1xkndsunrbqgvjzu...@mail.gmail.com
[3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyp...@cyphar.com/

Suggested-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 

[PATCH v17 10/13] namei: LOOKUP_{IN_ROOT, BENEATH}: permit limited ".." resolution

2019-11-16 Thread Aleksa Sarai
Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
(in the case of LOOKUP_BENEATH the resolution will still fail if ".."
resolution would resolve a path outside of the root -- while
LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
still disallowed entirely[*].

As Jann explains[1,2], the need for this patch (and the original no-".."
restriction) is explained by observing there is a fairly easy-to-exploit
race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
used to "skip over" nd->root and thus escape to the filesystem above
nd->root.

  thread1 [attacker]:
for (;;)
  renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
for (;;)
  openat2(dirb, "b/c/../../etc/shadow",
  { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.

The primary reason for deferring to userspace with -EAGAIN is that an
in-kernel retry loop (or doing a path_is_under() check after re-taking
the relevant seqlocks) can become unreasonably expensive on machines
with lots of VFS activity (nfsd can cause lots of rename_lock updates).
Thus it should be up to userspace how many times they wish to retry the
lookup -- the selftests for this attack indicate that there is a ~35%
chance of the lookup succeeding on the first try even with an attacker
thrashing rename_lock.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check for
magic-links after they are resolved. However this seems unlikely to
be a feature that people *really* need -- it can be added later if
it turns out a lot of people want it.

[1]: 
https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=omm0vhwuld6eulmne3r6xa...@mail.gmail.com/
[2]: 
https://lore.kernel.org/lkml/cag48ez30wjhbsro2hoc_dr7v91m+hnfzbp5ogrmzaxbaorv...@mail.gmail.com/

Cc: Christian Brauner 
Suggested-by: Jann Horn 
Suggested-by: Linus Torvalds 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c | 37 -
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a6196786db13..88c706e459f6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode*inode; /* path.dentry.d_inode */
unsigned intflags;
-   unsignedseq, m_seq;
+   unsignedseq, m_seq, r_seq;
int last_type;
unsigneddepth;
int total_link_count;
@@ -1781,22 +1781,30 @@ static inline int handle_dots(struct nameidata *nd, int 
type)
if (type == LAST_DOTDOT) {
int error = 0;
 
-   /*
-* Scoped-lookup flags resolving ".." is not currently safe --
-* races can cause our parent to have moved outside of the root
-* and us to skip over it.
-*/
-   if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
-   return -EXDEV;
if (!nd->root.mnt) {
error = set_root(nd);
if (error)
return error;
}
-   if (nd->flags & LOOKUP_RCU) {
-   return follow_dotdot_rcu(nd);
-   } else
-   return follow_dotdot(nd);
+   if (nd->flags & LOOKUP_RCU)
+   error = follow_dotdot_rcu(nd);
+   else
+   error = follow_dotdot(nd);
+   if (error)
+   return error;
+
+   if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
+   /*
+* If there was a racing rename or mount along our
+* path, then we can't be sure that ".." hasn't jumped
+* above nd->root 

[PATCH v17 09/13] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution

2019-11-16 Thread Aleksa Sarai
/* Background. */
Container runtimes or other administrative management processes will
often interact with root filesystems while in the host mount namespace,
because the cost of doing a chroot(2) on every operation is too
prohibitive (especially in Go, which cannot safely use vfork). However,
a malicious program can trick the management process into doing
operations on files outside of the root filesystem through careful
crafting of symlinks.

Most programs that need this feature have attempted to make this process
safe, by doing all of the path resolution in userspace (with symlinks
being scoped to the root of the malicious root filesystem).
Unfortunately, this method is prone to foot-guns and usually such
implementations have subtle security bugs.

Thus, what userspace needs is a way to resolve a path as though it were
in a chroot(2) -- with all absolute symlinks being resolved relative to
the dirfd root (and ".." components being stuck under the dirfd root).
It is much simpler and more straight-forward to provide this
functionality in-kernel (because it can be done far more cheaply and
correctly).

More classical applications that also have this problem (which have
their own potentially buggy userspace path sanitisation code) include
web servers, archive extraction tools, network file servers, and so on.

/* Userspace API. */
LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_IN_ROOT applies to all components of the path.

With LOOKUP_IN_ROOT, any path component which attempts to cross the
starting point of the pathname lookup (the dirfd passed to openat) will
remain at the starting point. Thus, all absolute paths and symlinks will
be scoped within the starting point.

There is a slight change in behaviour regarding pathnames -- if the
pathname is absolute then the dirfd is still used as the root of
resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
foot-guns, at the cost of a minor API inconsistency).

As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
will be lifted in a future patch, but requires more work to ensure that
permitting ".." is done safely.

Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.

/* Testing. */
LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.

[1]: 
https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=omm0vhwuld6eulmne3r6xa...@mail.gmail.com/

Cc: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 10 +++---
 include/linux/namei.h |  3 ++-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 3f7bb22c375d..a6196786db13 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2289,13 +2289,16 @@ static const char *path_init(struct nameidata *nd, 
unsigned flags)
 
nd->m_seq = read_seqbegin(_lock);
 
-   /* Figure out the starting path and root (if needed). */
-   if (*s == '/') {
+   /* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
+   if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
return s;
-   } else if (nd->dfd == AT_FDCWD) {
+   }
+
+   /* Relative pathname -- get the starting-point it is relative to. */
+   if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
unsigned seq;
@@ -2335,6 +2338,7 @@ static const char *path_init(struct nameidata *nd, 
unsigned flags)
}
fdput(f);
}
+
/* For scoped-lookups we need to set the root to the dirfd as well. */
if (flags & LOOKUP_IS_SCOPED) {
nd->root = nd->path;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 93dad378f1e8..93151e47ec47 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -45,8 +45,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_NO_MAGICLINKS   0x02 /* No nd_jump_link() crossing. */
 #define LOOKUP_NO_XDEV 0x04 /* No mountpoint crossing. */
 #define LOOKUP_BENEATH 0x08 /* No escaping from starting point. */
+#define LOOKUP_IN_ROOT 0x10 /* Treat dirfd as fs root. */
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
-#define LOOKUP_IS_SCOPED LOOKUP_BENEATH
+#define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | 

[PATCH v17 08/13] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution

2019-11-16 Thread Aleksa Sarai
/* Background. */
There are many circumstances when userspace wants to resolve a path and
ensure that it doesn't go outside of a particular root directory during
resolution. Obvious examples include archive extraction tools, as well as
other security-conscious userspace programs. FreeBSD spun out O_BENEATH
from their Capsicum project[1,2], so it also seems reasonable to
implement similar functionality for Linux.

This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
variation on David Drysdale's O_BENEATH patchset[4], which in turn was
based on the Capsicum project[5]).

/* Userspace API. */
LOOKUP_BENEATH will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_BENEATH applies to all components of the path.

With LOOKUP_BENEATH, any path component which attempts to "escape" the
starting point of the filesystem lookup (the dirfd passed to openat)
will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.

Due to a security concern brought up by Jann[6], any ".." path
components are also blocked. This restriction will be lifted in a future
patch, but requires more work to ensure that permitting ".." is done
safely.

Magic-link jumps are also blocked, because they can beam the path lookup
across the starting point. It would be possible to detect and block
only the "bad" crossings with path_is_under() checks, but it's unclear
whether it makes sense to permit magic-links at all. However, userspace
is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
magic-link crossing is entirely disabled.

/* Testing. */
LOOKUP_BENEATH is tested as part of the openat2(2) selftests.

[1]: https://reviews.freebsd.org/D2808
[2]: https://reviews.freebsd.org/D17547
[3]: https://lore.kernel.org/lkml/20170429220414.gt29...@zeniv.linux.org.uk/
[4]: 
https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysd...@google.com/
[5]: 
https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysd...@google.com/
[6]: 
https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=omm0vhwuld6eulmne3r6xa...@mail.gmail.com/

Cc: Christian Brauner 
Suggested-by: David Drysdale 
Suggested-by: Al Viro 
Suggested-by: Andy Lutomirski 
Suggested-by: Linus Torvalds 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 70 +++
 include/linux/namei.h |  4 +++
 2 files changed, 68 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 321c8ad5d6b3..3f7bb22c375d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -641,6 +641,14 @@ static bool legitimize_links(struct nameidata *nd)
 
 static bool legitimize_root(struct nameidata *nd)
 {
+   /*
+* For scoped-lookups (where nd->root has been zeroed), we need to
+* restart the whole lookup from scratch -- because set_root() is wrong
+* for these lookups (nd->dfd is the root, not the filesystem root).
+*/
+   if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED))
+   return false;
+   /* Nothing to do if nd->root is zero or is managed by the VFS user. */
if (!nd->root.mnt || (nd->flags & LOOKUP_ROOT))
return true;
nd->flags |= LOOKUP_ROOT_GRABBED;
@@ -776,12 +784,27 @@ static int complete_walk(struct nameidata *nd)
int status;
 
if (nd->flags & LOOKUP_RCU) {
-   if (!(nd->flags & LOOKUP_ROOT))
+   /*
+* We don't want to zero nd->root for scoped-lookups or
+* externally-managed nd->root.
+*/
+   if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED)))
nd->root.mnt = NULL;
if (unlikely(unlazy_walk(nd)))
return -ECHILD;
}
 
+   if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
+   /*
+* Do a final check to ensure that the path didn't escape. Note
+* that this should already be guaranteed by all of the other
+* LOOKUP_IS_SCOPED checks (and delaying this check this late
+* does open the door to some possible timing-based attacks).
+*/
+   if (WARN_ON(!path_is_under(>path, >root)))
+   return -EXDEV;
+   }
+
if (likely(!(nd->flags & LOOKUP_JUMPED)))
return 0;
 
@@ -802,6 +825,14 @@ static int set_root(struct nameidata *nd)
 {
struct fs_struct *fs = current->fs;
 
+   /*
+* Jumping to the real root in a scoped-lookup is a BUG in namei, but we
+* still have to ensure it doesn't happen because it will cause a 
breakout
+* from the dirfd.
+*/
+   if (WARN_ON(nd->flags & LOOKUP_IS_SCOPED))
+   return -ENOTRECOVERABLE;
+
if (nd->flags & LOOKUP_RCU) {
unsigned seq;
 
@@ -838,6 +869,8 @@ static inline void path_to_nameidata(const struct path 
*path,
 
 

[PATCH v17 07/13] namei: LOOKUP_NO_XDEV: block mountpoint crossing

2019-11-16 Thread Aleksa Sarai
/* Background. */
The need to contain path operations within a mountpoint has been a
long-standing usecase that userspace has historically implemented
manually with liberal usage of stat(). find, rsync, tar and
many other programs implement these semantics -- but it'd be much
simpler to have a fool-proof way of refusing to open a path if it
crosses a mountpoint.

This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
variation on David Drysdale's O_BENEATH patchset[2], which in turn was
based on the Capsicum project[3]).

/* Userspace API. */
LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_XDEV applies to all components of the path.

With LOOKUP_NO_XDEV, any path component which crosses a mount-point
during path resolution (including "..") will yield an -EXDEV. Absolute
paths, absolute symlinks, and magic-links will only yield an -EXDEV if
the jump involved changing mount-points.

/* Testing. */
LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.

[1]: https://lore.kernel.org/lkml/20170429220414.gt29...@zeniv.linux.org.uk/
[2]: 
https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysd...@google.com/
[3]: 
https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysd...@google.com/

Cc: Christian Brauner 
Suggested-by: David Drysdale 
Suggested-by: Al Viro 
Suggested-by: Andy Lutomirski 
Suggested-by: Linus Torvalds 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 31 +++
 include/linux/namei.h |  1 +
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 415a897729c8..321c8ad5d6b3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -838,6 +838,11 @@ static inline void path_to_nameidata(const struct path 
*path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+   if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+   /* Absolute path arguments to path_init() are allowed. */
+   if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
+   return -EXDEV;
+   }
if (!nd->root.mnt) {
int error = set_root(nd);
if (error)
@@ -873,6 +878,12 @@ int nd_jump_link(struct path *path)
if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
goto err;
 
+   error = -EXDEV;
+   if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+   if (nd->path.mnt != path->mnt)
+   goto err;
+   }
+
path_put(>path);
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
@@ -1280,12 +1291,16 @@ static int follow_managed(struct path *path, struct 
nameidata *nd)
break;
}
 
-   if (need_mntput && path->mnt == mnt)
-   mntput(path->mnt);
+   if (need_mntput) {
+   if (path->mnt == mnt)
+   mntput(path->mnt);
+   if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+   ret = -EXDEV;
+   else
+   nd->flags |= LOOKUP_JUMPED;
+   }
if (ret == -EISDIR || !ret)
ret = 1;
-   if (need_mntput)
-   nd->flags |= LOOKUP_JUMPED;
if (unlikely(ret < 0))
path_put_conditional(path, nd);
return ret;
@@ -1342,6 +1357,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, 
struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry);
if (!mounted)
break;
+   if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+   return false;
path->mnt = >mnt;
path->dentry = mounted->mnt.mnt_root;
nd->flags |= LOOKUP_JUMPED;
@@ -1388,6 +1405,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (>mnt == nd->path.mnt)
break;
+   if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+   return -ECHILD;
/* we know that mountpoint was pinned */
nd->path.dentry = mountpoint;
nd->path.mnt = >mnt;
@@ -1402,6 +1421,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return -ECHILD;
if (!mounted)
break;
+   if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+   return -ECHILD;
nd->path.mnt = >mnt;
nd->path.dentry = mounted->mnt.mnt_root;
inode = nd->path.dentry->d_inode;
@@ -1500,6 +1521,8 @@ static int follow_dotdot(struct nameidata *nd)
}
if (!follow_up(>path))
break;
+   if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+   return -EXDEV;
 

[PATCH v17 06/13] namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution

2019-11-16 Thread Aleksa Sarai
/* Background. */
There has always been a special class of symlink-like objects in procfs
(and a few other pseudo-filesystems) which allow for non-lexical
resolution of paths using nd_jump_link(). These "magic-links" do not
follow traditional mount namespace boundaries, and have been used
consistently in container escape attacks because they can be used to
trick unsuspecting privileged processes into resolving unexpected paths.

It is also non-trivial for userspace to unambiguously avoid resolving
magic-links, because they do not have a reliable indication that they
are a magic-link (in order to verify them you'd have to manually open
the path given by readlink(2) and then verify that the two file
descriptors reference the same underlying file, which is plagued with
possible race conditions or supplementary attack scenarios).

It would therefore be very helpful for userspace to be able to avoid
these symlinks easily, thus hopefully removing a tool from attackers'
toolboxes.

This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
variation on David Drysdale's O_BENEATH patchset[2], which in turn was
based on the Capsicum project[3]).

/* Userspace API. */
LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_MAGICLINKS applies to all components of the path.

With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered
during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW
for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.

LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.

/* Testing. */
LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.

[1]: https://lore.kernel.org/lkml/20170429220414.gt29...@zeniv.linux.org.uk/
[2]: 
https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysd...@google.com/
[3]: 
https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysd...@google.com/

Cc: Christian Brauner 
Suggested-by: David Drysdale 
Suggested-by: Al Viro 
Suggested-by: Andy Lutomirski 
Suggested-by: Linus Torvalds 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 10 +-
 include/linux/namei.h |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 303731935eb2..415a897729c8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -867,13 +867,21 @@ static int nd_jump_root(struct nameidata *nd)
  */
 int nd_jump_link(struct path *path)
 {
+   int error = -ELOOP;
struct nameidata *nd = current->nameidata;
-   path_put(>path);
 
+   if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
+   goto err;
+
+   path_put(>path);
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED;
return 0;
+
+err:
+   path_put(path);
+   return error;
 }
 
 static inline void put_link(struct nameidata *nd)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 0d86e75c04a7..1573b8493d98 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -41,6 +41,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 
 /* Scoping flags for lookup. */
 #define LOOKUP_NO_SYMLINKS 0x01 /* No symlink crossing. */
+#define LOOKUP_NO_MAGICLINKS   0x02 /* No nd_jump_link() crossing. */
 
 extern int path_pts(struct path *path);
 
-- 
2.24.0



[PATCH v17 05/13] namei: LOOKUP_NO_SYMLINKS: block symlink resolution

2019-11-16 Thread Aleksa Sarai
/* Background. */
Userspace cannot easily resolve a path without resolving symlinks, and
would have to manually resolve each path component with O_PATH and
O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw
up (resulting in possible security bugs). Linus has mentioned that Git
has a particular need for this kind of flag[1]. It also resolves a
fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only
blocks the opening of trailing symlinks.

This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a
variation on David Drysdale's O_BENEATH patchset[3], which in turn was
based on the Capsicum project[4]).

/* Userspace API. */
LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_SYMLINKS applies to all components of the path.

With LOOKUP_NO_SYMLINKS, any symlink path component encountered during
path resolution will yield -ELOOP. If the trailing component is a
symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW
will not error out and will instead provide a handle to the trailing
symlink -- without resolving it.

/* Testing. */
LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.

[1]: 
https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=wvhd8agsp7q...@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/20170429220414.gt29...@zeniv.linux.org.uk/
[3]: 
https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysd...@google.com/
[4]: 
https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysd...@google.com/

Cc: Christian Brauner 
Suggested-by: Al Viro 
Suggested-by: Linus Torvalds 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 3 +++
 include/linux/namei.h | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 74574a69a614..303731935eb2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1052,6 +1052,9 @@ const char *get_link(struct nameidata *nd)
int error;
const char *res;
 
+   if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+   return ERR_PTR(-ELOOP);
+
if (!(nd->flags & LOOKUP_RCU)) {
touch_atime(>link);
cond_resched();
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 758e9b47db6f..0d86e75c04a7 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -39,6 +39,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_ROOT0x2000
 #define LOOKUP_ROOT_GRABBED0x0008
 
+/* Scoping flags for lookup. */
+#define LOOKUP_NO_SYMLINKS 0x01 /* No symlink crossing. */
+
 extern int path_pts(struct path *path);
 
 extern int user_path_at_empty(int, const char __user *, unsigned, struct path 
*, int *empty);
-- 
2.24.0



[PATCH v17 04/13] namei: allow set_root() to produce errors

2019-11-16 Thread Aleksa Sarai
For LOOKUP_BENEATH and LOOKUP_IN_ROOT it is necessary to ensure that
set_root() is never called, and thus (for hardening purposes) it should
return an error rather than permit a breakout from the root. In
addition, move all of the repetitive set_root() calls to nd_jump_root().

Signed-off-by: Aleksa Sarai 
---
 fs/namei.c | 35 ---
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 1024a641f075..74574a69a614 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -798,7 +798,7 @@ static int complete_walk(struct nameidata *nd)
return status;
 }
 
-static void set_root(struct nameidata *nd)
+static int set_root(struct nameidata *nd)
 {
struct fs_struct *fs = current->fs;
 
@@ -814,6 +814,7 @@ static void set_root(struct nameidata *nd)
get_fs_root(fs, >root);
nd->flags |= LOOKUP_ROOT_GRABBED;
}
+   return 0;
 }
 
 static void path_put_conditional(struct path *path, struct nameidata *nd)
@@ -837,6 +838,11 @@ static inline void path_to_nameidata(const struct path 
*path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+   if (!nd->root.mnt) {
+   int error = set_root(nd);
+   if (error)
+   return error;
+   }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1080,10 +1086,9 @@ const char *get_link(struct nameidata *nd)
return res;
}
if (*res == '/') {
-   if (!nd->root.mnt)
-   set_root(nd);
-   if (unlikely(nd_jump_root(nd)))
-   return ERR_PTR(-ECHILD);
+   error = nd_jump_root(nd);
+   if (unlikely(error))
+   return ERR_PTR(error);
while (unlikely(*++res == '/'))
;
}
@@ -1698,8 +1703,13 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
if (type == LAST_DOTDOT) {
-   if (!nd->root.mnt)
-   set_root(nd);
+   int error = 0;
+
+   if (!nd->root.mnt) {
+   error = set_root(nd);
+   if (error)
+   return error;
+   }
if (nd->flags & LOOKUP_RCU) {
return follow_dotdot_rcu(nd);
} else
@@ -2162,6 +2172,7 @@ static int link_path_walk(const char *name, struct 
nameidata *nd)
 /* must be paired with terminate_walk() */
 static const char *path_init(struct nameidata *nd, unsigned flags)
 {
+   int error;
const char *s = nd->name->name;
 
if (!*s)
@@ -2194,11 +2205,13 @@ static const char *path_init(struct nameidata *nd, 
unsigned flags)
nd->path.dentry = NULL;
 
nd->m_seq = read_seqbegin(_lock);
+
+   /* Figure out the starting path and root (if needed). */
if (*s == '/') {
-   set_root(nd);
-   if (likely(!nd_jump_root(nd)))
-   return s;
-   return ERR_PTR(-ECHILD);
+   error = nd_jump_root(nd);
+   if (unlikely(error))
+   return ERR_PTR(error);
+   return s;
} else if (nd->dfd == AT_FDCWD) {
if (flags & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
-- 
2.24.0



[PATCH v17 03/13] namei: allow nd_jump_link() to produce errors

2019-11-16 Thread Aleksa Sarai
In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
ability for nd_jump_link() to return an error which the corresponding
get_link() caller must propogate back up to the VFS.

Suggested-by: Al Viro 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c |  3 ++-
 fs/proc/base.c |  3 +--
 fs/proc/namespaces.c   | 14 +-
 include/linux/namei.h  |  2 +-
 security/apparmor/apparmorfs.c |  6 --
 5 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5a47d9c09581..1024a641f075 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -859,7 +859,7 @@ static int nd_jump_root(struct nameidata *nd)
  * Helper to directly jump to a known parsed path from ->get_link,
  * caller must have taken a reference to path beforehand.
  */
-void nd_jump_link(struct path *path)
+int nd_jump_link(struct path *path)
 {
struct nameidata *nd = current->nameidata;
path_put(>path);
@@ -867,6 +867,7 @@ void nd_jump_link(struct path *path)
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED;
+   return 0;
 }
 
 static inline void put_link(struct nameidata *nd)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..ee97dd322f3e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1626,8 +1626,7 @@ static const char *proc_pid_get_link(struct dentry 
*dentry,
if (error)
goto out;
 
-   nd_jump_link();
-   return NULL;
+   error = nd_jump_link();
 out:
return ERR_PTR(error);
 }
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 08dd94df1a66..a8cca516f1a9 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -51,11 +51,15 @@ static const char *proc_ns_get_link(struct dentry *dentry,
if (!task)
return ERR_PTR(-EACCES);
 
-   if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
-   error = ns_get_path(_path, task, ns_ops);
-   if (!error)
-   nd_jump_link(_path);
-   }
+   if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
+   goto out;
+
+   error = ns_get_path(_path, task, ns_ops);
+   if (error)
+   goto out;
+
+   error = nd_jump_link(_path);
+out:
put_task_struct(task);
return ERR_PTR(error);
 }
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 397a08ade6a2..758e9b47db6f 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -68,7 +68,7 @@ extern int follow_up(struct path *);
 extern struct dentry *lock_rename(struct dentry *, struct dentry *);
 extern void unlock_rename(struct dentry *, struct dentry *);
 
-extern void nd_jump_link(struct path *path);
+extern int __must_check nd_jump_link(struct path *path);
 
 static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
 {
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 45d13b6462aa..0b7d6dce6291 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -2455,16 +2455,18 @@ static const char *policy_get_link(struct dentry 
*dentry,
 {
struct aa_ns *ns;
struct path path;
+   int error;
 
if (!dentry)
return ERR_PTR(-ECHILD);
+
ns = aa_get_current_ns();
path.mnt = mntget(aafs_mnt);
path.dentry = dget(ns_dir(ns));
-   nd_jump_link();
+   error = nd_jump_link();
aa_put_ns(ns);
 
-   return NULL;
+   return ERR_PTR(error);
 }
 
 static int policy_readlink(struct dentry *dentry, char __user *buffer,
-- 
2.24.0



[PATCH v17 02/13] nsfs: clean-up ns_get_path() signature to return int

2019-11-16 Thread Aleksa Sarai
ns_get_path() and ns_get_path_cb() only ever return either NULL or an
ERR_PTR. It is far more idiomatic to simply return an integer, and it
makes all of the callers of ns_get_path() more straightforward to read.

Fixes: e149ed2b805f ("take the targets of /proc/*/ns/* symlinks to separate fs")
Signed-off-by: Aleksa Sarai 
---
 fs/nsfs.c   | 29 ++---
 fs/proc/namespaces.c|  6 +++---
 include/linux/proc_ns.h |  4 ++--
 kernel/bpf/offload.c| 12 ++--
 kernel/events/core.c|  2 +-
 5 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index a0431642c6b5..f3d2833c5781 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -52,7 +52,7 @@ static void nsfs_evict(struct inode *inode)
ns->ops->put(ns);
 }
 
-static void *__ns_get_path(struct path *path, struct ns_common *ns)
+static int __ns_get_path(struct path *path, struct ns_common *ns)
 {
struct vfsmount *mnt = nsfs_mnt;
struct dentry *dentry;
@@ -71,13 +71,13 @@ static void *__ns_get_path(struct path *path, struct 
ns_common *ns)
 got_it:
path->mnt = mntget(mnt);
path->dentry = dentry;
-   return NULL;
+   return 0;
 slow:
rcu_read_unlock();
inode = new_inode_pseudo(mnt->mnt_sb);
if (!inode) {
ns->ops->put(ns);
-   return ERR_PTR(-ENOMEM);
+   return -ENOMEM;
}
inode->i_ino = ns->inum;
inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
@@ -89,7 +89,7 @@ static void *__ns_get_path(struct path *path, struct 
ns_common *ns)
dentry = d_alloc_anon(mnt->mnt_sb);
if (!dentry) {
iput(inode);
-   return ERR_PTR(-ENOMEM);
+   return -ENOMEM;
}
d_instantiate(dentry, inode);
dentry->d_fsdata = (void *)ns->ops;
@@ -98,23 +98,22 @@ static void *__ns_get_path(struct path *path, struct 
ns_common *ns)
d_delete(dentry);   /* make sure ->d_prune() does nothing */
dput(dentry);
cpu_relax();
-   return ERR_PTR(-EAGAIN);
+   return -EAGAIN;
}
goto got_it;
 }
 
-void *ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
+int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
 void *private_data)
 {
-   void *ret;
+   int ret;
 
do {
struct ns_common *ns = ns_get_cb(private_data);
if (!ns)
-   return ERR_PTR(-ENOENT);
-
+   return -ENOENT;
ret = __ns_get_path(path, ns);
-   } while (ret == ERR_PTR(-EAGAIN));
+   } while (ret == -EAGAIN);
 
return ret;
 }
@@ -131,7 +130,7 @@ static struct ns_common *ns_get_path_task(void 
*private_data)
return args->ns_ops->get(args->task);
 }
 
-void *ns_get_path(struct path *path, struct task_struct *task,
+int ns_get_path(struct path *path, struct task_struct *task,
  const struct proc_ns_operations *ns_ops)
 {
struct ns_get_path_task_args args = {
@@ -147,7 +146,7 @@ int open_related_ns(struct ns_common *ns,
 {
struct path path = {};
struct file *f;
-   void *err;
+   int err;
int fd;
 
fd = get_unused_fd_flags(O_CLOEXEC);
@@ -164,11 +163,11 @@ int open_related_ns(struct ns_common *ns,
}
 
err = __ns_get_path(, relative);
-   } while (err == ERR_PTR(-EAGAIN));
+   } while (err == -EAGAIN);
 
-   if (IS_ERR(err)) {
+   if (err) {
put_unused_fd(fd);
-   return PTR_ERR(err);
+   return err;
}
 
f = dentry_open(, O_RDONLY, current_cred());
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..08dd94df1a66 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -42,14 +42,14 @@ static const char *proc_ns_get_link(struct dentry *dentry,
const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops;
struct task_struct *task;
struct path ns_path;
-   void *error = ERR_PTR(-EACCES);
+   int error = -EACCES;
 
if (!dentry)
return ERR_PTR(-ECHILD);
 
task = get_proc_task(inode);
if (!task)
-   return error;
+   return ERR_PTR(-EACCES);
 
if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
error = ns_get_path(_path, task, ns_ops);
@@ -57,7 +57,7 @@ static const char *proc_ns_get_link(struct dentry *dentry,
nd_jump_link(_path);
}
put_task_struct(task);
-   return error;
+   return ERR_PTR(error);
 }
 
 static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int 
buflen)
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..aed366b4795c 100644
--- a/include/linux/proc_ns.h
+++ 

[PATCH v17 01/13] namei: only return -ECHILD from follow_dotdot_rcu()

2019-11-16 Thread Aleksa Sarai
It's over-zealous to return hard errors under RCU-walk here, given that
a REF-walk will be triggered for all other cases handling ".." under
RCU.

The original purpose of this check was to ensure that if a rename occurs
such that a directory is moved outside of the bind-mount which the
resolution started in, it would be detected and blocked to avoid being
able to mess with paths outside of the bind-mount. However, triggering a
new REF-walk is just as effective a solution.

Cc: "Eric W. Biederman" 
Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from 
their mnt_root")
Suggested-by: Al Viro 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..5a47d9c09581 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1359,7 +1359,7 @@ static int follow_dotdot_rcu(struct nameidata *nd)
nd->path.dentry = parent;
nd->seq = seq;
if (unlikely(!path_connected(>path)))
-   return -ENOENT;
+   return -ECHILD;
break;
} else {
struct mount *mnt = real_mount(nd->path.mnt);
-- 
2.24.0



[PATCH v17 00/13] open: introduce openat2(2) syscall

2019-11-16 Thread Aleksa Sarai
This patchset is being developed here:
  

Patch changelog:
 v17:
  * Add a path_is_under() check for LOOKUP_IS_SCOPED in complete_walk(), as a
last line of defence to ensure that namei bugs will not break the contract
of LOOKUP_BENEATH or LOOKUP_IN_ROOT.
  * Update based on feedback by Al Viro:
* Make nd_jump_link() free the passed path on error, so that callers don't
  need to worry about it in the error path.
* Remove needless m_retry and r_retry variables in handle_dots().
* Always return -ECHILD from follow_dotdot_rcu().
 v16: 
 v15: 
 v14: 
  
 v13: 
 v12: 
 v11: 
  
 v10: 
 v09: 
 v08: 
 v07: 
 v06: 
 v05: 
 v04: 
 v03: 
 v02: 
 v01: 

For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Furthermore, the need for some sort of control over VFS's path resolution (to
avoid malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a revival
of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
project[5]) with a few additions and changes made based on the previous
discussion within [6] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS, the
flag has been split up into separate flags. However, instead of being an
openat(2) flag it is provided through a new syscall openat2(2) which provides
several other improvements to the openat(2) interface (see the patch
description for more details). The following new LOOKUP_* flags are added:

  * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).

  * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.

It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.

In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.

  * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races

Re: [PATCH v4 0/3] Append new variables to vmcoreinfo (TCR_EL1.T1SZ for arm64 and MAX_PHYSMEM_BITS for all archs)

2019-11-16 Thread Bhupesh Sharma
Hi Akashi,

On Fri, Nov 15, 2019 at 7:29 AM AKASHI Takahiro
 wrote:
>
> Bhupesh,
>
> On Fri, Nov 15, 2019 at 01:24:17AM +0530, Bhupesh Sharma wrote:
> > Hi Akashi,
> >
> > On Wed, Nov 13, 2019 at 12:11 PM AKASHI Takahiro
> >  wrote:
> > >
> > > Hi Bhupesh,
> > >
> > > Do you have a corresponding patch for userspace tools,
> > > including crash util and/or makedumpfile?
> > > Otherwise, we can't verify that a generated core file is
> > > correctly handled.
> >
> > Sure. I am still working on the crash-utility related changes, but you
> > can find the makedumpfile changes I posted a couple of days ago here
> > (see [0]) and the github link for the makedumpfile changes can be seen
> > via [1].
> >
> > I will post the crash-util changes shortly as well.
> > Thanks for having a look at the same.
>
> Thank you.
> I have tested my kdump patch with a hacked version of crash
> where VA_BITS_ACTUAL is calculated from tcr_el1_t1sz in vmcoreinfo.

Thanks a lot for testing the changes. I will push the crash utility
changes for review shortly and also Cc you to the patches.
It would be great to have your Tested-by for this patch-set, if the
user-space works fine for you with these changes.

Regards,
Bhupesh

> -Takahiro Akashi
>
>
> > [0]. http://lists.infradead.org/pipermail/kexec/2019-November/023963.html
> > [1]. 
> > https://github.com/bhupesh-sharma/makedumpfile/tree/52-bit-va-support-via-vmcore-upstream-v4
> >
> > Regards,
> > Bhupesh
> >
> > >
> > > Thanks,
> > > -Takahiro Akashi
> > >
> > > On Mon, Nov 11, 2019 at 01:31:19PM +0530, Bhupesh Sharma wrote:
> > > > Changes since v3:
> > > > 
> > > > - v3 can be seen here:
> > > >   http://lists.infradead.org/pipermail/kexec/2019-March/022590.html
> > > > - Addressed comments from James and exported TCR_EL1.T1SZ in vmcoreinfo
> > > >   instead of PTRS_PER_PGD.
> > > > - Added a new patch (via [PATCH 3/3]), which fixes a simple typo in
> > > >   'Documentation/arm64/memory.rst'
> > > >
> > > > Changes since v2:
> > > > 
> > > > - v2 can be seen here:
> > > >   http://lists.infradead.org/pipermail/kexec/2019-March/022531.html
> > > > - Protected 'MAX_PHYSMEM_BITS' vmcoreinfo variable under 
> > > > CONFIG_SPARSEMEM
> > > >   ifdef sections, as suggested by Kazu.
> > > > - Updated vmcoreinfo documentation to add description about
> > > >   'MAX_PHYSMEM_BITS' variable (via [PATCH 3/3]).
> > > >
> > > > Changes since v1:
> > > > 
> > > > - v1 was sent out as a single patch which can be seen here:
> > > >   http://lists.infradead.org/pipermail/kexec/2019-February/022411.html
> > > >
> > > > - v2 breaks the single patch into two independent patches:
> > > >   [PATCH 1/2] appends 'PTRS_PER_PGD' to vmcoreinfo for arm64 arch, 
> > > > whereas
> > > >   [PATCH 2/2] appends 'MAX_PHYSMEM_BITS' to vmcoreinfo in core kernel 
> > > > code (all archs)
> > > >
> > > > This patchset primarily fixes the regression reported in user-space
> > > > utilities like 'makedumpfile' and 'crash-utility' on arm64 architecture
> > > > with the availability of 52-bit address space feature in underlying
> > > > kernel. These regressions have been reported both on CPUs which don't
> > > > support ARMv8.2 extensions (i.e. LVA, LPA) and are running newer kernels
> > > > and also on prototype platforms (like ARMv8 FVP simulator model) which
> > > > support ARMv8.2 extensions and are running newer kernels.
> > > >
> > > > The reason for these regressions is that right now user-space tools
> > > > have no direct access to these values (since these are not exported
> > > > from the kernel) and hence need to rely on a best-guess method of
> > > > determining value of 'vabits_actual' and 'MAX_PHYSMEM_BITS' supported
> > > > by underlying kernel.
> > > >
> > > > Exporting these values via vmcoreinfo will help user-land in such cases.
> > > > In addition, as per suggestion from makedumpfile maintainer (Kazu),
> > > > it makes more sense to append 'MAX_PHYSMEM_BITS' to
> > > > vmcoreinfo in the core code itself rather than in arm64 arch-specific
> > > > code, so that the user-space code for other archs can also benefit from
> > > > this addition to the vmcoreinfo and use it as a standard way of
> > > > determining 'SECTIONS_SHIFT' value in user-land.
> > > >
> > > > Cc: Boris Petkov 
> > > > Cc: Ingo Molnar 
> > > > Cc: Thomas Gleixner 
> > > > Cc: Jonathan Corbet 
> > > > Cc: James Morse 
> > > > Cc: Mark Rutland 
> > > > Cc: Will Deacon 
> > > > Cc: Steve Capper 
> > > > Cc: Catalin Marinas 
> > > > Cc: Ard Biesheuvel 
> > > > Cc: Michael Ellerman 
> > > > Cc: Paul Mackerras 
> > > > Cc: Benjamin Herrenschmidt 
> > > > Cc: Dave Anderson 
> > > > Cc: Kazuhito Hagio 
> > > > Cc: x...@kernel.org
> > > > Cc: linuxppc-dev@lists.ozlabs.org
> > > > Cc: linux-arm-ker...@lists.infradead.org
> > > > Cc: linux-ker...@vger.kernel.org
> > > > Cc: linux-...@vger.kernel.org
> > > > Cc: ke...@lists.infradead.org
> > > >
> > > > Bhupesh Sharma (3):
> > > >   crash_core, 

Re: [RFC PATCH 1/4] libnvdimm/namespace: Make namespace size validation arch dependent

2019-11-16 Thread Dan Williams
On Sat, Nov 16, 2019 at 4:15 AM Aneesh Kumar K.V
 wrote:
>
>
> Hi Dan,
>
> "Aneesh Kumar K.V"  writes:
>
> > Dan Williams  writes:
> >
> >> On Wed, Oct 30, 2019 at 10:35 PM Aneesh Kumar K.V
> >>  wrote:
> >> [..]
> >>> > True, for the pfn device and the device-dax mapping size, but I'm
> >>> > suggesting adding another instance of alignment control at the raw
> >>> > namespace level. That would need to be disconnected from the
> >>> > device-dax page mapping granularity.
> >>> >
> >>>
> >>> Can you explain what you mean by raw namespace level ? We don't have
> >>> multiple values against which we need to check the alignment of
> >>> namespace start and namespace size.
> >>>
> >>> If you can outline how and where you would like to enforce that check I
> >>> can start working on it.
> >>>
> >>
> >> What I mean is that the process of setting up a pfn namespace goes
> >> something like this in shell script form:
> >>
> >> 1/ echo $size > /sys/bus/nd/devices/$namespace/size
> >> 2/ echo $namespace > /sys/bus/nd/devices/$pfn/namespace
> >> 3/ echo $pfn_align > /sys/bus/nd/devices/$pfn/align
> >>
> >> What I'm suggesting is add an optional 0th step that does:
> >>
> >> echo $raw_align > /sys/bus/nd/devices/$namespace/align
> >>
> >> Where the raw align needs to be needs to be max($pfn_align,
> >> arch_mapping_granulariy).
> >
> >
> > I started looking at this and was wondering about userspace being aware
> > of the direct-map mapping size. We can figure that out by parsing kernel
> > log
> >
> > [0.00] Page orders: linear mapping = 24, virtual = 16, io = 16, 
> > vmemmap = 24
> >
> >
> > But I am not sure we want to do that. There is not set of raw_align
> > value to select. What we need to make sure is the below.
> >
> > 1) While creating a namespace we need to make sure that namespace size
> > is multiple of direct-map mapping size. If we ensure that
> > size is aligned, we should also get the namespace start to be aligned?
> >
> > 2) While initialzing a namespace by scanning label area, we need to make
> > sure every namespace in the region satisfy the above requirement. If not
> > we should mark the region disabled.
> >
> >
> >>
> >> So on powerpc where PAGE_SIZE < arch_mapping_granulariy, the following:
> >>
> >> cat /sys/bus/nd/devices/$namespace/supported_aligns
> >>
> >> ...would show the same output as:
> >>
> >> cat /sys/bus/nd/devices/$pfn/align
> >>
> >> ...but with any alignment choice less than arch_mapping_granulariy removed.
> >>
> >
> > I am not sure why we need to do that. For example: even if we have
> > direct-map mapping size as PAGE_SIZE (64K), we still should allow an user
> > mapping with hugepage size (16M)?
> >
>
>
> Considering the direct-map map size is not going to be user selectable,
> do you agree that we can skip the above step 0 configuration you
> suggested.
>
> The changes proposed in the patch series essentially does the rest.
>
> 1) It validate the size against the arch specific limit during
> namespace creation. (part of step 1)

This validation is a surprise failure to ndctl.

> 2) It also disable initializing a region if it find the size not
> correctly aligned as per the platform requirement.

There needs to be a way for the user to discover the correct alignment
that the kernel will accept.

> 3) Direct map  mapping size is different from supported_alignment for a
> namespace. The supported alignment controls what possible PAGE SIZE user want 
> the
> namespace to be mapped to user space.

No, the namespace alignment is different than the page mapping size.
The alignment is only interpreted as a mapping size at the device-dax
level, otherwise at the raw namespace level it's just an arbitrary
alignment.

> With the above do you think the current patch series is good?

I don't think we've quite converged on a solution.


Re: [PATCH v16 02/12] namei: allow nd_jump_link() to produce errors

2019-11-16 Thread Aleksa Sarai
On 2019-11-16, Al Viro  wrote:
> On Sat, Nov 16, 2019 at 11:27:52AM +1100, Aleksa Sarai wrote:
> > +   error = nd_jump_link();
> > +   if (error)
> > +   path_put();
> 
> > +   error = nd_jump_link(_path);
> > +   if (error)
> > +   path_put(_path);
> 
> > +   error = nd_jump_link();
> > +   if (error)
> > +   path_put();
> 
> 3 calls.  Exact same boilerplate in each to handle a failure case.
> Which spells "wrong calling conventions"; it's absolutely clear
> that we want that path_put() inside nd_jump_link().
> 
> The rule should be this: reference that used to be held in
> *path is consumed in any case.  On success it goes into
> nd->path, on error it's just dropped, but in any case, the
> caller has the same refcounting environment to deal with.
> 
> If you need the same boilerplate cleanup on failure again and again,
> the calling conventions are wrong and need to be fixed.

Will do.

> And I'm not sure that int is the right return type here, to be honest.
> void * might be better - return ERR_PTR() or NULL, so that the value
> could be used as return value of ->get_link() that calls that thing.

I don't agree, given that the few callers of ns_get_path() are
inconsistent with regards to whether they should use IS_ERR() or check
for NULL, not to mention that "void *error" reads to me as being very
odd given how common "int error" is throughout the kernel. There's also
the "error == ERR_PTR(-EAGAIN)" checks which also read as being quite
odd too.

But the main motivating factor for changing it was that the one use
where "void *" is useful (proc_ns_get_link) becomes needlessly ugly
because of the "nd_jump_link() can return errors" change:

error = ERR_PTR(nd_jump_link(_path));

Or probably (if you don't want to rely on ERR_PTR(0) == NULL):

int err = nd_jump_link(_path);
if (err)
error = ERR_PTR(err);

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v16 06/12] namei: LOOKUP_NO_XDEV: block mountpoint crossing

2019-11-16 Thread Aleksa Sarai
On 2019-11-16, Al Viro  wrote:
> On Sat, Nov 16, 2019 at 11:27:56AM +1100, Aleksa Sarai wrote:
> 
> > @@ -1383,6 +1398,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > return -ECHILD;
> > if (>mnt == nd->path.mnt)
> > break;
> > +   if (unlikely(nd->flags & LOOKUP_NO_XDEV))
> > +   return -EXDEV;
> > /* we know that mountpoint was pinned */
> > nd->path.dentry = mountpoint;
> > nd->path.mnt = >mnt;
> > @@ -1397,6 +1414,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > return -ECHILD;
> > if (!mounted)
> > break;
> > +   if (unlikely(nd->flags & LOOKUP_NO_XDEV))
> > +   return -EXDEV;
> > nd->path.mnt = >mnt;
> > nd->path.dentry = mounted->mnt.mnt_root;
> > inode = nd->path.dentry->d_inode;
> 
> I really don't think we should return hard errors from that function.
> Let the caller redo it in refwalk mode.

I suspected as much, though my reason for not changing it was that the
mount_lock check should ensure that the cached status of whether ".." is
a mountpoint crossing is correct. But I guess this is more about being
safe than sorry, rather than an actual bug?

> It's not the fast path, especially for this kind of errors.  Matter of
> fact, I'm not sure about -ENOENT returned in another failure case
> there - it's probably OK, but again, -ECHILD would be just as good.

I can switch the -ENOENT too if you like.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: Found the commit for: 5.3.7 64-bits kernel doesn't boot on G5 Quad

2019-11-16 Thread John Paul Adrian Glaubitz
Hi Romain!

Great detective work!

On 11/16/19 5:34 PM, Romain Dolbeau wrote:
> So it seems to me that 0034d395f89d9c092bb15adbabdca5283e258b41
> introduced the bug that crashes the PowerMac G5 (as/could anyone
> tried/try on some other PPC970-based system, like a JS20 ? to see if
> it's PowerMac-specific or not).

So, that would be this commit [1] then:

> authorAneesh Kumar K.V2019-04-17 
> 18:29:14 +0530
> committer Michael Ellerman   2019-04-21 23:12:39 
> +1000
> commit0034d395f89d9c092bb15adbabdca5283e258b41 (patch)
> tree  e5850612e6ada1285b19b21a21722fb2ea95b43e
> parenta35a3c6f60657812366fca86a9ce71df1b8f7aff (diff)
> download  linux-0034d395f89d9c092bb15adbabdca5283e258b41.tar.gz
> powerpc/mm/hash64: Map all the kernel regions in the same 0xc range
> This patch maps vmalloc, IO and vmemap regions in the 0xc address range
> instead of the current 0xd and 0xf range. This brings the mapping closer
> to radix translation mode.

> If anyone has an idea on how to fix this, that would be very welcome,
> as I'm way out of my depth in the PPC64 MMU code.

The best thing would be to notify the author of said commit about the regression
which would be Aneesh Kumar (CC'ed).

Adrian

> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0034d395f89d9c092bb15adbabdca5283e258b41

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913


Re: [PATCH v16 09/12] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution

2019-11-16 Thread Aleksa Sarai
On 2019-11-16, Al Viro  wrote:
> On Sat, Nov 16, 2019 at 11:27:59AM +1100, Aleksa Sarai wrote:
> 
> > +   if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
> > +   bool m_retry = read_seqretry(_lock, nd->m_seq);
> > +   bool r_retry = read_seqretry(_lock, nd->r_seq);
> > +
> > +   /*
> > +* If there was a racing rename or mount along our
> > +* path, then we can't be sure that ".." hasn't jumped
> > +* above nd->root (and so userspace should retry or use
> > +* some fallback).
> > +*/
> > +   if (unlikely(m_retry || r_retry))
> > +   return -EAGAIN;
> > +   }
> > }
> > return 0;
> 
> Elaborate...  Do these boolean variables make any sense now, really?

You're quite right, they don't make sense any more. I'll drop them.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Found the commit for: 5.3.7 64-bits kernel doesn't boot on G5 Quad

2019-11-16 Thread Romain Dolbeau
Le dim. 10 nov. 2019 à 11:45, Romain Dolbeau  a écrit :
> Any suggestion, advice, or patch to try welcome :-)

>From my bisect, I figured that
0034d395f89d9c092bb15adbabdca5283e258b41 was the likely culprit, but
that the bug was masked by the printk() issues that were fixed later:
commit 2ac5a3bf7042a1c4abbcce1b6f0ec61e5d3786c2 mentions "Report on
Power: Kernel crashes very early during boot". Once that was merged,
the current bug appears instead of the early crash.

So what I did was:

1) checkout 0034d395f89d9c092bb15adbabdca5283e258b41
2) merge printk-for-5.2 & printk-for-5.2-fixes from pmladek/printk,
hoping to remove the printk issue ; there was only fairly trivial
conflicts

-> the resulting kernel displays the same bug as Debian's and vanilla
5.3 and HEAD, which is what I hoped for.

However, the merges did pick up quite a few commits, so I then:

3) did a revert of only 0034d395f89d9c092bb15adbabdca5283e258b41 ;
that did not cause conflicts (on HEAD, this causes a lot of conflicts
I don't know how to resolve).

-> the resulting kernel works fine ! :-)

So it seems to me that 0034d395f89d9c092bb15adbabdca5283e258b41
introduced the bug that crashes the PowerMac G5 (as/could anyone
tried/try on some other PPC970-based system, like a JS20 ? to see if
it's PowerMac-specific or not).

If anyone has an idea on how to fix this, that would be very welcome,
as I'm way out of my depth in the PPC64 MMU code.

Cordially,

-- 
Romain Dolbeau


[PATCH AUTOSEL 4.4 25/77] macintosh/windfarm_smu_sat: Fix debug output

2019-11-16 Thread Sasha Levin
From: Benjamin Herrenschmidt 

[ Upstream commit fc0c8b36d379a046525eacb9c3323ca635283757 ]

There's some antiquated debug output that's trying
to do a hand-made hexdump and turning into horrible
1-byte-per-line output these days.

Use print_hex_dump() instead

Signed-off-by: Benjamin Herrenschmidt 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 drivers/macintosh/windfarm_smu_sat.c | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/drivers/macintosh/windfarm_smu_sat.c 
b/drivers/macintosh/windfarm_smu_sat.c
index ad6223e883404..3d310dd60a0be 100644
--- a/drivers/macintosh/windfarm_smu_sat.c
+++ b/drivers/macintosh/windfarm_smu_sat.c
@@ -22,14 +22,6 @@
 
 #define VERSION "1.0"
 
-#define DEBUG
-
-#ifdef DEBUG
-#define DBG(args...)   printk(args)
-#else
-#define DBG(args...)   do { } while(0)
-#endif
-
 /* If the cache is older than 800ms we'll refetch it */
 #define MAX_AGEmsecs_to_jiffies(800)
 
@@ -106,13 +98,10 @@ struct smu_sdbp_header *smu_sat_get_sdb_partition(unsigned 
int sat_id, int id,
buf[i+2] = data[3];
buf[i+3] = data[2];
}
-#ifdef DEBUG
-   DBG(KERN_DEBUG "sat %d partition %x:", sat_id, id);
-   for (i = 0; i < len; ++i)
-   DBG(" %x", buf[i]);
-   DBG("\n");
-#endif
 
+   printk(KERN_DEBUG "sat %d partition %x:", sat_id, id);
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, buf, len, false);
if (size)
*size = len;
return (struct smu_sdbp_header *) buf;
@@ -132,13 +121,13 @@ static int wf_sat_read_cache(struct wf_sat *sat)
if (err < 0)
return err;
sat->last_read = jiffies;
+
 #ifdef LOTSA_DEBUG
{
int i;
-   DBG(KERN_DEBUG "wf_sat_get: data is");
-   for (i = 0; i < 16; ++i)
-   DBG(" %.2x", sat->cache[i]);
-   DBG("\n");
+   printk(KERN_DEBUG "wf_sat_get: data is");
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, sat->cache, 16, false);
}
 #endif
return 0;
-- 
2.20.1



[PATCH AUTOSEL 4.4 06/77] powerpc/eeh: Fix use of EEH_PE_KEEP on wrong field

2019-11-16 Thread Sasha Levin
From: Sam Bobroff 

[ Upstream commit 473af09b56dc4be68e4af33220ceca6be67aa60d ]

eeh_add_to_parent_pe() sometimes removes the EEH_PE_KEEP flag, but it
incorrectly removes it from pe->type, instead of pe->state.

However, rather than clearing it from the correct field, remove it.
Inspection of the code shows that it can't ever have had any effect
(even if it had been cleared from the correct field), because the
field is never tested after it is cleared by the statement in
question.

The clear statement was added by commit 807a827d4e74 ("powerpc/eeh:
Keep PE during hotplug"), but it didn't explain why it was necessary.

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_pe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 304f07cfa2622..4d4c32d0e6cee 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -367,7 +367,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
while (parent) {
if (!(parent->type & EEH_PE_INVALID))
break;
-   parent->type &= ~(EEH_PE_INVALID | EEH_PE_KEEP);
+   parent->type &= ~EEH_PE_INVALID;
parent = parent->parent;
}
 
-- 
2.20.1



[PATCH AUTOSEL 4.4 05/77] powerpc: Fix signedness bug in update_flash_db()

2019-11-16 Thread Sasha Levin
From: Dan Carpenter 

[ Upstream commit 014704e6f54189a203cc14c7c0bb411b940241bc ]

The "count < sizeof(struct os_area_db)" comparison is type promoted to
size_t so negative values of "count" are treated as very high values
and we accidentally return success instead of a negative error code.

This doesn't really change runtime much but it fixes a static checker
warning.

Signed-off-by: Dan Carpenter 
Acked-by: Geoff Levand 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/ps3/os-area.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/ps3/os-area.c 
b/arch/powerpc/platforms/ps3/os-area.c
index 3db53e8aff927..9b2ef76578f06 100644
--- a/arch/powerpc/platforms/ps3/os-area.c
+++ b/arch/powerpc/platforms/ps3/os-area.c
@@ -664,7 +664,7 @@ static int update_flash_db(void)
db_set_64(db, _area_db_id_rtc_diff, saved_params.rtc_diff);
 
count = os_area_flash_write(db, sizeof(struct os_area_db), pos);
-   if (count < sizeof(struct os_area_db)) {
+   if (count < 0 || count < sizeof(struct os_area_db)) {
pr_debug("%s: os_area_flash_write failed %zd\n", __func__,
 count);
error = count < 0 ? count : -EIO;
-- 
2.20.1



[PATCH AUTOSEL 4.9 68/99] mm/memory_hotplug: make add_memory() take the device_hotplug_lock

2019-11-16 Thread Sasha Levin
From: David Hildenbrand 

[ Upstream commit 8df1d0e4a265f25dc1e7e7624ccdbcb4a6630c89 ]

add_memory() currently does not take the device_hotplug_lock, however
is aleady called under the lock from
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
to synchronize against CPU hot-remove and similar.

In general, we should hold the device_hotplug_lock when adding memory to
synchronize against online/offline request (e.g.  from user space) - which
already resulted in lock inversions due to device_lock() and
mem_hotplug_lock - see 30467e0b3be ("mm, hotplug: fix concurrent memory
hot-add deadlock").  add_memory()/add_memory_resource() will create memory
block devices, so this really feels like the right thing to do.

Holding the device_hotplug_lock makes sure that a memory block device
can really only be accessed (e.g. via .online/.state) from user space,
once the memory has been fully added to the system.

The lock is not held yet in
drivers/xen/balloon.c
arch/powerpc/platforms/powernv/memtrace.c
drivers/s390/char/sclp_cmd.c
drivers/hv/hv_balloon.c
So, let's either use the locked variants or take the lock.

Don't export add_memory_resource(), as it once was exported to be used by
XEN, which is never built as a module.  If somebody requires it, we also
have to export a locked variant (as device_hotplug_lock is never
exported).

Link: http://lkml.kernel.org/r/20180925091457.28651-3-da...@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rafael J. Wysocki 
Reviewed-by: Rashmica Gupta 
Reviewed-by: Oscar Salvador 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Nathan Fontenot 
Cc: John Allen 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Mathieu Malaterre 
Cc: Pavel Tatashin 
Cc: YASUAKI ISHIMATSU 
Cc: Balbir Singh 
Cc: Haiyang Zhang 
Cc: Heiko Carstens 
Cc: Jonathan Corbet 
Cc: Kate Stewart 
Cc: "K. Y. Srinivasan" 
Cc: Martin Schwidefsky 
Cc: Michael Neuling 
Cc: Philippe Ombredanne 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 .../platforms/pseries/hotplug-memory.c|  2 +-
 drivers/acpi/acpi_memhotplug.c|  2 +-
 drivers/base/memory.c |  9 ++--
 drivers/xen/balloon.c |  3 +++
 include/linux/memory_hotplug.h|  1 +
 mm/memory_hotplug.c   | 22 ---
 6 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c0a0947f43bbb..656bbbd731d03 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -616,7 +616,7 @@ static int dlpar_add_lmb(struct of_drconf_cell *lmb)
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
/* Add the memory */
-   rc = add_memory(nid, lmb->base_addr, block_sz);
+   rc = __add_memory(nid, lmb->base_addr, block_sz);
if (rc) {
dlpar_remove_device_tree_lmb(lmb);
dlpar_release_drc(lmb->drc_index);
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef7309cb..2ccfbb61ca899 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = add_memory(node, info->start_addr, info->length);
+   result = __add_memory(node, info->start_addr, info->length);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index c5cdd190b7816..9f96f1b43c15f 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -500,15 +500,20 @@ memory_probe_store(struct device *dev, struct 
device_attribute *attr,
if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
return -EINVAL;
 
+   ret = lock_device_hotplug_sysfs();
+   if (ret)
+   goto out;
+
nid = memory_add_physaddr_to_nid(phys_addr);
-   ret = add_memory(nid, phys_addr,
-MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+   ret = __add_memory(nid, phys_addr,
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
 
if (ret)
goto out;
 
ret = count;
 out:
+   unlock_device_hotplug();
return ret;
 }
 
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 6af117af97804..731cf54f75c65 100644
--- 

[PATCH AUTOSEL 4.9 50/99] powerpc/process: Fix flush_all_to_thread for SPE

2019-11-16 Thread Sasha Levin
From: Felipe Rechia 

[ Upstream commit e901378578c62202594cba0f6c076f3df365ec91 ]

Fix a bug introduced by the creation of flush_all_to_thread() for
processors that have SPE (Signal Processing Engine) and use it to
compute floating-point operations.

>From userspace perspective, the problem was seen in attempts of
computing floating-point operations which should generate exceptions.
For example:

  fork();
  float x = 0.0 / 0.0;
  isnan(x);   // forked process returns False (should be True)

The operation above also should always cause the SPEFSCR FINV bit to
be set. However, the SPE floating-point exceptions were turned off
after a fork().

Kernel versions prior to the bug used flush_spe_to_thread(), which
first saves SPEFSCR register values in tsk->thread and then calls
giveup_spe(tsk).

After commit 579e633e764e, the save_all() function was called first
to giveup_spe(), and then the SPEFSCR register values were saved in
tsk->thread. This would save the SPEFSCR register values after
disabling SPE for that thread, causing the bug described above.

Fixes 579e633e764e ("powerpc: create flush_all_to_thread()")
Signed-off-by: Felipe Rechia 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/process.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 47c6c0401b3a2..54c95e7c74cce 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -576,12 +576,11 @@ void flush_all_to_thread(struct task_struct *tsk)
if (tsk->thread.regs) {
preempt_disable();
BUG_ON(tsk != current);
-   save_all(tsk);
-
 #ifdef CONFIG_SPE
if (tsk->thread.regs->msr & MSR_SPE)
tsk->thread.spefscr = mfspr(SPRN_SPEFSCR);
 #endif
+   save_all(tsk);
 
preempt_enable();
}
-- 
2.20.1



[PATCH AUTOSEL 4.9 30/99] macintosh/windfarm_smu_sat: Fix debug output

2019-11-16 Thread Sasha Levin
From: Benjamin Herrenschmidt 

[ Upstream commit fc0c8b36d379a046525eacb9c3323ca635283757 ]

There's some antiquated debug output that's trying
to do a hand-made hexdump and turning into horrible
1-byte-per-line output these days.

Use print_hex_dump() instead

Signed-off-by: Benjamin Herrenschmidt 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 drivers/macintosh/windfarm_smu_sat.c | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/drivers/macintosh/windfarm_smu_sat.c 
b/drivers/macintosh/windfarm_smu_sat.c
index ad6223e883404..3d310dd60a0be 100644
--- a/drivers/macintosh/windfarm_smu_sat.c
+++ b/drivers/macintosh/windfarm_smu_sat.c
@@ -22,14 +22,6 @@
 
 #define VERSION "1.0"
 
-#define DEBUG
-
-#ifdef DEBUG
-#define DBG(args...)   printk(args)
-#else
-#define DBG(args...)   do { } while(0)
-#endif
-
 /* If the cache is older than 800ms we'll refetch it */
 #define MAX_AGEmsecs_to_jiffies(800)
 
@@ -106,13 +98,10 @@ struct smu_sdbp_header *smu_sat_get_sdb_partition(unsigned 
int sat_id, int id,
buf[i+2] = data[3];
buf[i+3] = data[2];
}
-#ifdef DEBUG
-   DBG(KERN_DEBUG "sat %d partition %x:", sat_id, id);
-   for (i = 0; i < len; ++i)
-   DBG(" %x", buf[i]);
-   DBG("\n");
-#endif
 
+   printk(KERN_DEBUG "sat %d partition %x:", sat_id, id);
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, buf, len, false);
if (size)
*size = len;
return (struct smu_sdbp_header *) buf;
@@ -132,13 +121,13 @@ static int wf_sat_read_cache(struct wf_sat *sat)
if (err < 0)
return err;
sat->last_read = jiffies;
+
 #ifdef LOTSA_DEBUG
{
int i;
-   DBG(KERN_DEBUG "wf_sat_get: data is");
-   for (i = 0; i < 16; ++i)
-   DBG(" %.2x", sat->cache[i]);
-   DBG("\n");
+   printk(KERN_DEBUG "wf_sat_get: data is");
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, sat->cache, 16, false);
}
 #endif
return 0;
-- 
2.20.1



[PATCH AUTOSEL 4.9 05/99] powerpc: Fix signedness bug in update_flash_db()

2019-11-16 Thread Sasha Levin
From: Dan Carpenter 

[ Upstream commit 014704e6f54189a203cc14c7c0bb411b940241bc ]

The "count < sizeof(struct os_area_db)" comparison is type promoted to
size_t so negative values of "count" are treated as very high values
and we accidentally return success instead of a negative error code.

This doesn't really change runtime much but it fixes a static checker
warning.

Signed-off-by: Dan Carpenter 
Acked-by: Geoff Levand 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/ps3/os-area.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/ps3/os-area.c 
b/arch/powerpc/platforms/ps3/os-area.c
index 3db53e8aff927..9b2ef76578f06 100644
--- a/arch/powerpc/platforms/ps3/os-area.c
+++ b/arch/powerpc/platforms/ps3/os-area.c
@@ -664,7 +664,7 @@ static int update_flash_db(void)
db_set_64(db, _area_db_id_rtc_diff, saved_params.rtc_diff);
 
count = os_area_flash_write(db, sizeof(struct os_area_db), pos);
-   if (count < sizeof(struct os_area_db)) {
+   if (count < 0 || count < sizeof(struct os_area_db)) {
pr_debug("%s: os_area_flash_write failed %zd\n", __func__,
 count);
error = count < 0 ? count : -EIO;
-- 
2.20.1



[PATCH AUTOSEL 4.9 06/99] powerpc/eeh: Fix use of EEH_PE_KEEP on wrong field

2019-11-16 Thread Sasha Levin
From: Sam Bobroff 

[ Upstream commit 473af09b56dc4be68e4af33220ceca6be67aa60d ]

eeh_add_to_parent_pe() sometimes removes the EEH_PE_KEEP flag, but it
incorrectly removes it from pe->type, instead of pe->state.

However, rather than clearing it from the correct field, remove it.
Inspection of the code shows that it can't ever have had any effect
(even if it had been cleared from the correct field), because the
field is never tested after it is cleared by the statement in
question.

The clear statement was added by commit 807a827d4e74 ("powerpc/eeh:
Keep PE during hotplug"), but it didn't explain why it was necessary.

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_pe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 1abd8dd77ec13..eee2131a97e61 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -370,7 +370,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
while (parent) {
if (!(parent->type & EEH_PE_INVALID))
break;
-   parent->type &= ~(EEH_PE_INVALID | EEH_PE_KEEP);
+   parent->type &= ~EEH_PE_INVALID;
parent = parent->parent;
}
 
-- 
2.20.1



[PATCH AUTOSEL 4.14 104/150] mm/memory_hotplug: make add_memory() take the device_hotplug_lock

2019-11-16 Thread Sasha Levin
From: David Hildenbrand 

[ Upstream commit 8df1d0e4a265f25dc1e7e7624ccdbcb4a6630c89 ]

add_memory() currently does not take the device_hotplug_lock, however
is aleady called under the lock from
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
to synchronize against CPU hot-remove and similar.

In general, we should hold the device_hotplug_lock when adding memory to
synchronize against online/offline request (e.g.  from user space) - which
already resulted in lock inversions due to device_lock() and
mem_hotplug_lock - see 30467e0b3be ("mm, hotplug: fix concurrent memory
hot-add deadlock").  add_memory()/add_memory_resource() will create memory
block devices, so this really feels like the right thing to do.

Holding the device_hotplug_lock makes sure that a memory block device
can really only be accessed (e.g. via .online/.state) from user space,
once the memory has been fully added to the system.

The lock is not held yet in
drivers/xen/balloon.c
arch/powerpc/platforms/powernv/memtrace.c
drivers/s390/char/sclp_cmd.c
drivers/hv/hv_balloon.c
So, let's either use the locked variants or take the lock.

Don't export add_memory_resource(), as it once was exported to be used by
XEN, which is never built as a module.  If somebody requires it, we also
have to export a locked variant (as device_hotplug_lock is never
exported).

Link: http://lkml.kernel.org/r/20180925091457.28651-3-da...@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rafael J. Wysocki 
Reviewed-by: Rashmica Gupta 
Reviewed-by: Oscar Salvador 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Nathan Fontenot 
Cc: John Allen 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Mathieu Malaterre 
Cc: Pavel Tatashin 
Cc: YASUAKI ISHIMATSU 
Cc: Balbir Singh 
Cc: Haiyang Zhang 
Cc: Heiko Carstens 
Cc: Jonathan Corbet 
Cc: Kate Stewart 
Cc: "K. Y. Srinivasan" 
Cc: Martin Schwidefsky 
Cc: Michael Neuling 
Cc: Philippe Ombredanne 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 .../platforms/pseries/hotplug-memory.c|  2 +-
 drivers/acpi/acpi_memhotplug.c|  2 +-
 drivers/base/memory.c |  9 ++--
 drivers/xen/balloon.c |  3 +++
 include/linux/memory_hotplug.h|  1 +
 mm/memory_hotplug.c   | 22 ---
 6 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 93e09f108ca17..99a3cf51c5ba4 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -787,7 +787,7 @@ static int dlpar_add_lmb(struct of_drconf_cell *lmb)
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
/* Add the memory */
-   rc = add_memory(nid, lmb->base_addr, block_sz);
+   rc = __add_memory(nid, lmb->base_addr, block_sz);
if (rc) {
dlpar_remove_device_tree_lmb(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef7309cb..2ccfbb61ca899 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = add_memory(node, info->start_addr, info->length);
+   result = __add_memory(node, info->start_addr, info->length);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index c617e00f4361d..8e5818e735e2f 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -517,15 +517,20 @@ memory_probe_store(struct device *dev, struct 
device_attribute *attr,
if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
return -EINVAL;
 
+   ret = lock_device_hotplug_sysfs();
+   if (ret)
+   goto out;
+
nid = memory_add_physaddr_to_nid(phys_addr);
-   ret = add_memory(nid, phys_addr,
-MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+   ret = __add_memory(nid, phys_addr,
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
 
if (ret)
goto out;
 
ret = count;
 out:
+   unlock_device_hotplug();
return ret;
 }
 
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 7d521babc020b..71a6deeb4e714 100644
--- a/drivers/xen/balloon.c
+++ 

[PATCH AUTOSEL 4.14 092/150] selftests/powerpc/cache_shape: Fix out-of-tree build

2019-11-16 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 69f8117f17b332a68cd8f4bf8c2d0d3d5b84efc5 ]

Use TEST_GEN_PROGS and don't redefine all, this makes the out-of-tree
build work. We need to move the extra dependencies below the include
of lib.mk, because it adds the $(OUTPUT) prefix if it's defined.

We can also drop the clean rule, lib.mk does it for us.

Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/powerpc/cache_shape/Makefile | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/powerpc/cache_shape/Makefile 
b/tools/testing/selftests/powerpc/cache_shape/Makefile
index 1be547434a49c..7e0c175b82978 100644
--- a/tools/testing/selftests/powerpc/cache_shape/Makefile
+++ b/tools/testing/selftests/powerpc/cache_shape/Makefile
@@ -1,11 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := cache_shape
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c
+TEST_GEN_PROGS := cache_shape
 
 include ../../lib.mk
 
-clean:
-   rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c
-- 
2.20.1



[PATCH AUTOSEL 4.14 091/150] selftests/powerpc/switch_endian: Fix out-of-tree build

2019-11-16 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 266bac361d5677e61a6815bd29abeb3bdced2b07 ]

For the out-of-tree build to work we need to tell switch_endian_test
to look for check-reversed.S in $(OUTPUT).

Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/powerpc/switch_endian/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/powerpc/switch_endian/Makefile 
b/tools/testing/selftests/powerpc/switch_endian/Makefile
index 30b8ff8fb82e7..e4cedfe9753d7 100644
--- a/tools/testing/selftests/powerpc/switch_endian/Makefile
+++ b/tools/testing/selftests/powerpc/switch_endian/Makefile
@@ -7,6 +7,7 @@ EXTRA_CLEAN = $(OUTPUT)/*.o $(OUTPUT)/check-reversed.S
 
 include ../../lib.mk
 
+$(OUTPUT)/switch_endian_test: ASFLAGS += -I $(OUTPUT)
 $(OUTPUT)/switch_endian_test: $(OUTPUT)/check-reversed.S
 
 $(OUTPUT)/check-reversed.o: $(OUTPUT)/check.o
-- 
2.20.1



[PATCH AUTOSEL 4.14 090/150] selftests/powerpc/signal: Fix out-of-tree build

2019-11-16 Thread Sasha Levin
From: Joel Stanley 

[ Upstream commit 27825349d7b238533a47e3d98b8bb0efd886b752 ]

We should use TEST_GEN_PROGS, not TEST_PROGS. That tells the selftests
makefile (lib.mk) that those tests are generated (built), and so it
adds the $(OUTPUT) prefix for us, making the out-of-tree build work
correctly.

It also means we don't need our own clean rule, lib.mk does it.

We also have to update the signal_tm rule to use $(OUTPUT).

Signed-off-by: Joel Stanley 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/powerpc/signal/Makefile | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/powerpc/signal/Makefile 
b/tools/testing/selftests/powerpc/signal/Makefile
index a7cbd5082e271..4213978f3ee2c 100644
--- a/tools/testing/selftests/powerpc/signal/Makefile
+++ b/tools/testing/selftests/powerpc/signal/Makefile
@@ -1,14 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := signal signal_tm
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c signal.S
+TEST_GEN_PROGS := signal signal_tm
 
 CFLAGS += -maltivec
-signal_tm: CFLAGS += -mhtm
+$(OUTPUT)/signal_tm: CFLAGS += -mhtm
 
 include ../../lib.mk
 
-clean:
-   rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c signal.S
-- 
2.20.1



[PATCH AUTOSEL 4.14 089/150] powerpc/xmon: Relax frame size for clang

2019-11-16 Thread Sasha Levin
From: Joel Stanley 

[ Upstream commit 9c87156cce5a63735d1218f0096a65c50a7a32aa ]

When building with clang (8 trunk, 7.0 release) the frame size limit is
hit:

 arch/powerpc/xmon/xmon.c:452:12: warning: stack frame size of 2576
 bytes in function 'xmon_core' [-Wframe-larger-than=]

Some investigation by Naveen indicates this is due to clang saving the
addresses to printf format strings on the stack.

While this issue is investigated, bump up the frame size limit for xmon
when building with clang.

Link: https://github.com/ClangBuiltLinux/linux/issues/252
Signed-off-by: Joel Stanley 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/xmon/Makefile | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/xmon/Makefile b/arch/powerpc/xmon/Makefile
index 549e99e71112b..ac5ee067aa512 100644
--- a/arch/powerpc/xmon/Makefile
+++ b/arch/powerpc/xmon/Makefile
@@ -13,6 +13,12 @@ UBSAN_SANITIZE := n
 ORIG_CFLAGS := $(KBUILD_CFLAGS)
 KBUILD_CFLAGS = $(subst -mno-sched-epilog,,$(subst 
$(CC_FLAGS_FTRACE),,$(ORIG_CFLAGS)))
 
+ifdef CONFIG_CC_IS_CLANG
+# clang stores addresses on the stack causing the frame size to blow
+# out. See https://github.com/ClangBuiltLinux/linux/issues/252
+KBUILD_CFLAGS += -Wframe-larger-than=4096
+endif
+
 ccflags-$(CONFIG_PPC64) := $(NO_MINIMAL_TOC)
 
 obj-y  += xmon.o nonstdio.o spr_access.o
-- 
2.20.1



[PATCH AUTOSEL 4.14 076/150] powerpc/process: Fix flush_all_to_thread for SPE

2019-11-16 Thread Sasha Levin
From: Felipe Rechia 

[ Upstream commit e901378578c62202594cba0f6c076f3df365ec91 ]

Fix a bug introduced by the creation of flush_all_to_thread() for
processors that have SPE (Signal Processing Engine) and use it to
compute floating-point operations.

>From userspace perspective, the problem was seen in attempts of
computing floating-point operations which should generate exceptions.
For example:

  fork();
  float x = 0.0 / 0.0;
  isnan(x);   // forked process returns False (should be True)

The operation above also should always cause the SPEFSCR FINV bit to
be set. However, the SPE floating-point exceptions were turned off
after a fork().

Kernel versions prior to the bug used flush_spe_to_thread(), which
first saves SPEFSCR register values in tsk->thread and then calls
giveup_spe(tsk).

After commit 579e633e764e, the save_all() function was called first
to giveup_spe(), and then the SPEFSCR register values were saved in
tsk->thread. This would save the SPEFSCR register values after
disabling SPE for that thread, causing the bug described above.

Fixes 579e633e764e ("powerpc: create flush_all_to_thread()")
Signed-off-by: Felipe Rechia 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/process.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 5e5da2073fdff..ba0d4f9a99bac 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -567,12 +567,11 @@ void flush_all_to_thread(struct task_struct *tsk)
if (tsk->thread.regs) {
preempt_disable();
BUG_ON(tsk != current);
-   save_all(tsk);
-
 #ifdef CONFIG_SPE
if (tsk->thread.regs->msr & MSR_SPE)
tsk->thread.spefscr = mfspr(SPRN_SPEFSCR);
 #endif
+   save_all(tsk);
 
preempt_enable();
}
-- 
2.20.1



[PATCH AUTOSEL 4.14 059/150] powerpc/pseries: Export raw per-CPU VPA data via debugfs

2019-11-16 Thread Sasha Levin
From: Aravinda Prasad 

[ Upstream commit c6c26fb55e8e4b3fc376be5611685990a17de27a ]

This patch exports the raw per-CPU VPA data via debugfs.
A per-CPU file is created which exports the VPA data of
that CPU to help debug some of the VPA related issues or
to analyze the per-CPU VPA related statistics.

v3: Removed offline CPU check.

v2: Included offline CPU check and other review comments.

Signed-off-by: Aravinda Prasad 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/lpar.c | 54 +++
 1 file changed, 54 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index eb738ef577926..c0ae3847b8db5 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
@@ -1036,3 +1037,56 @@ static int __init reserve_vrma_context_id(void)
return 0;
 }
 machine_device_initcall(pseries, reserve_vrma_context_id);
+
+#ifdef CONFIG_DEBUG_FS
+/* debugfs file interface for vpa data */
+static ssize_t vpa_file_read(struct file *filp, char __user *buf, size_t len,
+ loff_t *pos)
+{
+   int cpu = (long)filp->private_data;
+   struct lppaca *lppaca = _of(cpu);
+
+   return simple_read_from_buffer(buf, len, pos, lppaca,
+   sizeof(struct lppaca));
+}
+
+static const struct file_operations vpa_fops = {
+   .open   = simple_open,
+   .read   = vpa_file_read,
+   .llseek = default_llseek,
+};
+
+static int __init vpa_debugfs_init(void)
+{
+   char name[16];
+   long i;
+   static struct dentry *vpa_dir;
+
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return 0;
+
+   vpa_dir = debugfs_create_dir("vpa", powerpc_debugfs_root);
+   if (!vpa_dir) {
+   pr_warn("%s: can't create vpa root dir\n", __func__);
+   return -ENOMEM;
+   }
+
+   /* set up the per-cpu vpa file*/
+   for_each_possible_cpu(i) {
+   struct dentry *d;
+
+   sprintf(name, "cpu-%ld", i);
+
+   d = debugfs_create_file(name, 0400, vpa_dir, (void *)i,
+   _fops);
+   if (!d) {
+   pr_warn("%s: can't create per-cpu vpa file\n",
+   __func__);
+   return -ENOMEM;
+   }
+   }
+
+   return 0;
+}
+machine_arch_initcall(pseries, vpa_debugfs_init);
+#endif /* CONFIG_DEBUG_FS */
-- 
2.20.1



[PATCH AUTOSEL 4.14 048/150] macintosh/windfarm_smu_sat: Fix debug output

2019-11-16 Thread Sasha Levin
From: Benjamin Herrenschmidt 

[ Upstream commit fc0c8b36d379a046525eacb9c3323ca635283757 ]

There's some antiquated debug output that's trying
to do a hand-made hexdump and turning into horrible
1-byte-per-line output these days.

Use print_hex_dump() instead

Signed-off-by: Benjamin Herrenschmidt 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 drivers/macintosh/windfarm_smu_sat.c | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/drivers/macintosh/windfarm_smu_sat.c 
b/drivers/macintosh/windfarm_smu_sat.c
index da7f4fc1a51d1..a0f61eb853c55 100644
--- a/drivers/macintosh/windfarm_smu_sat.c
+++ b/drivers/macintosh/windfarm_smu_sat.c
@@ -22,14 +22,6 @@
 
 #define VERSION "1.0"
 
-#define DEBUG
-
-#ifdef DEBUG
-#define DBG(args...)   printk(args)
-#else
-#define DBG(args...)   do { } while(0)
-#endif
-
 /* If the cache is older than 800ms we'll refetch it */
 #define MAX_AGEmsecs_to_jiffies(800)
 
@@ -106,13 +98,10 @@ struct smu_sdbp_header *smu_sat_get_sdb_partition(unsigned 
int sat_id, int id,
buf[i+2] = data[3];
buf[i+3] = data[2];
}
-#ifdef DEBUG
-   DBG(KERN_DEBUG "sat %d partition %x:", sat_id, id);
-   for (i = 0; i < len; ++i)
-   DBG(" %x", buf[i]);
-   DBG("\n");
-#endif
 
+   printk(KERN_DEBUG "sat %d partition %x:", sat_id, id);
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, buf, len, false);
if (size)
*size = len;
return (struct smu_sdbp_header *) buf;
@@ -132,13 +121,13 @@ static int wf_sat_read_cache(struct wf_sat *sat)
if (err < 0)
return err;
sat->last_read = jiffies;
+
 #ifdef LOTSA_DEBUG
{
int i;
-   DBG(KERN_DEBUG "wf_sat_get: data is");
-   for (i = 0; i < 16; ++i)
-   DBG(" %.2x", sat->cache[i]);
-   DBG("\n");
+   printk(KERN_DEBUG "wf_sat_get: data is");
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, sat->cache, 16, false);
}
 #endif
return 0;
-- 
2.20.1



[PATCH AUTOSEL 4.14 008/150] powerpc/eeh: Fix use of EEH_PE_KEEP on wrong field

2019-11-16 Thread Sasha Levin
From: Sam Bobroff 

[ Upstream commit 473af09b56dc4be68e4af33220ceca6be67aa60d ]

eeh_add_to_parent_pe() sometimes removes the EEH_PE_KEEP flag, but it
incorrectly removes it from pe->type, instead of pe->state.

However, rather than clearing it from the correct field, remove it.
Inspection of the code shows that it can't ever have had any effect
(even if it had been cleared from the correct field), because the
field is never tested after it is cleared by the statement in
question.

The clear statement was added by commit 807a827d4e74 ("powerpc/eeh:
Keep PE during hotplug"), but it didn't explain why it was necessary.

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_pe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 8545a9523b9bc..7339ca4fdc19e 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -381,7 +381,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
while (parent) {
if (!(parent->type & EEH_PE_INVALID))
break;
-   parent->type &= ~(EEH_PE_INVALID | EEH_PE_KEEP);
+   parent->type &= ~EEH_PE_INVALID;
parent = parent->parent;
}
 
-- 
2.20.1



[PATCH AUTOSEL 4.14 007/150] powerpc/boot: Disable vector instructions

2019-11-16 Thread Sasha Levin
From: Joel Stanley 

[ Upstream commit e8e132e6885962582784b6fa16a80d07ea739c0f ]

This will avoid auto-vectorisation when building with higher
optimisation levels.

We don't know if the machine can support VSX and even if it's present
it's probably not going to be enabled at this point in boot.

These flag were both added prior to GCC 4.6 which is the minimum
compiler version supported by upstream, thanks to Segher for the
details.

Signed-off-by: Joel Stanley 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index e2a5a932c24a8..5807c9d8e56d5 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -24,8 +24,8 @@ compress-$(CONFIG_KERNEL_GZIP) := CONFIG_KERNEL_GZIP
 compress-$(CONFIG_KERNEL_XZ)   := CONFIG_KERNEL_XZ
 
 BOOTCFLAGS:= -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
--fno-strict-aliasing -Os -msoft-float -pipe \
--fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
+-fno-strict-aliasing -Os -msoft-float -mno-altivec -mno-vsx \
+-pipe -fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
 -D$(compress-y)
 
 BOOTCC := $(CC)
-- 
2.20.1



[PATCH AUTOSEL 4.14 006/150] powerpc: Fix signedness bug in update_flash_db()

2019-11-16 Thread Sasha Levin
From: Dan Carpenter 

[ Upstream commit 014704e6f54189a203cc14c7c0bb411b940241bc ]

The "count < sizeof(struct os_area_db)" comparison is type promoted to
size_t so negative values of "count" are treated as very high values
and we accidentally return success instead of a negative error code.

This doesn't really change runtime much but it fixes a static checker
warning.

Signed-off-by: Dan Carpenter 
Acked-by: Geoff Levand 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/ps3/os-area.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/ps3/os-area.c 
b/arch/powerpc/platforms/ps3/os-area.c
index 3db53e8aff927..9b2ef76578f06 100644
--- a/arch/powerpc/platforms/ps3/os-area.c
+++ b/arch/powerpc/platforms/ps3/os-area.c
@@ -664,7 +664,7 @@ static int update_flash_db(void)
db_set_64(db, _area_db_id_rtc_diff, saved_params.rtc_diff);
 
count = os_area_flash_write(db, sizeof(struct os_area_db), pos);
-   if (count < sizeof(struct os_area_db)) {
+   if (count < 0 || count < sizeof(struct os_area_db)) {
pr_debug("%s: os_area_flash_write failed %zd\n", __func__,
 count);
error = count < 0 ? count : -EIO;
-- 
2.20.1



[PATCH AUTOSEL 4.19 165/237] powerpc/powernv: hold device_hotplug_lock when calling device_online()

2019-11-16 Thread Sasha Levin
From: David Hildenbrand 

[ Upstream commit cec1680591d6d5b10ecc10f370210089416e98af ]

device_online() should be called with device_hotplug_lock() held.

Link: http://lkml.kernel.org/r/20180925091457.28651-5-da...@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rashmica Gupta 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Rashmica Gupta 
Cc: Balbir Singh 
Cc: Michael Neuling 
Cc: Boris Ostrovsky 
Cc: Dan Williams 
Cc: Greg Kroah-Hartman 
Cc: Haiyang Zhang 
Cc: Heiko Carstens 
Cc: John Allen 
Cc: Jonathan Corbet 
Cc: Joonsoo Kim 
Cc: Juergen Gross 
Cc: Kate Stewart 
Cc: "K. Y. Srinivasan" 
Cc: Len Brown 
Cc: Martin Schwidefsky 
Cc: Mathieu Malaterre 
Cc: Michal Hocko 
Cc: Nathan Fontenot 
Cc: Oscar Salvador 
Cc: Philippe Ombredanne 
Cc: Rafael J. Wysocki 
Cc: "Rafael J. Wysocki" 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: Vlastimil Babka 
Cc: YASUAKI ISHIMATSU 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/powernv/memtrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 232bf5987f91d..dd3cc4632b9ae 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -244,9 +244,11 @@ static int memtrace_online(void)
 * we need to online the memory ourselves.
 */
if (!memhp_auto_online) {
+   lock_device_hotplug();
walk_memory_range(PFN_DOWN(ent->start),
  PFN_UP(ent->start + ent->size - 1),
  NULL, online_mem_block);
+   unlock_device_hotplug();
}
 
/*
-- 
2.20.1



[PATCH AUTOSEL 4.19 163/237] mm/memory_hotplug: make add_memory() take the device_hotplug_lock

2019-11-16 Thread Sasha Levin
From: David Hildenbrand 

[ Upstream commit 8df1d0e4a265f25dc1e7e7624ccdbcb4a6630c89 ]

add_memory() currently does not take the device_hotplug_lock, however
is aleady called under the lock from
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
to synchronize against CPU hot-remove and similar.

In general, we should hold the device_hotplug_lock when adding memory to
synchronize against online/offline request (e.g.  from user space) - which
already resulted in lock inversions due to device_lock() and
mem_hotplug_lock - see 30467e0b3be ("mm, hotplug: fix concurrent memory
hot-add deadlock").  add_memory()/add_memory_resource() will create memory
block devices, so this really feels like the right thing to do.

Holding the device_hotplug_lock makes sure that a memory block device
can really only be accessed (e.g. via .online/.state) from user space,
once the memory has been fully added to the system.

The lock is not held yet in
drivers/xen/balloon.c
arch/powerpc/platforms/powernv/memtrace.c
drivers/s390/char/sclp_cmd.c
drivers/hv/hv_balloon.c
So, let's either use the locked variants or take the lock.

Don't export add_memory_resource(), as it once was exported to be used by
XEN, which is never built as a module.  If somebody requires it, we also
have to export a locked variant (as device_hotplug_lock is never
exported).

Link: http://lkml.kernel.org/r/20180925091457.28651-3-da...@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Rafael J. Wysocki 
Reviewed-by: Rashmica Gupta 
Reviewed-by: Oscar Salvador 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Nathan Fontenot 
Cc: John Allen 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Mathieu Malaterre 
Cc: Pavel Tatashin 
Cc: YASUAKI ISHIMATSU 
Cc: Balbir Singh 
Cc: Haiyang Zhang 
Cc: Heiko Carstens 
Cc: Jonathan Corbet 
Cc: Kate Stewart 
Cc: "K. Y. Srinivasan" 
Cc: Martin Schwidefsky 
Cc: Michael Neuling 
Cc: Philippe Ombredanne 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 .../platforms/pseries/hotplug-memory.c|  2 +-
 drivers/acpi/acpi_memhotplug.c|  2 +-
 drivers/base/memory.c |  9 ++--
 drivers/xen/balloon.c |  3 +++
 include/linux/memory_hotplug.h|  1 +
 mm/memory_hotplug.c   | 22 ---
 6 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index f99cd31b6fd1a..7d61da2fe5226 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -705,7 +705,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
/* Add the memory */
-   rc = add_memory(nid, lmb->base_addr, block_sz);
+   rc = __add_memory(nid, lmb->base_addr, block_sz);
if (rc) {
dlpar_remove_device_tree_lmb(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef7309cb..2ccfbb61ca899 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = add_memory(node, info->start_addr, info->length);
+   result = __add_memory(node, info->start_addr, info->length);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 85ee64d0a44e9..07901cacfec63 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -519,15 +519,20 @@ memory_probe_store(struct device *dev, struct 
device_attribute *attr,
if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
return -EINVAL;
 
+   ret = lock_device_hotplug_sysfs();
+   if (ret)
+   goto out;
+
nid = memory_add_physaddr_to_nid(phys_addr);
-   ret = add_memory(nid, phys_addr,
-MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+   ret = __add_memory(nid, phys_addr,
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
 
if (ret)
goto out;
 
ret = count;
 out:
+   unlock_device_hotplug();
return ret;
 }
 
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index d4e8b717ce2b2..747a15acbce37 100644
--- a/drivers/xen/balloon.c
+++ 

[PATCH AUTOSEL 4.19 147/237] selftests/powerpc/cache_shape: Fix out-of-tree build

2019-11-16 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 69f8117f17b332a68cd8f4bf8c2d0d3d5b84efc5 ]

Use TEST_GEN_PROGS and don't redefine all, this makes the out-of-tree
build work. We need to move the extra dependencies below the include
of lib.mk, because it adds the $(OUTPUT) prefix if it's defined.

We can also drop the clean rule, lib.mk does it for us.

Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/powerpc/cache_shape/Makefile | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/powerpc/cache_shape/Makefile 
b/tools/testing/selftests/powerpc/cache_shape/Makefile
index ede4d3dae7505..689f6c8ebcd8d 100644
--- a/tools/testing/selftests/powerpc/cache_shape/Makefile
+++ b/tools/testing/selftests/powerpc/cache_shape/Makefile
@@ -1,12 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := cache_shape
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c
+TEST_GEN_PROGS := cache_shape
 
 top_srcdir = ../../../../..
 include ../../lib.mk
 
-clean:
-   rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c
-- 
2.20.1



[PATCH AUTOSEL 4.19 146/237] selftests/powerpc/switch_endian: Fix out-of-tree build

2019-11-16 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 266bac361d5677e61a6815bd29abeb3bdced2b07 ]

For the out-of-tree build to work we need to tell switch_endian_test
to look for check-reversed.S in $(OUTPUT).

Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/powerpc/switch_endian/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/powerpc/switch_endian/Makefile 
b/tools/testing/selftests/powerpc/switch_endian/Makefile
index fcd2dcb8972ba..bdc081afedb0f 100644
--- a/tools/testing/selftests/powerpc/switch_endian/Makefile
+++ b/tools/testing/selftests/powerpc/switch_endian/Makefile
@@ -8,6 +8,7 @@ EXTRA_CLEAN = $(OUTPUT)/*.o $(OUTPUT)/check-reversed.S
 top_srcdir = ../../../../..
 include ../../lib.mk
 
+$(OUTPUT)/switch_endian_test: ASFLAGS += -I $(OUTPUT)
 $(OUTPUT)/switch_endian_test: $(OUTPUT)/check-reversed.S
 
 $(OUTPUT)/check-reversed.o: $(OUTPUT)/check.o
-- 
2.20.1



[PATCH AUTOSEL 4.19 145/237] selftests/powerpc/signal: Fix out-of-tree build

2019-11-16 Thread Sasha Levin
From: Joel Stanley 

[ Upstream commit 27825349d7b238533a47e3d98b8bb0efd886b752 ]

We should use TEST_GEN_PROGS, not TEST_PROGS. That tells the selftests
makefile (lib.mk) that those tests are generated (built), and so it
adds the $(OUTPUT) prefix for us, making the out-of-tree build work
correctly.

It also means we don't need our own clean rule, lib.mk does it.

We also have to update the signal_tm rule to use $(OUTPUT).

Signed-off-by: Joel Stanley 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/powerpc/signal/Makefile | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/powerpc/signal/Makefile 
b/tools/testing/selftests/powerpc/signal/Makefile
index 1fca25c6ace06..209a958dca127 100644
--- a/tools/testing/selftests/powerpc/signal/Makefile
+++ b/tools/testing/selftests/powerpc/signal/Makefile
@@ -1,15 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := signal signal_tm
-
-all: $(TEST_PROGS)
-
-$(TEST_PROGS): ../harness.c ../utils.c signal.S
+TEST_GEN_PROGS := signal signal_tm
 
 CFLAGS += -maltivec
-signal_tm: CFLAGS += -mhtm
+$(OUTPUT)/signal_tm: CFLAGS += -mhtm
 
 top_srcdir = ../../../../..
 include ../../lib.mk
 
-clean:
-   rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c signal.S
-- 
2.20.1



[PATCH AUTOSEL 4.19 144/237] selftests/powerpc/ptrace: Fix out-of-tree build

2019-11-16 Thread Sasha Levin
From: Joel Stanley 

[ Upstream commit c39b79082a38a4f8c801790edecbbb4d62ed2992 ]

We should use TEST_GEN_PROGS, not TEST_PROGS. That tells the selftests
makefile (lib.mk) that those tests are generated (built), and so it
adds the $(OUTPUT) prefix for us, making the out-of-tree build work
correctly.

It also means we don't need our own clean rule, lib.mk does it.

We also have to update the ptrace-pkey and core-pkey rules to use
$(OUTPUT).

Signed-off-by: Joel Stanley 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/powerpc/ptrace/Makefile | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/Makefile 
b/tools/testing/selftests/powerpc/ptrace/Makefile
index 923d531265f8c..9f9423430059e 100644
--- a/tools/testing/selftests/powerpc/ptrace/Makefile
+++ b/tools/testing/selftests/powerpc/ptrace/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-TEST_PROGS := ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr \
+TEST_GEN_PROGS := ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr \
   ptrace-tar ptrace-tm-tar ptrace-tm-spd-tar ptrace-vsx 
ptrace-tm-vsx \
   ptrace-tm-spd-vsx ptrace-tm-spr ptrace-hwbreak ptrace-pkey 
core-pkey \
   perf-hwbreak
@@ -7,14 +7,9 @@ TEST_PROGS := ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr \
 top_srcdir = ../../../../..
 include ../../lib.mk
 
-all: $(TEST_PROGS)
-
 CFLAGS += -m64 -I../../../../../usr/include -I../tm -mhtm -fno-pie
 
-ptrace-pkey core-pkey: child.h
-ptrace-pkey core-pkey: LDLIBS += -pthread
-
-$(TEST_PROGS): ../harness.c ../utils.c ../lib/reg.S ptrace.h
+$(OUTPUT)/ptrace-pkey $(OUTPUT)/core-pkey: child.h
+$(OUTPUT)/ptrace-pkey $(OUTPUT)/core-pkey: LDLIBS += -pthread
 
-clean:
-   rm -f $(TEST_PROGS) *.o
+$(TEST_GEN_PROGS): ../harness.c ../utils.c ../lib/reg.S ptrace.h
-- 
2.20.1



[PATCH AUTOSEL 4.19 143/237] powerpc/xmon: Relax frame size for clang

2019-11-16 Thread Sasha Levin
From: Joel Stanley 

[ Upstream commit 9c87156cce5a63735d1218f0096a65c50a7a32aa ]

When building with clang (8 trunk, 7.0 release) the frame size limit is
hit:

 arch/powerpc/xmon/xmon.c:452:12: warning: stack frame size of 2576
 bytes in function 'xmon_core' [-Wframe-larger-than=]

Some investigation by Naveen indicates this is due to clang saving the
addresses to printf format strings on the stack.

While this issue is investigated, bump up the frame size limit for xmon
when building with clang.

Link: https://github.com/ClangBuiltLinux/linux/issues/252
Signed-off-by: Joel Stanley 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/xmon/Makefile | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/xmon/Makefile b/arch/powerpc/xmon/Makefile
index 9d7d8e6d705c4..9ba44e190e5e4 100644
--- a/arch/powerpc/xmon/Makefile
+++ b/arch/powerpc/xmon/Makefile
@@ -13,6 +13,12 @@ UBSAN_SANITIZE := n
 ORIG_CFLAGS := $(KBUILD_CFLAGS)
 KBUILD_CFLAGS = $(subst $(CC_FLAGS_FTRACE),,$(ORIG_CFLAGS))
 
+ifdef CONFIG_CC_IS_CLANG
+# clang stores addresses on the stack causing the frame size to blow
+# out. See https://github.com/ClangBuiltLinux/linux/issues/252
+KBUILD_CFLAGS += -Wframe-larger-than=4096
+endif
+
 ccflags-$(CONFIG_PPC64) := $(NO_MINIMAL_TOC)
 
 obj-y  += xmon.o nonstdio.o spr_access.o
-- 
2.20.1



[PATCH AUTOSEL 4.19 126/237] powerpc/process: Fix flush_all_to_thread for SPE

2019-11-16 Thread Sasha Levin
From: Felipe Rechia 

[ Upstream commit e901378578c62202594cba0f6c076f3df365ec91 ]

Fix a bug introduced by the creation of flush_all_to_thread() for
processors that have SPE (Signal Processing Engine) and use it to
compute floating-point operations.

>From userspace perspective, the problem was seen in attempts of
computing floating-point operations which should generate exceptions.
For example:

  fork();
  float x = 0.0 / 0.0;
  isnan(x);   // forked process returns False (should be True)

The operation above also should always cause the SPEFSCR FINV bit to
be set. However, the SPE floating-point exceptions were turned off
after a fork().

Kernel versions prior to the bug used flush_spe_to_thread(), which
first saves SPEFSCR register values in tsk->thread and then calls
giveup_spe(tsk).

After commit 579e633e764e, the save_all() function was called first
to giveup_spe(), and then the SPEFSCR register values were saved in
tsk->thread. This would save the SPEFSCR register values after
disabling SPE for that thread, causing the bug described above.

Fixes 579e633e764e ("powerpc: create flush_all_to_thread()")
Signed-off-by: Felipe Rechia 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/process.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 909c9407e392a..02b69a68139cc 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -575,12 +575,11 @@ void flush_all_to_thread(struct task_struct *tsk)
if (tsk->thread.regs) {
preempt_disable();
BUG_ON(tsk != current);
-   save_all(tsk);
-
 #ifdef CONFIG_SPE
if (tsk->thread.regs->msr & MSR_SPE)
tsk->thread.spefscr = mfspr(SPRN_SPEFSCR);
 #endif
+   save_all(tsk);
 
preempt_enable();
}
-- 
2.20.1



[PATCH AUTOSEL 4.19 095/237] powerpc/64s/radix: Fix radix__flush_tlb_collapsed_pmd double flushing pmd

2019-11-16 Thread Sasha Levin
From: Nicholas Piggin 

[ Upstream commit dd76ff5af35350fd6d5bb5b069e73b6017f66893 ]

Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/tlb-radix.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 62be0e5732b70..02bc3ae8522f4 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -1071,7 +1071,6 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, 
unsigned long addr)
goto local;
}
_tlbie_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, 
true);
-   goto local;
} else {
 local:
_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, 
true);
-- 
2.20.1



[PATCH AUTOSEL 4.19 094/237] powerpc/mm/radix: Fix small page at boundary when splitting

2019-11-16 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 81d1b54dec95209ab5e5be2cf37182885f998753 ]

When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the
linear mapping at the text/data boundary so we can map the kernel
text read only.

Currently we always use a small page at the text/data boundary, even
when that's not necessary:

  Mapped 0x-0x00e0 with 2.00 MiB pages
  Mapped 0x00e0-0x0100 with 64.0 KiB pages
  Mapped 0x0100-0x4000 with 2.00 MiB pages

This is because the check that the mapping crosses the __init_begin
boundary is too strict, it also returns true when we map exactly up to
the boundary.

So fix it to check that the mapping would actually map past
__init_begin, and with that we see:

  Mapped 0x-0x4000 with 2.00 MiB pages
  Mapped 0x4000-0x0001 with 1.00 GiB pages

Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/pgtable-radix.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index b387c7b917b7e..69caeb5bccb21 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -295,14 +295,14 @@ static int __meminit create_physical_mapping(unsigned 
long start,
 
if (split_text_mapping && (mapping_size == PUD_SIZE) &&
(addr < __pa_symbol(__init_begin)) &&
-   (addr + mapping_size) >= __pa_symbol(__init_begin)) {
+   (addr + mapping_size) > __pa_symbol(__init_begin)) {
max_mapping_size = PMD_SIZE;
goto retry;
}
 
if (split_text_mapping && (mapping_size == PMD_SIZE) &&
(addr < __pa_symbol(__init_begin)) &&
-   (addr + mapping_size) >= __pa_symbol(__init_begin)) {
+   (addr + mapping_size) > __pa_symbol(__init_begin)) {
mapping_size = PAGE_SIZE;
psize = mmu_virtual_psize;
}
-- 
2.20.1



[PATCH AUTOSEL 4.19 093/237] powerpc/mm/radix: Fix overuse of small pages in splitting logic

2019-11-16 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 3b5657ed5b4e27ccf593a41ff3c5aa27dae8df18 ]

When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the
linear mapping at the text/data boundary so we can map the kernel text
read only.

But the current logic uses small pages for the entire text section,
regardless of whether a larger page size would fit. eg. with the
boundary at 16M we could use 2M pages, but instead we use 64K pages up
to the 16M boundary:

  Mapped 0x-0x0100 with 64.0 KiB pages
  Mapped 0x0100-0x4000 with 2.00 MiB pages
  Mapped 0x4000-0x0001 with 1.00 GiB pages

This is because the test is checking if addr is < __init_begin
and addr + mapping_size is >= _stext. But that is true for all pages
between _stext and __init_begin.

Instead what we want to check is if we are crossing the text/data
boundary, which is at __init_begin. With that fixed we see:

  Mapped 0x-0x00e0 with 2.00 MiB pages
  Mapped 0x00e0-0x0100 with 64.0 KiB pages
  Mapped 0x0100-0x4000 with 2.00 MiB pages
  Mapped 0x4000-0x0001 with 1.00 GiB pages

ie. we're correctly using 2MB pages below __init_begin, but we still
drop down to 64K pages unnecessarily at the boundary.

Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/pgtable-radix.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 24a2eadc8c21a..b387c7b917b7e 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -295,14 +295,14 @@ static int __meminit create_physical_mapping(unsigned 
long start,
 
if (split_text_mapping && (mapping_size == PUD_SIZE) &&
(addr < __pa_symbol(__init_begin)) &&
-   (addr + mapping_size) >= __pa_symbol(_stext)) {
+   (addr + mapping_size) >= __pa_symbol(__init_begin)) {
max_mapping_size = PMD_SIZE;
goto retry;
}
 
if (split_text_mapping && (mapping_size == PMD_SIZE) &&
(addr < __pa_symbol(__init_begin)) &&
-   (addr + mapping_size) >= __pa_symbol(_stext)) {
+   (addr + mapping_size) >= __pa_symbol(__init_begin)) {
mapping_size = PAGE_SIZE;
psize = mmu_virtual_psize;
}
-- 
2.20.1



[PATCH AUTOSEL 4.19 092/237] powerpc/mm/radix: Fix off-by-one in split mapping logic

2019-11-16 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit 5c6499b7041b43807dfaeda28aa87fc0e62558f7 ]

When we have CONFIG_STRICT_KERNEL_RWX enabled, we try to split the
kernel linear (1:1) mapping so that the kernel text is in a separate
page to kernel data, so we can mark the former read-only.

We could achieve that just by always using 64K pages for the linear
mapping, but we try to be smarter. Instead we use huge pages when
possible, and only switch to smaller pages when necessary.

However we have an off-by-one bug in that logic, which causes us to
calculate the wrong boundary between text and data.

For example with the end of the kernel text at 16M we see:

  radix-mmu: Mapped 0x-0x0120 with 64.0 KiB pages
  radix-mmu: Mapped 0x0120-0x4000 with 2.00 MiB pages
  radix-mmu: Mapped 0x4000-0x0001 with 1.00 GiB pages

ie. we mapped from 0 to 18M with 64K pages, even though the boundary
between text and data is at 16M.

With the fix we see we're correctly hitting the 16M boundary:

  radix-mmu: Mapped 0x-0x0100 with 64.0 KiB pages
  radix-mmu: Mapped 0x0100-0x4000 with 2.00 MiB pages
  radix-mmu: Mapped 0x4000-0x0001 with 1.00 GiB pages

Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/pgtable-radix.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 3ea4c1f107d7e..24a2eadc8c21a 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -294,14 +294,14 @@ static int __meminit create_physical_mapping(unsigned 
long start,
}
 
if (split_text_mapping && (mapping_size == PUD_SIZE) &&
-   (addr <= __pa_symbol(__init_begin)) &&
+   (addr < __pa_symbol(__init_begin)) &&
(addr + mapping_size) >= __pa_symbol(_stext)) {
max_mapping_size = PMD_SIZE;
goto retry;
}
 
if (split_text_mapping && (mapping_size == PMD_SIZE) &&
-   (addr <= __pa_symbol(__init_begin)) &&
+   (addr < __pa_symbol(__init_begin)) &&
(addr + mapping_size) >= __pa_symbol(_stext)) {
mapping_size = PAGE_SIZE;
psize = mmu_virtual_psize;
-- 
2.20.1



[PATCH AUTOSEL 4.19 091/237] powerpc/pseries: Export raw per-CPU VPA data via debugfs

2019-11-16 Thread Sasha Levin
From: Aravinda Prasad 

[ Upstream commit c6c26fb55e8e4b3fc376be5611685990a17de27a ]

This patch exports the raw per-CPU VPA data via debugfs.
A per-CPU file is created which exports the VPA data of
that CPU to help debug some of the VPA related issues or
to analyze the per-CPU VPA related statistics.

v3: Removed offline CPU check.

v2: Included offline CPU check and other review comments.

Signed-off-by: Aravinda Prasad 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/lpar.c | 54 +++
 1 file changed, 54 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index ea602f7f97ce1..49e3a88b6a0c1 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
@@ -1032,3 +1033,56 @@ static int __init reserve_vrma_context_id(void)
return 0;
 }
 machine_device_initcall(pseries, reserve_vrma_context_id);
+
+#ifdef CONFIG_DEBUG_FS
+/* debugfs file interface for vpa data */
+static ssize_t vpa_file_read(struct file *filp, char __user *buf, size_t len,
+ loff_t *pos)
+{
+   int cpu = (long)filp->private_data;
+   struct lppaca *lppaca = _of(cpu);
+
+   return simple_read_from_buffer(buf, len, pos, lppaca,
+   sizeof(struct lppaca));
+}
+
+static const struct file_operations vpa_fops = {
+   .open   = simple_open,
+   .read   = vpa_file_read,
+   .llseek = default_llseek,
+};
+
+static int __init vpa_debugfs_init(void)
+{
+   char name[16];
+   long i;
+   static struct dentry *vpa_dir;
+
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return 0;
+
+   vpa_dir = debugfs_create_dir("vpa", powerpc_debugfs_root);
+   if (!vpa_dir) {
+   pr_warn("%s: can't create vpa root dir\n", __func__);
+   return -ENOMEM;
+   }
+
+   /* set up the per-cpu vpa file*/
+   for_each_possible_cpu(i) {
+   struct dentry *d;
+
+   sprintf(name, "cpu-%ld", i);
+
+   d = debugfs_create_file(name, 0400, vpa_dir, (void *)i,
+   _fops);
+   if (!d) {
+   pr_warn("%s: can't create per-cpu vpa file\n",
+   __func__);
+   return -ENOMEM;
+   }
+   }
+
+   return 0;
+}
+machine_arch_initcall(pseries, vpa_debugfs_init);
+#endif /* CONFIG_DEBUG_FS */
-- 
2.20.1



[PATCH AUTOSEL 4.19 075/237] macintosh/windfarm_smu_sat: Fix debug output

2019-11-16 Thread Sasha Levin
From: Benjamin Herrenschmidt 

[ Upstream commit fc0c8b36d379a046525eacb9c3323ca635283757 ]

There's some antiquated debug output that's trying
to do a hand-made hexdump and turning into horrible
1-byte-per-line output these days.

Use print_hex_dump() instead

Signed-off-by: Benjamin Herrenschmidt 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 drivers/macintosh/windfarm_smu_sat.c | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/drivers/macintosh/windfarm_smu_sat.c 
b/drivers/macintosh/windfarm_smu_sat.c
index da7f4fc1a51d1..a0f61eb853c55 100644
--- a/drivers/macintosh/windfarm_smu_sat.c
+++ b/drivers/macintosh/windfarm_smu_sat.c
@@ -22,14 +22,6 @@
 
 #define VERSION "1.0"
 
-#define DEBUG
-
-#ifdef DEBUG
-#define DBG(args...)   printk(args)
-#else
-#define DBG(args...)   do { } while(0)
-#endif
-
 /* If the cache is older than 800ms we'll refetch it */
 #define MAX_AGEmsecs_to_jiffies(800)
 
@@ -106,13 +98,10 @@ struct smu_sdbp_header *smu_sat_get_sdb_partition(unsigned 
int sat_id, int id,
buf[i+2] = data[3];
buf[i+3] = data[2];
}
-#ifdef DEBUG
-   DBG(KERN_DEBUG "sat %d partition %x:", sat_id, id);
-   for (i = 0; i < len; ++i)
-   DBG(" %x", buf[i]);
-   DBG("\n");
-#endif
 
+   printk(KERN_DEBUG "sat %d partition %x:", sat_id, id);
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, buf, len, false);
if (size)
*size = len;
return (struct smu_sdbp_header *) buf;
@@ -132,13 +121,13 @@ static int wf_sat_read_cache(struct wf_sat *sat)
if (err < 0)
return err;
sat->last_read = jiffies;
+
 #ifdef LOTSA_DEBUG
{
int i;
-   DBG(KERN_DEBUG "wf_sat_get: data is");
-   for (i = 0; i < 16; ++i)
-   DBG(" %.2x", sat->cache[i]);
-   DBG("\n");
+   printk(KERN_DEBUG "wf_sat_get: data is");
+   print_hex_dump(KERN_DEBUG, "  ", DUMP_PREFIX_OFFSET,
+  16, 1, sat->cache, 16, false);
}
 #endif
return 0;
-- 
2.20.1



[PATCH AUTOSEL 4.19 014/237] powerpc/eeh: Fix use of EEH_PE_KEEP on wrong field

2019-11-16 Thread Sasha Levin
From: Sam Bobroff 

[ Upstream commit 473af09b56dc4be68e4af33220ceca6be67aa60d ]

eeh_add_to_parent_pe() sometimes removes the EEH_PE_KEEP flag, but it
incorrectly removes it from pe->type, instead of pe->state.

However, rather than clearing it from the correct field, remove it.
Inspection of the code shows that it can't ever have had any effect
(even if it had been cleared from the correct field), because the
field is never tested after it is cleared by the statement in
question.

The clear statement was added by commit 807a827d4e74 ("powerpc/eeh:
Keep PE during hotplug"), but it didn't explain why it was necessary.

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_pe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 1b238ecc553e2..210d239a93950 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -379,7 +379,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
while (parent) {
if (!(parent->type & EEH_PE_INVALID))
break;
-   parent->type &= ~(EEH_PE_INVALID | EEH_PE_KEEP);
+   parent->type &= ~EEH_PE_INVALID;
parent = parent->parent;
}
 
-- 
2.20.1



[PATCH AUTOSEL 4.19 013/237] powerpc/eeh: Fix null deref for devices removed during EEH

2019-11-16 Thread Sasha Levin
From: Sam Bobroff 

[ Upstream commit bcbe3730531239abd45ab6c6af4a18078b37dd47 ]

If a device is removed during EEH processing (either by a driver's
handler or as part of recovery), it can lead to a null dereference
in eeh_pe_report_edev().

To handle this, skip devices that have been removed.

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_driver.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 110eba400de7c..af1f3d5f9a0f7 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -281,6 +281,10 @@ static void eeh_pe_report_edev(struct eeh_dev *edev, 
eeh_report_fn fn,
struct pci_driver *driver;
enum pci_ers_result new_result;
 
+   if (!edev->pdev) {
+   eeh_edev_info(edev, "no device");
+   return;
+   }
device_lock(>pdev->dev);
if (eeh_edev_actionable(edev)) {
driver = eeh_pcid_get(edev->pdev);
-- 
2.20.1



[PATCH AUTOSEL 4.19 012/237] powerpc/boot: Disable vector instructions

2019-11-16 Thread Sasha Levin
From: Joel Stanley 

[ Upstream commit e8e132e6885962582784b6fa16a80d07ea739c0f ]

This will avoid auto-vectorisation when building with higher
optimisation levels.

We don't know if the machine can support VSX and even if it's present
it's probably not going to be enabled at this point in boot.

These flag were both added prior to GCC 4.6 which is the minimum
compiler version supported by upstream, thanks to Segher for the
details.

Signed-off-by: Joel Stanley 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 25e3184f11f78..7d5ddf53750ce 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -32,8 +32,8 @@ else
 endif
 
 BOOTCFLAGS:= -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
--fno-strict-aliasing -Os -msoft-float -pipe \
--fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
+-fno-strict-aliasing -Os -msoft-float -mno-altivec -mno-vsx \
+-pipe -fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
 -D$(compress-y)
 
 ifdef CONFIG_PPC64_BOOT_WRAPPER
-- 
2.20.1



[PATCH AUTOSEL 4.19 011/237] powerpc/boot: Fix opal console in boot wrapper

2019-11-16 Thread Sasha Levin
From: Joel Stanley 

[ Upstream commit 1a855eaccf353f7ed1d51a3d4b3af727ccbd81ca ]

As of commit 10c77dba40ff ("powerpc/boot: Fix build failure in 32-bit
boot wrapper") the opal code is hidden behind CONFIG_PPC64_BOOT_WRAPPER,
but the boot wrapper avoids include/linux, so it does not get the normal
Kconfig flags.

We can drop the guard entirely as in commit f8e8e69cea49 ("powerpc/boot:
Only build OPAL code when necessary") the makefile only includes opal.c
in the build if CONFIG_PPC64_BOOT_WRAPPER is set.

Fixes: 10c77dba40ff ("powerpc/boot: Fix build failure in 32-bit boot wrapper")
Signed-off-by: Joel Stanley 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/opal.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/boot/opal.c b/arch/powerpc/boot/opal.c
index 0272570d02de1..dfb199ef5b949 100644
--- a/arch/powerpc/boot/opal.c
+++ b/arch/powerpc/boot/opal.c
@@ -13,8 +13,6 @@
 #include 
 #include "../include/asm/opal-api.h"
 
-#ifdef CONFIG_PPC64_BOOT_WRAPPER
-
 /* Global OPAL struct used by opal-call.S */
 struct opal {
u64 base;
@@ -101,9 +99,3 @@ int opal_console_init(void *devp, struct serial_console_data 
*scdp)
 
return 0;
 }
-#else
-int opal_console_init(void *devp, struct serial_console_data *scdp)
-{
-   return -1;
-}
-#endif /* __powerpc64__ */
-- 
2.20.1



[PATCH AUTOSEL 4.19 010/237] powerpc: Fix signedness bug in update_flash_db()

2019-11-16 Thread Sasha Levin
From: Dan Carpenter 

[ Upstream commit 014704e6f54189a203cc14c7c0bb411b940241bc ]

The "count < sizeof(struct os_area_db)" comparison is type promoted to
size_t so negative values of "count" are treated as very high values
and we accidentally return success instead of a negative error code.

This doesn't really change runtime much but it fixes a static checker
warning.

Signed-off-by: Dan Carpenter 
Acked-by: Geoff Levand 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/ps3/os-area.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/ps3/os-area.c 
b/arch/powerpc/platforms/ps3/os-area.c
index cdbfc5cfd6f38..f5387ad822798 100644
--- a/arch/powerpc/platforms/ps3/os-area.c
+++ b/arch/powerpc/platforms/ps3/os-area.c
@@ -664,7 +664,7 @@ static int update_flash_db(void)
db_set_64(db, _area_db_id_rtc_diff, saved_params.rtc_diff);
 
count = os_area_flash_write(db, sizeof(struct os_area_db), pos);
-   if (count < sizeof(struct os_area_db)) {
+   if (count < 0 || count < sizeof(struct os_area_db)) {
pr_debug("%s: os_area_flash_write failed %zd\n", __func__,
 count);
error = count < 0 ? count : -EIO;
-- 
2.20.1



Re: [PATCH v3 2/4] powerpc/fadump: reorganize /sys/kernel/fadump_* sysfs files

2019-11-16 Thread Sourabh Jain



On 11/9/19 6:29 PM, Michal Suchánek wrote:
> On Sat, Nov 09, 2019 at 05:53:37PM +0530, Sourabh Jain wrote:
>> As the number of FADump sysfs files increases it is hard to manage all of
>> them inside /sys/kernel directory. It's better to have all the FADump
>> related sysfs files in a dedicated directory /sys/kernel/fadump. But in
>> order to maintain the backward compatibility the /sys/kernel/fadump_*
>> sysfs files are replicated inside /sys/kernel/fadump/ and eventually get
>> removed in future.
>>
>> As the FADump sysfs files are now part of dedicated directory there is no
>> need to prefix their name with fadump_, hence sysfs file names are also
>> updated. For example fadump_enabled sysfs file is now referred as enabled.
>>
>> Also consolidate ABI documentation for all the FADump sysfs files in a
>> single file Documentation/ABI/testing/sysfs-kernel-fadump.
>>
>> Signed-off-by: Sourabh Jain 
>> ---
>>  Documentation/ABI/testing/sysfs-kernel-fadump | 41 +++
>>  arch/powerpc/kernel/fadump.c  | 38 +
>>  arch/powerpc/platforms/powernv/opal-core.c| 10 +++--
>>  3 files changed, 86 insertions(+), 3 deletions(-)
>>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-fadump
>>
>> diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump 
>> b/Documentation/ABI/testing/sysfs-kernel-fadump
>> new file mode 100644
>> index ..a77f1a5ba389
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-kernel-fadump
>> @@ -0,0 +1,41 @@
>> +What:   /sys/kernel/fadump/*
>> +Date:   Nov 2019
>> +Contact:linuxppc-dev@lists.ozlabs.org
>> +Description:
>> +The /sys/kernel/fadump/* is a collection of FADump sysfs
>> +file provide information about the configuration status
>> +of Firmware Assisted Dump (FADump).
>> +
>> +What:   /sys/kernel/fadump/enabled
>> +Date:   Nov 2019
>> +Contact:linuxppc-dev@lists.ozlabs.org
>> +Description:read only
>> +Primarily used to identify whether the FADump is enabled in
>> +the kernel or not.
>> +User:   Kdump service
>> +
>> +What:   /sys/kernel/fadump/registered
>> +Date:   Nov 2019
>> +Contact:linuxppc-dev@lists.ozlabs.org
>> +Description:read/write
>> +Helps to control the dump collect feature from userspace.
>> +Setting 1 to this file enables the system to collect the
>> +dump and 0 to disable it.
>> +User:   Kdump service
>> +
>> +What:   /sys/kernel/fadump/release_mem
>> +Date:   Nov 2019
>> +Contact:linuxppc-dev@lists.ozlabs.org
>> +Description:write only
>> +This is a special sysfs file and only available when
>> +the system is booted to capture the vmcore using FADump.
>> +It is used to release the memory reserved by FADump to
>> +save the crash dump.
>> +
>> +What:   /sys/kernel/fadump/release_opalcore
>> +Date:   Nov 2019
>> +Contact:linuxppc-dev@lists.ozlabs.org
>> +Description:write only
>> +The sysfs file is available when the system is booted to
>> +collect the dump on OPAL based machine. It used to release
>> +the memory used to collect the opalcore.
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index ed59855430b9..a9591def0c84 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -1418,6 +1418,9 @@ static int fadump_region_show(struct seq_file *m, void 
>> *private)
>>  return 0;
>>  }
>>  
>> +struct kobject *fadump_kobj;
>> +EXPORT_SYMBOL_GPL(fadump_kobj);
>> +
>>  static struct kobj_attribute fadump_release_attr = 
>> __ATTR(fadump_release_mem,
>>  0200, NULL,
>>  fadump_release_memory_store);
>> @@ -1428,6 +1431,16 @@ static struct kobj_attribute fadump_register_attr = 
>> __ATTR(fadump_registered,
>>  0644, fadump_register_show,
>>  fadump_register_store);
>>  
>> +static struct kobj_attribute release_attr = __ATTR(release_mem,
>> +0200, NULL,
>> +fadump_release_memory_store);
>> +static struct kobj_attribute enable_attr = __ATTR(enabled,
>> +0444, fadump_enabled_show,
>> +NULL);
>> +static struct kobj_attribute register_attr = __ATTR(registered,
>> +0644, fadump_register_show,
>> +fadump_register_store);
>> +
>>  DEFINE_SHOW_ATTRIBUTE(fadump_region);
>>  
>>  static void fadump_init_files(void)
>> @@ -1435,6 +1448,11 @@ 

Re: [RFC PATCH 1/4] libnvdimm/namespace: Make namespace size validation arch dependent

2019-11-16 Thread Aneesh Kumar K.V


Hi Dan,

"Aneesh Kumar K.V"  writes:

> Dan Williams  writes:
>
>> On Wed, Oct 30, 2019 at 10:35 PM Aneesh Kumar K.V
>>  wrote:
>> [..]
>>> > True, for the pfn device and the device-dax mapping size, but I'm
>>> > suggesting adding another instance of alignment control at the raw
>>> > namespace level. That would need to be disconnected from the
>>> > device-dax page mapping granularity.
>>> >
>>>
>>> Can you explain what you mean by raw namespace level ? We don't have
>>> multiple values against which we need to check the alignment of
>>> namespace start and namespace size.
>>>
>>> If you can outline how and where you would like to enforce that check I
>>> can start working on it.
>>>
>>
>> What I mean is that the process of setting up a pfn namespace goes
>> something like this in shell script form:
>>
>> 1/ echo $size > /sys/bus/nd/devices/$namespace/size
>> 2/ echo $namespace > /sys/bus/nd/devices/$pfn/namespace
>> 3/ echo $pfn_align > /sys/bus/nd/devices/$pfn/align
>>
>> What I'm suggesting is add an optional 0th step that does:
>>
>> echo $raw_align > /sys/bus/nd/devices/$namespace/align
>>
>> Where the raw align needs to be needs to be max($pfn_align,
>> arch_mapping_granulariy).
>
>
> I started looking at this and was wondering about userspace being aware
> of the direct-map mapping size. We can figure that out by parsing kernel
> log
>
> [0.00] Page orders: linear mapping = 24, virtual = 16, io = 16, 
> vmemmap = 24
>
>
> But I am not sure we want to do that. There is not set of raw_align
> value to select. What we need to make sure is the below.
>
> 1) While creating a namespace we need to make sure that namespace size
> is multiple of direct-map mapping size. If we ensure that
> size is aligned, we should also get the namespace start to be aligned?
>
> 2) While initialzing a namespace by scanning label area, we need to make
> sure every namespace in the region satisfy the above requirement. If not
> we should mark the region disabled. 
>
>
>>
>> So on powerpc where PAGE_SIZE < arch_mapping_granulariy, the following:
>>
>> cat /sys/bus/nd/devices/$namespace/supported_aligns
>>
>> ...would show the same output as:
>>
>> cat /sys/bus/nd/devices/$pfn/align
>>
>> ...but with any alignment choice less than arch_mapping_granulariy removed.
>>
>
> I am not sure why we need to do that. For example: even if we have
> direct-map mapping size as PAGE_SIZE (64K), we still should allow an user
> mapping with hugepage size (16M)?
>


Considering the direct-map map size is not going to be user selectable,
do you agree that we can skip the above step 0 configuration you
suggested.

The changes proposed in the patch series essentially does the rest.

1) It validate the size against the arch specific limit during
namespace creation. (part of step 1)
2) It also disable initializing a region if it find the size not
correctly aligned as per the platform requirement.
3) Direct map  mapping size is different from supported_alignment for a
namespace. The supported alignment controls what possible PAGE SIZE user want 
the
namespace to be mapped to user space.


With the above do you think the current patch series is good?

-aneesh