[PATCH v13 0/9] namei: openat2(2) path resolution restrictions

2019-09-30 Thread Aleksa Sarai
This patchset is being developed here:
  

It depends on the copy_struct_from_user() helpers being developed here:
  
and posted here:
  

Patch changelog:
 v13:
  * Fix race with the magic-link mode semantics by recomputing the mode during
->get_link() and storing it with nd_jump_link(). A selftest was added for
this attack scenario as well. [Jann Horn]
  * Fix gap in RESOLVE_NO_XDEV with magic-links -- now magic-link resolution is
only permitted if the link doesn't jump vfsmounts.
  * Remove path_is_under() checks for ".." resolution (due to the possibility
of O(m*n) lookup behaviour). Instead, return -EAGAIN if a racing rename or
mount occurs. Userspace is then encouraged to retry or have another
fallback (if after several tries, it still fails it's likely that there is
an attack going on -- though failures will occur spuriously because
&{rename,mount}_lock are both global). [Linus Torvalds]
  * Move copy_struct_from_user() to a separate series so it can be merged
separately. [Christian Brauner]
  * Small test improvements (mainly making the TAP output more readable and
adding a few new minor test cases). Now the openat2(2) self-tests have ~271
overall test cases.
  * Expand on changes to path-lookup in the kernel docs.
  * Kernel-doc fixes. [Randy Dunlap]
 v12: 
 v11: 
  
 v10: 
 v09: 
 v08: 
 v07: 
 v06: 
 v05: 
 v04: 
 v03: 
 v02: 
 v01: 

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall openat2(2)
which provides several other improvements to the openat(2) interface (see the
patch description for more details). The following new LOOKUP_* flags are
added:

  * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is also
blocked (though magic-link jumps on the same vfsmount are permitted).

  * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.

It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.

  * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".

Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more 

[PATCH v13 4/9] namei: O_BENEATH-style path resolution flags

2019-09-30 Thread Aleksa Sarai
Add the following flags to allow various restrictions on path resolution
(these affect the *entire* resolution, rather than just the final path
component -- as is the case with LOOKUP_FOLLOW).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web servers,
archive extraction tools, network file servers, and so on.

These flags are exposed to userspace through openat2(2) in a later
patchset.

* LOOKUP_NO_XDEV: Disallow mount-point crossing (both *down* into one,
  or *up* from one). Both bind-mounts and cross-filesystem mounts are
  blocked by this flag. The naming is based on "find -xdev" as well as
  -EXDEV (though find(1) doesn't walk upwards, the semantics seem
  obvious).

* LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" (or rather,
  magic-link) jumping. This is a very specific restriction, and it
  exists because /proc/$pid/fd/... "symlinks" allow for access outside
  nd->root and pose risk to container runtimes that don't want to be
  tricked into accessing a host path (but do want to allow
  no-funny-business symlink resolution).

* LOOKUP_NO_SYMLINKS: Disallows resolution through symlinks of any kind
  (including magic-links).

* LOOKUP_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all magic-link jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of magic-link jumping is done because it is not clear
  whether semantically they should be allowed -- while some magic-links
  are safe there are many that can cause escapes (and once a
  resolution is outside of the root, O_BENEATH will no longer detect
  it). Future patches may re-enable magic-link jumping when such jumps
  would remain inside the root.

The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would
violates their requirement, while the others all return -EXDEV.

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Christian Brauner 
Suggested-by: David Drysdale 
Suggested-by: Al Viro 
Suggested-by: Andy Lutomirski 
Suggested-by: Linus Torvalds 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 135 ++
 include/linux/namei.h |   9 +++
 2 files changed, 120 insertions(+), 24 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index bfeac55b23b7..b80efc0ae0f3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -504,7 +504,10 @@ struct nameidata {
struct filename *name;
struct nameidata *saved;
struct inode*link_inode;
-   umode_t last_magiclink_mode;
+   struct {
+   umode_t mode;
+   bool same_mnt;
+   } last_magiclink;
unsignedroot_seq;
int dfd;
 } __randomize_layout;
@@ -642,6 +645,13 @@ static bool legitimize_links(struct nameidata *nd)
 
 static bool legitimize_root(struct nameidata *nd)
 {
+   /*
+* If nd->root was zeroed with scoped-lookup flags, we need to restart
+* the whole lookup from scratch (get_fs_root() is dangerous for these
+* lookups because the root is nd->dfd, not the fs root).
+*/
+   if (!nd->root.mnt && (nd->flags & LOOKUP_DIRFD_SCOPE_FLAGS))
+   return false;
if (!nd->root.mnt || (nd->flags & LOOKUP_ROOT))
return true;
nd->flags |= LOOKUP_ROOT_GRABBED;
@@ -799,10 +809,18 @@ static int 

[PATCH v13 7/9] open: openat2(2) syscall

2019-09-30 Thread Aleksa Sarai
The most obvious syscall to add support for the new LOOKUP_* scoping
flags would be openat(2). However, there are a few reasons why this is
not the best course of action:

 * The new LOOKUP_* flags are intended to be security features, and
   openat(2) will silently ignore all unknown flags. This means that
   users would need to avoid foot-gunning themselves constantly when
   using this interface if it were part of openat(2). This can be fixed
   by having userspace libraries handle this for users[1], but should be
   avoided if possible.

 * Resolution scoping feels like a different operation to the existing
   O_* flags. And since openat(2) has limited flag space, it seems to be
   quite wasteful to clutter it with 5 flags that are all
   resolution-related. Arguably O_NOFOLLOW is also a resolution flag but
   its entire purpose is to error out if you encounter a trailing
   symlink -- not to scope resolution.

 * Other systems would be able to reimplement this syscall allowing for
   cross-OS standardisation rather than being hidden amongst O_* flags
   which may result in it not being used by all the parties that might
   want to use it (file servers, web servers, container runtimes, etc).

 * It gives us the opportunity to iterate on the O_PATH interface. In
   particular, the new @how->upgrade_mask field for fd re-opening is
   only possible because we have a clean slate without needing to re-use
   the ACC_MODE flag design nor the existing openat(2) @mode semantics.

To this end, we introduce the openat2(2) syscall. It provides all of the
features of openat(2) through the @how->flags argument, but also
also provides a new @how->resolve argument which exposes RESOLVE_* flags
that map to our new LOOKUP_* flags. It also eliminates the long-standing
ugliness of variadic-open(2) by embedding it in a struct.

In order to allow for userspace to lock down their usage of file
descriptor re-opening, openat2(2) has the ability for users to disallow
certain re-opening modes through @how->upgrade_mask. At the moment,
there is no UPGRADE_NOEXEC.

[1]: https://github.com/openSUSE/libpathrs

Suggested-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 arch/alpha/kernel/syscalls/syscall.tbl  |  1 +
 arch/arm/tools/syscall.tbl  |  1 +
 arch/arm64/include/asm/unistd.h |  2 +-
 arch/arm64/include/asm/unistd32.h   |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl   |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl   |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl|  1 +
 arch/s390/kernel/syscalls/syscall.tbl   |  1 +
 arch/sh/kernel/syscalls/syscall.tbl |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl  |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl  |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl  |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl |  1 +
 fs/open.c   | 94 -
 include/linux/fcntl.h   | 19 -
 include/linux/fs.h  |  4 +-
 include/linux/syscalls.h| 14 ++-
 include/uapi/asm-generic/unistd.h   |  5 +-
 include/uapi/linux/fcntl.h  | 42 +
 24 files changed, 168 insertions(+), 30 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..9f374f7d9514 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 543common  fspick  sys_fspick
 544common  pidfd_open  sys_pidfd_open
 # 545 reserved for clone3
+547common  openat2 sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..4ba54bc7e19a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 435common  clone3  sys_clone3
+437common  openat2 sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..8aa00ccb0b96 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   436
+#define __NR_compat_syscalls   438
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index 

[PATCH v13 6/9] namei: permit ".." resolution with LOOKUP_{IN_ROOT,BENEATH}

2019-09-30 Thread Aleksa Sarai
This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
".." resolution (in the case of LOOKUP_BENEATH the resolution will still
fail if ".." resolution would resolve a path outside of the root --
while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
are still disallowed entirely[*].

The need for this patch (and the original no-".." restriction) is
explained by observing there is a fairly easy-to-exploit race condition
with chroot(2) (and thus by extension LOOKUP_IN_ROOT and LOOKUP_BENEATH
if ".." is allowed) where a rename(2) of a path can be used to "skip
over" nd->root and thus escape to the filesystem above nd->root.

  thread1 [attacker]:
for (;;)
  renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
for (;;)
  openat2(dirb, "b/c/../../etc/shadow",
  { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
and will return -EAGAIN for userspace to decide to either retry or abort
the lookup. It should be noted that ".." is the weak point of chroot(2)
-- walking *into* a subdirectory tautologically cannot result in you
walking *outside* nd->root (except through a bind-mount or magic-link).
There is also no other way for a directory's parent to change (which is
the primary worry with ".." resolution here) other than a rename or
MS_MOVE.

This is a first-pass implementation, where -EAGAIN will be returned if
any rename or mount occurs anywhere on the host (in any namespace). This
will result in spurious errors, but there isn't a satisfactory
alternative (other than denying ".." altogether).

One other possible alternative (which previous versions of this patch
used) would be to check with path_is_under() if there was a racing
rename or mount (after re-taking the relevant seqlocks). While this does
work, it results in possible O(n*m) behaviour if there are many renames
or mounts occuring *anywhere on the system*.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check (as
with the alternative solution for "..") for magic-links after they
are resolved. However this seems unlikely to be a feature that
people *really* need -- it can be added later if it turns out a lot
of people want it.

Signed-off-by: Aleksa Sarai 
---
 fs/namei.c | 43 +--
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index efed62c6136e..9c35768fcf4f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode*inode; /* path.dentry.d_inode */
unsigned intflags;
-   unsignedseq, m_seq;
+   unsignedseq, m_seq, r_seq;
int last_type;
unsigneddepth;
int total_link_count;
@@ -1766,22 +1766,35 @@ static inline int handle_dots(struct nameidata *nd, int 
type)
if (type == LAST_DOTDOT) {
int error = 0;
 
-   /*
-* Scoped-lookup flags resolving ".." is not currently safe --
-* races can cause our parent to have moved outside of the root
-* and us to skip over it.
-*/
-   if (unlikely(nd->flags & LOOKUP_DIRFD_SCOPE_FLAGS))
-   return -EXDEV;
if (!nd->root.mnt) {
error = set_root(nd);
if (error)
return error;
}
-   if (nd->flags & LOOKUP_RCU) {
-   return follow_dotdot_rcu(nd);
-   } else
-   return follow_dotdot(nd);
+   if (nd->flags & LOOKUP_RCU)
+   error = follow_dotdot_rcu(nd);
+   else
+   error = follow_dotdot(nd);
+   if (error)
+   return error;
+
+   if (unlikely(nd->flags & LOOKUP_DIRFD_SCOPE_FLAGS)) {
+   bool m_retry = read_seqretry(_lock, nd->m_seq);
+   bool r_retry = read_seqretry(_lock, nd->r_seq);
+
+   /*
+* If there was a racing rename or mount along our
+* path, then we can't be sure that ".." hasn't jumped
+* above nd->root (and so 

[PATCH v13 9/9] Documentation: update path-lookup to mention trailing magic-links

2019-09-30 Thread Aleksa Sarai
We've introduced new (somewhat subtle) behaviour regarding trailing
magic-links, so it's best to make sure everyone can follow along with
the reasoning behind trailing_magiclink().

Signed-off-by: Aleksa Sarai 
---
 Documentation/filesystems/path-lookup.rst | 80 ++-
 1 file changed, 63 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/path-lookup.rst 
b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..c30145b3d9ba 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -405,6 +405,10 @@ is requested.  Keeping a reference in the ``nameidata`` 
ensures that
 only one root is in effect for the entire path walk, even if it races
 with a ``chroot()`` system call.
 
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
+``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
+passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
+
 The root is needed when either of two conditions holds: (1) either the
 pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
 component is being handled, since "``..``" from the root must always stay
@@ -1149,22 +1153,61 @@ so ``NULL`` is returned to indicate that the symlink 
can be released and
 the stack frame discarded.
 
 The other case involves things in ``/proc`` that look like symlinks but
-aren't really::
+aren't really (and are therefore commonly referred to as "magic-links")::
 
  $ ls -l /proc/self/fd/1
  lrwx-- 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
 
 Every open file descriptor in any process is represented in ``/proc`` by
-something that looks like a symlink.  It is really a reference to the
-target file, not just the name of it.  When you ``readlink`` these
-objects you get a name that might refer to the same file - unless it
-has been unlinked or mounted over.  When ``walk_component()`` follows
-one of these, the ``->follow_link()`` method in "procfs" doesn't return
-a string name, but instead calls ``nd_jump_link()`` which updates the
-``nameidata`` in place to point to that target.  ``->follow_link()`` then
-returns ``NULL``.  Again there is no final component and ``get_link()``
-reports this by leaving the ``last_type`` field of ``nameidata`` as
-``LAST_BIND``.
+a magic-link.  It is really a reference to the target file, not just the
+name of it (hence making them "magical" compared to ordinary symlinks).
+When you ``readlink`` these objects you get a name that might refer to
+the same file - unless it has been unlinked or mounted over.  When
+``walk_component()`` follows one of these, the ``->follow_link()`` method
+in "procfs" doesn't return a string name, but instead calls
+``nd_jump_link()`` which updates the ``nameidata`` in place to point to
+that target.  ``->follow_link()`` then returns ``NULL``. Again there is
+no final component and ``get_link()`` reports this by leaving the
+``last_type`` field of ``nameidata`` as ``LAST_BIND``.
+
+In order to avoid potential re-opening attacks (especially in the context
+of containers), it is necessary to restrict the ability for a trailing
+magic-link to be opened. The restrictions are as follows (and are
+implemented in ``trailing_magiclink()``):
+
+* If the ``open()`` is an "ordinary open" (without ``O_PATH``), the
+  access-mode of the ``open()`` call must be permitted by one of the
+  octets in the magic-link's file mode (elsewhere in Linux, ordinary
+  symlinks have a file mode of ``0777`` but this doesn't apply to
+  magic-links). Each "ordinary" file in ``/proc/self/fd/$n`` has the user
+  octet of its file mode set to correspond to the access-mode it was
+  opened with.
+
+  This restriction means that you cannot re-open an ``O_RDONLY`` file
+  descriptor through ``/proc/self/fd/$n`` with ``O_RDWR``.
+
+With a "half-open" (with ``O_PATH``), there is no ``-EACCES``-enforced
+restrictions on ``open()``, but there are rules about the mode shown in
+``/proc/self/fd/$n``:
+
+* If the target of the ``open()`` is not a magic-link, then the group
+  octet of the file mode is set to permit all access modes.
+
+* Otherwise, the mode of the new ``O_PATH`` descriptor is set to
+  effectively the same mode as the magic-link (though the permissions are
+  set in the group octet of the mode). This means that an ``O_PATH`` of a
+  magic-link gives you no more re-open permissions than the magic-link
+  itself.
+
+With these ``O_PATH`` restrictions, it is still possible to re-open an
+``O_PATH`` file descriptor but you cannot use ``O_PATH`` to work around
+the above restrictions on "ordinary opens" of magic-links.
+
+In order to avoid certain race conditions (where a file descriptor
+associated with a magic-link is swapped, causing the ``link_inode`` of
+``nameidata`` to become stale during magic-link traversal),
+``nd_jump_link()`` stores the mode of the magic-link during traversal in
+``last_magiclink``.
 
 Following the symlink in the 

Re: [PATCH v13 7/9] open: openat2(2) syscall

2019-09-30 Thread kbuild test robot
Hi Aleksa,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.4-rc1 next-20190930]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Aleksa-Sarai/namei-openat2-2-path-resolution-restrictions/20191001-025628
config: i386-defconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-13) 7.4.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   fs/open.c: In function '__do_sys_openat2':
>> fs/open.c:1173:8: error: implicit declaration of function 
>> 'copy_struct_from_user'; did you mean 'copy_siginfo_from_user'? 
>> [-Werror=implicit-function-declaration]
 err = copy_struct_from_user(, sizeof(tmp), how, usize);
   ^
   copy_siginfo_from_user
   cc1: some warnings being treated as errors

vim +1173 fs/open.c

  1163  
  1164  SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
  1165  const struct open_how __user *, how, size_t, usize)
  1166  {
  1167  int err;
  1168  struct open_how tmp;
  1169  
  1170  if (unlikely(usize < OPEN_HOW_SIZE_VER0))
  1171  return -EINVAL;
  1172  
> 1173  err = copy_struct_from_user(, sizeof(tmp), how, usize);
  1174  if (err)
  1175  return err;
  1176  
  1177  if (force_o_largefile())
  1178  tmp.flags |= O_LARGEFILE;
  1179  
  1180  return do_sys_open(dfd, filename, );
  1181  }
  1182  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Darlehen

2019-09-29 Thread Mrs Christabel
Grüße dich.

Willkommen bei Frau Christabel als Teil ihres Wohlfahrtspaketes
Anbieten eines Floating Loans mit 2% Zinssatz ohne jegliche Sicherheiten
Sicherheit. Dies soll Einzelpersonen und Unternehmen helfen, ihre
Finanzen zu erreichen
Ziele.

* Persönlicher Kredit (ungesichert)
* Business-Darlehen (ungesichert)
* Schuldenkonsolidierungsdarlehen
* Verbessere dein Zuhause

Bitte geben Sie uns folgendes an:
Vollständiger Name...
Heimatadresse
Telefon:..
Handy, Mobiltelefon..
Land...
Besetzung:
Sex:...
Alter:..
Monatliches Einkommen..
Benötigte Menge ..
Darlehensdauer: ...
Notwendigkeit für Darlehensantrag .
Hast du dich vorher beworben? .

Alle Antworten sollten an weitergeleitet werden

  Frau Christabel
Darlehensoffizier
E-Mail: christabelfundingi...@gmail.com

Sobald wir den Antrag erhalten haben, werden wir Ihren Kredit bearbeiten
Antrag auf Genehmigung

Danke und viele Grüße.


[PATCH RESEND v3 01/26] PCI: Add define for the number of standard PCI BARs

2019-09-27 Thread Denis Efremov
Code that iterates over all standard PCI BARs typically uses
PCI_STD_RESOURCE_END. However, it requires the "unusual" loop condition
"i <= PCI_STD_RESOURCE_END" rather than something more standard like
"i < PCI_STD_NUM_BARS".

This patch adds the definition PCI_STD_NUM_BARS which is equivalent to
"PCI_STD_RESOURCE_END + 1". To iterate through all possible BARs, loop
conditions changed to the *number* of BARs "i < PCI_STD_NUM_BARS",
instead of the index of the last valid BAR "i <= PCI_STD_RESOURCE_END"
or PCI_ROM_RESOURCE. The magic constant (6) is also replaced with new
define PCI_STD_NUM_BARS.

Signed-off-by: Denis Efremov 
---
 drivers/pci/pci-sysfs.c   |  4 ++--
 drivers/pci/pci.c | 13 +++--
 drivers/pci/proc.c|  4 ++--
 drivers/pci/quirks.c  |  4 ++--
 include/linux/pci.h   |  2 +-
 include/uapi/linux/pci_regs.h |  1 +
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 965c72104150..3e26b8e03bd5 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1257,7 +1257,7 @@ static void pci_remove_resource_files(struct pci_dev 
*pdev)
 {
int i;
 
-   for (i = 0; i < PCI_ROM_RESOURCE; i++) {
+   for (i = 0; i < PCI_STD_NUM_BARS; i++) {
struct bin_attribute *res_attr;
 
res_attr = pdev->res_attr[i];
@@ -1328,7 +1328,7 @@ static int pci_create_resource_files(struct pci_dev *pdev)
int retval;
 
/* Expose the PCI resources from this device as files */
-   for (i = 0; i < PCI_ROM_RESOURCE; i++) {
+   for (i = 0; i < PCI_STD_NUM_BARS; i++) {
 
/* skip empty resources */
if (!pci_resource_len(pdev, i))
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 1b27b5af3d55..7d543986026b 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -674,7 +674,7 @@ struct resource *pci_find_resource(struct pci_dev *dev, 
struct resource *res)
 {
int i;
 
-   for (i = 0; i < PCI_ROM_RESOURCE; i++) {
+   for (i = 0; i < PCI_STD_NUM_BARS; i++) {
struct resource *r = >resource[i];
 
if (r->start && resource_contains(r, res))
@@ -3768,7 +3768,7 @@ void pci_release_selected_regions(struct pci_dev *pdev, 
int bars)
 {
int i;
 
-   for (i = 0; i < 6; i++)
+   for (i = 0; i < PCI_STD_NUM_BARS; i++)
if (bars & (1 << i))
pci_release_region(pdev, i);
 }
@@ -3779,7 +3779,7 @@ static int __pci_request_selected_regions(struct pci_dev 
*pdev, int bars,
 {
int i;
 
-   for (i = 0; i < 6; i++)
+   for (i = 0; i < PCI_STD_NUM_BARS; i++)
if (bars & (1 << i))
if (__pci_request_region(pdev, i, res_name, excl))
goto err_out;
@@ -3827,7 +3827,7 @@ EXPORT_SYMBOL(pci_request_selected_regions_exclusive);
 
 void pci_release_regions(struct pci_dev *pdev)
 {
-   pci_release_selected_regions(pdev, (1 << 6) - 1);
+   pci_release_selected_regions(pdev, (1 << PCI_STD_NUM_BARS) - 1);
 }
 EXPORT_SYMBOL(pci_release_regions);
 
@@ -3846,7 +3846,8 @@ EXPORT_SYMBOL(pci_release_regions);
  */
 int pci_request_regions(struct pci_dev *pdev, const char *res_name)
 {
-   return pci_request_selected_regions(pdev, ((1 << 6) - 1), res_name);
+   return pci_request_selected_regions(pdev,
+   ((1 << PCI_STD_NUM_BARS) - 1), res_name);
 }
 EXPORT_SYMBOL(pci_request_regions);
 
@@ -3868,7 +3869,7 @@ EXPORT_SYMBOL(pci_request_regions);
 int pci_request_regions_exclusive(struct pci_dev *pdev, const char *res_name)
 {
return pci_request_selected_regions_exclusive(pdev,
-   ((1 << 6) - 1), res_name);
+   ((1 << PCI_STD_NUM_BARS) - 1), res_name);
 }
 EXPORT_SYMBOL(pci_request_regions_exclusive);
 
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
index fe7fe678965b..cb61ec2c24e8 100644
--- a/drivers/pci/proc.c
+++ b/drivers/pci/proc.c
@@ -248,13 +248,13 @@ static int proc_bus_pci_mmap(struct file *file, struct 
vm_area_struct *vma)
}
 
/* Make sure the caller is mapping a real resource for this device */
-   for (i = 0; i < PCI_ROM_RESOURCE; i++) {
+   for (i = 0; i < PCI_STD_NUM_BARS; i++) {
if (dev->resource[i].flags & res_bit &&
pci_mmap_fits(dev, i, vma,  PCI_MMAP_PROCFS))
break;
}
 
-   if (i >= PCI_ROM_RESOURCE)
+   if (i >= PCI_STD_NUM_BARS)
return -ENODEV;
 
if (fpriv->mmap_state == pci_mmap_mem &&
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44c4ae1abd00..998454b0ae8d 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -475,7 +475,7 @@ static void quirk_extend_bar_to_page(struct pci_dev *dev)
 {
int i;
 
-   for (i = 0; i <= PCI_STD_RESOURCE_END; i++) {
+   for (i = 0; 

[PATCH RESEND v3 00/26] Add definition for the number of standard PCI BARs

2019-09-27 Thread Denis Efremov
Code that iterates over all standard PCI BARs typically uses
PCI_STD_RESOURCE_END, but this is error-prone because it requires
"i <= PCI_STD_RESOURCE_END" rather than something like
"i < PCI_STD_NUM_BARS". We could add such a definition and use it the same
way PCI_SRIOV_NUM_BARS is used. The patchset also replaces constant (6)
with new define PCI_STD_NUM_BARS where appropriate and removes local
declarations for the number of PCI BARs.

Changes in v3:
  - Updated commits description.
  - Refactored "< PCI_ROM_RESOURCE" with "< PCI_STD_NUM_BARS" in loops.
  - Refactored "<= BAR_5" with "< PCI_STD_NUM_BARS" in loops.
  - Removed local define GASKET_NUM_BARS.
  - Removed local define PCI_NUM_BAR_RESOURCES.

Changes in v2:
  - Reversed checks in pci_iomap_range,pci_iomap_wc_range.
  - Refactored loops in vfio_pci to keep PCI_STD_RESOURCES.
  - Added 2 new patches to replace the magic constant with new define.
  - Splitted net patch in v1 to separate stmmac and dwc-xlgmac patches.

Denis Efremov (26):
  PCI: Add define for the number of standard PCI BARs
  PCI: hv: Use PCI_STD_NUM_BARS
  PCI: dwc: Use PCI_STD_NUM_BARS
  PCI: endpoint: Use PCI_STD_NUM_BARS
  misc: pci_endpoint_test: Use PCI_STD_NUM_BARS
  s390/pci: Use PCI_STD_NUM_BARS
  x86/PCI: Loop using PCI_STD_NUM_BARS
  alpha/PCI: Use PCI_STD_NUM_BARS
  ia64: Use PCI_STD_NUM_BARS
  stmmac: pci: Loop using PCI_STD_NUM_BARS
  net: dwc-xlgmac: Loop using PCI_STD_NUM_BARS
  ixgb: use PCI_STD_NUM_BARS
  e1000: Use PCI_STD_NUM_BARS
  rapidio/tsi721: Loop using PCI_STD_NUM_BARS
  efifb: Loop using PCI_STD_NUM_BARS
  fbmem: use PCI_STD_NUM_BARS
  vfio_pci: Loop using PCI_STD_NUM_BARS
  scsi: pm80xx: Use PCI_STD_NUM_BARS
  ata: sata_nv: Use PCI_STD_NUM_BARS
  staging: gasket: Use PCI_STD_NUM_BARS
  serial: 8250_pci: Use PCI_STD_NUM_BARS
  pata_atp867x: Use PCI_STD_NUM_BARS
  memstick: use PCI_STD_NUM_BARS
  USB: core: Use PCI_STD_NUM_BARS
  usb: pci-quirks: Use PCI_STD_NUM_BARS
  devres: use PCI_STD_NUM_BARS

 arch/alpha/kernel/pci-sysfs.c |  8 ++---
 arch/ia64/sn/pci/pcibr/pcibr_dma.c|  4 +--
 arch/s390/include/asm/pci.h   |  5 +--
 arch/s390/include/asm/pci_clp.h   |  6 ++--
 arch/s390/pci/pci.c   | 16 +-
 arch/s390/pci/pci_clp.c   |  6 ++--
 arch/x86/pci/common.c |  2 +-
 arch/x86/pci/intel_mid_pci.c  |  2 +-
 drivers/ata/pata_atp867x.c|  2 +-
 drivers/ata/sata_nv.c |  2 +-
 drivers/memstick/host/jmb38x_ms.c |  2 +-
 drivers/misc/pci_endpoint_test.c  |  8 ++---
 drivers/net/ethernet/intel/e1000/e1000.h  |  1 -
 drivers/net/ethernet/intel/e1000/e1000_main.c |  2 +-
 drivers/net/ethernet/intel/ixgb/ixgb.h|  1 -
 drivers/net/ethernet/intel/ixgb/ixgb_main.c   |  2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_pci.c  |  4 +--
 .../net/ethernet/synopsys/dwc-xlgmac-pci.c|  2 +-
 drivers/pci/controller/dwc/pci-dra7xx.c   |  2 +-
 .../pci/controller/dwc/pci-layerscape-ep.c|  2 +-
 drivers/pci/controller/dwc/pcie-artpec6.c |  2 +-
 .../pci/controller/dwc/pcie-designware-plat.c |  2 +-
 drivers/pci/controller/dwc/pcie-designware.h  |  2 +-
 drivers/pci/controller/pci-hyperv.c   | 10 +++---
 drivers/pci/endpoint/functions/pci-epf-test.c | 10 +++---
 drivers/pci/pci-sysfs.c   |  4 +--
 drivers/pci/pci.c | 13 
 drivers/pci/proc.c|  4 +--
 drivers/pci/quirks.c  |  4 +--
 drivers/rapidio/devices/tsi721.c  |  2 +-
 drivers/scsi/pm8001/pm8001_hwi.c  |  2 +-
 drivers/scsi/pm8001/pm8001_init.c |  2 +-
 drivers/staging/gasket/gasket_constants.h |  3 --
 drivers/staging/gasket/gasket_core.c  | 12 +++
 drivers/staging/gasket/gasket_core.h  |  4 +--
 drivers/tty/serial/8250/8250_pci.c|  8 ++---
 drivers/usb/core/hcd-pci.c|  2 +-
 drivers/usb/host/pci-quirks.c |  2 +-
 drivers/vfio/pci/vfio_pci.c   | 11 ---
 drivers/vfio/pci/vfio_pci_config.c| 32 ++-
 drivers/vfio/pci/vfio_pci_private.h   |  4 +--
 drivers/video/fbdev/core/fbmem.c  |  4 +--
 drivers/video/fbdev/efifb.c   |  2 +-
 include/linux/pci-epc.h   |  2 +-
 include/linux/pci.h   |  2 +-
 include/uapi/linux/pci_regs.h |  1 +
 lib/devres.c  |  2 +-
 47 files changed, 112 insertions(+), 115 deletions(-)

-- 
2.21.0



[PATCH 23/29] parisc: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/parisc/kernel/vmlinux.lds.S | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/parisc/kernel/vmlinux.lds.S b/arch/parisc/kernel/vmlinux.lds.S
index 12b3d7d5e9e4..1dc2f71e62b1 100644
--- a/arch/parisc/kernel/vmlinux.lds.S
+++ b/arch/parisc/kernel/vmlinux.lds.S
@@ -19,6 +19,7 @@
*(.data..vm0.pte)
 
 #define CC_USING_PATCHABLE_FUNCTION_ENTRY
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  8
 
 #include 
 
@@ -129,9 +130,6 @@ SECTIONS
 
RO_DATA(8)
 
-   /* RO because of BUILDTIME_EXTABLE_SORT */
-   EXCEPTION_TABLE(8)
-
/* unwind info */
.PARISC.unwind : {
__start___unwind = .;
-- 
2.17.1



[PATCH 21/29] ia64: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/ia64/kernel/vmlinux.lds.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index 0d86fc8e88d5..18a732597112 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -6,6 +6,7 @@
 #include 
 
 #define EMITS_PT_NOTE
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  16
 
 #include 
 
@@ -70,7 +71,6 @@ SECTIONS {
/*
 * Read-only data
 */
-   EXCEPTION_TABLE(16)
 
/* MCA table */
. = ALIGN(16);
-- 
2.17.1



[PATCH 11/29] vmlinux.lds.h: Replace RODATA with RO_DATA

2019-09-26 Thread Kees Cook
There's no reason to keep the RODATA macro: just replace the callers
with the expected RO_DATA macro.

Signed-off-by: Kees Cook 
---
 arch/alpha/kernel/vmlinux.lds.S  | 2 +-
 arch/ia64/kernel/vmlinux.lds.S   | 2 +-
 arch/microblaze/kernel/vmlinux.lds.S | 2 +-
 arch/mips/kernel/vmlinux.lds.S   | 2 +-
 arch/um/include/asm/common.lds.S | 2 +-
 arch/xtensa/kernel/vmlinux.lds.S | 2 +-
 include/asm-generic/vmlinux.lds.h| 4 +---
 7 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index bf28043485f6..af411817dd7d 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -34,7 +34,7 @@ SECTIONS
swapper_pg_dir = SWAPPER_PGD;
_etext = .; /* End of text section */
 
-   RODATA
+   RO_DATA(4096)
EXCEPTION_TABLE(16)
 
/* Will be freed after init */
diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index ad3578924589..0d86fc8e88d5 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -104,7 +104,7 @@ SECTIONS {
code_continues2 : {
} :text
 
-   RODATA
+   RO_DATA(4096)
 
.opd : AT(ADDR(.opd) - LOAD_OFFSET) {
__start_opd = .;
diff --git a/arch/microblaze/kernel/vmlinux.lds.S 
b/arch/microblaze/kernel/vmlinux.lds.S
index d008e50bb212..2299694748ea 100644
--- a/arch/microblaze/kernel/vmlinux.lds.S
+++ b/arch/microblaze/kernel/vmlinux.lds.S
@@ -51,7 +51,7 @@ SECTIONS {
}
 
. = ALIGN(16);
-   RODATA
+   RO_DATA(4096)
EXCEPTION_TABLE(16)
 
/*
diff --git a/arch/mips/kernel/vmlinux.lds.S b/arch/mips/kernel/vmlinux.lds.S
index 91e566defc16..a5f00ec73ea6 100644
--- a/arch/mips/kernel/vmlinux.lds.S
+++ b/arch/mips/kernel/vmlinux.lds.S
@@ -82,7 +82,7 @@ SECTIONS
}
 
_sdata = .; /* Start of data section */
-   RODATA
+   RO_DATA(4096)
 
/* writeable */
.data : {   /* Data */
diff --git a/arch/um/include/asm/common.lds.S b/arch/um/include/asm/common.lds.S
index a24b284f5135..eca6c452a41b 100644
--- a/arch/um/include/asm/common.lds.S
+++ b/arch/um/include/asm/common.lds.S
@@ -9,7 +9,7 @@
   _sdata = .;
   PROVIDE (sdata = .);
 
-  RODATA
+  RO_DATA(4096)
 
   .unprotected : { *(.unprotected) }
   . = ALIGN(4096);
diff --git a/arch/xtensa/kernel/vmlinux.lds.S b/arch/xtensa/kernel/vmlinux.lds.S
index a0a843745695..b97e5798b9cf 100644
--- a/arch/xtensa/kernel/vmlinux.lds.S
+++ b/arch/xtensa/kernel/vmlinux.lds.S
@@ -124,7 +124,7 @@ SECTIONS
 
   . = ALIGN(16);
 
-  RODATA
+  RO_DATA(4096)
 
   /*  Relocation table */
 
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 3a4c1cb971da..9520dede6c7a 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -513,9 +513,7 @@
. = ALIGN((align)); \
__end_rodata = .;
 
-/* RODATA & RO_DATA provided for backward compatibility.
- * All archs are supposed to use RO_DATA() */
-#define RODATA  RO_DATA_SECTION(4096)
+/* All archs are supposed to use RO_DATA() */
 #define RO_DATA(align)  RO_DATA_SECTION(align)
 
 /*
-- 
2.17.1



[PATCH 15/29] x86: Actually use _etext for end of text segment

2019-09-26 Thread Kees Cook
Various calculations are using the end of the exception table (which
does not need to be executable) as the end of the text segment. Instead,
in preparation for moving the exception table into RO_DATA, move _etext
after the exception table and update the calculations.

Signed-off-by: Kees Cook 
---
 arch/x86/include/asm/sections.h | 1 -
 arch/x86/kernel/vmlinux.lds.S   | 7 +++
 arch/x86/mm/init_64.c   | 6 +++---
 arch/x86/mm/pti.c   | 2 +-
 4 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 71b32f2570ab..036c360910c5 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -6,7 +6,6 @@
 #include 
 
 extern char __brk_base[], __brk_limit[];
-extern struct exception_table_entry __stop___ex_table[];
 extern char __end_rodata_aligned[];
 
 #if defined(CONFIG_X86_64)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 41362e90142d..a1a758e25b2b 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -143,15 +143,14 @@ SECTIONS
*(.text.__x86.indirect_thunk)
__indirect_thunk_end = .;
 #endif
-
-   /* End of text section */
-   _etext = .;
} :text = 0x9090
 
EXCEPTION_TABLE(16)
 
-   /* .text should occupy whole number of pages */
+   /* End of text section, which should occupy whole number of pages */
+   _etext = .;
. = ALIGN(PAGE_SIZE);
+
X86_ALIGN_RODATA_BEGIN
RO_DATA(PAGE_SIZE)
X86_ALIGN_RODATA_END
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a6b5c653727b..26299e9ce6da 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1263,7 +1263,7 @@ int kernel_set_to_readonly;
 void set_kernel_text_rw(void)
 {
unsigned long start = PFN_ALIGN(_text);
-   unsigned long end = PFN_ALIGN(__stop___ex_table);
+   unsigned long end = PFN_ALIGN(_etext);
 
if (!kernel_set_to_readonly)
return;
@@ -1282,7 +1282,7 @@ void set_kernel_text_rw(void)
 void set_kernel_text_ro(void)
 {
unsigned long start = PFN_ALIGN(_text);
-   unsigned long end = PFN_ALIGN(__stop___ex_table);
+   unsigned long end = PFN_ALIGN(_etext);
 
if (!kernel_set_to_readonly)
return;
@@ -1301,7 +1301,7 @@ void mark_rodata_ro(void)
unsigned long start = PFN_ALIGN(_text);
unsigned long rodata_start = PFN_ALIGN(__start_rodata);
unsigned long end = (unsigned long) &__end_rodata_hpage_align;
-   unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
+   unsigned long text_end = PFN_ALIGN(&_etext);
unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
unsigned long all_end;
 
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..bd3404fd9d80 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -572,7 +572,7 @@ static void pti_clone_kernel_text(void)
 */
unsigned long start = PFN_ALIGN(_text);
unsigned long end_clone  = (unsigned long)__end_rodata_aligned;
-   unsigned long end_global = PFN_ALIGN((unsigned long)__stop___ex_table);
+   unsigned long end_global = PFN_ALIGN((unsigned long)_etext);
 
if (!pti_kernel_image_global_ok())
return;
-- 
2.17.1



[PATCH 22/29] microblaze: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/microblaze/kernel/vmlinux.lds.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/microblaze/kernel/vmlinux.lds.S 
b/arch/microblaze/kernel/vmlinux.lds.S
index b8efb08204a1..abe5ff0f3773 100644
--- a/arch/microblaze/kernel/vmlinux.lds.S
+++ b/arch/microblaze/kernel/vmlinux.lds.S
@@ -11,6 +11,8 @@
 OUTPUT_ARCH(microblaze)
 ENTRY(microblaze_start)
 
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  16
+
 #include 
 #include 
 #include 
@@ -52,7 +54,6 @@ SECTIONS {
 
. = ALIGN(16);
RO_DATA(4096)
-   EXCEPTION_TABLE(16)
 
/*
 * sdata2 section can go anywhere, but must be word aligned
-- 
2.17.1



[PATCH 16/29] x86: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The exception table was needlessly marked executable. In preparation
for execute-only memory, this moves the table into the RO_DATA segment
via a new macro that can be used by any architectures that want to make
a similar consolidation.

Signed-off-by: Kees Cook 
---
 arch/x86/kernel/vmlinux.lds.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index a1a758e25b2b..a5c8571e4967 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -22,6 +22,7 @@
 #endif
 
 #define EMITS_PT_NOTE
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  16
 
 #include 
 #include 
@@ -145,8 +146,6 @@ SECTIONS
 #endif
} :text = 0x9090
 
-   EXCEPTION_TABLE(16)
-
/* End of text section, which should occupy whole number of pages */
_etext = .;
. = ALIGN(PAGE_SIZE);
-- 
2.17.1



[PATCH 03/29] powerpc: Rename PT_LOAD identifier "kernel" to "text"

2019-09-26 Thread Kees Cook
In preparation for moving NOTES into RO_DATA, this renames the linker
script internal identifier for the PT_LOAD Program Header from "kernel"
to "text" to match other architectures.

Signed-off-by: Kees Cook 
---
 arch/powerpc/kernel/vmlinux.lds.S | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index a3c8492b2b19..e184a63aa5b0 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -18,7 +18,7 @@
 ENTRY(_stext)
 
 PHDRS {
-   kernel PT_LOAD FLAGS(7); /* RWX */
+   text PT_LOAD FLAGS(7); /* RWX */
note PT_NOTE FLAGS(0);
 }
 
@@ -63,7 +63,7 @@ SECTIONS
 #else /* !CONFIG_PPC64 */
HEAD_TEXT
 #endif
-   } :kernel
+   } :text
 
__head_end = .;
 
@@ -112,7 +112,7 @@ SECTIONS
__got2_end = .;
 #endif /* CONFIG_PPC32 */
 
-   } :kernel
+   } :text
 
. = ALIGN(ETEXT_ALIGN_SIZE);
_etext = .;
@@ -163,9 +163,9 @@ SECTIONS
 #endif
EXCEPTION_TABLE(0)
 
-   NOTES :kernel :note
+   NOTES :text :note
/* Restore program header away from PT_NOTE. */
-   .dummy : { *(.dummy) } :kernel
+   .dummy : { *(.dummy) } :text
 
 /*
  * Init sections discarded at runtime
@@ -180,7 +180,7 @@ SECTIONS
 #ifdef CONFIG_PPC64
*(.tramp.ftrace.init);
 #endif
-   } :kernel
+   } :text
 
/* .exit.text is discarded at runtime, not link time,
 * to deal with references from __bug_table
-- 
2.17.1



[PATCH 00/29] vmlinux.lds.h: Refactor EXCEPTION_TABLE and NOTES

2019-09-26 Thread Kees Cook
This series works to move the linker sections for NOTES and
EXCEPTION_TABLE into the RO_DATA area, where they belong on most
(all?) architectures. The problem being addressed was the discovery
by Rick Edgecombe that the exception table was accidentally marked
executable while he was developing his execute-only-memory series. When
permissions were flipped from readable-and-executable to only-executable,
the exception table became unreadable, causing things to explode rather
badly. :)

Roughly speaking, the steps are:

- regularize the linker names for PT_NOTE and PT_LOAD program headers
  (to "note" and "text" respectively)
- regularize restoration of linker section to program header assignment
  (when PT_NOTE exists)
- move NOTES into RO_DATA
- finish macro naming conversions for RO_DATA and RW_DATA
- move EXCEPTION_TABLE into RO_DATA on architectures where this is clear
- clean up some x86-specific reporting of kernel memory resources
- switch x86 linker fill byte from x90 (NOP) to 0xcc (INT3), just because
  I finally realized what that trailing ": 0x9090" meant -- and we should
  trap, not slide, if execution lands in section padding

Since these changes are treewide, I'd love to get architecture-maintainer
Acks and either have this live in x86 -tip or in my own tree, however
people think it should go.

Thanks!

-Kees

Kees Cook (29):
  powerpc: Rename "notes" PT_NOTE to "note"
  powerpc: Remove PT_NOTE workaround
  powerpc: Rename PT_LOAD identifier "kernel" to "text"
  alpha: Rename PT_LOAD identifier "kernel" to "text"
  ia64: Rename PT_LOAD identifier "code" to "text"
  s390: Move RO_DATA into "text" PT_LOAD Program Header
  x86: Restore "text" Program Header with dummy section
  vmlinux.lds.h: Provide EMIT_PT_NOTE to indicate export of .notes
  vmlinux.lds.h: Move Program Header restoration into NOTES macro
  vmlinux.lds.h: Move NOTES into RO_DATA
  vmlinux.lds.h: Replace RODATA with RO_DATA
  vmlinux.lds.h: Replace RO_DATA_SECTION with RO_DATA
  vmlinux.lds.h: Replace RW_DATA_SECTION with RW_DATA
  vmlinux.lds.h: Allow EXCEPTION_TABLE to live in RO_DATA
  x86: Actually use _etext for end of text segment
  x86: Move EXCEPTION_TABLE to RO_DATA segment
  alpha: Move EXCEPTION_TABLE to RO_DATA segment
  arm64: Move EXCEPTION_TABLE to RO_DATA segment
  c6x: Move EXCEPTION_TABLE to RO_DATA segment
  h8300: Move EXCEPTION_TABLE to RO_DATA segment
  ia64: Move EXCEPTION_TABLE to RO_DATA segment
  microblaze: Move EXCEPTION_TABLE to RO_DATA segment
  parisc: Move EXCEPTION_TABLE to RO_DATA segment
  powerpc: Move EXCEPTION_TABLE to RO_DATA segment
  xtensa: Move EXCEPTION_TABLE to RO_DATA segment
  x86/mm: Remove redundant  on addresses
  x86/mm: Report which part of kernel image is freed
  x86/mm: Report actual image regions in /proc/iomem
  x86: Use INT3 instead of NOP for linker fill bytes

 arch/alpha/kernel/vmlinux.lds.S  | 18 +-
 arch/arc/kernel/vmlinux.lds.S|  6 ++--
 arch/arm/kernel/vmlinux-xip.lds.S|  4 +--
 arch/arm/kernel/vmlinux.lds.S|  4 +--
 arch/arm64/kernel/vmlinux.lds.S  |  9 ++---
 arch/c6x/kernel/vmlinux.lds.S|  8 ++---
 arch/csky/kernel/vmlinux.lds.S   |  5 ++-
 arch/h8300/kernel/vmlinux.lds.S  |  9 ++---
 arch/hexagon/kernel/vmlinux.lds.S|  5 ++-
 arch/ia64/kernel/vmlinux.lds.S   | 20 +--
 arch/m68k/kernel/vmlinux-nommu.lds   |  4 +--
 arch/m68k/kernel/vmlinux-std.lds |  2 +-
 arch/m68k/kernel/vmlinux-sun3.lds|  2 +-
 arch/microblaze/kernel/vmlinux.lds.S |  8 ++---
 arch/mips/kernel/vmlinux.lds.S   | 15 
 arch/nds32/kernel/vmlinux.lds.S  |  5 ++-
 arch/nios2/kernel/vmlinux.lds.S  |  5 ++-
 arch/openrisc/kernel/vmlinux.lds.S   |  7 ++--
 arch/parisc/kernel/vmlinux.lds.S | 11 +++---
 arch/powerpc/kernel/vmlinux.lds.S| 37 ---
 arch/riscv/kernel/vmlinux.lds.S  |  5 ++-
 arch/s390/kernel/vmlinux.lds.S   | 12 +++
 arch/sh/kernel/vmlinux.lds.S |  3 +-
 arch/sparc/kernel/vmlinux.lds.S  |  3 +-
 arch/um/include/asm/common.lds.S |  3 +-
 arch/unicore32/kernel/vmlinux.lds.S  |  5 ++-
 arch/x86/include/asm/processor.h |  2 +-
 arch/x86/include/asm/sections.h  |  1 -
 arch/x86/kernel/setup.c  | 12 ++-
 arch/x86/kernel/vmlinux.lds.S| 16 -
 arch/x86/mm/init.c   |  8 ++---
 arch/x86/mm/init_64.c| 16 +
 arch/x86/mm/pti.c|  2 +-
 arch/xtensa/kernel/vmlinux.lds.S |  8 ++---
 include/asm-generic/vmlinux.lds.h| 53 
 35 files changed, 159 insertions(+), 174 deletions(-)

-- 
2.17.1



[PATCH 12/29] vmlinux.lds.h: Replace RO_DATA_SECTION with RO_DATA

2019-09-26 Thread Kees Cook
This finishes renaming RO_DATA_SECTION to RO_DATA. (Calling this a
"section" is a lie, since it's multiple sections and section flags cannot
be applied to the macro.)

Signed-off-by: Kees Cook 
---
 arch/arc/kernel/vmlinux.lds.S   | 2 +-
 arch/c6x/kernel/vmlinux.lds.S   | 2 +-
 arch/csky/kernel/vmlinux.lds.S  | 2 +-
 arch/h8300/kernel/vmlinux.lds.S | 2 +-
 arch/hexagon/kernel/vmlinux.lds.S   | 2 +-
 arch/m68k/kernel/vmlinux-nommu.lds  | 2 +-
 arch/nds32/kernel/vmlinux.lds.S | 2 +-
 arch/nios2/kernel/vmlinux.lds.S | 2 +-
 arch/openrisc/kernel/vmlinux.lds.S  | 4 ++--
 arch/parisc/kernel/vmlinux.lds.S| 4 ++--
 arch/riscv/kernel/vmlinux.lds.S | 2 +-
 arch/s390/kernel/vmlinux.lds.S  | 2 +-
 arch/unicore32/kernel/vmlinux.lds.S | 2 +-
 include/asm-generic/vmlinux.lds.h   | 7 ++-
 14 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/arch/arc/kernel/vmlinux.lds.S b/arch/arc/kernel/vmlinux.lds.S
index 1d6eef4b6976..7d1d27066deb 100644
--- a/arch/arc/kernel/vmlinux.lds.S
+++ b/arch/arc/kernel/vmlinux.lds.S
@@ -95,7 +95,7 @@ SECTIONS
_etext = .;
 
_sdata = .;
-   RO_DATA_SECTION(PAGE_SIZE)
+   RO_DATA(PAGE_SIZE)
 
/*
 * 1. this is .data essentially
diff --git a/arch/c6x/kernel/vmlinux.lds.S b/arch/c6x/kernel/vmlinux.lds.S
index d6e3802536b3..a3547f9d415b 100644
--- a/arch/c6x/kernel/vmlinux.lds.S
+++ b/arch/c6x/kernel/vmlinux.lds.S
@@ -82,7 +82,7 @@ SECTIONS
 
EXCEPTION_TABLE(16)
 
-   RO_DATA_SECTION(PAGE_SIZE)
+   RO_DATA(PAGE_SIZE)
.const :
{
*(.const .const.* .gnu.linkonce.r.*)
diff --git a/arch/csky/kernel/vmlinux.lds.S b/arch/csky/kernel/vmlinux.lds.S
index 75dd31412242..8598bd7a7bcd 100644
--- a/arch/csky/kernel/vmlinux.lds.S
+++ b/arch/csky/kernel/vmlinux.lds.S
@@ -49,7 +49,7 @@ SECTIONS
 
 
_sdata = .;
-   RO_DATA_SECTION(PAGE_SIZE)
+   RO_DATA(PAGE_SIZE)
RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
_edata = .;
 
diff --git a/arch/h8300/kernel/vmlinux.lds.S b/arch/h8300/kernel/vmlinux.lds.S
index 88776e785245..d3247d33b115 100644
--- a/arch/h8300/kernel/vmlinux.lds.S
+++ b/arch/h8300/kernel/vmlinux.lds.S
@@ -38,7 +38,7 @@ SECTIONS
_etext = . ;
}
EXCEPTION_TABLE(16)
-   RO_DATA_SECTION(4)
+   RO_DATA(4)
ROMEND = .;
 #if defined(CONFIG_ROMKERNEL)
. = RAMTOP;
diff --git a/arch/hexagon/kernel/vmlinux.lds.S 
b/arch/hexagon/kernel/vmlinux.lds.S
index 6a6e8fc422ee..0145251fa317 100644
--- a/arch/hexagon/kernel/vmlinux.lds.S
+++ b/arch/hexagon/kernel/vmlinux.lds.S
@@ -50,7 +50,7 @@ SECTIONS
 
_sdata = .;
RW_DATA_SECTION(32,PAGE_SIZE,_THREAD_SIZE)
-   RO_DATA_SECTION(PAGE_SIZE)
+   RO_DATA(PAGE_SIZE)
_edata = .;
 
EXCEPTION_TABLE(16)
diff --git a/arch/m68k/kernel/vmlinux-nommu.lds 
b/arch/m68k/kernel/vmlinux-nommu.lds
index cf6edda38971..de80f8b8ae78 100644
--- a/arch/m68k/kernel/vmlinux-nommu.lds
+++ b/arch/m68k/kernel/vmlinux-nommu.lds
@@ -60,7 +60,7 @@ SECTIONS {
 #endif
 
_sdata = .;
-   RO_DATA_SECTION(PAGE_SIZE)
+   RO_DATA(PAGE_SIZE)
RW_DATA_SECTION(16, PAGE_SIZE, THREAD_SIZE)
_edata = .;
 
diff --git a/arch/nds32/kernel/vmlinux.lds.S b/arch/nds32/kernel/vmlinux.lds.S
index c4f1c5a604c3..10ff570ba95b 100644
--- a/arch/nds32/kernel/vmlinux.lds.S
+++ b/arch/nds32/kernel/vmlinux.lds.S
@@ -53,7 +53,7 @@ SECTIONS
_etext = .; /* End of text and rodata section */
 
_sdata = .;
-   RO_DATA_SECTION(PAGE_SIZE)
+   RO_DATA(PAGE_SIZE)
RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
_edata  =  .;
 
diff --git a/arch/nios2/kernel/vmlinux.lds.S b/arch/nios2/kernel/vmlinux.lds.S
index 20e4078b3477..318804a2c7a1 100644
--- a/arch/nios2/kernel/vmlinux.lds.S
+++ b/arch/nios2/kernel/vmlinux.lds.S
@@ -49,7 +49,7 @@ SECTIONS
__init_end = .;
 
_sdata = .;
-   RO_DATA_SECTION(PAGE_SIZE)
+   RO_DATA(PAGE_SIZE)
RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
_edata = .;
 
diff --git a/arch/openrisc/kernel/vmlinux.lds.S 
b/arch/openrisc/kernel/vmlinux.lds.S
index 142c51c994f5..f73e0d3ea09f 100644
--- a/arch/openrisc/kernel/vmlinux.lds.S
+++ b/arch/openrisc/kernel/vmlinux.lds.S
@@ -67,8 +67,8 @@ SECTIONS
 
_sdata = .;
 
-   /* Page alignment required for RO_DATA_SECTION */
-   RO_DATA_SECTION(PAGE_SIZE)
+   /* Page alignment required for RO_DATA */
+   RO_DATA(PAGE_SIZE)
_e_kernel_ro = .;
 
/* Whatever comes after _e_kernel_ro had better be page-aligend, too */
diff --git a/arch/parisc/kernel/vmlinux.lds.S b/arch/parisc/kernel/vmlinux.lds.S
index 168d12b2ebb8..e1c563c7dca1 100644
--- a/arch/parisc/kernel/vmlinux.lds.S
+++ b/arch/parisc/kernel/vmlinux.lds.S
@@ -109,7 +109,7 @@ SECTIONS
_sdata = .;
 
/* Architecturally we need to keep __gp below 

[PATCH 07/29] x86: Restore "text" Program Header with dummy section

2019-09-26 Thread Kees Cook
Instead of depending on markings in the section following NOTES to
restore the associated Program Header, use a dummy section, as done
in other architectures. This is preparation for moving NOTES into the
RO_DATA macro.

Signed-off-by: Kees Cook 
---
 arch/x86/kernel/vmlinux.lds.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index e2feacf921a0..788e78978030 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -147,8 +147,9 @@ SECTIONS
} :text = 0x9090
 
NOTES :text :note
+   .dummy : { *(.dummy) } :text
 
-   EXCEPTION_TABLE(16) :text = 0x9090
+   EXCEPTION_TABLE(16)
 
/* .text should occupy whole number of pages */
. = ALIGN(PAGE_SIZE);
-- 
2.17.1



[PATCH 10/29] vmlinux.lds.h: Move NOTES into RO_DATA

2019-09-26 Thread Kees Cook
The .notes section should be non-executable read-only data. As such, it
can live in the RO_DATA macro instead of being per-architecture defined.

Signed-off-by: Kees Cook 
---
 arch/alpha/kernel/vmlinux.lds.S  | 2 --
 arch/arc/kernel/vmlinux.lds.S| 2 --
 arch/arm/kernel/vmlinux-xip.lds.S| 2 --
 arch/arm/kernel/vmlinux.lds.S| 2 --
 arch/arm64/kernel/vmlinux.lds.S  | 1 -
 arch/c6x/kernel/vmlinux.lds.S| 1 -
 arch/csky/kernel/vmlinux.lds.S   | 1 -
 arch/h8300/kernel/vmlinux.lds.S  | 1 -
 arch/hexagon/kernel/vmlinux.lds.S| 1 -
 arch/ia64/kernel/vmlinux.lds.S   | 2 --
 arch/microblaze/kernel/vmlinux.lds.S | 1 -
 arch/mips/kernel/vmlinux.lds.S   | 2 --
 arch/nds32/kernel/vmlinux.lds.S  | 1 -
 arch/nios2/kernel/vmlinux.lds.S  | 1 -
 arch/openrisc/kernel/vmlinux.lds.S   | 1 -
 arch/parisc/kernel/vmlinux.lds.S | 1 -
 arch/powerpc/kernel/vmlinux.lds.S| 2 --
 arch/riscv/kernel/vmlinux.lds.S  | 1 -
 arch/s390/kernel/vmlinux.lds.S   | 2 --
 arch/sh/kernel/vmlinux.lds.S | 1 -
 arch/sparc/kernel/vmlinux.lds.S  | 1 -
 arch/um/include/asm/common.lds.S | 1 -
 arch/unicore32/kernel/vmlinux.lds.S  | 1 -
 arch/x86/kernel/vmlinux.lds.S| 2 --
 arch/xtensa/kernel/vmlinux.lds.S | 1 -
 include/asm-generic/vmlinux.lds.h| 9 +
 26 files changed, 5 insertions(+), 38 deletions(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index cdfdc91ce64c..bf28043485f6 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -34,8 +34,6 @@ SECTIONS
swapper_pg_dir = SWAPPER_PGD;
_etext = .; /* End of text section */
 
-   NOTES
-
RODATA
EXCEPTION_TABLE(16)
 
diff --git a/arch/arc/kernel/vmlinux.lds.S b/arch/arc/kernel/vmlinux.lds.S
index 6c693a9d29b6..1d6eef4b6976 100644
--- a/arch/arc/kernel/vmlinux.lds.S
+++ b/arch/arc/kernel/vmlinux.lds.S
@@ -118,8 +118,6 @@ SECTIONS
/DISCARD/ : {   *(.eh_frame) }
 #endif
 
-   NOTES
-
. = ALIGN(PAGE_SIZE);
_end = . ;
 
diff --git a/arch/arm/kernel/vmlinux-xip.lds.S 
b/arch/arm/kernel/vmlinux-xip.lds.S
index 8c74037ade22..d2a9651c24ad 100644
--- a/arch/arm/kernel/vmlinux-xip.lds.S
+++ b/arch/arm/kernel/vmlinux-xip.lds.S
@@ -70,8 +70,6 @@ SECTIONS
ARM_UNWIND_SECTIONS
 #endif
 
-   NOTES
-
_etext = .; /* End of text and rodata section */
 
ARM_VECTORS
diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index 23150c0f0f4d..068db6860867 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -81,8 +81,6 @@ SECTIONS
ARM_UNWIND_SECTIONS
 #endif
 
-   NOTES
-
 #ifdef CONFIG_STRICT_KERNEL_RWX
. = ALIGN(1<

[PATCH 18/29] arm64: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/arm64/kernel/vmlinux.lds.S | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 81d94e371c95..c6ba2eee0ee8 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -5,6 +5,8 @@
  * Written by Martin Mares 
  */
 
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  8
+
 #include 
 #include 
 #include 
@@ -135,8 +137,8 @@ SECTIONS
. = ALIGN(SEGMENT_ALIGN);
_etext = .; /* End of text section */
 
-   RO_DATA(PAGE_SIZE)  /* everything from this point to */
-   EXCEPTION_TABLE(8)  /* __init_begin will be marked RO NX */
+   /* everything from this point to __init_begin will be marked RO NX */
+   RO_DATA(PAGE_SIZE)
 
. = ALIGN(PAGE_SIZE);
idmap_pg_dir = .;
-- 
2.17.1



[PATCH 09/29] vmlinux.lds.h: Move Program Header restoration into NOTES macro

2019-09-26 Thread Kees Cook
In preparation for moving NOTES into RO_DATA, the Program Header
assignment restoration needs to be part of the NOTES macro itself.

Signed-off-by: Kees Cook 
---
 arch/alpha/kernel/vmlinux.lds.S   |  5 +
 arch/ia64/kernel/vmlinux.lds.S|  4 +---
 arch/mips/kernel/vmlinux.lds.S|  3 +--
 arch/powerpc/kernel/vmlinux.lds.S |  4 +---
 arch/s390/kernel/vmlinux.lds.S|  4 +---
 arch/x86/kernel/vmlinux.lds.S |  3 +--
 include/asm-generic/vmlinux.lds.h | 13 +++--
 7 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index 363a60ba7c31..cdfdc91ce64c 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -34,10 +34,7 @@ SECTIONS
swapper_pg_dir = SWAPPER_PGD;
_etext = .; /* End of text section */
 
-   NOTES :text :note
-   .dummy : {
-   *(.dummy)
-   } :text
+   NOTES
 
RODATA
EXCEPTION_TABLE(16)
diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index e034a6a4a444..fdcc992ab360 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -70,9 +70,7 @@ SECTIONS {
/*
 * Read-only data
 */
-   NOTES :text :note   /* put .notes in text and mark in PT_NOTE  */
-   code_continues : {
-   } :text/* switch back to regular program...  */
+   NOTES
 
EXCEPTION_TABLE(16)
 
diff --git a/arch/mips/kernel/vmlinux.lds.S b/arch/mips/kernel/vmlinux.lds.S
index 1c95612eb800..6a22f531d815 100644
--- a/arch/mips/kernel/vmlinux.lds.S
+++ b/arch/mips/kernel/vmlinux.lds.S
@@ -81,8 +81,7 @@ SECTIONS
__stop___dbe_table = .;
}
 
-   NOTES NOTES_HEADERS
-   .dummy : { *(.dummy) } :text
+   NOTES
 
_sdata = .; /* Start of data section */
RODATA
diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 7e26e20c8324..4f19d814d592 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -164,9 +164,7 @@ SECTIONS
 #endif
EXCEPTION_TABLE(0)
 
-   NOTES :text :note
-   /* Restore program header away from PT_NOTE. */
-   .dummy : { *(.dummy) } :text
+   NOTES
 
 /*
  * Init sections discarded at runtime
diff --git a/arch/s390/kernel/vmlinux.lds.S b/arch/s390/kernel/vmlinux.lds.S
index 646d939346df..f88eedeb915a 100644
--- a/arch/s390/kernel/vmlinux.lds.S
+++ b/arch/s390/kernel/vmlinux.lds.S
@@ -52,9 +52,7 @@ SECTIONS
_etext = .; /* End of text section */
} :text = 0x0700
 
-   NOTES :text :note
-
-   .dummy : { *(.dummy) } :text
+   NOTES
 
RO_DATA_SECTION(PAGE_SIZE)
 
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 2e18bf5c1aed..8be25b09c2b7 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -148,8 +148,7 @@ SECTIONS
_etext = .;
} :text = 0x9090
 
-   NOTES :text :note
-   .dummy : { *(.dummy) } :text
+   NOTES
 
EXCEPTION_TABLE(16)
 
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 2cc3ff9ac8c7..6a0a657dfdb4 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -56,10 +56,18 @@
 
 /*
  * Only some architectures want to have the .notes segment visible in
- * a separate PT_NOTE ELF Program Header.
+ * a separate PT_NOTE ELF Program Header. When this happens, it needs
+ * to be visible in both the kernel text's PT_LOAD and the PT_NOTE
+ * Program Headers. In this case, though, the PT_LOAD needs to be made
+ * the default again so that all the following sections don't also end
+ * up in the PT_NOTE Program Header.
  */
 #ifdef EMITS_PT_NOTE
 #define NOTES_HEADERS  :text :note
+#define NOTES_HEADERS_RESTORE  __restore_ph : { *(.__restore_ph) } :text
+#else
+#define NOTES_HEADERS
+#define NOTES_HEADERS_RESTORE
 #endif
 
 /* Align . to a 8 byte boundary equals to maximum function alignment. */
@@ -792,7 +800,8 @@
__start_notes = .;  \
KEEP(*(.note.*))\
__stop_notes = .;   \
-   }
+   } NOTES_HEADERS \
+   NOTES_HEADERS_RESTORE
 
 #define INIT_SETUP(initsetup_align)\
. = ALIGN(initsetup_align); \
-- 
2.17.1



[PATCH 25/29] xtensa: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/xtensa/kernel/vmlinux.lds.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/xtensa/kernel/vmlinux.lds.S b/arch/xtensa/kernel/vmlinux.lds.S
index bdbd7c4056c1..7341964722ae 100644
--- a/arch/xtensa/kernel/vmlinux.lds.S
+++ b/arch/xtensa/kernel/vmlinux.lds.S
@@ -14,6 +14,8 @@
  * Joe Taylor 
  */
 
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  16
+
 #include 
 #include 
 #include 
@@ -130,7 +132,6 @@ SECTIONS
 
   .fixup   : { *(.fixup) }
 
-  EXCEPTION_TABLE(16)
   /* Data section */
 
   _sdata = .;
-- 
2.17.1



[PATCH 29/29] x86: Use INT3 instead of NOP for linker fill bytes

2019-09-26 Thread Kees Cook
Instead of using 0x90 (NOP) to fill bytes between functions, which makes
it easier to sloppily target functions in function pointer overwrite
attacks, fill with 0xCC (INT3) to force a trap. Also drops the space
between "=" and the value to better match the binutils documentation
https://sourceware.org/binutils/docs/ld/Output-Section-Fill.html#Output-Section-Fill

Example "objdump -d" before:

...
810001e0 :
810001e0:   48 8b 25 e1 b1 51 01mov 0x151b1e1(%rip),%rsp
# 8251b3c8 
810001e7:   e9 d5 fe ff ff  jmpq   81c1 

810001ec:   90  nop
810001ed:   90  nop
810001ee:   90  nop
810001ef:   90  nop

810001f0 <__startup_64>:
...

After:

...
810001e0 :
810001e0:   48 8b 25 41 79 53 01mov 0x1537941(%rip),%rsp
# 82537b28 
810001e7:   e9 d5 fe ff ff  jmpq   81c1 

810001ec:   cc  int3
810001ed:   cc  int3
810001ee:   cc  int3
810001ef:   cc  int3

810001f0 <__startup_64>:
...

Signed-off-by: Kees Cook 
---
 arch/x86/kernel/vmlinux.lds.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index a5c8571e4967..a37817fafb22 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -144,7 +144,7 @@ SECTIONS
*(.text.__x86.indirect_thunk)
__indirect_thunk_end = .;
 #endif
-   } :text = 0x9090
+   } :text =0x
 
/* End of text section, which should occupy whole number of pages */
_etext = .;
-- 
2.17.1



[PATCH 17/29] alpha: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/alpha/kernel/vmlinux.lds.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index edc45f45523b..72303827bcb4 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 #define EMITS_PT_NOTE
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  16
 
 #include 
 #include 
@@ -35,7 +36,6 @@ SECTIONS
_etext = .; /* End of text section */
 
RO_DATA(4096)
-   EXCEPTION_TABLE(16)
 
/* Will be freed after init */
__init_begin = ALIGN(PAGE_SIZE);
-- 
2.17.1



[PATCH 14/29] vmlinux.lds.h: Allow EXCEPTION_TABLE to live in RO_DATA

2019-09-26 Thread Kees Cook
Many architectures have an EXCEPTION_TABLE that needs only to be
read-only. As such, it should live in RO_DATA. This creates a macro to
identify this case for the architectures that can move EXCEPTION_TABLE
into RO_DATA.

Signed-off-by: Kees Cook 
---
 include/asm-generic/vmlinux.lds.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index d57a28786bb8..35a6cba39d9f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -69,6 +69,17 @@
 #define NOTES_HEADERS_RESTORE
 #endif
 
+/*
+ * Some architectures have non-executable read-only exception tables.
+ * They can be added to the RO_DATA segment by specifying their desired
+ * alignment.
+ */
+#ifdef RO_DATA_EXCEPTION_TABLE_ALIGN
+#define RO_DATA_EXCEPTION_TABLE
EXCEPTION_TABLE(RO_DATA_EXCEPTION_TABLE_ALIGN)
+#else
+#define RO_DATA_EXCEPTION_TABLE
+#endif
+
 /* Align . to a 8 byte boundary equals to maximum function alignment. */
 #define ALIGN_FUNCTION()  . = ALIGN(8)
 
@@ -508,6 +519,7 @@
__stop___modver = .;\
}   \
\
+   RO_DATA_EXCEPTION_TABLE \
NOTES   \
\
. = ALIGN((align)); \
-- 
2.17.1



[PATCH 08/29] vmlinux.lds.h: Provide EMIT_PT_NOTE to indicate export of .notes

2019-09-26 Thread Kees Cook
In preparation for moving NOTES into RO_DATA, this provides a mechanism
for architectures that want to emit a PT_NOTE Program Header to do so.

Signed-off-by: Kees Cook 
---
 arch/alpha/kernel/vmlinux.lds.S   |  3 +++
 arch/ia64/kernel/vmlinux.lds.S|  2 ++
 arch/mips/kernel/vmlinux.lds.S| 12 ++--
 arch/powerpc/kernel/vmlinux.lds.S |  1 +
 arch/s390/kernel/vmlinux.lds.S|  2 ++
 arch/x86/kernel/vmlinux.lds.S |  2 ++
 include/asm-generic/vmlinux.lds.h |  8 
 7 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index 781090cacc96..363a60ba7c31 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -1,4 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+
+#define EMITS_PT_NOTE
+
 #include 
 #include 
 #include 
diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index c1067992fcd1..e034a6a4a444 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -5,6 +5,8 @@
 #include 
 #include 
 
+#define EMITS_PT_NOTE
+
 #include 
 
 OUTPUT_FORMAT("elf64-ia64-little")
diff --git a/arch/mips/kernel/vmlinux.lds.S b/arch/mips/kernel/vmlinux.lds.S
index 33ee0d18fb0a..1c95612eb800 100644
--- a/arch/mips/kernel/vmlinux.lds.S
+++ b/arch/mips/kernel/vmlinux.lds.S
@@ -10,6 +10,11 @@
  */
 #define BSS_FIRST_SECTIONS *(.bss..swapper_pg_dir)
 
+/* Cavium Octeon should not have a separate PT_NOTE Program Header. */
+#ifndef CONFIG_CAVIUM_OCTEON_SOC
+#define EMITS_PT_NOTE
+#endif
+
 #include 
 
 #undef mips
@@ -76,12 +81,7 @@ SECTIONS
__stop___dbe_table = .;
}
 
-#ifdef CONFIG_CAVIUM_OCTEON_SOC
-#define NOTES_HEADER
-#else /* CONFIG_CAVIUM_OCTEON_SOC */
-#define NOTES_HEADER :note
-#endif /* CONFIG_CAVIUM_OCTEON_SOC */
-   NOTES :text NOTES_HEADER
+   NOTES NOTES_HEADERS
.dummy : { *(.dummy) } :text
 
_sdata = .; /* Start of data section */
diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index e184a63aa5b0..7e26e20c8324 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -6,6 +6,7 @@
 #endif
 
 #define BSS_FIRST_SECTIONS *(.bss.prominit)
+#define EMITS_PT_NOTE
 
 #include 
 #include 
diff --git a/arch/s390/kernel/vmlinux.lds.S b/arch/s390/kernel/vmlinux.lds.S
index 13294fef473e..646d939346df 100644
--- a/arch/s390/kernel/vmlinux.lds.S
+++ b/arch/s390/kernel/vmlinux.lds.S
@@ -15,6 +15,8 @@
 /* Handle ro_after_init data on our own. */
 #define RO_AFTER_INIT_DATA
 
+#define EMITS_PT_NOTE
+
 #include 
 #include 
 
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 788e78978030..2e18bf5c1aed 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -21,6 +21,8 @@
 #define LOAD_OFFSET __START_KERNEL_map
 #endif
 
+#define EMITS_PT_NOTE
+
 #include 
 #include 
 #include 
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index cd28f63bfbc7..2cc3ff9ac8c7 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -54,6 +54,14 @@
 #define LOAD_OFFSET 0
 #endif
 
+/*
+ * Only some architectures want to have the .notes segment visible in
+ * a separate PT_NOTE ELF Program Header.
+ */
+#ifdef EMITS_PT_NOTE
+#define NOTES_HEADERS  :text :note
+#endif
+
 /* Align . to a 8 byte boundary equals to maximum function alignment. */
 #define ALIGN_FUNCTION()  . = ALIGN(8)
 
-- 
2.17.1



[PATCH 19/29] c6x: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/c6x/kernel/vmlinux.lds.S | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/c6x/kernel/vmlinux.lds.S b/arch/c6x/kernel/vmlinux.lds.S
index a3547f9d415b..9a09aab63ab3 100644
--- a/arch/c6x/kernel/vmlinux.lds.S
+++ b/arch/c6x/kernel/vmlinux.lds.S
@@ -5,6 +5,9 @@
  *  Copyright (C) 2010, 2011 Texas Instruments Incorporated
  *  Mark Salter 
  */
+
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  16
+
 #include 
 #include 
 #include 
@@ -80,8 +83,6 @@ SECTIONS
*(.gnu.warning)
}
 
-   EXCEPTION_TABLE(16)
-
RO_DATA(PAGE_SIZE)
.const :
{
-- 
2.17.1



[PATCH 24/29] powerpc: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/powerpc/kernel/vmlinux.lds.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 4e7cec088c8b..2ed44e5824d5 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -7,6 +7,7 @@
 
 #define BSS_FIRST_SECTIONS *(.bss.prominit)
 #define EMITS_PT_NOTE
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  0
 
 #include 
 #include 
@@ -162,7 +163,6 @@ SECTIONS
__stop__btb_flush_fixup = .;
}
 #endif
-   EXCEPTION_TABLE(0)
 
 /*
  * Init sections discarded at runtime
-- 
2.17.1



[PATCH 01/29] powerpc: Rename "notes" PT_NOTE to "note"

2019-09-26 Thread Kees Cook
The Program Header identifiers are internal to the linker scripts. In
preparation for moving the NOTES segment declaration into RO_DATA,
standardize the identifier for the PT_NOTE entry to "note" as used by
all other architectures that emit PT_NOTE.

Signed-off-by: Kees Cook 
---
 arch/powerpc/kernel/vmlinux.lds.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 060a1acd7c6d..81e672654789 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -19,7 +19,7 @@ ENTRY(_stext)
 
 PHDRS {
kernel PT_LOAD FLAGS(7); /* RWX */
-   notes PT_NOTE FLAGS(0);
+   note PT_NOTE FLAGS(0);
dummy PT_NOTE FLAGS(0);
 
/* binutils < 2.18 has a bug that makes it misbehave when taking an
@@ -177,7 +177,7 @@ SECTIONS
 #endif
EXCEPTION_TABLE(0)
 
-   NOTES :kernel :notes
+   NOTES :kernel :note
 
/* The dummy segment contents for the bug workaround mentioned above
   near PHDRS.  */
-- 
2.17.1



[PATCH 26/29] x86/mm: Remove redundant on addresses

2019-09-26 Thread Kees Cook
The  on addresses are redundant and are better removed to match all
the other similar functions.

Signed-off-by: Kees Cook 
---
 arch/x86/mm/init_64.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 26299e9ce6da..e67ddca8b7a8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1300,9 +1300,9 @@ void mark_rodata_ro(void)
 {
unsigned long start = PFN_ALIGN(_text);
unsigned long rodata_start = PFN_ALIGN(__start_rodata);
-   unsigned long end = (unsigned long) &__end_rodata_hpage_align;
-   unsigned long text_end = PFN_ALIGN(&_etext);
-   unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
+   unsigned long end = (unsigned long)__end_rodata_hpage_align;
+   unsigned long text_end = PFN_ALIGN(_etext);
+   unsigned long rodata_end = PFN_ALIGN(__end_rodata);
unsigned long all_end;
 
printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
-- 
2.17.1



[PATCH 04/29] alpha: Rename PT_LOAD identifier "kernel" to "text"

2019-09-26 Thread Kees Cook
In preparation for moving NOTES into RO_DATA, this renames the linker
script internal identifier for the PT_LOAD Program Header from "kernel"
to "text" to match other architectures.

Signed-off-by: Kees Cook 
---
 arch/alpha/kernel/vmlinux.lds.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index c4b5ceceab52..781090cacc96 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -8,7 +8,7 @@
 OUTPUT_FORMAT("elf64-alpha")
 OUTPUT_ARCH(alpha)
 ENTRY(__start)
-PHDRS { kernel PT_LOAD; note PT_NOTE; }
+PHDRS { text PT_LOAD; note PT_NOTE; }
 jiffies = jiffies_64;
 SECTIONS
 {
@@ -27,14 +27,14 @@ SECTIONS
LOCK_TEXT
*(.fixup)
*(.gnu.warning)
-   } :kernel
+   } :text
swapper_pg_dir = SWAPPER_PGD;
_etext = .; /* End of text section */
 
-   NOTES :kernel :note
+   NOTES :text :note
.dummy : {
*(.dummy)
-   } :kernel
+   } :text
 
RODATA
EXCEPTION_TABLE(16)
-- 
2.17.1



[PATCH 28/29] x86/mm: Report actual image regions in /proc/iomem

2019-09-26 Thread Kees Cook
The resource reservations in made for the kernel image did not reflect
the gaps between text, rodata, and data. This adds the rodata resource
and updates the start/end calculations to match the respective calls to
free_kernel_image_pages().

Before (booted with "nokaslr" for easier comparison):

0010-bffd9fff : System RAM
  0100-01e011d0 : Kernel code
  01e011d1-025619bf : Kernel data
  02a95000-035f : Kernel bss

After:

0010-bffd9fff : System RAM
  0100-01e011d0 : Kernel code
  0200-023d4fff : Kernel rodata
  0240-025619ff : Kernel data
  02a95000-035f : Kernel bss

Signed-off-by: Kees Cook 
---
 arch/x86/kernel/setup.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index bbe35bf879f5..0bc71ee3a9c3 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -143,6 +143,13 @@ struct boot_params boot_params;
 /*
  * Machine setup..
  */
+static struct resource rodata_resource = {
+   .name   = "Kernel rodata",
+   .start  = 0,
+   .end= 0,
+   .flags  = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
+};
+
 static struct resource data_resource = {
.name   = "Kernel data",
.start  = 0,
@@ -951,7 +958,9 @@ void __init setup_arch(char **cmdline_p)
 
code_resource.start = __pa_symbol(_text);
code_resource.end = __pa_symbol(_etext)-1;
-   data_resource.start = __pa_symbol(_etext);
+   rodata_resource.start = __pa_symbol(__start_rodata);
+   rodata_resource.end = __pa_symbol(__end_rodata)-1;
+   data_resource.start = __pa_symbol(_sdata);
data_resource.end = __pa_symbol(_edata)-1;
bss_resource.start = __pa_symbol(__bss_start);
bss_resource.end = __pa_symbol(__bss_stop)-1;
@@ -1040,6 +1049,7 @@ void __init setup_arch(char **cmdline_p)
 
/* after parse_early_param, so could debug it */
insert_resource(_resource, _resource);
+   insert_resource(_resource, _resource);
insert_resource(_resource, _resource);
insert_resource(_resource, _resource);
 
-- 
2.17.1



[PATCH 27/29] x86/mm: Report which part of kernel image is freed

2019-09-26 Thread Kees Cook
The memory freeing report wasn't very useful for figuring out which
parts of the kernel image were being freed. This adds the details for
clearer reporting.

Before:

[2.150450] Freeing unused kernel image memory: 1348K
[2.154574] Write protecting the kernel read-only data: 20480k
[2.157641] Freeing unused kernel image memory: 2040K
[2.158827] Freeing unused kernel image memory: 172K

After:

[2.329678] Freeing unused kernel image (initmem) memory: 1348K
[2.331953] Write protecting the kernel read-only data: 20480k
[2.335361] Freeing unused kernel image (text/rodata gap) memory: 2040K
[2.336927] Freeing unused kernel image (rodata/data gap) memory: 172K

Signed-off-by: Kees Cook 
---
 arch/x86/include/asm/processor.h | 2 +-
 arch/x86/mm/init.c   | 8 
 arch/x86/mm/init_64.c| 6 --
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 6e0a3b43d027..790f250d39a8 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -958,7 +958,7 @@ static inline uint32_t hypervisor_cpuid_base(const char 
*sig, uint32_t leaves)
 
 extern unsigned long arch_align_stack(unsigned long sp);
 void free_init_pages(const char *what, unsigned long begin, unsigned long end);
-extern void free_kernel_image_pages(void *begin, void *end);
+extern void free_kernel_image_pages(const char *what, void *begin, void *end);
 
 void default_idle(void);
 #ifdef CONFIG_XEN
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fd10d91a6115..e7bb483557c9 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -829,14 +829,13 @@ void free_init_pages(const char *what, unsigned long 
begin, unsigned long end)
  * used for the kernel image only.  free_init_pages() will do the
  * right thing for either kind of address.
  */
-void free_kernel_image_pages(void *begin, void *end)
+void free_kernel_image_pages(const char *what, void *begin, void *end)
 {
unsigned long begin_ul = (unsigned long)begin;
unsigned long end_ul = (unsigned long)end;
unsigned long len_pages = (end_ul - begin_ul) >> PAGE_SHIFT;
 
-
-   free_init_pages("unused kernel image", begin_ul, end_ul);
+   free_init_pages(what, begin_ul, end_ul);
 
/*
 * PTI maps some of the kernel into userspace.  For performance,
@@ -865,7 +864,8 @@ void __ref free_initmem(void)
 
mem_encrypt_free_decrypted_mem();
 
-   free_kernel_image_pages(&__init_begin, &__init_end);
+   free_kernel_image_pages("unused kernel image (initmem)",
+   &__init_begin, &__init_end);
 }
 
 #ifdef CONFIG_BLK_DEV_INITRD
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e67ddca8b7a8..dcb9bc961b39 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1334,8 +1334,10 @@ void mark_rodata_ro(void)
set_memory_ro(start, (end-start) >> PAGE_SHIFT);
 #endif
 
-   free_kernel_image_pages((void *)text_end, (void *)rodata_start);
-   free_kernel_image_pages((void *)rodata_end, (void *)_sdata);
+   free_kernel_image_pages("unused kernel image (text/rodata gap)",
+   (void *)text_end, (void *)rodata_start);
+   free_kernel_image_pages("unused kernel image (rodata/data gap)",
+   (void *)rodata_end, (void *)_sdata);
 
debug_checkwx();
 }
-- 
2.17.1



[PATCH 13/29] vmlinux.lds.h: Replace RW_DATA_SECTION with RW_DATA

2019-09-26 Thread Kees Cook
This renames RW_DATA_SECTION to RW_DATA. (Calling this a "section" is
a lie, since it's multiple sections and section flags cannot be applied
to the macro.)

Signed-off-by: Kees Cook 
---
 arch/alpha/kernel/vmlinux.lds.S  | 2 +-
 arch/arc/kernel/vmlinux.lds.S| 2 +-
 arch/arm/kernel/vmlinux-xip.lds.S| 2 +-
 arch/arm/kernel/vmlinux.lds.S| 2 +-
 arch/arm64/kernel/vmlinux.lds.S  | 2 +-
 arch/csky/kernel/vmlinux.lds.S   | 2 +-
 arch/h8300/kernel/vmlinux.lds.S  | 2 +-
 arch/hexagon/kernel/vmlinux.lds.S| 2 +-
 arch/m68k/kernel/vmlinux-nommu.lds   | 2 +-
 arch/m68k/kernel/vmlinux-std.lds | 2 +-
 arch/m68k/kernel/vmlinux-sun3.lds| 2 +-
 arch/microblaze/kernel/vmlinux.lds.S | 2 +-
 arch/nds32/kernel/vmlinux.lds.S  | 2 +-
 arch/nios2/kernel/vmlinux.lds.S  | 2 +-
 arch/openrisc/kernel/vmlinux.lds.S   | 2 +-
 arch/parisc/kernel/vmlinux.lds.S | 2 +-
 arch/riscv/kernel/vmlinux.lds.S  | 2 +-
 arch/s390/kernel/vmlinux.lds.S   | 2 +-
 arch/sh/kernel/vmlinux.lds.S | 2 +-
 arch/sparc/kernel/vmlinux.lds.S  | 2 +-
 arch/unicore32/kernel/vmlinux.lds.S  | 2 +-
 arch/xtensa/kernel/vmlinux.lds.S | 2 +-
 include/asm-generic/vmlinux.lds.h| 4 ++--
 23 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/alpha/kernel/vmlinux.lds.S b/arch/alpha/kernel/vmlinux.lds.S
index af411817dd7d..edc45f45523b 100644
--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -50,7 +50,7 @@ SECTIONS
 
_sdata = .; /* Start of rw data section */
_data = .;
-   RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
+   RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
 
.got : {
*(.got)
diff --git a/arch/arc/kernel/vmlinux.lds.S b/arch/arc/kernel/vmlinux.lds.S
index 7d1d27066deb..54139a6f469b 100644
--- a/arch/arc/kernel/vmlinux.lds.S
+++ b/arch/arc/kernel/vmlinux.lds.S
@@ -101,7 +101,7 @@ SECTIONS
 * 1. this is .data essentially
 * 2. THREAD_SIZE for init.task, must be kernel-stk sz aligned
 */
-   RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
+   RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
 
_edata = .;
 
diff --git a/arch/arm/kernel/vmlinux-xip.lds.S 
b/arch/arm/kernel/vmlinux-xip.lds.S
index d2a9651c24ad..21b8b271c80d 100644
--- a/arch/arm/kernel/vmlinux-xip.lds.S
+++ b/arch/arm/kernel/vmlinux-xip.lds.S
@@ -112,7 +112,7 @@ SECTIONS
 
. = ALIGN(THREAD_SIZE);
_sdata = .;
-   RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
+   RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
.data.ro_after_init : AT(ADDR(.data.ro_after_init) - LOAD_OFFSET) {
*(.data..ro_after_init)
}
diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index 068db6860867..319ccb10846a 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -141,7 +141,7 @@ SECTIONS
__init_end = .;
 
_sdata = .;
-   RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
+   RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
_edata = .;
 
BSS_SECTION(0, 0, 0)
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 5cf9424485d5..81d94e371c95 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -205,7 +205,7 @@ SECTIONS
 
_data = .;
_sdata = .;
-   RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
+   RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)
 
/*
 * Data written with the MMU off but read with the MMU on requires
diff --git a/arch/csky/kernel/vmlinux.lds.S b/arch/csky/kernel/vmlinux.lds.S
index 8598bd7a7bcd..2ff37beaf2bf 100644
--- a/arch/csky/kernel/vmlinux.lds.S
+++ b/arch/csky/kernel/vmlinux.lds.S
@@ -50,7 +50,7 @@ SECTIONS
 
_sdata = .;
RO_DATA(PAGE_SIZE)
-   RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
+   RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
_edata = .;
 
EXCEPTION_TABLE(L1_CACHE_BYTES)
diff --git a/arch/h8300/kernel/vmlinux.lds.S b/arch/h8300/kernel/vmlinux.lds.S
index d3247d33b115..2ac7bdcd2fe0 100644
--- a/arch/h8300/kernel/vmlinux.lds.S
+++ b/arch/h8300/kernel/vmlinux.lds.S
@@ -47,7 +47,7 @@ SECTIONS
 #endif
_sdata = . ;
__data_start = . ;
-   RW_DATA_SECTION(0, PAGE_SIZE, THREAD_SIZE)
+   RW_DATA(0, PAGE_SIZE, THREAD_SIZE)
 #if defined(CONFIG_ROMKERNEL)
 #undef ADDR
 #endif
diff --git a/arch/hexagon/kernel/vmlinux.lds.S 
b/arch/hexagon/kernel/vmlinux.lds.S
index 0145251fa317..0ca2471ddb9f 100644
--- a/arch/hexagon/kernel/vmlinux.lds.S
+++ b/arch/hexagon/kernel/vmlinux.lds.S
@@ -49,7 +49,7 @@ SECTIONS
INIT_DATA_SECTION(PAGE_SIZE)
 
_sdata = .;
-   RW_DATA_SECTION(32,PAGE_SIZE,_THREAD_SIZE)
+   RW_DATA(32,PAGE_SIZE,_THREAD_SIZE)
RO_DATA(PAGE_SIZE)
_edata = .;
 
diff 

[PATCH 20/29] h8300: Move EXCEPTION_TABLE to RO_DATA segment

2019-09-26 Thread Kees Cook
The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/h8300/kernel/vmlinux.lds.S | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/h8300/kernel/vmlinux.lds.S b/arch/h8300/kernel/vmlinux.lds.S
index 2ac7bdcd2fe0..aba90398b774 100644
--- a/arch/h8300/kernel/vmlinux.lds.S
+++ b/arch/h8300/kernel/vmlinux.lds.S
@@ -1,4 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+
+#define RO_DATA_EXCEPTION_TABLE_ALIGN  16
+
 #include 
 #include 
 #include 
@@ -37,7 +40,6 @@ SECTIONS
 #endif
_etext = . ;
}
-   EXCEPTION_TABLE(16)
RO_DATA(4)
ROMEND = .;
 #if defined(CONFIG_ROMKERNEL)
-- 
2.17.1



[PATCH 05/29] ia64: Rename PT_LOAD identifier "code" to "text"

2019-09-26 Thread Kees Cook
In preparation for moving NOTES into RO_DATA, this renames the linker
script internal identifier for the PT_LOAD Program Header from "code"
to "text" to match other architectures.

Signed-off-by: Kees Cook 
---
 arch/ia64/kernel/vmlinux.lds.S | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index 0da58cf8e213..c1067992fcd1 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -13,7 +13,7 @@ ENTRY(phys_start)
 jiffies = jiffies_64;
 
 PHDRS {
-   code   PT_LOAD;
+   text   PT_LOAD;
percpu PT_LOAD;
data   PT_LOAD;
note   PT_NOTE;
@@ -36,7 +36,7 @@ SECTIONS {
phys_start = _start - LOAD_OFFSET;
 
code : {
-   } :code
+   } :text
. = KERNEL_START;
 
_text = .;
@@ -68,9 +68,9 @@ SECTIONS {
/*
 * Read-only data
 */
-   NOTES :code :note   /* put .notes in text and mark in PT_NOTE  */
+   NOTES :text :note   /* put .notes in text and mark in PT_NOTE  */
code_continues : {
-   } : code   /* switch back to regular program...  */
+   } :text/* switch back to regular program...  */
 
EXCEPTION_TABLE(16)
 
@@ -102,9 +102,9 @@ SECTIONS {
__start_unwind = .;
*(.IA_64.unwind*)
__end_unwind = .;
-   } :code :unwind
+   } :text :unwind
code_continues2 : {
-   } : code
+   } :text
 
RODATA
 
@@ -224,7 +224,7 @@ SECTIONS {
_end = .;
 
code : {
-   } :code
+   } :text
 
STABS_DEBUG
DWARF_DEBUG
-- 
2.17.1



[PATCH 06/29] s390: Move RO_DATA into "text" PT_LOAD Program Header

2019-09-26 Thread Kees Cook
In preparation for moving NOTES into RO_DATA, this moves RO_DATA back
into the "text" PT_LOAD Program Header, as done with other
architectures. The "data" PT_LOAD now starts with the writable data
section.

Signed-off-by: Kees Cook 
---
 arch/s390/kernel/vmlinux.lds.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/kernel/vmlinux.lds.S b/arch/s390/kernel/vmlinux.lds.S
index 7e0eb4020917..13294fef473e 100644
--- a/arch/s390/kernel/vmlinux.lds.S
+++ b/arch/s390/kernel/vmlinux.lds.S
@@ -52,7 +52,7 @@ SECTIONS
 
NOTES :text :note
 
-   .dummy : { *(.dummy) } :data
+   .dummy : { *(.dummy) } :text
 
RO_DATA_SECTION(PAGE_SIZE)
 
@@ -64,7 +64,7 @@ SECTIONS
.data..ro_after_init : {
 *(.data..ro_after_init)
JUMP_TABLE_DATA
-   }
+   } :data
EXCEPTION_TABLE(16)
. = ALIGN(PAGE_SIZE);
__end_ro_after_init = .;
-- 
2.17.1



[PATCH 02/29] powerpc: Remove PT_NOTE workaround

2019-09-26 Thread Kees Cook
The kernel requires gcc 4.6 now, so this PT_NOTE workaround can be
removed in preparation for moving NOTES into RO_DATA.

Signed-off-by: Kees Cook 
---
 arch/powerpc/kernel/vmlinux.lds.S | 24 ++--
 1 file changed, 2 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 81e672654789..a3c8492b2b19 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -20,20 +20,6 @@ ENTRY(_stext)
 PHDRS {
kernel PT_LOAD FLAGS(7); /* RWX */
note PT_NOTE FLAGS(0);
-   dummy PT_NOTE FLAGS(0);
-
-   /* binutils < 2.18 has a bug that makes it misbehave when taking an
-  ELF file with all segments at load address 0 as input.  This
-  happens when running "strip" on vmlinux, because of the AT() magic
-  in this linker script.  People using GCC >= 4.2 won't run into
-  this problem, because the "build-id" support will put some data
-  into the "notes" segment (at a non-zero load address).
-
-  To work around this, we force some data into both the "dummy"
-  segment and the kernel segment, so the dummy segment will get a
-  non-zero load address.  It's not enough to always create the
-  "notes" segment, since if nothing gets assigned to it, its load
-  address will be zero.  */
 }
 
 #ifdef CONFIG_PPC64
@@ -178,14 +164,8 @@ SECTIONS
EXCEPTION_TABLE(0)
 
NOTES :kernel :note
-
-   /* The dummy segment contents for the bug workaround mentioned above
-  near PHDRS.  */
-   .dummy : AT(ADDR(.dummy) - LOAD_OFFSET) {
-   LONG(0)
-   LONG(0)
-   LONG(0)
-   } :kernel :dummy
+   /* Restore program header away from PT_NOTE. */
+   .dummy : { *(.dummy) } :kernel
 
 /*
  * Init sections discarded at runtime
-- 
2.17.1



Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-26 Thread Peter Zijlstra
On Thu, Sep 26, 2019 at 11:05:59AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2019 at 11:45:26PM +0200, Peter Zijlstra wrote:
> > [7.149889] [Firmware Bug]: device: 'pci:7f': no node assigned on 
> > NUMA capable HW
> > [7.882888] [Firmware Bug]: device: 'pci:ff': no node assigned on 
> > NUMA capable HW
> 
> Going by the limited number of intel numa boxes I have, it looks like:
> 
>   socket = (~busid) >> (8-n)

Bah, I got my notes mixed up, it should be: busid >> (8-n)

> where 'n' is the number of bits required to encode the largest socket
> id, ie 1 for 2-socket and 2 for 4 socket.
> 
> For 8 socket systems we start using pci domains, and things get more
> 'interesting' :/


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-26 Thread Peter Zijlstra
On Wed, Sep 25, 2019 at 11:45:26PM +0200, Peter Zijlstra wrote:
> [7.149889] [Firmware Bug]: device: 'pci:7f': no node assigned on NUMA 
> capable HW
> [7.882888] [Firmware Bug]: device: 'pci:ff': no node assigned on NUMA 
> capable HW

Going by the limited number of intel numa boxes I have, it looks like:

  socket = (~busid) >> (8-n)

where 'n' is the number of bits required to encode the largest socket
id, ie 1 for 2-socket and 2 for 4 socket.

For 8 socket systems we start using pci domains, and things get more
'interesting' :/


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Peter Zijlstra
On Wed, Sep 25, 2019 at 06:31:54PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2019 at 03:25:44PM +0200, Michal Hocko wrote:
> > I am sorry but I still do not understand why you consider this whack a
> > mole better then simply live with the fact that NUMA_NO_NODE is a
> > reality and that using the full cpu mask is a reasonable answer to that.
> 
> Because it doesn't make physical sense. A device _cannot_ be local to
> all CPUs in a NUMA system.

The below patch still gives a fair amount of noise on my fairly old and
cruft IVB-EP, but it gets rid of most of the simple stuff.

[2.890739] [Firmware Bug]: device: 'platform': no node assigned on NUMA 
capable HW
[2.901855] [Firmware Bug]: device: 'vtcon0': no node assigned on NUMA 
capable HW
[2.911804] [Firmware Bug]: device: 'id': no node assigned on NUMA capable HW
[3.800832] [Firmware Bug]: device: 'fbcon': no node assigned on NUMA 
capable HW
[4.824808] [Firmware Bug]: device: 'LNXSYSTM:00': no node assigned on NUMA 
capable HW
[5.112739] [Firmware Bug]: device: 'pci:00': no node assigned on NUMA 
capable HW
[6.703425] [Firmware Bug]: device: 'pci:80': no node assigned on NUMA 
capable HW
[7.049515] [Firmware Bug]: device: 'ACPI0004:00': no node assigned on NUMA 
capable HW
[7.078823] [Firmware Bug]: device: 'ACPI0004:01': no node assigned on NUMA 
capable HW
[7.149889] [Firmware Bug]: device: 'pci:7f': no node assigned on NUMA 
capable HW
[7.158798] [Firmware Bug]: device: ':7f': no node assigned on NUMA 
capable HW
[7.183796] [Firmware Bug]: device: ':7f:08.0': no node assigned on NUMA 
capable HW
[7.199796] [Firmware Bug]: device: ':7f:09.0': no node assigned on NUMA 
capable HW
[7.215792] [Firmware Bug]: device: ':7f:0a.0': no node assigned on NUMA 
capable HW
[7.231791] [Firmware Bug]: device: ':7f:0a.1': no node assigned on NUMA 
capable HW
[7.247793] [Firmware Bug]: device: ':7f:0a.2': no node assigned on NUMA 
capable HW
[7.262794] [Firmware Bug]: device: ':7f:0a.3': no node assigned on NUMA 
capable HW
[7.278789] [Firmware Bug]: device: ':7f:0b.0': no node assigned on NUMA 
capable HW
[7.294787] [Firmware Bug]: device: ':7f:0b.3': no node assigned on NUMA 
capable HW
[7.310794] [Firmware Bug]: device: ':7f:0c.0': no node assigned on NUMA 
capable HW
[7.325796] [Firmware Bug]: device: ':7f:0c.1': no node assigned on NUMA 
capable HW
[7.341790] [Firmware Bug]: device: ':7f:0c.2': no node assigned on NUMA 
capable HW
[7.357789] [Firmware Bug]: device: ':7f:0c.3': no node assigned on NUMA 
capable HW
[7.373789] [Firmware Bug]: device: ':7f:0c.4': no node assigned on NUMA 
capable HW
[7.388789] [Firmware Bug]: device: ':7f:0d.0': no node assigned on NUMA 
capable HW
[7.404791] [Firmware Bug]: device: ':7f:0d.1': no node assigned on NUMA 
capable HW
[7.420789] [Firmware Bug]: device: ':7f:0d.2': no node assigned on NUMA 
capable HW
[7.436790] [Firmware Bug]: device: ':7f:0d.3': no node assigned on NUMA 
capable HW
[7.451789] [Firmware Bug]: device: ':7f:0d.4': no node assigned on NUMA 
capable HW
[7.467799] [Firmware Bug]: device: ':7f:0e.0': no node assigned on NUMA 
capable HW
[7.483797] [Firmware Bug]: device: ':7f:0e.1': no node assigned on NUMA 
capable HW
[7.499830] [Firmware Bug]: device: ':7f:0f.0': no node assigned on NUMA 
capable HW
[7.515825] [Firmware Bug]: device: ':7f:0f.1': no node assigned on NUMA 
capable HW
[7.530823] [Firmware Bug]: device: ':7f:0f.2': no node assigned on NUMA 
capable HW
[7.546824] [Firmware Bug]: device: ':7f:0f.3': no node assigned on NUMA 
capable HW
[7.562823] [Firmware Bug]: device: ':7f:0f.4': no node assigned on NUMA 
capable HW
[7.578822] [Firmware Bug]: device: ':7f:0f.5': no node assigned on NUMA 
capable HW
[7.594830] [Firmware Bug]: device: ':7f:10.0': no node assigned on NUMA 
capable HW
[7.609834] [Firmware Bug]: device: ':7f:10.1': no node assigned on NUMA 
capable HW
[7.625825] [Firmware Bug]: device: ':7f:10.2': no node assigned on NUMA 
capable HW
[7.641824] [Firmware Bug]: device: ':7f:10.3': no node assigned on NUMA 
capable HW
[7.657825] [Firmware Bug]: device: ':7f:10.4': no node assigned on NUMA 
capable HW
[7.673824] [Firmware Bug]: device: ':7f:10.5': no node assigned on NUMA 
capable HW
[7.689792] [Firmware Bug]: device: ':7f:10.6': no node assigned on NUMA 
capable HW
[7.704825] [Firmware Bug]: device: ':7f:10.7': no node assigned on NUMA 
capable HW
[7.720791] [Firmware Bug]: device: ':7f:13.0': no node assigned on NUMA 
capable HW
[7.736793] [Firmware Bug]: device: ':7f:13.1': no node assigned on NUMA 
capable HW
[7.752791] [Firmware Bug]: device: ':7f:13.4': no node assigned on NUMA 
capable HW
[7.767780] [Firmware Bug]: 

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Michal Hocko
On Wed 25-09-19 12:40:40, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 03:19:39PM +0200, Michal Hocko wrote:
> 
> > > The below would get rid of the PMU and workqueue warnings with no
> > > side-effects (the device isn't used for anything except sysfs).
> > 
> > Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...
> 
> It doesn't matter that 0 is _never_ used. These are fake devices,
> and all we care about is getting rid of that error.

That is a very subtle and hard to review assumption. Even if this holds
now a future change might easily break this AFAIU. It also assumes that
you catch all such special devices.

I am sorry but I still do not understand why you consider this whack a
mole better then simply live with the fact that NUMA_NO_NODE is a
reality and that using the full cpu mask is a reasonable answer to that.
Anyway, I feel we are loop here so I will leave out the final decision
to you.

> If it makes you feel better we can make it -2 and have dev_to_node()
> WARN if it ever sees one.

That would help

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Peter Zijlstra
On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
> From the discussion above, It seems making the node_to_cpumask_map()
> NUMA_NO_NODE aware is the most feasible way to move forwad.

That's still wrong.


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Peter Zijlstra
On Tue, Sep 24, 2019 at 03:19:39PM +0200, Michal Hocko wrote:

> > The below would get rid of the PMU and workqueue warnings with no
> > side-effects (the device isn't used for anything except sysfs).
> 
> Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...

It doesn't matter that 0 is _never_ used. These are fake devices,
and all we care about is getting rid of that error.

If it makes you feel better we can make it -2 and have dev_to_node()
WARN if it ever sees one.


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Yunsheng Lin
On 2019/9/24 21:19, Michal Hocko wrote:
> On Tue 24-09-19 14:59:36, Peter Zijlstra wrote:
>> On Tue, Sep 24, 2019 at 02:43:25PM +0200, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
 On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
>>>
> We can push back and say we don't respect the specification because it
> is batshit insane ;-)

 Here is my fingers crossed.

 [...]

> Now granted; there's a number of virtual devices that really don't have
> a node affinity, but then, those are not hurt by forcing them onto a
> random node, they really don't do anything. Like:

 Do you really consider a random node a better fix than simply living
 with a more robust NUMA_NO_NODE which tells the actual state? Page
 allocator would effectivelly use the local node in that case. Any code
 using the cpumask will know that any of the online cpus are usable.
>>>
>>> For the pmu devices? Yes, those 'devices' aren't actually used for
>>> anything other than sysfs entries.
>>>
>>> Nothing else uses the struct device.
>>
>> The below would get rid of the PMU and workqueue warnings with no
>> side-effects (the device isn't used for anything except sysfs).
> 
> Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...

Hi, Peter & Michal

>From the discussion above, It seems making the node_to_cpumask_map()
NUMA_NO_NODE aware is the most feasible way to move forwad.

Any suggestion?

> 



Re: [PATCH v2 00/21] Refine memblock API

2019-09-24 Thread Adam Ford
On Mon, Jan 21, 2019 at 2:05 AM Mike Rapoport  wrote:
>
> Hi,
>
> Current memblock API is quite extensive and, which is more annoying,
> duplicated. Except the low-level functions that allow searching for a free
> memory region and marking it as reserved, memblock provides three (well,
> two and a half) sets of functions to allocate memory. There are several
> overlapping functions that return a physical address and there are
> functions that return virtual address. Those that return the virtual
> address may also clear the allocated memory. And, on top of all that, some
> allocators panic and some return NULL in case of error.
>
> This set tries to reduce the mess, and trim down the amount of memblock
> allocation methods.
>
> Patches 1-10 consolidate the functions that return physical address of
> the allocated memory
>
> Patches 11-13 are some trivial cleanups
>
> Patches 14-19 add checks for the return value of memblock_alloc*() and
> panics in case of errors. The patches 14-18 include some minor refactoring
> to have better readability of the resulting code and patch 19 is a
> mechanical addition of
>
> if (!ptr)
> panic();
>
> after memblock_alloc*() calls.
>
> And, finally, patches 20 and 21 remove panic() calls memblock and _nopanic
> variants from memblock.
>
> v2 changes:
> * replace some more %lu with %zu
> * remove panics where they are not needed in s390 and in printk
> * collect Acked-by and Reviewed-by.
>
>
> Christophe Leroy (1):
>   powerpc: use memblock functions returning virtual address
>
> Mike Rapoport (20):
>   openrisc: prefer memblock APIs returning virtual address
>   memblock: replace memblock_alloc_base(ANYWHERE) with memblock_phys_alloc
>   memblock: drop memblock_alloc_base_nid()
>   memblock: emphasize that memblock_alloc_range() returns a physical address
>   memblock: memblock_phys_alloc_try_nid(): don't panic
>   memblock: memblock_phys_alloc(): don't panic
>   memblock: drop __memblock_alloc_base()
>   memblock: drop memblock_alloc_base()
>   memblock: refactor internal allocation functions
>   memblock: make memblock_find_in_range_node() and choose_memblock_flags() 
> static
>   arch: use memblock_alloc() instead of memblock_alloc_from(size, align, 0)
>   arch: don't memset(0) memory returned by memblock_alloc()
>   ia64: add checks for the return value of memblock_alloc*()
>   sparc: add checks for the return value of memblock_alloc*()
>   mm/percpu: add checks for the return value of memblock_alloc*()
>   init/main: add checks for the return value of memblock_alloc*()
>   swiotlb: add checks for the return value of memblock_alloc*()
>   treewide: add checks for the return value of memblock_alloc*()
>   memblock: memblock_alloc_try_nid: don't panic
>   memblock: drop memblock_alloc_*_nopanic() variants
>
I know it's rather late, but this patch broke the Etnaviv 3D graphics
in my i.MX6Q.

When I try to use the 3D, it returns some errors and the dmesg log
shows some memory allocation errors too:
[3.682347] etnaviv etnaviv: bound 13.gpu (ops gpu_ops)
[3.688669] etnaviv etnaviv: bound 134000.gpu (ops gpu_ops)
[3.695099] etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops)
[3.700800] etnaviv-gpu 13.gpu: model: GC2000, revision: 5108
[3.723013] etnaviv-gpu 13.gpu: command buffer outside valid
memory window
[3.731308] etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
[3.752437] etnaviv-gpu 134000.gpu: command buffer outside valid
memory window
[3.760583] etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
[3.766766] etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0
[3.776131] [drm] Initialized etnaviv 1.2.0 20151214 for etnaviv on minor 0

# glmark2-es2-drm
Error creating gpu
Error: eglCreateWindowSurface failed with error: 0x3009
Error: eglCreateWindowSurface failed with error: 0x3009
Error: CanvasGeneric: Invalid EGL state
Error: main: Could not initialize canvas


Before this patch:

[3.691995] etnaviv etnaviv: bound 13.gpu (ops gpu_ops)
[3.698356] etnaviv etnaviv: bound 134000.gpu (ops gpu_ops)
[3.704792] etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops)
[3.710488] etnaviv-gpu 13.gpu: model: GC2000, revision: 5108
[3.733649] etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
[3.756115] etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
[3.762250] etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0
[3.771432] [drm] Initialized etnaviv 1.2.0 20151214 for etnaviv on minor 0

and the 3D gemos work without this.

I don't know enough about the i.MX6 nor the 3D accelerator to know how
to fix it.
I am hoping someone in the know might have some suggestions.

>  arch/alpha/kernel/core_cia.c  |   5 +-
>  arch/alpha/kernel/core_marvel.c   |   6 +
>  arch/alpha/kernel/pci-noop.c  |  13 +-
>  arch/alpha/kernel/pci.c   |  11 +-
>  arch/alpha/kernel/pci_iommu.c |  16 +-
>  arch/alpha/kernel/setup.c   

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Michal Hocko
On Tue 24-09-19 14:59:36, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 02:43:25PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
> > > On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
> > 
> > > > We can push back and say we don't respect the specification because it
> > > > is batshit insane ;-)
> > > 
> > > Here is my fingers crossed.
> > > 
> > > [...]
> > > 
> > > > Now granted; there's a number of virtual devices that really don't have
> > > > a node affinity, but then, those are not hurt by forcing them onto a
> > > > random node, they really don't do anything. Like:
> > > 
> > > Do you really consider a random node a better fix than simply living
> > > with a more robust NUMA_NO_NODE which tells the actual state? Page
> > > allocator would effectivelly use the local node in that case. Any code
> > > using the cpumask will know that any of the online cpus are usable.
> > 
> > For the pmu devices? Yes, those 'devices' aren't actually used for
> > anything other than sysfs entries.
> > 
> > Nothing else uses the struct device.
> 
> The below would get rid of the PMU and workqueue warnings with no
> side-effects (the device isn't used for anything except sysfs).

Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2019 at 02:43:25PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
> > On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
> 
> > > We can push back and say we don't respect the specification because it
> > > is batshit insane ;-)
> > 
> > Here is my fingers crossed.
> > 
> > [...]
> > 
> > > Now granted; there's a number of virtual devices that really don't have
> > > a node affinity, but then, those are not hurt by forcing them onto a
> > > random node, they really don't do anything. Like:
> > 
> > Do you really consider a random node a better fix than simply living
> > with a more robust NUMA_NO_NODE which tells the actual state? Page
> > allocator would effectivelly use the local node in that case. Any code
> > using the cpumask will know that any of the online cpus are usable.
> 
> For the pmu devices? Yes, those 'devices' aren't actually used for
> anything other than sysfs entries.
> 
> Nothing else uses the struct device.

The below would get rid of the PMU and workqueue warnings with no
side-effects (the device isn't used for anything except sysfs).

I'm stuck in the device code for BDIs, I can't find a sane place to set
the node before it gets added, due to it using device_create_vargs().

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4f08b17d6426..2a64dcc3d70f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9965,6 +9965,7 @@ static int pmu_dev_alloc(struct pmu *pmu)
if (!pmu->dev)
goto out;
 
+   set_dev_node(pmu->dev, 0);
pmu->dev->groups = pmu->attr_groups;
device_initialize(pmu->dev);
ret = dev_set_name(pmu->dev, "%s", pmu->name);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bc2e09a8ea61..efafc4590bbe 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5613,6 +5613,7 @@ int workqueue_sysfs_register(struct workqueue_struct *wq)
wq_dev->dev.bus = _subsys;
wq_dev->dev.release = wq_device_release;
dev_set_name(_dev->dev, "%s", wq->name);
+   set_dev_node(wq_dev, 0);
 
/*
 * unbound_attrs are created separately.  Suppress uevent until


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
> On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:

> > We can push back and say we don't respect the specification because it
> > is batshit insane ;-)
> 
> Here is my fingers crossed.
> 
> [...]
> 
> > Now granted; there's a number of virtual devices that really don't have
> > a node affinity, but then, those are not hurt by forcing them onto a
> > random node, they really don't do anything. Like:
> 
> Do you really consider a random node a better fix than simply living
> with a more robust NUMA_NO_NODE which tells the actual state? Page
> allocator would effectivelly use the local node in that case. Any code
> using the cpumask will know that any of the online cpus are usable.

For the pmu devices? Yes, those 'devices' aren't actually used for
anything other than sysfs entries.

Nothing else uses the struct device.

> Compare that to a wild guess that might be easily wrong and have subtle
> side effects which are really hard to debug. You will only see a higher
> utilization on a specific node. Good luck with a bug report like that.

We'd have the FW_BUG in the dmesg, which should be a big fat clue.




Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Michal Hocko
On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 01:54:01PM +0200, Michal Hocko wrote:
> > On Tue 24-09-19 13:23:49, Peter Zijlstra wrote:
> > > On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
> > [...]
> > > > To be honest I really fail to see why to object to a simple semantic
> > > > that NUMA_NO_NODE imply all usable cpus. Could you explain that please?
> > > 
> > > Because it feels wrong. The device needs to be _somewhere_. It simply
> > > cannot be node-less.
> > 
> > What if it doesn't have any numa preference for what ever reason? There
> > is no other way to express that than NUMA_NO_NODE.
> 
> Like I said; how does that physically work? The device needs to be
> somewhere. It _must_ have a preference.
> 
> > Anyway, I am not going to argue more about this because it seems more of
> > a discussion about "HW shouldn't be doing that although the specification
> > allows that" which cannot really have any outcome except of "feels
> > correct/wrong".
> 
> We can push back and say we don't respect the specification because it
> is batshit insane ;-)

Here is my fingers crossed.

[...]

> Now granted; there's a number of virtual devices that really don't have
> a node affinity, but then, those are not hurt by forcing them onto a
> random node, they really don't do anything. Like:

Do you really consider a random node a better fix than simply living
with a more robust NUMA_NO_NODE which tells the actual state? Page
allocator would effectivelly use the local node in that case. Any code
using the cpumask will know that any of the online cpus are usable.

Compare that to a wild guess that might be easily wrong and have subtle
side effects which are really hard to debug. You will only see a higher
utilization on a specific node. Good luck with a bug report like that.

Anyway, I really do  not feel strongly about that. If you really consider
it a bad idea then I can live with that. This just felt easier and
reasonably consistent to address. Implementing the guessing and fighting
vendors who really do not feel like providing a real affinity sounds
harder and more error prone.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Yunsheng Lin
On 2019/9/24 19:58, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 07:44:28PM +0800, Yunsheng Lin wrote:
>> From [1], there is a lot of devices with node id of NUMA_NO_NODE with the
>> FW_BUG.
>>
>> [1] 
>> https://lore.kernel.org/lkml/5a188e2b-6c07-a9db-fbaa-561e9362d...@huawei.com/
> 
> So aside of all the software devices which we can (and really should)
> fix; these:
> 
> 26.470076]  pci:00: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 26.815436]  pci:7b: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 27.004447]  pci:7a: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 27.236797]  pci:78: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 27.505833]  pci:7c: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 28.056452]  pci:74: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 28.470018]  pci:80: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 28.726411]  pci:bb: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 28.916656]  pci:ba: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 29.152839]  pci:b8: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 29.425808]  pci:bc: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 29.718593]  pci:b4: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 
> look like actual problems. How can PCI devices not have a node assigned?

The above PCI devices do not have a node assigned because I downgraded
the bios to a older version that has not implemented the "Proximity Domain"
feature specified by APCI, which is optional feature, so the bios denied
that it is a bug of the bios.

> 
> .
> 



Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Yunsheng Lin
On 2019/9/24 19:28, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 07:07:36PM +0800, Yunsheng Lin wrote:
>> On 2019/9/24 17:25, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
 On 2019/9/24 4:34, Peter Zijlstra wrote:
>>>
> I'm saying the ACPI standard is wrong. Explain to me how it is
> physically possible to have a device without NUMA affinity in a NUMA
> system?
>
>  1) The fundamental interconnect is not uniform.
>  2) The device needs to actually be somewhere.
>

 From what I can see, NUMA_NO_NODE may make sense in the below case:

 1) Theoretically, there would be a device that can access all the memory
 uniformly and can be accessed by all cpus uniformly even in a NUMA system.
 Suppose we have two nodes, and the device just sit in the middle of the
 interconnect between the two nodes.

 Even we define a third node solely for the device, we may need to look at
 the node distance to decide the device can be accessed uniformly.

 Or we can decide that the device can be accessed uniformly by setting
 it's node to NUMA_NO_NODE.
>>>
>>> This is indeed a theoretical case; it doesn't scale. The moment you're
>>> adding multiple sockets or even board interconnects this all goes out
>>> the window.
>>>
>>> And in this case, forcing the device to either node is fine.
>>
>> Not really.
>> For packet sending and receiving, the buffer memory may be allocated
>> dynamically. Node of tx buffer memory is mainly based on the cpu
>> that is sending sending, node of rx buffer memory is mainly based on
>> the cpu the interrupt handler of the device is running on, and the
>> device' interrupt affinity is mainly based on node id of the device.
>>
>> We can bind the processes that are using the device to both nodes
>> in order to utilize memory on both nodes for packet sending.
>>
>> But for packet receiving, the node1 may not be used becuase the node
>> id of device is forced to node 0, which is the default way to bind
>> the interrupt to the cpu of the same node.
>>
>> If node_to_cpumask_map() returns all usable cpus when the device's node
>> id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.
> 
> s/binded/bound/
> 
> Sure; the data can be allocated wherever, but the control structures are
> not dynamically allocated every time. They are persistent, and they will
> be local to some node.
> 
> Anyway, are you saying this stupid corner case is actually relevant?
> Because how does it scale out? What if you have 8 sockets, with each
> socket having 2 nodes and 1 such magic device. Then returning all CPUs
> is just plain wrong.

Yes, the hardware may not scale out, but what about the virtual device?

> 
 2) For many virtual deivces, such as tun or loopback netdevice, they
 are also accessed uniformly by all cpus.
>>>
>>> Not true; the virtual device will sit in memory local to some node.
>>>
>>> And as with physical devices, you probably want at least one (virtual)
>>> queue per node.
>>
>> There may be similar handling as above for virtual device too.
> 
> And it'd be similarly broken.

>From [1], there is a lot of devices with node id of NUMA_NO_NODE with the
FW_BUG.

[1] 
https://lore.kernel.org/lkml/5a188e2b-6c07-a9db-fbaa-561e9362d...@huawei.com/


> 
> .
> 



Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2019 at 07:07:36PM +0800, Yunsheng Lin wrote:
> On 2019/9/24 17:25, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
> >> On 2019/9/24 4:34, Peter Zijlstra wrote:
> > 
> >>> I'm saying the ACPI standard is wrong. Explain to me how it is
> >>> physically possible to have a device without NUMA affinity in a NUMA
> >>> system?
> >>>
> >>>  1) The fundamental interconnect is not uniform.
> >>>  2) The device needs to actually be somewhere.
> >>>
> >>
> >> From what I can see, NUMA_NO_NODE may make sense in the below case:
> >>
> >> 1) Theoretically, there would be a device that can access all the memory
> >> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
> >> Suppose we have two nodes, and the device just sit in the middle of the
> >> interconnect between the two nodes.
> >>
> >> Even we define a third node solely for the device, we may need to look at
> >> the node distance to decide the device can be accessed uniformly.
> >>
> >> Or we can decide that the device can be accessed uniformly by setting
> >> it's node to NUMA_NO_NODE.
> > 
> > This is indeed a theoretical case; it doesn't scale. The moment you're
> > adding multiple sockets or even board interconnects this all goes out
> > the window.
> > 
> > And in this case, forcing the device to either node is fine.
> 
> Not really.
> For packet sending and receiving, the buffer memory may be allocated
> dynamically. Node of tx buffer memory is mainly based on the cpu
> that is sending sending, node of rx buffer memory is mainly based on
> the cpu the interrupt handler of the device is running on, and the
> device' interrupt affinity is mainly based on node id of the device.
> 
> We can bind the processes that are using the device to both nodes
> in order to utilize memory on both nodes for packet sending.
> 
> But for packet receiving, the node1 may not be used becuase the node
> id of device is forced to node 0, which is the default way to bind
> the interrupt to the cpu of the same node.
> 
> If node_to_cpumask_map() returns all usable cpus when the device's node
> id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.

s/binded/bound/

Sure; the data can be allocated wherever, but the control structures are
not dynamically allocated every time. They are persistent, and they will
be local to some node.

Anyway, are you saying this stupid corner case is actually relevant?
Because how does it scale out? What if you have 8 sockets, with each
socket having 2 nodes and 1 such magic device. Then returning all CPUs
is just plain wrong.

> >> 2) For many virtual deivces, such as tun or loopback netdevice, they
> >> are also accessed uniformly by all cpus.
> > 
> > Not true; the virtual device will sit in memory local to some node.
> > 
> > And as with physical devices, you probably want at least one (virtual)
> > queue per node.
> 
> There may be similar handling as above for virtual device too.

And it'd be similarly broken.


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
> On Tue 24-09-19 11:17:14, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 09:47:51AM +0200, Michal Hocko wrote:
> > > On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> > > > On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I even the
> > > > > ACPI standard is considering this optional. Yunsheng Lin has referred 
> > > > > to
> > > > > the specific part of the standard in one of the earlier discussions.
> > > > > Trying to guess the node affinity is worse than providing all CPUs 
> > > > > IMHO.
> > > > 
> > > > I'm saying the ACPI standard is wrong.
> > > 
> > > Even if you were right on this the reality is that a HW is likely to
> > > follow that standard and we cannot rule out NUMA_NO_NODE being
> > > specified. As of now we would access beyond the defined array and that
> > > is clearly a bug.
> > 
> > Right, because the device node is wrong, so we fix _that_!
> > 
> > > Let's assume that this is really a bug for a moment. What are you going
> > > to do about that? BUG_ON? I do not really see any solution besides to 
> > > either
> > > provide something sensible or BUG_ON. If you are worried about a
> > > conditional then this should be pretty easy to solve by starting the
> > > array at -1 index and associate it with the online cpu mask.
> > 
> > The same thing I proposed earlier; force the device node to 0 (or any
> > other convenient random valid value) and issue a FW_BUG message to the
> > console.
> 
> Why would you "fix" anything and how do you know that node 0 is the
> right choice? I have seen setups with node 0 without any memory and
> similar unexpected things.

We don't know 0 is right; but we know 'unkown' is wrong, so we FW_BUG
and pick _something_.

> To be honest I really fail to see why to object to a simple semantic
> that NUMA_NO_NODE imply all usable cpus. Could you explain that please?

Because it feels wrong. The device needs to be _somewhere_. It simply
cannot be node-less.


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Yunsheng Lin
On 2019/9/24 17:25, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
>> On 2019/9/24 4:34, Peter Zijlstra wrote:
> 
>>> I'm saying the ACPI standard is wrong. Explain to me how it is
>>> physically possible to have a device without NUMA affinity in a NUMA
>>> system?
>>>
>>>  1) The fundamental interconnect is not uniform.
>>>  2) The device needs to actually be somewhere.
>>>
>>
>> From what I can see, NUMA_NO_NODE may make sense in the below case:
>>
>> 1) Theoretically, there would be a device that can access all the memory
>> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
>> Suppose we have two nodes, and the device just sit in the middle of the
>> interconnect between the two nodes.
>>
>> Even we define a third node solely for the device, we may need to look at
>> the node distance to decide the device can be accessed uniformly.
>>
>> Or we can decide that the device can be accessed uniformly by setting
>> it's node to NUMA_NO_NODE.
> 
> This is indeed a theoretical case; it doesn't scale. The moment you're
> adding multiple sockets or even board interconnects this all goes out
> the window.
> 
> And in this case, forcing the device to either node is fine.

Not really.
For packet sending and receiving, the buffer memory may be allocated
dynamically. Node of tx buffer memory is mainly based on the cpu
that is sending sending, node of rx buffer memory is mainly based on
the cpu the interrupt handler of the device is running on, and the
device' interrupt affinity is mainly based on node id of the device.

We can bind the processes that are using the device to both nodes
in order to utilize memory on both nodes for packet sending.

But for packet receiving, the node1 may not be used becuase the node
id of device is forced to node 0, which is the default way to bind
the interrupt to the cpu of the same node.

If node_to_cpumask_map() returns all usable cpus when the device's node
id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.

> 
>> 2) For many virtual deivces, such as tun or loopback netdevice, they
>> are also accessed uniformly by all cpus.
> 
> Not true; the virtual device will sit in memory local to some node.
> 
> And as with physical devices, you probably want at least one (virtual)
> queue per node.

There may be similar handling as above for virtual device too.

> 
> 
> .
> 



Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Michal Hocko
On Tue 24-09-19 11:17:14, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 09:47:51AM +0200, Michal Hocko wrote:
> > On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> > > On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> > [...]
> > > > I even the
> > > > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > > > the specific part of the standard in one of the earlier discussions.
> > > > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> > > 
> > > I'm saying the ACPI standard is wrong.
> > 
> > Even if you were right on this the reality is that a HW is likely to
> > follow that standard and we cannot rule out NUMA_NO_NODE being
> > specified. As of now we would access beyond the defined array and that
> > is clearly a bug.
> 
> Right, because the device node is wrong, so we fix _that_!
> 
> > Let's assume that this is really a bug for a moment. What are you going
> > to do about that? BUG_ON? I do not really see any solution besides to either
> > provide something sensible or BUG_ON. If you are worried about a
> > conditional then this should be pretty easy to solve by starting the
> > array at -1 index and associate it with the online cpu mask.
> 
> The same thing I proposed earlier; force the device node to 0 (or any
> other convenient random valid value) and issue a FW_BUG message to the
> console.

Why would you "fix" anything and how do you know that node 0 is the
right choice? I have seen setups with node 0 without any memory and
similar unexpected things.

To be honest I really fail to see why to object to a simple semantic
that NUMA_NO_NODE imply all usable cpus. Could you explain that please?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
> On 2019/9/24 4:34, Peter Zijlstra wrote:

> > I'm saying the ACPI standard is wrong. Explain to me how it is
> > physically possible to have a device without NUMA affinity in a NUMA
> > system?
> > 
> >  1) The fundamental interconnect is not uniform.
> >  2) The device needs to actually be somewhere.
> > 
> 
> From what I can see, NUMA_NO_NODE may make sense in the below case:
> 
> 1) Theoretically, there would be a device that can access all the memory
> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
> Suppose we have two nodes, and the device just sit in the middle of the
> interconnect between the two nodes.
> 
> Even we define a third node solely for the device, we may need to look at
> the node distance to decide the device can be accessed uniformly.
> 
> Or we can decide that the device can be accessed uniformly by setting
> it's node to NUMA_NO_NODE.

This is indeed a theoretical case; it doesn't scale. The moment you're
adding multiple sockets or even board interconnects this all goes out
the window.

And in this case, forcing the device to either node is fine.

> 2) For many virtual deivces, such as tun or loopback netdevice, they
> are also accessed uniformly by all cpus.

Not true; the virtual device will sit in memory local to some node.

And as with physical devices, you probably want at least one (virtual)
queue per node.



Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra
On Tue, Sep 24, 2019 at 09:47:51AM +0200, Michal Hocko wrote:
> On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> > On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> [...]
> > > I even the
> > > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > > the specific part of the standard in one of the earlier discussions.
> > > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> > 
> > I'm saying the ACPI standard is wrong.
> 
> Even if you were right on this the reality is that a HW is likely to
> follow that standard and we cannot rule out NUMA_NO_NODE being
> specified. As of now we would access beyond the defined array and that
> is clearly a bug.

Right, because the device node is wrong, so we fix _that_!

> Let's assume that this is really a bug for a moment. What are you going
> to do about that? BUG_ON? I do not really see any solution besides to either
> provide something sensible or BUG_ON. If you are worried about a
> conditional then this should be pretty easy to solve by starting the
> array at -1 index and associate it with the online cpu mask.

The same thing I proposed earlier; force the device node to 0 (or any
other convenient random valid value) and issue a FW_BUG message to the
console.



Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Michal Hocko
On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
[...]
> > I even the
> > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > the specific part of the standard in one of the earlier discussions.
> > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> 
> I'm saying the ACPI standard is wrong.

Even if you were right on this the reality is that a HW is likely to
follow that standard and we cannot rule out NUMA_NO_NODE being
specified. As of now we would access beyond the defined array and that
is clearly a bug.

Let's assume that this is really a bug for a moment. What are you going
to do about that? BUG_ON? I do not really see any solution besides to either
provide something sensible or BUG_ON. If you are worried about a
conditional then this should be pretty easy to solve by starting the
array at -1 index and associate it with the online cpu mask.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Yunsheng Lin
On 2019/9/24 4:34, Peter Zijlstra wrote:
> On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
>> On Mon 23-09-19 17:48:52, Peter Zijlstra wrote:
> 
>> To the NUMA_NO_NODE itself. Your earlier email noted:
>> : > +
>> : >  if ((unsigned)node >= nr_node_ids) {
>> : >  printk(KERN_WARNING
>> : >  "cpumask_of_node(%d): (unsigned)node >= 
>> nr_node_ids(%u)\n",
>> : 
>> : I still think this makes absolutely no sense what so ever.
>>
>> Did you mean the NUMA_NO_NODE handling or the specific node >= nr_node_ids
>> check?
> 
> The NUMA_NO_NODE thing. It's is physical impossibility. And if the
> device description doesn't give us a node, then the description is
> incomplete and wrong and we should bloody well complain about it.
> 
>> Because as to NUMA_NO_NODE I believe this makes sense because this is
>> the only way that a device is not bound to any numa node.
> 
> Which is a physical impossibility.
> 
>> I even the
>> ACPI standard is considering this optional. Yunsheng Lin has referred to
>> the specific part of the standard in one of the earlier discussions.
>> Trying to guess the node affinity is worse than providing all CPUs IMHO.
> 
> I'm saying the ACPI standard is wrong. Explain to me how it is
> physically possible to have a device without NUMA affinity in a NUMA
> system?
> 
>  1) The fundamental interconnect is not uniform.
>  2) The device needs to actually be somewhere.
> 

>From what I can see, NUMA_NO_NODE may make sense in the below case:

1) Theoretically, there would be a device that can access all the memory
uniformly and can be accessed by all cpus uniformly even in a NUMA system.
Suppose we have two nodes, and the device just sit in the middle of the
interconnect between the two nodes.

Even we define a third node solely for the device, we may need to look at
the node distance to decide the device can be accessed uniformly.

Or we can decide that the device can be accessed uniformly by setting
it's node to NUMA_NO_NODE.


2) For many virtual deivces, such as tun or loopback netdevice, they
are also accessed uniformly by all cpus.



Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Peter Zijlstra
On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> On Mon 23-09-19 17:48:52, Peter Zijlstra wrote:

> To the NUMA_NO_NODE itself. Your earlier email noted:
> : > +
> : >   if ((unsigned)node >= nr_node_ids) {
> : >   printk(KERN_WARNING
> : >   "cpumask_of_node(%d): (unsigned)node >= 
> nr_node_ids(%u)\n",
> : 
> : I still think this makes absolutely no sense what so ever.
> 
> Did you mean the NUMA_NO_NODE handling or the specific node >= nr_node_ids
> check?

The NUMA_NO_NODE thing. It's is physical impossibility. And if the
device description doesn't give us a node, then the description is
incomplete and wrong and we should bloody well complain about it.

> Because as to NUMA_NO_NODE I believe this makes sense because this is
> the only way that a device is not bound to any numa node.

Which is a physical impossibility.

> I even the
> ACPI standard is considering this optional. Yunsheng Lin has referred to
> the specific part of the standard in one of the earlier discussions.
> Trying to guess the node affinity is worse than providing all CPUs IMHO.

I'm saying the ACPI standard is wrong. Explain to me how it is
physically possible to have a device without NUMA affinity in a NUMA
system?

 1) The fundamental interconnect is not uniform.
 2) The device needs to actually be somewhere.

>From these it seems to follow that access to the device is subject to
NUMA.



Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Michal Hocko
On Mon 23-09-19 17:48:52, Peter Zijlstra wrote:
> On Mon, Sep 23, 2019 at 05:28:56PM +0200, Michal Hocko wrote:
> > On Mon 23-09-19 17:15:19, Peter Zijlstra wrote:
> 
> > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > > index 4123100e..9859acb 100644
> > > > --- a/arch/x86/mm/numa.c
> > > > +++ b/arch/x86/mm/numa.c
> > > > @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
> > > >   */
> > > >  const struct cpumask *cpumask_of_node(int node)
> > > >  {
> > > > +   if (node == NUMA_NO_NODE)
> > > > +   return cpu_online_mask;
> > > 
> > > This mandates the caller holds cpus_read_lock() or something, I'm pretty
> > > sure that if I put:
> > > 
> > >   lockdep_assert_cpus_held();
> > 
> > Is this documented somewhere?
> 
> No idea... common sense :-)

I thought that and cpuhotplug were forbiden to be used in the same
sentence :p

> > Also how does that differ from a normal
> > case when a proper node is used? The cpumask will always be dynamic in
> > the cpu hotplug presence, right?
> 
> As per normal yes, and I'm fairly sure there's a ton of bugs. Any
> 'online' state is subject to change except when you're holding
> sufficient locks to stop it.
> 
> Disabling preemption also stabilizes it, because cpu unplug relies on
> stop-machine.

OK, I guess it is fair to document that callers should be careful when
using this if they absolutely need any stability. But I strongly suspect
they simply do not care all that much. They mostly do care to have
something that gives them an idea which CPUs are close to the device and
that can tolerate some race.

In other words this is more of an optimization than a correctness issue.
 
> > > here, it comes apart real quick. Without holding the cpu hotplug lock,
> > > the online mask is gibberish.
> > 
> > Can the returned cpu mask go away?
> 
> No, the cpu_online_mask itself has static storage, the contents OTOH can
> change at will. Very little practical difference :-)
 
OK, thanks for the confirmation. I was worried that I've overlooked
something.

To the NUMA_NO_NODE itself. Your earlier email noted:
: > +
: > if ((unsigned)node >= nr_node_ids) {
: > printk(KERN_WARNING
: > "cpumask_of_node(%d): (unsigned)node >= 
nr_node_ids(%u)\n",
: 
: I still think this makes absolutely no sense what so ever.

Did you mean the NUMA_NO_NODE handling or the specific node >= nr_node_ids
check?

Because as to NUMA_NO_NODE I believe this makes sense because this is
the only way that a device is not bound to any numa node. I even the
ACPI standard is considering this optional. Yunsheng Lin has referred to
the specific part of the standard in one of the earlier discussions.
Trying to guess the node affinity is worse than providing all CPUs IMHO.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Peter Zijlstra
On Mon, Sep 23, 2019 at 05:28:56PM +0200, Michal Hocko wrote:
> On Mon 23-09-19 17:15:19, Peter Zijlstra wrote:

> > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > index 4123100e..9859acb 100644
> > > --- a/arch/x86/mm/numa.c
> > > +++ b/arch/x86/mm/numa.c
> > > @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
> > >   */
> > >  const struct cpumask *cpumask_of_node(int node)
> > >  {
> > > + if (node == NUMA_NO_NODE)
> > > + return cpu_online_mask;
> > 
> > This mandates the caller holds cpus_read_lock() or something, I'm pretty
> > sure that if I put:
> > 
> > lockdep_assert_cpus_held();
> 
> Is this documented somewhere?

No idea... common sense :-)

> Also how does that differ from a normal
> case when a proper node is used? The cpumask will always be dynamic in
> the cpu hotplug presence, right?

As per normal yes, and I'm fairly sure there's a ton of bugs. Any
'online' state is subject to change except when you're holding
sufficient locks to stop it.

Disabling preemption also stabilizes it, because cpu unplug relies on
stop-machine.

> > here, it comes apart real quick. Without holding the cpu hotplug lock,
> > the online mask is gibberish.
> 
> Can the returned cpu mask go away?

No, the cpu_online_mask itself has static storage, the contents OTOH can
change at will. Very little practical difference :-)




Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Michal Hocko
On Mon 23-09-19 17:15:19, Peter Zijlstra wrote:
> On Tue, Sep 17, 2019 at 08:48:54PM +0800, Yunsheng Lin wrote:
> > When passing the return value of dev_to_node() to cpumask_of_node()
> > without checking if the device's node id is NUMA_NO_NODE, there is
> > global-out-of-bounds detected by KASAN.
> > 
> > From the discussion [1], NUMA_NO_NODE really means no node affinity,
> > which also means all cpus should be usable. So the cpumask_of_node()
> > should always return all cpus online when user passes the node id as
> > NUMA_NO_NODE, just like similar semantic that page allocator handles
> > NUMA_NO_NODE.
> > 
> > But we cannot really copy the page allocator logic. Simply because the
> > page allocator doesn't enforce the near node affinity. It just picks it
> > up as a preferred node but then it is free to fallback to any other numa
> > node. This is not the case here and node_to_cpumask_map will only restrict
> > to the particular node's cpus which would have really non deterministic
> > behavior depending on where the code is executed. So in fact we really
> > want to return cpu_online_mask for NUMA_NO_NODE.
> > 
> > Also there is a debugging version of node_to_cpumask_map() for x86 and
> > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> > 
> > [1] https://lore.kernel.org/patchwork/patch/1125789/
> 
> That is bloody unusable, don't do that. Use:
> 
>   https://lkml.kernel.org/r/$MSGID
> 
> if anything. Then I can find it in my local mbox without having to
> resort to touching a mouse and shitty browser software.
> 
> (also patchwork is absolute crap for reading email threads)
> 
> Anyway, I found it -- I think, I refused to click the link. I replied
> there.
> 
> > Signed-off-by: Yunsheng Lin 
> > Suggested-by: Michal Hocko 
> > Acked-by: Michal Hocko 
> 
> 
> 
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 4123100e..9859acb 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
> >   */
> >  const struct cpumask *cpumask_of_node(int node)
> >  {
> > +   if (node == NUMA_NO_NODE)
> > +   return cpu_online_mask;
> 
> This mandates the caller holds cpus_read_lock() or something, I'm pretty
> sure that if I put:
> 
>   lockdep_assert_cpus_held();

Is this documented somewhere? Also how does that differ from a normal
case when a proper node is used? The cpumask will always be dynamic in
the cpu hotplug presence, right?

> here, it comes apart real quick. Without holding the cpu hotplug lock,
> the online mask is gibberish.

Can the returned cpu mask go away?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Peter Zijlstra
On Tue, Sep 17, 2019 at 08:48:54PM +0800, Yunsheng Lin wrote:
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking if the device's node id is NUMA_NO_NODE, there is
> global-out-of-bounds detected by KASAN.
> 
> From the discussion [1], NUMA_NO_NODE really means no node affinity,
> which also means all cpus should be usable. So the cpumask_of_node()
> should always return all cpus online when user passes the node id as
> NUMA_NO_NODE, just like similar semantic that page allocator handles
> NUMA_NO_NODE.
> 
> But we cannot really copy the page allocator logic. Simply because the
> page allocator doesn't enforce the near node affinity. It just picks it
> up as a preferred node but then it is free to fallback to any other numa
> node. This is not the case here and node_to_cpumask_map will only restrict
> to the particular node's cpus which would have really non deterministic
> behavior depending on where the code is executed. So in fact we really
> want to return cpu_online_mask for NUMA_NO_NODE.
> 
> Also there is a debugging version of node_to_cpumask_map() for x86 and
> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> 
> [1] https://lore.kernel.org/patchwork/patch/1125789/

That is bloody unusable, don't do that. Use:

  https://lkml.kernel.org/r/$MSGID

if anything. Then I can find it in my local mbox without having to
resort to touching a mouse and shitty browser software.

(also patchwork is absolute crap for reading email threads)

Anyway, I found it -- I think, I refused to click the link. I replied
there.

> Signed-off-by: Yunsheng Lin 
> Suggested-by: Michal Hocko 
> Acked-by: Michal Hocko 



> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 4123100e..9859acb 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
>   */
>  const struct cpumask *cpumask_of_node(int node)
>  {
> + if (node == NUMA_NO_NODE)
> + return cpu_online_mask;

This mandates the caller holds cpus_read_lock() or something, I'm pretty
sure that if I put:

lockdep_assert_cpus_held();

here, it comes apart real quick. Without holding the cpu hotplug lock,
the online mask is gibberish.

> +
>   if ((unsigned)node >= nr_node_ids) {
>   printk(KERN_WARNING
>   "cpumask_of_node(%d): (unsigned)node >= 
> nr_node_ids(%u)\n",

I still think this makes absolutely no sense what so ever.


FROM MS LISA HUGH(BUSINESS)....

2019-09-22 Thread Ms Lisa Hugh



Dear Friend,

I am Ms Lisa Hugh work with the department of Audit and accounting manager here 
in the Bank(B.O.A).

Please i need your assistance for the transferring of thIs fund to your bank 
account for both of us benefit for life time investment, amount (US$4.5M 
DOLLARS).

I have every inquiry details to make the bank believe you and release the fund 
in within 5 banking working days with your full co-operation with me forsuccess.

Note/ 50% for you why 50% for me after success of the transfer to your bank
account.

Below information is what i need from you so will can be reaching each
other

1)Full name ...
2)Private telephone number...
3)Age...
4)Nationality...
5)Occupation ...


Thanks.

Ms Lisa Hugh


FROM MS LISA HUGH(BUSINESS).

2019-09-20 Thread Ms Lisa Hugh



Dear Friend,

I am Ms Lisa Hugh work with the department of Audit and accounting manager here 
in the Bank(B.O.A).

Please i need your assistance for the transferring of thIs fund to your bank 
account for both of us benefit for life time investment, amount (US$4.5M 
DOLLARS).

I have every inquiry details to make the bank believe you and release the fund 
in within 5 banking working days with your full co-operation with me forsuccess.

Note/ 50% for you why 50% for me after success of the transfer to your bank
account.

Below information is what i need from you so will can be reaching each
other

1)Full name ...
2)Private telephone number...
3)Age...
4)Nationality...
5)Occupation ...


Thanks.

Ms Lisa Hugh


Re: [PATCH v12 05/12] namei: obey trailing magic-link DAC permissions

2019-09-18 Thread Aleksa Sarai
On 2019-09-18, Aleksa Sarai  wrote:
> On 2019-09-17, Jann Horn  wrote:
> > On Wed, Sep 4, 2019 at 10:21 PM Aleksa Sarai  wrote:
> > > The ability for userspace to "re-open" file descriptors through
> > > /proc/self/fd has been a very useful tool for all sorts of usecases
> > > (container runtimes are one common example). However, the current
> > > interface for doing this has resulted in some pretty subtle security
> > > holes. Userspace can re-open a file descriptor with more permissions
> > > than the original, which can result in cases such as /proc/$pid/exe
> > > being re-opened O_RDWR at a later date even though (by definition)
> > > /proc/$pid/exe cannot be opened for writing. When combined with O_PATH
> > > the results can get even more confusing.
> > [...]
> > > Instead we have to restrict it in such a way that it doesn't break
> > > (good) users but does block potential attackers. The solution applied in
> > > this patch is to restrict *re-opening* (not resolution through)
> > > magic-links by requiring that mode of the link be obeyed. Normal
> > > symlinks have modes of a+rwx but magic-links have other modes. These
> > > magic-link modes were historically ignored during path resolution, but
> > > they've now been re-purposed for more useful ends.
> > 
> > Thanks for dealing with this issue!
> > 
> > [...]
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 209c51a5226c..54d57dad0f91 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
> > >
> > > nd->path = *path;
> > > nd->inode = nd->path.dentry->d_inode;
> > > -   nd->flags |= LOOKUP_JUMPED;
> > > +   nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
> > >  }
> > [...]
> > > +static int trailing_magiclink(struct nameidata *nd, int acc_mode,
> > > + fmode_t *opath_mask)
> > > +{
> > > +   struct inode *inode = nd->link_inode;
> > > +   fmode_t upgrade_mask = 0;
> > > +
> > > +   /* Was the trailing_symlink() a magic-link? */
> > > +   if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
> > > +   return 0;
> > > +
> > > +   /*
> > > +* Figure out the upgrade-mask of the link_inode. Since these 
> > > aren't
> > > +* strictly POSIX semantics we don't do an acl_permission_check() 
> > > here,
> > > +* so we only care that at least one bit is set for each 
> > > upgrade-mode.
> > > +*/
> > > +   if (inode->i_mode & S_IRUGO)
> > > +   upgrade_mask |= FMODE_PATH_READ;
> > > +   if (inode->i_mode & S_IWUGO)
> > > +   upgrade_mask |= FMODE_PATH_WRITE;
> > > +   /* Restrict the O_PATH upgrade-mask of the caller. */
> > > +   if (opath_mask)
> > > +   *opath_mask &= upgrade_mask;
> > > +   return may_open_magiclink(upgrade_mask, acc_mode);
> > >  }
> > 
> > This looks racy because entries in the file descriptor table can be
> > switched out as long as task->files->file_lock isn't held. Unless I'm
> > missing something, something like the following (untested) would
> > bypass this restriction:
> 
> You're absolutely right -- good catch!
> 
> > Perhaps you could change nd_jump_link() to "void nd_jump_link(struct
> > path *path, umode_t link_mode)", and let proc_pid_get_link() pass the
> > link_mode through from an out-argument of .proc_get_link()? Then
> > proc_fd_link() could grab the proper mode in a race-free manner. And
> > nd_jump_link() could stash the mode in the nameidata.
> 
> This indeed does appear to be the simplest solution -- I'm currently
> testing a variation of the patch you proposed (with a few extra bits to
> deal with nd_jump_link and proc_get_link being used elsewhere).
> 
> I'll include this change (assuming it fixes the flaw you found) in the
> v13 series I'll send around next week. Thanks, Jann!

In case you're interested -- I've also included a selftest based on this
attack in my series (though it uses CLONE_FILES so that we could also
test O_EMPTYPATH, which wasn't affected because it didn't go through
procfs and thus couldn't hit the "outdated inode->i_mode" problem).

The attack script succeeds around 20% of the time on the original
patchset, and with the updated patchset it doesn't succeed in several
hundred thousand attempts (which I've repeated a few times).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v12 05/12] namei: obey trailing magic-link DAC permissions

2019-09-18 Thread Aleksa Sarai
On 2019-09-17, Jann Horn  wrote:
> On Wed, Sep 4, 2019 at 10:21 PM Aleksa Sarai  wrote:
> > The ability for userspace to "re-open" file descriptors through
> > /proc/self/fd has been a very useful tool for all sorts of usecases
> > (container runtimes are one common example). However, the current
> > interface for doing this has resulted in some pretty subtle security
> > holes. Userspace can re-open a file descriptor with more permissions
> > than the original, which can result in cases such as /proc/$pid/exe
> > being re-opened O_RDWR at a later date even though (by definition)
> > /proc/$pid/exe cannot be opened for writing. When combined with O_PATH
> > the results can get even more confusing.
> [...]
> > Instead we have to restrict it in such a way that it doesn't break
> > (good) users but does block potential attackers. The solution applied in
> > this patch is to restrict *re-opening* (not resolution through)
> > magic-links by requiring that mode of the link be obeyed. Normal
> > symlinks have modes of a+rwx but magic-links have other modes. These
> > magic-link modes were historically ignored during path resolution, but
> > they've now been re-purposed for more useful ends.
> 
> Thanks for dealing with this issue!
> 
> [...]
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 209c51a5226c..54d57dad0f91 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
> >
> > nd->path = *path;
> > nd->inode = nd->path.dentry->d_inode;
> > -   nd->flags |= LOOKUP_JUMPED;
> > +   nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
> >  }
> [...]
> > +static int trailing_magiclink(struct nameidata *nd, int acc_mode,
> > + fmode_t *opath_mask)
> > +{
> > +   struct inode *inode = nd->link_inode;
> > +   fmode_t upgrade_mask = 0;
> > +
> > +   /* Was the trailing_symlink() a magic-link? */
> > +   if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
> > +   return 0;
> > +
> > +   /*
> > +* Figure out the upgrade-mask of the link_inode. Since these aren't
> > +* strictly POSIX semantics we don't do an acl_permission_check() 
> > here,
> > +* so we only care that at least one bit is set for each 
> > upgrade-mode.
> > +*/
> > +   if (inode->i_mode & S_IRUGO)
> > +   upgrade_mask |= FMODE_PATH_READ;
> > +   if (inode->i_mode & S_IWUGO)
> > +   upgrade_mask |= FMODE_PATH_WRITE;
> > +   /* Restrict the O_PATH upgrade-mask of the caller. */
> > +   if (opath_mask)
> > +   *opath_mask &= upgrade_mask;
> > +   return may_open_magiclink(upgrade_mask, acc_mode);
> >  }
> 
> This looks racy because entries in the file descriptor table can be
> switched out as long as task->files->file_lock isn't held. Unless I'm
> missing something, something like the following (untested) would
> bypass this restriction:

You're absolutely right -- good catch!

> Perhaps you could change nd_jump_link() to "void nd_jump_link(struct
> path *path, umode_t link_mode)", and let proc_pid_get_link() pass the
> link_mode through from an out-argument of .proc_get_link()? Then
> proc_fd_link() could grab the proper mode in a race-free manner. And
> nd_jump_link() could stash the mode in the nameidata.

This indeed does appear to be the simplest solution -- I'm currently
testing a variation of the patch you proposed (with a few extra bits to
deal with nd_jump_link and proc_get_link being used elsewhere).

I'll include this change (assuming it fixes the flaw you found) in the
v13 series I'll send around next week. Thanks, Jann!

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v12 05/12] namei: obey trailing magic-link DAC permissions

2019-09-17 Thread Jann Horn
On Wed, Sep 4, 2019 at 10:21 PM Aleksa Sarai  wrote:
> The ability for userspace to "re-open" file descriptors through
> /proc/self/fd has been a very useful tool for all sorts of usecases
> (container runtimes are one common example). However, the current
> interface for doing this has resulted in some pretty subtle security
> holes. Userspace can re-open a file descriptor with more permissions
> than the original, which can result in cases such as /proc/$pid/exe
> being re-opened O_RDWR at a later date even though (by definition)
> /proc/$pid/exe cannot be opened for writing. When combined with O_PATH
> the results can get even more confusing.
[...]
> Instead we have to restrict it in such a way that it doesn't break
> (good) users but does block potential attackers. The solution applied in
> this patch is to restrict *re-opening* (not resolution through)
> magic-links by requiring that mode of the link be obeyed. Normal
> symlinks have modes of a+rwx but magic-links have other modes. These
> magic-link modes were historically ignored during path resolution, but
> they've now been re-purposed for more useful ends.

Thanks for dealing with this issue!

[...]
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..54d57dad0f91 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -872,7 +872,7 @@ void nd_jump_link(struct path *path)
>
> nd->path = *path;
> nd->inode = nd->path.dentry->d_inode;
> -   nd->flags |= LOOKUP_JUMPED;
> +   nd->flags |= LOOKUP_JUMPED | LOOKUP_MAGICLINK_JUMPED;
>  }
[...]
> +static int trailing_magiclink(struct nameidata *nd, int acc_mode,
> + fmode_t *opath_mask)
> +{
> +   struct inode *inode = nd->link_inode;
> +   fmode_t upgrade_mask = 0;
> +
> +   /* Was the trailing_symlink() a magic-link? */
> +   if (!(nd->flags & LOOKUP_MAGICLINK_JUMPED))
> +   return 0;
> +
> +   /*
> +* Figure out the upgrade-mask of the link_inode. Since these aren't
> +* strictly POSIX semantics we don't do an acl_permission_check() 
> here,
> +* so we only care that at least one bit is set for each upgrade-mode.
> +*/
> +   if (inode->i_mode & S_IRUGO)
> +   upgrade_mask |= FMODE_PATH_READ;
> +   if (inode->i_mode & S_IWUGO)
> +   upgrade_mask |= FMODE_PATH_WRITE;
> +   /* Restrict the O_PATH upgrade-mask of the caller. */
> +   if (opath_mask)
> +   *opath_mask &= upgrade_mask;
> +   return may_open_magiclink(upgrade_mask, acc_mode);
>  }

This looks racy because entries in the file descriptor table can be
switched out as long as task->files->file_lock isn't held. Unless I'm
missing something, something like the following (untested) would
bypass this restriction:

int readonly_fd = ...; /* some read-only fd we want to reopen as writable */
int writable_fd = open("/dev/null", O_RDWR);
int flippy_fd = dup(writable_fd);
char flippy_fd_path[100];
sprintf(flippy_fd_path, "/proc/%d/fd/%d", getpid(), flippy_fd);
if (fork() == 0) {
  while (1) {
int reopened_fd = open(flippy_fd_path, O_RDWR);
if (reopened_fd == -1) continue;
char reopened_fd_path[100];
sprintf(reopened_fd_path, "/proc/self/fd/%d", reopened_fd);
char reopened_fd_target[1000];
int target_len = readlink(reopened_fd_path, reopened_fd_target,
sizeof(reopened_fd_target)-1);
reopened_fd_target[target_len] = 0;
if (strcmp(reopened_fd_target, "/dev/null"))
  printf("managed to reopen as writable\n");
close(reopened_fd);
  }
} else {
  while (1) {
dup2(readonly_fd, flippy_fd);
dup2(writable_fd, flippy_fd);
  }
}

Perhaps you could change nd_jump_link() to "void nd_jump_link(struct
path *path, umode_t link_mode)", and let proc_pid_get_link() pass the
link_mode through from an out-argument of .proc_get_link()? Then
proc_fd_link() could grab the proper mode in a race-free manner. And
nd_jump_link() could stash the mode in the nameidata.

A sketch of how I imagine that:
===
diff --git a/fs/namei.c b/fs/namei.c
index 6b936038319b..14c6790203c7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -506,6 +506,7 @@ struct nameidata {
struct inode*link_inode;
unsignedroot_seq;
int dfd;
+   umode_t last_link_mode;
 } __randomize_layout;

 static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
@@ -890,7 +891,7 @@ static int nd_jump_root(struct nameidata *nd)
  * Helper to directly jump to a known parsed path from ->get_link,
  * caller must have taken a reference to path beforehand.
  */
-void nd_jump_link(struct path *path)
+void nd_jump_link(struct path *path, umode_t link_mode)
 {
struct nameidata *nd = current->nameidata;
path_put(>path);
@@ -898,6 +899,7 @@ void nd_jump_link(struct path *path)
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED | 

[PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-17 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking if the device's node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id as
NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Also there is a debugging version of node_to_cpumask_map() for x86 and
arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
---
V6: Drop the cpu_all_mask -> cpu_online_mask change for it seems a
little controversial, may need deeper investigation, and rebased
on the latest linux-next.
V5: Drop unsigned "fix" change for x86/arm64, and change comment log
according to Michal's comment.
V4: Have all these changes in a single patch.
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
V2: make the node id checking change to other arches too.
---
 arch/arm64/include/asm/numa.h| 3 +++
 arch/arm64/mm/numa.c | 3 +++
 arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
 arch/s390/include/asm/topology.h | 3 +++
 arch/x86/include/asm/topology.h  | 3 +++
 arch/x86/mm/numa.c   | 3 +++
 6 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
index 626ad01..c8a4b31 100644
--- a/arch/arm64/include/asm/numa.h
+++ b/arch/arm64/include/asm/numa.h
@@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 4decf16..5ae7eea 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -46,6 +46,9 @@ EXPORT_SYMBOL(node_to_cpumask_map);
  */
 const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
if (WARN_ON(node >= nr_node_ids))
return cpu_none_mask;
 
diff --git a/arch/mips/include/asm/mach-loongson64/topology.h 
b/arch/mips/include/asm/mach-loongson64/topology.h
index 7ff819a..e78daa6 100644
--- a/arch/mips/include/asm/mach-loongson64/topology.h
+++ b/arch/mips/include/asm/mach-loongson64/topology.h
@@ -5,7 +5,9 @@
 #ifdef CONFIG_NUMA
 
 #define cpu_to_node(cpu)   (cpu_logical_map(cpu) >> 2)
-#define cpumask_of_node(node)  (&__node_data[(node)]->cpumask)
+#define cpumask_of_node(node)  ((node) == NUMA_NO_NODE ?   \
+cpu_online_mask :  \
+&__node_data[(node)]->cpumask)
 
 struct pci_bus;
 extern int pcibus_to_node(struct pci_bus *);
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index cca406f..1bd2e73 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -78,6 +78,9 @@ static inline int cpu_to_node(int cpu)
 #define cpumask_of_node cpumask_of_node
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return _to_cpumask_map[node];
 }
 
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b14d23..7fa82e1 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -69,6 +69,9 @@ extern const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4123100e..9859acb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -861,6 +861,9 @@ void 

Re: [PATCH v5] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-17 Thread Yunsheng Lin
On 2019/9/17 18:08, Michal Hocko wrote:
> On Tue 17-09-19 17:53:57, Yunsheng Lin wrote:
>> On 2019/9/17 17:36, Michal Hocko wrote:
>>> On Tue 17-09-19 14:20:11, Yunsheng Lin wrote:
 On 2019/9/17 13:28, Michael Ellerman wrote:
> Yunsheng Lin  writes:
>>> [...]
>> But we cannot really copy the page allocator logic. Simply because the
>> page allocator doesn't enforce the near node affinity. It just picks it
>> up as a preferred node but then it is free to fallback to any other numa
>> node. This is not the case here and node_to_cpumask_map will only 
>> restrict
>> to the particular node's cpus which would have really non deterministic
>> behavior depending on where the code is executed. So in fact we really
>> want to return cpu_online_mask for NUMA_NO_NODE.
>>
>> Some arches were already NUMA_NO_NODE aware, but they return 
>> cpu_all_mask,
>> which should be identical with cpu_online_mask when those arches do not
>> support cpu hotplug, this patch also changes them to return 
>> cpu_online_mask
>> in order to be consistent and use NUMA_NO_NODE instead of "-1".
>
> Except some of those arches *do* support CPU hotplug, powerpc and sparc
> at least. So switching from cpu_all_mask to cpu_online_mask is a
> meaningful change.

 Yes, thanks for pointing out.

>
> That doesn't mean it's wrong, but you need to explain why it's the right
> change.

 How about adding the below to the commit log:
 Even if some of the arches do support CPU hotplug, it does not make sense
 to return the cpu that has been hotplugged.

 Any suggestion?
>>>
>>> Again, for the third time, I believe. Make it a separate patch please.
>>> There is absolutely no reason to conflate those two things.
>>
>> Ok, thanks.
>> Will make the cpu_all_mask -> cpu_online_mask change a separate patch.
> 
> Thanks. This really needs per arch maintainer to check closely.
> 
>> Also, do you think it is better to resend this as individual patches for 
>> each arch
>> or have all these changes in a single patch? I am not sure which is the 
>> common
>> practice for a multi-arches changes like this.
> 
> It really depends on arch maintainers. Both approaches have some pros
> and cons. A single patch is more compact and and parts are not going to
> get lost in noise. They might generate some conflicts with parallel
> changes. I suspect a conflict risk is quite low in this code considering
> from a recent activity. A follow up arch specific patch would have to be
> routed via Andrew as well.
> 
> If Andrew is ok routing it through his tree and none of the arch
> maintainers is opposed then I would go with a single patch.

Ok, I will try a single patch for NUMA_NO_NODE aware change first.
"cpu_all_mask -> cpu_online_mask" change seems a little controversial,
and may need deeper investigation.


> 



Re: [PATCH v5] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-17 Thread Michal Hocko
On Tue 17-09-19 17:53:57, Yunsheng Lin wrote:
> On 2019/9/17 17:36, Michal Hocko wrote:
> > On Tue 17-09-19 14:20:11, Yunsheng Lin wrote:
> >> On 2019/9/17 13:28, Michael Ellerman wrote:
> >>> Yunsheng Lin  writes:
> > [...]
>  But we cannot really copy the page allocator logic. Simply because the
>  page allocator doesn't enforce the near node affinity. It just picks it
>  up as a preferred node but then it is free to fallback to any other numa
>  node. This is not the case here and node_to_cpumask_map will only 
>  restrict
>  to the particular node's cpus which would have really non deterministic
>  behavior depending on where the code is executed. So in fact we really
>  want to return cpu_online_mask for NUMA_NO_NODE.
> 
>  Some arches were already NUMA_NO_NODE aware, but they return 
>  cpu_all_mask,
>  which should be identical with cpu_online_mask when those arches do not
>  support cpu hotplug, this patch also changes them to return 
>  cpu_online_mask
>  in order to be consistent and use NUMA_NO_NODE instead of "-1".
> >>>
> >>> Except some of those arches *do* support CPU hotplug, powerpc and sparc
> >>> at least. So switching from cpu_all_mask to cpu_online_mask is a
> >>> meaningful change.
> >>
> >> Yes, thanks for pointing out.
> >>
> >>>
> >>> That doesn't mean it's wrong, but you need to explain why it's the right
> >>> change.
> >>
> >> How about adding the below to the commit log:
> >> Even if some of the arches do support CPU hotplug, it does not make sense
> >> to return the cpu that has been hotplugged.
> >>
> >> Any suggestion?
> > 
> > Again, for the third time, I believe. Make it a separate patch please.
> > There is absolutely no reason to conflate those two things.
> 
> Ok, thanks.
> Will make the cpu_all_mask -> cpu_online_mask change a separate patch.

Thanks. This really needs per arch maintainer to check closely.

> Also, do you think it is better to resend this as individual patches for each 
> arch
> or have all these changes in a single patch? I am not sure which is the common
> practice for a multi-arches changes like this.

It really depends on arch maintainers. Both approaches have some pros
and cons. A single patch is more compact and and parts are not going to
get lost in noise. They might generate some conflicts with parallel
changes. I suspect a conflict risk is quite low in this code considering
from a recent activity. A follow up arch specific patch would have to be
routed via Andrew as well.

If Andrew is ok routing it through his tree and none of the arch
maintainers is opposed then I would go with a single patch.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-17 Thread Yunsheng Lin
On 2019/9/17 17:36, Michal Hocko wrote:
> On Tue 17-09-19 14:20:11, Yunsheng Lin wrote:
>> On 2019/9/17 13:28, Michael Ellerman wrote:
>>> Yunsheng Lin  writes:
> [...]
 But we cannot really copy the page allocator logic. Simply because the
 page allocator doesn't enforce the near node affinity. It just picks it
 up as a preferred node but then it is free to fallback to any other numa
 node. This is not the case here and node_to_cpumask_map will only restrict
 to the particular node's cpus which would have really non deterministic
 behavior depending on where the code is executed. So in fact we really
 want to return cpu_online_mask for NUMA_NO_NODE.

 Some arches were already NUMA_NO_NODE aware, but they return cpu_all_mask,
 which should be identical with cpu_online_mask when those arches do not
 support cpu hotplug, this patch also changes them to return cpu_online_mask
 in order to be consistent and use NUMA_NO_NODE instead of "-1".
>>>
>>> Except some of those arches *do* support CPU hotplug, powerpc and sparc
>>> at least. So switching from cpu_all_mask to cpu_online_mask is a
>>> meaningful change.
>>
>> Yes, thanks for pointing out.
>>
>>>
>>> That doesn't mean it's wrong, but you need to explain why it's the right
>>> change.
>>
>> How about adding the below to the commit log:
>> Even if some of the arches do support CPU hotplug, it does not make sense
>> to return the cpu that has been hotplugged.
>>
>> Any suggestion?
> 
> Again, for the third time, I believe. Make it a separate patch please.
> There is absolutely no reason to conflate those two things.

Ok, thanks.
Will make the cpu_all_mask -> cpu_online_mask change a separate patch.

Also, do you think it is better to resend this as individual patches for each 
arch
or have all these changes in a single patch? I am not sure which is the common
practice for a multi-arches changes like this.

> 



Re: [PATCH v5] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-17 Thread Michal Hocko
On Tue 17-09-19 14:20:11, Yunsheng Lin wrote:
> On 2019/9/17 13:28, Michael Ellerman wrote:
> > Yunsheng Lin  writes:
[...]
> >> But we cannot really copy the page allocator logic. Simply because the
> >> page allocator doesn't enforce the near node affinity. It just picks it
> >> up as a preferred node but then it is free to fallback to any other numa
> >> node. This is not the case here and node_to_cpumask_map will only restrict
> >> to the particular node's cpus which would have really non deterministic
> >> behavior depending on where the code is executed. So in fact we really
> >> want to return cpu_online_mask for NUMA_NO_NODE.
> >>
> >> Some arches were already NUMA_NO_NODE aware, but they return cpu_all_mask,
> >> which should be identical with cpu_online_mask when those arches do not
> >> support cpu hotplug, this patch also changes them to return cpu_online_mask
> >> in order to be consistent and use NUMA_NO_NODE instead of "-1".
> > 
> > Except some of those arches *do* support CPU hotplug, powerpc and sparc
> > at least. So switching from cpu_all_mask to cpu_online_mask is a
> > meaningful change.
> 
> Yes, thanks for pointing out.
> 
> > 
> > That doesn't mean it's wrong, but you need to explain why it's the right
> > change.
> 
> How about adding the below to the commit log:
> Even if some of the arches do support CPU hotplug, it does not make sense
> to return the cpu that has been hotplugged.
>
> Any suggestion?

Again, for the third time, I believe. Make it a separate patch please.
There is absolutely no reason to conflate those two things.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-17 Thread Yunsheng Lin
On 2019/9/17 13:28, Michael Ellerman wrote:
> Yunsheng Lin  writes:
>> When passing the return value of dev_to_node() to cpumask_of_node()
>> without checking if the device's node id is NUMA_NO_NODE, there is
>> global-out-of-bounds detected by KASAN.
>>
>> From the discussion [1], NUMA_NO_NODE really means no node affinity,
>> which also means all cpus should be usable. So the cpumask_of_node()
>> should always return all cpus online when user passes the node id as
>> NUMA_NO_NODE, just like similar semantic that page allocator handles
>> NUMA_NO_NODE.
>>
>> But we cannot really copy the page allocator logic. Simply because the
>> page allocator doesn't enforce the near node affinity. It just picks it
>> up as a preferred node but then it is free to fallback to any other numa
>> node. This is not the case here and node_to_cpumask_map will only restrict
>> to the particular node's cpus which would have really non deterministic
>> behavior depending on where the code is executed. So in fact we really
>> want to return cpu_online_mask for NUMA_NO_NODE.
>>
>> Some arches were already NUMA_NO_NODE aware, but they return cpu_all_mask,
>> which should be identical with cpu_online_mask when those arches do not
>> support cpu hotplug, this patch also changes them to return cpu_online_mask
>> in order to be consistent and use NUMA_NO_NODE instead of "-1".
> 
> Except some of those arches *do* support CPU hotplug, powerpc and sparc
> at least. So switching from cpu_all_mask to cpu_online_mask is a
> meaningful change.

Yes, thanks for pointing out.

> 
> That doesn't mean it's wrong, but you need to explain why it's the right
> change.

How about adding the below to the commit log:
Even if some of the arches do support CPU hotplug, it does not make sense
to return the cpu that has been hotplugged.

Any suggestion?

> 
> 
>> Also there is a debugging version of node_to_cpumask_map() for x86 and
>> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
>> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
>>
>> [1] https://lore.kernel.org/patchwork/patch/1125789/
>> Signed-off-by: Yunsheng Lin 
>> Suggested-by: Michal Hocko 
>> Acked-by: Michal Hocko 
>> ---
>> V5: Drop unsigned "fix" change for x86/arm64, and change comment log
>> according to Michal's comment.
>> V4: Have all these changes in a single patch.
> 
> This makes it much harder to get the patch merged, you basically have to
> get Andrew Morton to merge it now. Sending individual patches for each
> arch means each arch maintainer can merge them separately.

I am new to the arch change here, and not sure which is the best way to get
the multi-arches change merged.

Do you think it is better to resend this as individual patches for each arch
after megre window?

thanks for reviewing.

> 
> cheers
> 
>> V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
>> for NUMA_NO_NODE case, and change the commit log to better justify
>> the change.
>> V2: make the node id checking change to other arches too.
>> ---
>>  arch/alpha/include/asm/topology.h| 2 +-
>>  arch/arm64/include/asm/numa.h| 3 +++
>>  arch/arm64/mm/numa.c | 3 +++
>>  arch/mips/include/asm/mach-ip27/topology.h   | 4 ++--
>>  arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
>>  arch/powerpc/include/asm/topology.h  | 6 +++---
>>  arch/s390/include/asm/topology.h | 3 +++
>>  arch/sparc/include/asm/topology_64.h | 6 +++---
>>  arch/x86/include/asm/topology.h  | 3 +++
>>  arch/x86/mm/numa.c   | 3 +++
>>  10 files changed, 27 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/alpha/include/asm/topology.h 
>> b/arch/alpha/include/asm/topology.h
>> index 5a77a40..836c9e2 100644
>> --- a/arch/alpha/include/asm/topology.h
>> +++ b/arch/alpha/include/asm/topology.h
>> @@ -31,7 +31,7 @@ static const struct cpumask *cpumask_of_node(int node)
>>  int cpu;
>>  
>>  if (node == NUMA_NO_NODE)
>> -return cpu_all_mask;
>> +return cpu_online_mask;
>>  
>>  cpumask_clear(_to_cpumask_map[node]);
>>  
>> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
>> index 626ad01..c8a4b31 100644
>> --- a/arch/arm64/include/asm/numa.h
>> +++ b/arch/arm64/include/asm/numa.h
>> @@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node);
>>  /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
>>  static inline const struct cpumask *cpumask_of_node(int node)
>>  {
>> +if (node == NUMA_NO_NODE)
>> +return cpu_online_mask;
>> +
>>  return node_to_cpumask_map[node];
>>  }
>>  #endif
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index 4f241cc..f57202d 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -46,6 +46,9 @@ EXPORT_SYMBOL(node_to_cpumask_map);
>>   */
>>  const struct 

Re: [PATCH v5] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-16 Thread Michael Ellerman
Yunsheng Lin  writes:
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking if the device's node id is NUMA_NO_NODE, there is
> global-out-of-bounds detected by KASAN.
>
> From the discussion [1], NUMA_NO_NODE really means no node affinity,
> which also means all cpus should be usable. So the cpumask_of_node()
> should always return all cpus online when user passes the node id as
> NUMA_NO_NODE, just like similar semantic that page allocator handles
> NUMA_NO_NODE.
>
> But we cannot really copy the page allocator logic. Simply because the
> page allocator doesn't enforce the near node affinity. It just picks it
> up as a preferred node but then it is free to fallback to any other numa
> node. This is not the case here and node_to_cpumask_map will only restrict
> to the particular node's cpus which would have really non deterministic
> behavior depending on where the code is executed. So in fact we really
> want to return cpu_online_mask for NUMA_NO_NODE.
>
> Some arches were already NUMA_NO_NODE aware, but they return cpu_all_mask,
> which should be identical with cpu_online_mask when those arches do not
> support cpu hotplug, this patch also changes them to return cpu_online_mask
> in order to be consistent and use NUMA_NO_NODE instead of "-1".

Except some of those arches *do* support CPU hotplug, powerpc and sparc
at least. So switching from cpu_all_mask to cpu_online_mask is a
meaningful change.

That doesn't mean it's wrong, but you need to explain why it's the right
change.


> Also there is a debugging version of node_to_cpumask_map() for x86 and
> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
>
> [1] https://lore.kernel.org/patchwork/patch/1125789/
> Signed-off-by: Yunsheng Lin 
> Suggested-by: Michal Hocko 
> Acked-by: Michal Hocko 
> ---
> V5: Drop unsigned "fix" change for x86/arm64, and change comment log
> according to Michal's comment.
> V4: Have all these changes in a single patch.

This makes it much harder to get the patch merged, you basically have to
get Andrew Morton to merge it now. Sending individual patches for each
arch means each arch maintainer can merge them separately.

cheers

> V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
> for NUMA_NO_NODE case, and change the commit log to better justify
> the change.
> V2: make the node id checking change to other arches too.
> ---
>  arch/alpha/include/asm/topology.h| 2 +-
>  arch/arm64/include/asm/numa.h| 3 +++
>  arch/arm64/mm/numa.c | 3 +++
>  arch/mips/include/asm/mach-ip27/topology.h   | 4 ++--
>  arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
>  arch/powerpc/include/asm/topology.h  | 6 +++---
>  arch/s390/include/asm/topology.h | 3 +++
>  arch/sparc/include/asm/topology_64.h | 6 +++---
>  arch/x86/include/asm/topology.h  | 3 +++
>  arch/x86/mm/numa.c   | 3 +++
>  10 files changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/arch/alpha/include/asm/topology.h 
> b/arch/alpha/include/asm/topology.h
> index 5a77a40..836c9e2 100644
> --- a/arch/alpha/include/asm/topology.h
> +++ b/arch/alpha/include/asm/topology.h
> @@ -31,7 +31,7 @@ static const struct cpumask *cpumask_of_node(int node)
>   int cpu;
>  
>   if (node == NUMA_NO_NODE)
> - return cpu_all_mask;
> + return cpu_online_mask;
>  
>   cpumask_clear(_to_cpumask_map[node]);
>  
> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
> index 626ad01..c8a4b31 100644
> --- a/arch/arm64/include/asm/numa.h
> +++ b/arch/arm64/include/asm/numa.h
> @@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node);
>  /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
>  static inline const struct cpumask *cpumask_of_node(int node)
>  {
> + if (node == NUMA_NO_NODE)
> + return cpu_online_mask;
> +
>   return node_to_cpumask_map[node];
>  }
>  #endif
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index 4f241cc..f57202d 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -46,6 +46,9 @@ EXPORT_SYMBOL(node_to_cpumask_map);
>   */
>  const struct cpumask *cpumask_of_node(int node)
>  {
> + if (node == NUMA_NO_NODE)
> + return cpu_online_mask;
> +
>   if (WARN_ON(node >= nr_node_ids))
>   return cpu_none_mask;
>  
> diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
> b/arch/mips/include/asm/mach-ip27/topology.h
> index 965f079..04505e6 100644
> --- a/arch/mips/include/asm/mach-ip27/topology.h
> +++ b/arch/mips/include/asm/mach-ip27/topology.h
> @@ -15,8 +15,8 @@ struct cpuinfo_ip27 {
>  extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
>  
>  #define cpu_to_node(cpu) (sn_cpu_info[(cpu)].p_nodeid)
> 

[PATCH v3 08/26] alpha/PCI: Use PCI_STD_NUM_BARS

2019-09-16 Thread Denis Efremov
Use define PCI_STD_NUM_BARS instead of PCI_ROM_RESOURCE for the number of
PCI BARs.

Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Signed-off-by: Denis Efremov 
---
 arch/alpha/kernel/pci-sysfs.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/alpha/kernel/pci-sysfs.c b/arch/alpha/kernel/pci-sysfs.c
index f94c732fedeb..0021580d79ad 100644
--- a/arch/alpha/kernel/pci-sysfs.c
+++ b/arch/alpha/kernel/pci-sysfs.c
@@ -71,10 +71,10 @@ static int pci_mmap_resource(struct kobject *kobj,
struct pci_bus_region bar;
int i;
 
-   for (i = 0; i < PCI_ROM_RESOURCE; i++)
+   for (i = 0; i < PCI_STD_NUM_BARS; i++)
if (res == >resource[i])
break;
-   if (i >= PCI_ROM_RESOURCE)
+   if (i >= PCI_STD_NUM_BARS)
return -ENODEV;
 
if (res->flags & IORESOURCE_MEM && iomem_is_exclusive(res->start))
@@ -115,7 +115,7 @@ void pci_remove_resource_files(struct pci_dev *pdev)
 {
int i;
 
-   for (i = 0; i < PCI_ROM_RESOURCE; i++) {
+   for (i = 0; i < PCI_STD_NUM_BARS; i++) {
struct bin_attribute *res_attr;
 
res_attr = pdev->res_attr[i];
@@ -232,7 +232,7 @@ int pci_create_resource_files(struct pci_dev *pdev)
int retval;
 
/* Expose the PCI resources from this device as files */
-   for (i = 0; i < PCI_ROM_RESOURCE; i++) {
+   for (i = 0; i < PCI_STD_NUM_BARS; i++) {
 
/* skip empty resources */
if (!pci_resource_len(pdev, i))
-- 
2.21.0



[PATCH v5] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-16 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking if the device's node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id as
NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Some arches were already NUMA_NO_NODE aware, but they return cpu_all_mask,
which should be identical with cpu_online_mask when those arches do not
support cpu hotplug, this patch also changes them to return cpu_online_mask
in order to be consistent and use NUMA_NO_NODE instead of "-1".

Also there is a debugging version of node_to_cpumask_map() for x86 and
arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
---
V5: Drop unsigned "fix" change for x86/arm64, and change comment log
according to Michal's comment.
V4: Have all these changes in a single patch.
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
V2: make the node id checking change to other arches too.
---
 arch/alpha/include/asm/topology.h| 2 +-
 arch/arm64/include/asm/numa.h| 3 +++
 arch/arm64/mm/numa.c | 3 +++
 arch/mips/include/asm/mach-ip27/topology.h   | 4 ++--
 arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
 arch/powerpc/include/asm/topology.h  | 6 +++---
 arch/s390/include/asm/topology.h | 3 +++
 arch/sparc/include/asm/topology_64.h | 6 +++---
 arch/x86/include/asm/topology.h  | 3 +++
 arch/x86/mm/numa.c   | 3 +++
 10 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/alpha/include/asm/topology.h 
b/arch/alpha/include/asm/topology.h
index 5a77a40..836c9e2 100644
--- a/arch/alpha/include/asm/topology.h
+++ b/arch/alpha/include/asm/topology.h
@@ -31,7 +31,7 @@ static const struct cpumask *cpumask_of_node(int node)
int cpu;
 
if (node == NUMA_NO_NODE)
-   return cpu_all_mask;
+   return cpu_online_mask;
 
cpumask_clear(_to_cpumask_map[node]);
 
diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
index 626ad01..c8a4b31 100644
--- a/arch/arm64/include/asm/numa.h
+++ b/arch/arm64/include/asm/numa.h
@@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 4f241cc..f57202d 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -46,6 +46,9 @@ EXPORT_SYMBOL(node_to_cpumask_map);
  */
 const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
if (WARN_ON(node >= nr_node_ids))
return cpu_none_mask;
 
diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
b/arch/mips/include/asm/mach-ip27/topology.h
index 965f079..04505e6 100644
--- a/arch/mips/include/asm/mach-ip27/topology.h
+++ b/arch/mips/include/asm/mach-ip27/topology.h
@@ -15,8 +15,8 @@ struct cpuinfo_ip27 {
 extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
 
 #define cpu_to_node(cpu)   (sn_cpu_info[(cpu)].p_nodeid)
-#define cpumask_of_node(node)  ((node) == -1 ? \
-cpu_all_mask : \
+#define cpumask_of_node(node)  ((node) == NUMA_NO_NODE ?   \
+cpu_online_mask :  \
 _data(node)->h_cpus)
 struct pci_bus;
 extern int pcibus_to_node(struct pci_bus *);
diff --git a/arch/mips/include/asm/mach-loongson64/topology.h 
b/arch/mips/include/asm/mach-loongson64/topology.h
index 7ff819a..e78daa6 100644
--- a/arch/mips/include/asm/mach-loongson64/topology.h
+++ 

Re: [PATCH v4] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-16 Thread Yunsheng Lin
On 2019/9/16 20:23, Michal Hocko wrote:
> On Mon 16-09-19 20:07:22, Yunsheng Lin wrote:
> [...]
 @@ -861,9 +861,12 @@ void numa_remove_cpu(int cpu)
   */
  const struct cpumask *cpumask_of_node(int node)
  {
 -  if (node >= nr_node_ids) {
 +  if (node == NUMA_NO_NODE)
 +  return cpu_online_mask;
 +
 +  if ((unsigned int)node >= nr_node_ids) {
printk(KERN_WARNING
 -  "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
 +  "cpumask_of_node(%d): node >= nr_node_ids(%u)\n",
node, nr_node_ids);
dump_stack();
return cpu_none_mask;
>>>
>>> Why do we need this?
>>
>> As the commit log says, the above cpumask_of_node() is for debugging,
>> it should catch other "node < 0" cases except NUMA_NO_NODE.
> 
> OK, I would just make it a separate patch.

Ok, thanks.

> 



Re: [PATCH v4] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-16 Thread Michal Hocko
On Mon 16-09-19 20:07:22, Yunsheng Lin wrote:
[...]
> >> @@ -861,9 +861,12 @@ void numa_remove_cpu(int cpu)
> >>   */
> >>  const struct cpumask *cpumask_of_node(int node)
> >>  {
> >> -  if (node >= nr_node_ids) {
> >> +  if (node == NUMA_NO_NODE)
> >> +  return cpu_online_mask;
> >> +
> >> +  if ((unsigned int)node >= nr_node_ids) {
> >>printk(KERN_WARNING
> >> -  "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
> >> +  "cpumask_of_node(%d): node >= nr_node_ids(%u)\n",
> >>node, nr_node_ids);
> >>dump_stack();
> >>return cpu_none_mask;
> > 
> > Why do we need this?
> 
> As the commit log says, the above cpumask_of_node() is for debugging,
> it should catch other "node < 0" cases except NUMA_NO_NODE.

OK, I would just make it a separate patch.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-16 Thread Yunsheng Lin
On 2019/9/16 16:43, Michal Hocko wrote:
> On Sun 15-09-19 16:20:56, Yunsheng Lin wrote:
>> When passing the return value of dev_to_node() to cpumask_of_node()
>> without checking if the device's node id is NUMA_NO_NODE, there is
>> global-out-of-bounds detected by KASAN.
>>
>> >From the discussion [1], NUMA_NO_NODE really means no node affinity,
>> which also means all cpus should be usable. So the cpumask_of_node()
>> should always return all cpus online when user passes the node id as
>> NUMA_NO_NODE, just like similar semantic that page allocator handles
>> NUMA_NO_NODE.
>>
>> But we cannot really copy the page allocator logic. Simply because the
>> page allocator doesn't enforce the near node affinity. It just picks it
>> up as a preferred node but then it is free to fallback to any other numa
>> node. This is not the case here and node_to_cpumask_map will only restrict
>> to the particular node's cpus which would have really non deterministic
>> behavior depending on where the code is executed. So in fact we really
>> want to return cpu_online_mask for NUMA_NO_NODE.
>>
>> Some arches were already NUMA_NO_NODE aware, so only change them to return
>> cpu_online_mask and use NUMA_NO_NODE instead of "-1".
>>
>> Also there is a debugging version of node_to_cpumask_map() for x86 and
>> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
>> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
>> And "fix" a sign "bug" since it is for debugging and should catch all the
>> error cases.
>>
>> [1] https://lore.kernel.org/patchwork/patch/1125789/
>> Signed-off-by: Yunsheng Lin 
>> Suggested-by: Michal Hocko 
> 
> The change makes sense to me. I wish this particular thing wasn't
> duplicated so heavily - maybe we can unify all of them and use a common
> code? In a separate patch most likely...
> 
> I would also not change cpu_all_mask -> cpu_online_mask in this patch.
> That is worth a patch on its own with some explanation. I haven't
> checked but I would suspect that alpha simply doesn't support cpu
> hotplug so the two things are the same. But this needs some explanation.

In commit 44c36aed43b5 ("alpha: cpumask_of_node() should handle -1 as a node")
and commit d797396f3387 ("MIPS: cpumask_of_node() should handle -1 as a node")
mention below:
"pcibus_to_node can return -1 if we cannot determine which node a pci bus
is on. If passed -1, cpumask_of_node will negatively index the lookup array
and pull in random data"

>From the cpu hotplug process: take_cpu_down() -> __cpu_disable().
alpha does not define the __cpu_disable() function, so it seems alpha does not
support HOTPLUG_CPU.

> 
> Other than that the patch looks good to me. Feel free to add
> Acked-by: Michal Hocko 
> 
> [...]
>> diff --git a/arch/alpha/include/asm/topology.h 
>> b/arch/alpha/include/asm/topology.h
>> index 5a77a40..836c9e2 100644
>> --- a/arch/alpha/include/asm/topology.h
>> +++ b/arch/alpha/include/asm/topology.h
>> @@ -31,7 +31,7 @@ static const struct cpumask *cpumask_of_node(int node)
>>  int cpu;
>>  
>>  if (node == NUMA_NO_NODE)
>> -return cpu_all_mask;
>> +return cpu_online_mask;
>>  
>>  cpumask_clear(_to_cpumask_map[node]);
>>  
> [...]
> 
>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>> index e6dad60..c676ffb 100644
>> --- a/arch/x86/mm/numa.c
>> +++ b/arch/x86/mm/numa.c
>> @@ -861,9 +861,12 @@ void numa_remove_cpu(int cpu)
>>   */
>>  const struct cpumask *cpumask_of_node(int node)
>>  {
>> -if (node >= nr_node_ids) {
>> +if (node == NUMA_NO_NODE)
>> +return cpu_online_mask;
>> +
>> +if ((unsigned int)node >= nr_node_ids) {
>>  printk(KERN_WARNING
>> -"cpumask_of_node(%d): node > nr_node_ids(%u)\n",
>> +"cpumask_of_node(%d): node >= nr_node_ids(%u)\n",
>>  node, nr_node_ids);
>>  dump_stack();
>>  return cpu_none_mask;
> 
> Why do we need this?

As the commit log says, the above cpumask_of_node() is for debugging,
it should catch other "node < 0" cases except NUMA_NO_NODE.

Thanks for reviewing.



Re: [PATCH v4] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-16 Thread Michal Hocko
On Sun 15-09-19 16:20:56, Yunsheng Lin wrote:
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking if the device's node id is NUMA_NO_NODE, there is
> global-out-of-bounds detected by KASAN.
> 
> >From the discussion [1], NUMA_NO_NODE really means no node affinity,
> which also means all cpus should be usable. So the cpumask_of_node()
> should always return all cpus online when user passes the node id as
> NUMA_NO_NODE, just like similar semantic that page allocator handles
> NUMA_NO_NODE.
> 
> But we cannot really copy the page allocator logic. Simply because the
> page allocator doesn't enforce the near node affinity. It just picks it
> up as a preferred node but then it is free to fallback to any other numa
> node. This is not the case here and node_to_cpumask_map will only restrict
> to the particular node's cpus which would have really non deterministic
> behavior depending on where the code is executed. So in fact we really
> want to return cpu_online_mask for NUMA_NO_NODE.
> 
> Some arches were already NUMA_NO_NODE aware, so only change them to return
> cpu_online_mask and use NUMA_NO_NODE instead of "-1".
> 
> Also there is a debugging version of node_to_cpumask_map() for x86 and
> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> And "fix" a sign "bug" since it is for debugging and should catch all the
> error cases.
> 
> [1] https://lore.kernel.org/patchwork/patch/1125789/
> Signed-off-by: Yunsheng Lin 
> Suggested-by: Michal Hocko 

The change makes sense to me. I wish this particular thing wasn't
duplicated so heavily - maybe we can unify all of them and use a common
code? In a separate patch most likely...

I would also not change cpu_all_mask -> cpu_online_mask in this patch.
That is worth a patch on its own with some explanation. I haven't
checked but I would suspect that alpha simply doesn't support cpu
hotplug so the two things are the same. But this needs some explanation.

Other than that the patch looks good to me. Feel free to add
Acked-by: Michal Hocko 

[...]
> diff --git a/arch/alpha/include/asm/topology.h 
> b/arch/alpha/include/asm/topology.h
> index 5a77a40..836c9e2 100644
> --- a/arch/alpha/include/asm/topology.h
> +++ b/arch/alpha/include/asm/topology.h
> @@ -31,7 +31,7 @@ static const struct cpumask *cpumask_of_node(int node)
>   int cpu;
>  
>   if (node == NUMA_NO_NODE)
> - return cpu_all_mask;
> + return cpu_online_mask;
>  
>   cpumask_clear(_to_cpumask_map[node]);
>  
[...]

> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index e6dad60..c676ffb 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -861,9 +861,12 @@ void numa_remove_cpu(int cpu)
>   */
>  const struct cpumask *cpumask_of_node(int node)
>  {
> - if (node >= nr_node_ids) {
> + if (node == NUMA_NO_NODE)
> + return cpu_online_mask;
> +
> + if ((unsigned int)node >= nr_node_ids) {
>   printk(KERN_WARNING
> - "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
> + "cpumask_of_node(%d): node >= nr_node_ids(%u)\n",
>   node, nr_node_ids);
>   dump_stack();
>   return cpu_none_mask;

Why do we need this?
-- 
Michal Hocko
SUSE Labs


[PATCH v4] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-15 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking if the device's node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id as
NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Some arches were already NUMA_NO_NODE aware, so only change them to return
cpu_online_mask and use NUMA_NO_NODE instead of "-1".

Also there is a debugging version of node_to_cpumask_map() for x86 and
arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
And "fix" a sign "bug" since it is for debugging and should catch all the
error cases.

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
---
V4: Have all these changes in a single patch.
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
V2: make the node id checking change to other arches too.
---
 arch/alpha/include/asm/topology.h| 2 +-
 arch/arm64/include/asm/numa.h| 3 +++
 arch/arm64/mm/numa.c | 5 -
 arch/mips/include/asm/mach-ip27/topology.h   | 4 ++--
 arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
 arch/powerpc/include/asm/topology.h  | 6 +++---
 arch/s390/include/asm/topology.h | 3 +++
 arch/sparc/include/asm/topology_64.h | 6 +++---
 arch/x86/include/asm/topology.h  | 3 +++
 arch/x86/mm/numa.c   | 7 +--
 10 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/arch/alpha/include/asm/topology.h 
b/arch/alpha/include/asm/topology.h
index 5a77a40..836c9e2 100644
--- a/arch/alpha/include/asm/topology.h
+++ b/arch/alpha/include/asm/topology.h
@@ -31,7 +31,7 @@ static const struct cpumask *cpumask_of_node(int node)
int cpu;
 
if (node == NUMA_NO_NODE)
-   return cpu_all_mask;
+   return cpu_online_mask;
 
cpumask_clear(_to_cpumask_map[node]);
 
diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
index 626ad01..c8a4b31 100644
--- a/arch/arm64/include/asm/numa.h
+++ b/arch/arm64/include/asm/numa.h
@@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 4f241cc..bef4bdd 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -46,7 +46,10 @@ EXPORT_SYMBOL(node_to_cpumask_map);
  */
 const struct cpumask *cpumask_of_node(int node)
 {
-   if (WARN_ON(node >= nr_node_ids))
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
+   if (WARN_ON((unsigned int)node >= nr_node_ids))
return cpu_none_mask;
 
if (WARN_ON(node_to_cpumask_map[node] == NULL))
diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
b/arch/mips/include/asm/mach-ip27/topology.h
index 965f079..04505e6 100644
--- a/arch/mips/include/asm/mach-ip27/topology.h
+++ b/arch/mips/include/asm/mach-ip27/topology.h
@@ -15,8 +15,8 @@ struct cpuinfo_ip27 {
 extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
 
 #define cpu_to_node(cpu)   (sn_cpu_info[(cpu)].p_nodeid)
-#define cpumask_of_node(node)  ((node) == -1 ? \
-cpu_all_mask : \
+#define cpumask_of_node(node)  ((node) == NUMA_NO_NODE ?   \
+cpu_online_mask :  \
 _data(node)->h_cpus)
 struct pci_bus;
 extern int pcibus_to_node(struct pci_bus *);
diff --git a/arch/mips/include/asm/mach-loongson64/topology.h 
b/arch/mips/include/asm/mach-loongson64/topology.h
index 7ff819a..e78daa6 100644
--- a/arch/mips/include/asm/mach-loongson64/topology.h
+++ b/arch/mips/include/asm/mach-loongson64/topology.h
@@ -5,7 +5,9 @@
 #ifdef CONFIG_NUMA
 
 #define cpu_to_node(cpu) 

Re: [PATCH v3 7/8] mips: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for mips

2019-09-15 Thread Yunsheng Lin
On 2019/9/15 14:46, Mike Rapoport wrote:
> On Sun, Sep 15, 2019 at 02:13:51PM +0800, Yunsheng Lin wrote:
>> On 2019/9/15 13:49, Mike Rapoport wrote:
>>> Hi,
>>>
>>> On Thu, Sep 12, 2019 at 06:15:33PM +0800, Yunsheng Lin wrote:
 When passing the return value of dev_to_node() to cpumask_of_node()
 without checking the node id if the node id is NUMA_NO_NODE, there is
 global-out-of-bounds detected by KASAN.

 From the discussion [1], NUMA_NO_NODE really means no node affinity,
 which also means all cpus should be usable. So the cpumask_of_node()
 should always return all cpus online when user passes the node id
 as NUMA_NO_NODE, just like similar semantic that page allocator handles
 NUMA_NO_NODE.

 But we cannot really copy the page allocator logic. Simply because the
 page allocator doesn't enforce the near node affinity. It just picks it
 up as a preferred node but then it is free to fallback to any other numa
 node. This is not the case here and node_to_cpumask_map will only restrict
 to the particular node's cpus which would have really non deterministic
 behavior depending on where the code is executed. So in fact we really
 want to return cpu_online_mask for NUMA_NO_NODE.

 Since this arch was already NUMA_NO_NODE aware, this patch only changes
 it to return cpu_online_mask and use NUMA_NO_NODE instead of "-1".

 [1] https://lore.kernel.org/patchwork/patch/1125789/
 Signed-off-by: Yunsheng Lin 
 Suggested-by: Michal Hocko 
 ---
 V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
 for NUMA_NO_NODE case, and change the commit log to better justify
 the change.
 ---
  arch/mips/include/asm/mach-ip27/topology.h | 4 ++--
>>>
>>> Nit: the subject says "mips:", but this patch only touches sgi-ip27 and
>>> loongson is updated as a separate patch. I don't see why both patches
>>> cannot be merged. Moreover, the whole set can be made as a single patch,
>>> IMHO.
>>
>> Thanks for reviewing.
>>
>> As this patchset touches a few files, which may has different maintainer.
>> I am not sure if a separate patch for different arch will make the merging
>> process easy, or a single patch will make the merging process easy?
> 
> The set makes the same logical change to several definitions of
> cpumask_of_node(). It's appropriate to have all these changes in a single
> patch.

Ok, thanks.
Will have all these changes in a single patch.


>  
>> It can be made as a single patch if a single patch will make the merging
>> process easy.
>>
>>>
  1 file changed, 2 insertions(+), 2 deletions(-)

 diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
 b/arch/mips/include/asm/mach-ip27/topology.h
 index 965f079..04505e6 100644
 --- a/arch/mips/include/asm/mach-ip27/topology.h
 +++ b/arch/mips/include/asm/mach-ip27/topology.h
 @@ -15,8 +15,8 @@ struct cpuinfo_ip27 {
  extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
  
  #define cpu_to_node(cpu)  (sn_cpu_info[(cpu)].p_nodeid)
 -#define cpumask_of_node(node) ((node) == -1 ? 
 \
 -   cpu_all_mask : \
 +#define cpumask_of_node(node) ((node) == NUMA_NO_NODE ?   
 \
 +   cpu_online_mask :  \
 _data(node)->h_cpus)
  struct pci_bus;
  extern int pcibus_to_node(struct pci_bus *);
 -- 
 2.8.1

>>>
>>
> 



Re: [PATCH v3 7/8] mips: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for mips

2019-09-15 Thread Mike Rapoport
On Sun, Sep 15, 2019 at 02:13:51PM +0800, Yunsheng Lin wrote:
> On 2019/9/15 13:49, Mike Rapoport wrote:
> > Hi,
> > 
> > On Thu, Sep 12, 2019 at 06:15:33PM +0800, Yunsheng Lin wrote:
> >> When passing the return value of dev_to_node() to cpumask_of_node()
> >> without checking the node id if the node id is NUMA_NO_NODE, there is
> >> global-out-of-bounds detected by KASAN.
> >>
> >> From the discussion [1], NUMA_NO_NODE really means no node affinity,
> >> which also means all cpus should be usable. So the cpumask_of_node()
> >> should always return all cpus online when user passes the node id
> >> as NUMA_NO_NODE, just like similar semantic that page allocator handles
> >> NUMA_NO_NODE.
> >>
> >> But we cannot really copy the page allocator logic. Simply because the
> >> page allocator doesn't enforce the near node affinity. It just picks it
> >> up as a preferred node but then it is free to fallback to any other numa
> >> node. This is not the case here and node_to_cpumask_map will only restrict
> >> to the particular node's cpus which would have really non deterministic
> >> behavior depending on where the code is executed. So in fact we really
> >> want to return cpu_online_mask for NUMA_NO_NODE.
> >>
> >> Since this arch was already NUMA_NO_NODE aware, this patch only changes
> >> it to return cpu_online_mask and use NUMA_NO_NODE instead of "-1".
> >>
> >> [1] https://lore.kernel.org/patchwork/patch/1125789/
> >> Signed-off-by: Yunsheng Lin 
> >> Suggested-by: Michal Hocko 
> >> ---
> >> V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
> >> for NUMA_NO_NODE case, and change the commit log to better justify
> >> the change.
> >> ---
> >>  arch/mips/include/asm/mach-ip27/topology.h | 4 ++--
> > 
> > Nit: the subject says "mips:", but this patch only touches sgi-ip27 and
> > loongson is updated as a separate patch. I don't see why both patches
> > cannot be merged. Moreover, the whole set can be made as a single patch,
> > IMHO.
> 
> Thanks for reviewing.
> 
> As this patchset touches a few files, which may has different maintainer.
> I am not sure if a separate patch for different arch will make the merging
> process easy, or a single patch will make the merging process easy?

The set makes the same logical change to several definitions of
cpumask_of_node(). It's appropriate to have all these changes in a single
patch.
 
> It can be made as a single patch if a single patch will make the merging
> process easy.
> 
> > 
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
> >> b/arch/mips/include/asm/mach-ip27/topology.h
> >> index 965f079..04505e6 100644
> >> --- a/arch/mips/include/asm/mach-ip27/topology.h
> >> +++ b/arch/mips/include/asm/mach-ip27/topology.h
> >> @@ -15,8 +15,8 @@ struct cpuinfo_ip27 {
> >>  extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
> >>  
> >>  #define cpu_to_node(cpu)  (sn_cpu_info[(cpu)].p_nodeid)
> >> -#define cpumask_of_node(node) ((node) == -1 ? 
> >> \
> >> -   cpu_all_mask : \
> >> +#define cpumask_of_node(node) ((node) == NUMA_NO_NODE ?   
> >> \
> >> +   cpu_online_mask :  \
> >> _data(node)->h_cpus)
> >>  struct pci_bus;
> >>  extern int pcibus_to_node(struct pci_bus *);
> >> -- 
> >> 2.8.1
> >>
> > 
> 

-- 
Sincerely yours,
Mike.



Re: [PATCH v3 7/8] mips: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for mips

2019-09-15 Thread Yunsheng Lin
On 2019/9/15 13:49, Mike Rapoport wrote:
> Hi,
> 
> On Thu, Sep 12, 2019 at 06:15:33PM +0800, Yunsheng Lin wrote:
>> When passing the return value of dev_to_node() to cpumask_of_node()
>> without checking the node id if the node id is NUMA_NO_NODE, there is
>> global-out-of-bounds detected by KASAN.
>>
>> From the discussion [1], NUMA_NO_NODE really means no node affinity,
>> which also means all cpus should be usable. So the cpumask_of_node()
>> should always return all cpus online when user passes the node id
>> as NUMA_NO_NODE, just like similar semantic that page allocator handles
>> NUMA_NO_NODE.
>>
>> But we cannot really copy the page allocator logic. Simply because the
>> page allocator doesn't enforce the near node affinity. It just picks it
>> up as a preferred node but then it is free to fallback to any other numa
>> node. This is not the case here and node_to_cpumask_map will only restrict
>> to the particular node's cpus which would have really non deterministic
>> behavior depending on where the code is executed. So in fact we really
>> want to return cpu_online_mask for NUMA_NO_NODE.
>>
>> Since this arch was already NUMA_NO_NODE aware, this patch only changes
>> it to return cpu_online_mask and use NUMA_NO_NODE instead of "-1".
>>
>> [1] https://lore.kernel.org/patchwork/patch/1125789/
>> Signed-off-by: Yunsheng Lin 
>> Suggested-by: Michal Hocko 
>> ---
>> V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
>> for NUMA_NO_NODE case, and change the commit log to better justify
>> the change.
>> ---
>>  arch/mips/include/asm/mach-ip27/topology.h | 4 ++--
> 
> Nit: the subject says "mips:", but this patch only touches sgi-ip27 and
> loongson is updated as a separate patch. I don't see why both patches
> cannot be merged. Moreover, the whole set can be made as a single patch,
> IMHO.

Thanks for reviewing.

As this patchset touches a few files, which may has different maintainer.
I am not sure if a separate patch for different arch will make the merging
process easy, or a single patch will make the merging process easy?

It can be made as a single patch if a single patch will make the merging
process easy.

> 
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
>> b/arch/mips/include/asm/mach-ip27/topology.h
>> index 965f079..04505e6 100644
>> --- a/arch/mips/include/asm/mach-ip27/topology.h
>> +++ b/arch/mips/include/asm/mach-ip27/topology.h
>> @@ -15,8 +15,8 @@ struct cpuinfo_ip27 {
>>  extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
>>  
>>  #define cpu_to_node(cpu)(sn_cpu_info[(cpu)].p_nodeid)
>> -#define cpumask_of_node(node)   ((node) == -1 ? 
>> \
>> - cpu_all_mask : \
>> +#define cpumask_of_node(node)   ((node) == NUMA_NO_NODE ?   
>> \
>> + cpu_online_mask :  \
>>   _data(node)->h_cpus)
>>  struct pci_bus;
>>  extern int pcibus_to_node(struct pci_bus *);
>> -- 
>> 2.8.1
>>
> 



Re: [PATCH v3 7/8] mips: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for mips

2019-09-14 Thread Mike Rapoport
Hi,

On Thu, Sep 12, 2019 at 06:15:33PM +0800, Yunsheng Lin wrote:
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking the node id if the node id is NUMA_NO_NODE, there is
> global-out-of-bounds detected by KASAN.
> 
> From the discussion [1], NUMA_NO_NODE really means no node affinity,
> which also means all cpus should be usable. So the cpumask_of_node()
> should always return all cpus online when user passes the node id
> as NUMA_NO_NODE, just like similar semantic that page allocator handles
> NUMA_NO_NODE.
> 
> But we cannot really copy the page allocator logic. Simply because the
> page allocator doesn't enforce the near node affinity. It just picks it
> up as a preferred node but then it is free to fallback to any other numa
> node. This is not the case here and node_to_cpumask_map will only restrict
> to the particular node's cpus which would have really non deterministic
> behavior depending on where the code is executed. So in fact we really
> want to return cpu_online_mask for NUMA_NO_NODE.
> 
> Since this arch was already NUMA_NO_NODE aware, this patch only changes
> it to return cpu_online_mask and use NUMA_NO_NODE instead of "-1".
> 
> [1] https://lore.kernel.org/patchwork/patch/1125789/
> Signed-off-by: Yunsheng Lin 
> Suggested-by: Michal Hocko 
> ---
> V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
> for NUMA_NO_NODE case, and change the commit log to better justify
> the change.
> ---
>  arch/mips/include/asm/mach-ip27/topology.h | 4 ++--

Nit: the subject says "mips:", but this patch only touches sgi-ip27 and
loongson is updated as a separate patch. I don't see why both patches
cannot be merged. Moreover, the whole set can be made as a single patch,
IMHO.

>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
> b/arch/mips/include/asm/mach-ip27/topology.h
> index 965f079..04505e6 100644
> --- a/arch/mips/include/asm/mach-ip27/topology.h
> +++ b/arch/mips/include/asm/mach-ip27/topology.h
> @@ -15,8 +15,8 @@ struct cpuinfo_ip27 {
>  extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
>  
>  #define cpu_to_node(cpu) (sn_cpu_info[(cpu)].p_nodeid)
> -#define cpumask_of_node(node)((node) == -1 ? 
> \
> -  cpu_all_mask : \
> +#define cpumask_of_node(node)((node) == NUMA_NO_NODE ?   
> \
> +  cpu_online_mask :  \
>_data(node)->h_cpus)
>  struct pci_bus;
>  extern int pcibus_to_node(struct pci_bus *);
> -- 
> 2.8.1
> 

-- 
Sincerely yours,
Mike.



[PATCH v3 0/8] make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-12 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN:

[   42.970381] 
==
[   42.977595] BUG: KASAN: global-out-of-bounds in __bitmap_weight+0x48/0xb0
[   42.984370] Read of size 8 at addr 20008cdf8790 by task kworker/0:1/13
[   42.991230]
[   42.992712] CPU: 0 PID: 13 Comm: kworker/0:1 Tainted: G   O  
5.2.0-rc4-g8bde06a-dirty #3
[   43.001830] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 
2280-A CS V2.B050.01 08/08/2019
[   43.011298] Workqueue: events work_for_cpu_fn
[   43.015643] Call trace:
[   43.018078]  dump_backtrace+0x0/0x1e8
[   43.021727]  show_stack+0x14/0x20
[   43.025031]  dump_stack+0xc4/0xfc
[   43.028335]  print_address_description+0x178/0x270
[   43.033113]  __kasan_report+0x164/0x1b8
[   43.036936]  kasan_report+0xc/0x18
[   43.040325]  __asan_load8+0x84/0xa8
[   43.043801]  __bitmap_weight+0x48/0xb0
[   43.047552]  hclge_init_ae_dev+0x988/0x1e78 [hclge]
[   43.052418]  hnae3_register_ae_dev+0xcc/0x278 [hnae3]
[   43.057467]  hns3_probe+0xe0/0x120 [hns3]
[   43.061464]  local_pci_probe+0x74/0xf0
[   43.065200]  work_for_cpu_fn+0x2c/0x48
[   43.068937]  process_one_work+0x3c0/0x878
[   43.072934]  worker_thread+0x400/0x670
[   43.076670]  kthread+0x1b0/0x1b8
[   43.079885]  ret_from_fork+0x10/0x18
[   43.083446]
[   43.084925] The buggy address belongs to the variable:
[   43.090052]  numa_distance+0x30/0x40
[   43.093613]
[   43.095091] Memory state around the buggy address:
[   43.099870]  20008cdf8680: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 fa 
fa
[   43.107078]  20008cdf8700: fa fa fa fa 04 fa fa fa fa fa fa fa 00 fa fa 
fa
[   43.114286] >20008cdf8780: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa 
fa
[   43.121494]  ^
[   43.125230]  20008cdf8800: 01 fa fa fa fa fa fa fa 04 fa fa fa fa fa fa 
fa
[   43.132439]  20008cdf8880: fa fa fa fa fa fa fa fa 00 00 fa fa fa fa fa 
fa
[   43.139646] 
==

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id
as NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Note:
1. Only arm64 has been compile tested and tested on real board.
2. x86 has been compile tested with defconfig.
3. Other arch has not been compile tested or tested on real board.

Change log:
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change, drop sh arch change since it always return cpu_online_mask.

Yunsheng Lin (8):
  arm64: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for arm64
  x86: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for x86
  alpha: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for alpha
  powerpc: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for
powerpc
  s390: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for s390
  sparc64: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for
sparc64
  mips: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for mips
  mips: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for
loongson64

 arch/alpha/include/asm/topology.h| 2 +-
 arch/arm64/include/asm/numa.h| 3 +++
 arch/arm64/mm/numa.c | 5 -
 arch/mips/include/asm/mach-ip27/topology.h   | 4 ++--
 arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
 arch/powerpc/include/asm/topology.h  | 4 ++--
 arch/s390/include/asm/topology.h | 3 +++
 arch/sparc/include/asm/topology_64.h | 4 ++--
 arch/x86/include/asm/topology.h  | 3 +++
 arch/x86/mm/numa.c   | 7 +--
 10 files changed, 28 insertions(+), 11 deletions(-)

-- 
2.8.1



[PATCH v3 8/8] mips: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for loongson64

2019-09-12 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id
as NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
---
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
---
 arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/mips/include/asm/mach-loongson64/topology.h 
b/arch/mips/include/asm/mach-loongson64/topology.h
index 7ff819a..2207e2e 100644
--- a/arch/mips/include/asm/mach-loongson64/topology.h
+++ b/arch/mips/include/asm/mach-loongson64/topology.h
@@ -5,7 +5,9 @@
 #ifdef CONFIG_NUMA
 
 #define cpu_to_node(cpu)   (cpu_logical_map(cpu) >> 2)
-#define cpumask_of_node(node)  (&__node_data[(node)]->cpumask)
+#define cpumask_of_node(node)  ((node) == NUMA_NO_NODE ?   \
+   cpu_online_mask :   \
+   (&__node_data[(node)]->cpumask)
 
 struct pci_bus;
 extern int pcibus_to_node(struct pci_bus *);
-- 
2.8.1



[PATCH v3 4/8] powerpc: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for powerpc

2019-09-12 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id
as NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Since this arch was already NUMA_NO_NODE aware, this patch only changes
it to return cpu_online_mask and use NUMA_NO_NODE instead of "-1".

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
---
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
---
 arch/powerpc/include/asm/topology.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea..107f5cd 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -17,8 +17,8 @@ struct device_node;
 
 #include 
 
-#define cpumask_of_node(node) ((node) == -1 ?  \
-  cpu_all_mask :   \
+#define cpumask_of_node(node) ((node) == NUMA_NO_NODE ?
\
+  cpu_online_mask :\
   node_to_cpumask_map[node])
 
 struct pci_bus;
-- 
2.8.1



[PATCH v3 7/8] mips: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for mips

2019-09-12 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id
as NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Since this arch was already NUMA_NO_NODE aware, this patch only changes
it to return cpu_online_mask and use NUMA_NO_NODE instead of "-1".

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
---
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
---
 arch/mips/include/asm/mach-ip27/topology.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/mips/include/asm/mach-ip27/topology.h 
b/arch/mips/include/asm/mach-ip27/topology.h
index 965f079..04505e6 100644
--- a/arch/mips/include/asm/mach-ip27/topology.h
+++ b/arch/mips/include/asm/mach-ip27/topology.h
@@ -15,8 +15,8 @@ struct cpuinfo_ip27 {
 extern struct cpuinfo_ip27 sn_cpu_info[NR_CPUS];
 
 #define cpu_to_node(cpu)   (sn_cpu_info[(cpu)].p_nodeid)
-#define cpumask_of_node(node)  ((node) == -1 ? \
-cpu_all_mask : \
+#define cpumask_of_node(node)  ((node) == NUMA_NO_NODE ?   \
+cpu_online_mask :  \
 _data(node)->h_cpus)
 struct pci_bus;
 extern int pcibus_to_node(struct pci_bus *);
-- 
2.8.1



[PATCH v3 5/8] s390: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for s390

2019-09-12 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id
as NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
---
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
---
 arch/s390/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index cca406f..1bd2e73 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -78,6 +78,9 @@ static inline int cpu_to_node(int cpu)
 #define cpumask_of_node cpumask_of_node
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return _to_cpumask_map[node];
 }
 
-- 
2.8.1



[PATCH v3 6/8] sparc64: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for sparc64

2019-09-12 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id
as NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Since this arch was already NUMA_NO_NODE aware, this patch only changes
it to return cpu_online_mask and use NUMA_NO_NODE instead of "-1".

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
---
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
---
 arch/sparc/include/asm/topology_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/include/asm/topology_64.h 
b/arch/sparc/include/asm/topology_64.h
index 34c628a..34f9240 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -11,8 +11,8 @@ static inline int cpu_to_node(int cpu)
return numa_cpu_lookup_table[cpu];
 }
 
-#define cpumask_of_node(node) ((node) == -1 ?  \
-  cpu_all_mask :   \
+#define cpumask_of_node(node) ((node) == NUMA_NO_NODE ?
\
+  cpu_online_mask :\
   _cpumask_lookup_table[node])
 
 struct pci_bus;
-- 
2.8.1



[PATCH v3 2/8] x86: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for x86

2019-09-12 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id
as NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Also there is a debuging version of node_to_cpumask_map(), which only
is used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this patch changes
it to handle NUMA_NO_NODE as the normal node_to_cpumask_map(). And "fix"
a sign "bug" since it is for debugging and should catch all the error
cases.

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
---
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
---
 arch/x86/include/asm/topology.h | 3 +++
 arch/x86/mm/numa.c  | 7 +--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b14d23..7fa82e1 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -69,6 +69,9 @@ extern const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e6dad60..c676ffb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -861,9 +861,12 @@ void numa_remove_cpu(int cpu)
  */
 const struct cpumask *cpumask_of_node(int node)
 {
-   if (node >= nr_node_ids) {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
+   if ((unsigned int)node >= nr_node_ids) {
printk(KERN_WARNING
-   "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
+   "cpumask_of_node(%d): node >= nr_node_ids(%u)\n",
node, nr_node_ids);
dump_stack();
return cpu_none_mask;
-- 
2.8.1



[PATCH v3 1/8] arm64: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for arm64

2019-09-12 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id
as NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Also there is a debuging version of node_to_cpumask_map(), which only
is used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this patch changes
it to handle NUMA_NO_NODE as the normal node_to_cpumask_map(). And "fix"
a sign "bug" since it is for debugging and should catch all the error
cases.

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
---
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
---
 arch/arm64/include/asm/numa.h | 3 +++
 arch/arm64/mm/numa.c  | 5 -
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
index 626ad01..c8a4b31 100644
--- a/arch/arm64/include/asm/numa.h
+++ b/arch/arm64/include/asm/numa.h
@@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 4f241cc..bef4bdd 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -46,7 +46,10 @@ EXPORT_SYMBOL(node_to_cpumask_map);
  */
 const struct cpumask *cpumask_of_node(int node)
 {
-   if (WARN_ON(node >= nr_node_ids))
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
+   if (WARN_ON((unsigned int)node >= nr_node_ids))
return cpu_none_mask;
 
if (WARN_ON(node_to_cpumask_map[node] == NULL))
-- 
2.8.1



[PATCH v3 3/8] alpha: numa: make node_to_cpumask_map() NUMA_NO_NODE aware for alpha

2019-09-12 Thread Yunsheng Lin
When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id
as NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Since this arch was already NUMA_NO_NODE aware, this patch only changes
it to return cpu_online_mask.

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
---
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
---
 arch/alpha/include/asm/topology.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/alpha/include/asm/topology.h 
b/arch/alpha/include/asm/topology.h
index 5a77a40..836c9e2 100644
--- a/arch/alpha/include/asm/topology.h
+++ b/arch/alpha/include/asm/topology.h
@@ -31,7 +31,7 @@ static const struct cpumask *cpumask_of_node(int node)
int cpu;
 
if (node == NUMA_NO_NODE)
-   return cpu_all_mask;
+   return cpu_online_mask;
 
cpumask_clear(_to_cpumask_map[node]);
 
-- 
2.8.1



Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-11 Thread Aleksa Sarai
On 2019-09-05, Peter Zijlstra  wrote:
> On Thu, Sep 05, 2019 at 11:43:05AM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > > On 2019-09-05, Peter Zijlstra  wrote:
> > > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > > +/**
> > > > > + * copy_struct_to_user: copy a struct to user space
> > > > > + * @dst:   Destination address, in user space.
> > > > > + * @usize: Size of @dst struct.
> > > > > + * @src:   Source address, in kernel space.
> > > > > + * @ksize: Size of @src struct.
> > > > > + *
> > > > > + * Copies a struct from kernel space to user space, in a way that 
> > > > > guarantees
> > > > > + * backwards-compatibility for struct syscall arguments (as long as 
> > > > > future
> > > > > + * struct extensions are made such that all new fields are 
> > > > > *appended* to the
> > > > > + * old struct, and zeroed-out new fields have the same meaning as 
> > > > > the old
> > > > > + * struct).
> > > > > + *
> > > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by 
> > > > > user space.
> > > > > + * The recommended usage is something like the following:
> > > > > + *
> > > > > + *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, 
> > > > > usize)
> > > > > + *   {
> > > > > + *  int err;
> > > > > + *  struct foo karg = {};
> > > > > + *
> > > > > + *  // do something with karg
> > > > > + *
> > > > > + *  err = copy_struct_to_user(uarg, usize, , sizeof(karg));
> > > > > + *  if (err)
> > > > > + *return err;
> > > > > + *
> > > > > + *  // ...
> > > > > + *   }
> > > > > + *
> > > > > + * There are three cases to consider:
> > > > > + *  * If @usize == @ksize, then it's copied verbatim.
> > > > > + *  * If @usize < @ksize, then kernel space is "returning" a newer 
> > > > > struct to an
> > > > > + *older user space. In order to avoid user space getting 
> > > > > incomplete
> > > > > + *information (new fields might be important), all trailing 
> > > > > bytes in @src
> > > > > + *(@ksize - @usize) must be zerored
> > > > 
> > > > s/zerored/zero/, right?
> > > 
> > > It should've been "zeroed".
> > 
> > That reads wrong to me; that way it reads like this function must take
> > that action and zero out the 'rest'; which is just wrong.
> > 
> > This function must verify those bytes are zero, not make them zero.
> > 
> > > > >  , otherwise -EFBIG is 
> > > > > returned.
> > > > 
> > > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> > > 
> > > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> > > a "too big" struct passed to the kernel, and EFBIG for a "too big"
> > > struct passed to user-space. I would personally have preferred EMSGSIZE
> > > instead of EFBIG, but felt using the existing error codes would be less
> > > confusing.
> > 
> > Sadly a recent commit:
> > 
> >   1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify 
> > sched_read_attr() ABI logic and code")
> > 
> > Made the situation even 'worse'.
> 
> And thinking more about things; I'm not convinced the above patch is
> actually right.
> 
> Do we really want to simply truncate all the attributes of the task?
> 
> And should we not at least set sched_flags when there are non-default
> clamp values applied?
> 
> See; that is I think the primary bug that had chrt failing; we tried to
> publish the default clamp values as !0.

I just saw this patch in -rc8 -- should I even attempt to port
sched_getattr(2) to copy_struct_to_user()? I agree that publishing a
default non-zero value is a mistake -- once you do that, old user space
will either get confused or lose information.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-10 Thread Ingo Molnar


* Linus Torvalds  wrote:

> On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski  wrote:
> >
> > Linus, you rejected resolveat() because you wanted a *nice* API
> 
> No. I rejected resoveat() because it was a completely broken garbage
> API that couldn't do even basic stuff right (like O_CREAT).
> 
> We have a ton of flag space in the new openat2() model, we might as
> well leave the old flags alone that people are (a) used to and (b) we
> have code to support _anyway_.
> 
> Making up a new flag namespace is only going to cause us - and users -
> more work, and more confusion. For no actual advantage. It's not going
> to be "cleaner". It's just going to be worse.

I suspect there is a "add a clean new flags namespace" analogy to the 
classic "add a clean new standard" XKCD:

https://xkcd.com/927/

Thanks,

Ingo


<    1   2   3   4   5   6   7   8   9   10   >