Re: [PATCH 17/17] arch: rename copy_thread_tls() back to copy_thread()

2020-06-26 Thread Greentime Hu
Christian Brauner  於 2020年6月23日 週二 上午7:58寫道:
>
> Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls()
> back simply copy_thread(). It's a simpler name, and doesn't imply that only
> tls is copied here. This finishes an outstanding chunk of internal process
> creation work since we've added clone3().
>
> Cc: Richard Henderson 
> Cc: Ivan Kokshaysky 
> Cc: Matt Turner 
> Cc: Vineet Gupta 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Mark Salter 
> Cc: Aurelien Jacquiot 
> Cc: Guo Ren 
> Cc: Yoshinori Sato 
> Cc: Brian Cain 
> Cc: Tony Luck 
> Cc: Fenghua Yu 
> Cc: Geert Uytterhoeven 
> Cc: Michal Simek 
> Cc: Thomas Bogendoerfer 
> Cc: Nick Hu 
> Cc: Greentime Hu 
> Cc: Vincent Chen 
> Cc: Ley Foon Tan 
> Cc: Jonas Bonn 
> Cc: Stefan Kristiansson 
> Cc: Stafford Horne 
> Cc: "James E.J. Bottomley" 
> Cc: Helge Deller 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Paul Walmsley 
> Cc: Palmer Dabbelt 
> Cc: Albert Ou 
> Cc: Christian Borntraeger 
> Cc: Rich Felker 
> Cc: "David S. Miller" 
> Cc: Jeff Dike 
> Cc: Richard Weinberger 
> Cc: Anton Ivanov 
> Cc: Guan Xuetao 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: x...@kernel.org
> Cc: "H. Peter Anvin" 
> Cc: Chris Zankel 
> Cc: Max Filippov 
> Cc: Kees Cook 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: Al Viro 
> Cc: linux-al...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c6x-...@linux-c6x.org
> Cc: linux-c...@vger.kernel.org
> Cc: uclinux-h8-de...@lists.sourceforge.jp
> Cc: linux-hexa...@vger.kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-m...@lists.linux-m68k.org
> Cc: linux-m...@vger.kernel.org
> Cc: openr...@lists.librecores.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ri...@lists.infradead.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@lists.infradead.org
> Cc: linux-xte...@linux-xtensa.org
> Signed-off-by: Christian Brauner 
> ---
>  arch/alpha/kernel/process.c  | 2 +-
>  arch/arc/kernel/process.c| 2 +-
>  arch/arm/kernel/process.c| 2 +-
>  arch/arm64/kernel/process.c  | 2 +-
>  arch/c6x/kernel/process.c| 2 +-
>  arch/csky/kernel/process.c   | 2 +-
>  arch/h8300/kernel/process.c  | 2 +-
>  arch/hexagon/kernel/process.c| 2 +-
>  arch/ia64/kernel/process.c   | 4 ++--
>  arch/m68k/kernel/process.c   | 2 +-
>  arch/microblaze/kernel/process.c | 2 +-
>  arch/mips/kernel/process.c   | 2 +-
>  arch/nds32/kernel/process.c  | 2 +-
>  arch/nios2/kernel/process.c  | 2 +-
>  arch/openrisc/kernel/process.c   | 4 ++--
>  arch/parisc/kernel/process.c | 2 +-
>  arch/powerpc/kernel/process.c| 2 +-
>  arch/riscv/kernel/process.c  | 2 +-
>  arch/s390/kernel/process.c   | 2 +-
>  arch/sh/kernel/process_32.c  | 2 +-
>  arch/sparc/kernel/process.c  | 6 +++---
>  arch/sparc/kernel/process_32.c   | 2 +-
>  arch/sparc/kernel/process_64.c   | 2 +-
>  arch/um/kernel/process.c | 2 +-
>  arch/unicore32/kernel/process.c  | 2 +-
>  arch/x86/kernel/process.c| 2 +-
>  arch/x86/kernel/unwind_frame.c   | 2 +-
>  arch/xtensa/kernel/process.c | 2 +-
>  include/linux/sched/task.h   | 2 +-
>  kernel/fork.c| 2 +-
>  30 files changed, 34 insertions(+), 34 deletions(-)
>
> [...]
> diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c
> index 7dbb1bf64165..1020e2c6dcd8 100644
> --- a/arch/nds32/kernel/process.c
> +++ b/arch/nds32/kernel/process.c
> @@ -149,7 +149,7 @@ void flush_thread(void)
>  DEFINE_PER_CPU(struct task_struct *, __entry_task);
>
>  asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
> -int copy_thread_tls(unsigned long clone_flags, unsigned long stack_start,
> +int copy_thread(unsigned long clone_flags, unsigned long stack_start,
> unsigned long stk_sz, struct task_struct *p,
> unsigned long tls)
>  {

Hi Christian, Thank you.
Acked-by: Greentime Hu 


Re: [PATCH 16/17] arch: remove HAVE_COPY_THREAD_TLS

2020-06-26 Thread Greentime Hu
Christian Brauner  於 2020年6月23日 週二 上午7:46寫道:
>
> All architectures support copy_thread_tls() now, so remove the legacy
> copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone
> uses the same process creation calling convention based on
> copy_thread_tls() and struct kernel_clone_args. This will make it easier to
> maintain the core process creation code under kernel/, simplifies the
> callpaths and makes the identical for all architectures.
>
> Cc: Richard Henderson 
> Cc: Ivan Kokshaysky 
> Cc: Matt Turner 
> Cc: Vineet Gupta 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Mark Salter 
> Cc: Aurelien Jacquiot 
> Cc: Guo Ren 
> Cc: Yoshinori Sato 
> Cc: Brian Cain 
> Cc: Tony Luck 
> Cc: Fenghua Yu 
> Cc: Geert Uytterhoeven 
> Cc: Michal Simek 
> Cc: Thomas Bogendoerfer 
> Cc: Nick Hu 
> Cc: Greentime Hu 
> Cc: Vincent Chen 
> Cc: Ley Foon Tan 
> Cc: Jonas Bonn 
> Cc: Stefan Kristiansson 
> Cc: Stafford Horne 
> Cc: "James E.J. Bottomley" 
> Cc: Helge Deller 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Paul Walmsley 
> Cc: Palmer Dabbelt 
> Cc: Albert Ou 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Christian Borntraeger 
> Cc: Rich Felker 
> Cc: "David S. Miller" 
> Cc: Jeff Dike 
> Cc: Richard Weinberger 
> Cc: Anton Ivanov 
> Cc: Guan Xuetao 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: x...@kernel.org
> Cc: Peter Zijlstra 
> Cc: Kees Cook 
> Cc: Mike Rapoport 
> Cc: "Matthew Wilcox
> Cc: Al Viro 
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-al...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c6x-...@linux-c6x.org
> Cc: linux-c...@vger.kernel.org
> Cc: uclinux-h8-de...@lists.sourceforge.jp
> Cc: linux-hexa...@vger.kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-m...@lists.linux-m68k.org
> Cc: linux-m...@vger.kernel.org
> Cc: openr...@lists.librecores.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ri...@lists.infradead.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@lists.infradead.org
> Cc: linux-xte...@linux-xtensa.org
> Signed-off-by: Christian Brauner 
> [...]
> diff --git a/arch/nds32/Kconfig b/arch/nds32/Kconfig
> index 7b6eaca81cce..e30298e99e1b 100644
> --- a/arch/nds32/Kconfig
> +++ b/arch/nds32/Kconfig
> @@ -48,7 +48,6 @@ config NDS32
> select HAVE_FUNCTION_GRAPH_TRACER
> select HAVE_FTRACE_MCOUNT_RECORD
> select HAVE_DYNAMIC_FTRACE
> -   select HAVE_COPY_THREAD_TLS
> help
>   Andes(nds32) Linux support.

Hi Christian, Thank you.
Acked-by: Greentime Hu 


Re: [PATCH v5 06/12] arch: xtensa: add linker section for KUnit test suites

2020-06-26 Thread Kees Cook
On Fri, Jun 26, 2020 at 02:09:11PM -0700, Brendan Higgins wrote:
> Add a linker section to xtensa where KUnit can put references to its
> test suites. This patch is an early step in transitioning to dispatching
> all KUnit tests from a centralized executor rather than having each as
> its own separate late_initcall.
> 
> Signed-off-by: Brendan Higgins 
> ---
>  arch/xtensa/kernel/vmlinux.lds.S | 4 

If you ever find yourself modifying multiple arch linker scripts for a
series, something has gone wrong. ;)

-- 
Kees Cook


Re: [PATCH v5 01/12] vmlinux.lds.h: add linker section for KUnit test suites

2020-06-26 Thread Kees Cook
On Fri, Jun 26, 2020 at 02:09:06PM -0700, Brendan Higgins wrote:
> Add a linker section where KUnit can put references to its test suites.
> This patch is the first step in transitioning to dispatching all KUnit
> tests from a centralized executor rather than having each as its own
> separate late_initcall.
> 
> Co-developed-by: Iurii Zaikin 
> Signed-off-by: Iurii Zaikin 
> Signed-off-by: Brendan Higgins 
> Reviewed-by: Stephen Boyd 
> ---
>  include/asm-generic/vmlinux.lds.h | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index db600ef218d7d..4f9b036fc9616 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -881,6 +881,13 @@
>   KEEP(*(.con_initcall.init)) \
>   __con_initcall_end = .;
>  
> +/* Alignment must be consistent with (kunit_suite *) in include/kunit/test.h 
> */

Nit on naming:

> +#define KUNIT_TEST_SUITES\

I would call this KUNIT_TABLE to maintain the same names as other things
of this nature.

> + . = ALIGN(8);   \
> + __kunit_suites_start = .;   \
> + KEEP(*(.kunit_test_suites)) \
> + __kunit_suites_end = .;
> +
>  #ifdef CONFIG_BLK_DEV_INITRD
>  #define INIT_RAM_FS  \
>   . = ALIGN(4);   \
> @@ -1056,6 +1063,7 @@
>   INIT_CALLS  \
>   CON_INITCALL\
>   INIT_RAM_FS \
> + KUNIT_TEST_SUITES   \
>   }

Nack: this must be in INIT_DATA, not in INIT_DATA_SECTION. Not all
architectures use the INIT_DATA_SECTION macro (e.g. arm64), but everything
uses INIT_DATA.


-- 
Kees Cook


[PATCH 11/11] ppc64/kexec_file: add appropriate regions for memory reserve map

2020-06-26 Thread Hari Bathini
While initrd, elfcorehdr and backup regions are already added to the
reserve map, there are a few missing regions that need to be added to
the memory reserve map. Add them here. And now that all the changes
to load panic kernel are in place, claim likewise.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kexec/file_load_64.c |   61 ++---
 1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 58fc2d8..813453d 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -185,6 +185,38 @@ static int get_crash_memory_ranges(struct crash_mem 
**mem_ranges)
 }
 
 /**
+ * get_reserved_memory_ranges - Get reserve memory ranges. This list includes
+ *  memory regions that should be added to the
+ *  memory reserve map to ensure the region is
+ *  protected from any mischeif.
+ * @mem_ranges: Range list to add the memory ranges to.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int get_reserved_memory_ranges(struct crash_mem **mem_ranges)
+{
+   int ret;
+
+   ret = add_rtas_mem_range(mem_ranges, false);
+   if (ret)
+   goto out;
+
+   ret = add_opal_mem_range(mem_ranges, false);
+   if (ret)
+   goto out;
+
+   ret = add_tce_mem_ranges(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_reserved_ranges(mem_ranges);
+out:
+   if (ret)
+   pr_err("Failed to setup reserved memory ranges\n");
+   return ret;
+}
+
+/**
  * __locate_mem_hole_ppc64 - Tests if the memory hole between buf_min & buf_max
  *   is large enough for the buffer. If true, sets
  *   kbuf->mem to the buffer.
@@ -1182,8 +1214,8 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
unsigned long initrd_load_addr,
unsigned long initrd_len, const char *cmdline)
 {
-   struct crash_mem *umem = NULL;
-   int chosen_node, ret;
+   struct crash_mem *umem = NULL, *rmem = NULL;
+   int i, chosen_node, ret;
 
/* Remove memory reservation for the current device tree. */
ret = delete_fdt_mem_rsv(fdt, __pa(initial_boot_params),
@@ -1229,6 +1261,24 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
}
}
 
+   /* Update memory reserve map */
+   ret = get_reserved_memory_ranges();
+   if (ret)
+   goto out;
+
+   for (i = 0; i < rmem->nr_ranges; i++) {
+   u64 base, size;
+
+   base = rmem->ranges[i].start;
+   size = rmem->ranges[i].end - base + 1;
+   ret = fdt_add_mem_rsv(fdt, base, size);
+   if (ret) {
+   pr_err("Error updating memory reserve map: %s\n",
+  fdt_strerror(ret));
+   goto out;
+   }
+   }
+
ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len,
cmdline, _node);
if (ret)
@@ -1239,6 +1289,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
pr_err("Failed to update device-tree with 
linux,booted-from-kexec\n");
 out:
kfree(umem);
+   kfree(rmem);
return ret;
 }
 
@@ -1378,10 +1429,10 @@ int arch_kexec_kernel_image_probe(struct kimage *image, 
void *buf,
 
/* Get exclude memory ranges needed for setting up kdump 
segments */
ret = get_exclude_memory_ranges(&(image->arch.exclude_ranges));
-   if (ret)
+   if (ret) {
pr_err("Failed to setup exclude memory ranges for 
buffer lookup\n");
-   /* Return this until all changes for panic kernel are in */
-   return -EOPNOTSUPP;
+   return ret;
+   }
}
 
return kexec_image_probe_default(image, buf, buf_len);



[PATCH 09/11] ppc64/kexec_file: setup backup region for kdump kernel

2020-06-26 Thread Hari Bathini
Though kdump kernel boots from loaded address, the first 64K bytes
of it is copied down to real 0. So, setup a backup region to copy
the first 64K bytes of crashed kernel, in purgatory, before booting
into kdump kernel. Also, update reserve map with backup region and
crashed kernel's memory to avoid kdump kernel from accidentially
using that memory.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/include/asm/crashdump-ppc64.h |5 +
 arch/powerpc/include/asm/kexec.h   |7 ++
 arch/powerpc/kexec/elf_64.c|9 +++
 arch/powerpc/kexec/file_load_64.c  |   96 
 arch/powerpc/purgatory/Makefile|   28 
 arch/powerpc/purgatory/purgatory_64.c  |   35 ++
 arch/powerpc/purgatory/trampoline_64.S |   23 +--
 7 files changed, 195 insertions(+), 8 deletions(-)
 create mode 100644 arch/powerpc/purgatory/purgatory_64.c

diff --git a/arch/powerpc/include/asm/crashdump-ppc64.h 
b/arch/powerpc/include/asm/crashdump-ppc64.h
index 3596c25..504a579 100644
--- a/arch/powerpc/include/asm/crashdump-ppc64.h
+++ b/arch/powerpc/include/asm/crashdump-ppc64.h
@@ -2,6 +2,11 @@
 #ifndef _ARCH_POWERPC_KEXEC_CRASHDUMP_PPC64_H
 #define _ARCH_POWERPC_KEXEC_CRASHDUMP_PPC64_H
 
+/* Backup region - first 64K bytes of System RAM. */
+#define BACKUP_SRC_START   0
+#define BACKUP_SRC_END 0x
+#define BACKUP_SRC_SIZE(BACKUP_SRC_END - BACKUP_SRC_START + 1)
+
 /* min & max addresses for kdump load segments */
 #define KDUMP_BUF_MIN  (crashk_res.start)
 #define KDUMP_BUF_MAX  ((crashk_res.end < ppc64_rma_size) ? \
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index e78cd0a..037cf2b 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -109,6 +109,9 @@ extern const struct kexec_file_ops kexec_elf64_ops;
 struct kimage_arch {
struct crash_mem *exclude_ranges;
 
+   unsigned long backup_start;
+   void *backup_buf;
+
 #ifdef CONFIG_IMA_KEXEC
phys_addr_t ima_buffer_addr;
size_t ima_buffer_size;
@@ -124,6 +127,10 @@ int setup_new_fdt(const struct kimage *image, void *fdt,
 int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
 
 #ifdef CONFIG_PPC64
+struct kexec_buf;
+
+int load_crashdump_segments_ppc64(struct kimage *image,
+ struct kexec_buf *kbuf);
 int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
  const void *fdt, unsigned long kernel_load_addr,
  unsigned long fdt_load_addr);
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index c695f94..4838b42 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -67,6 +67,15 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
 
pr_debug("Loaded purgatory at 0x%lx\n", pbuf.mem);
 
+   /* Setup additional segments needed for panic kernel */
+   if (image->type == KEXEC_TYPE_CRASH) {
+   ret = load_crashdump_segments_ppc64(image, );
+   if (ret) {
+   pr_err("Failed to load kdump kernel segments\n");
+   goto out;
+   }
+   }
+
if (initrd != NULL) {
kbuf.buffer = initrd;
kbuf.bufsz = kbuf.memsz = initrd_len;
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 8e66c28..d7d3841 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -863,6 +864,70 @@ static int kexec_do_relocs_ppc64(unsigned long my_r2, 
const Elf_Sym *sym,
 }
 
 /**
+ * load_backup_segment - Initialize backup segment of crashing kernel.
+ * @image:   Kexec image.
+ * @kbuf:Buffer contents and memory parameters.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int load_backup_segment(struct kimage *image, struct kexec_buf *kbuf)
+{
+   void *buf;
+   int ret;
+
+   /* Setup a segment for backup region */
+   buf = vzalloc(BACKUP_SRC_SIZE);
+   if (!buf)
+   return -ENOMEM;
+
+   /*
+* A source buffer has no meaning for backup region as data will
+* be copied from backup source, after crash, in the purgatory.
+* But as load segment code doesn't recognize such segments,
+* setup a dummy source buffer to keep it happy for now.
+*/
+   kbuf->buffer = buf;
+   kbuf->mem = KEXEC_BUF_MEM_UNKNOWN;
+   kbuf->bufsz = kbuf->memsz = BACKUP_SRC_SIZE;
+   kbuf->top_down = false;
+
+   ret = kexec_add_buffer(kbuf);
+   if (ret) {
+   vfree(buf);
+   return ret;
+   }
+
+   image->arch.backup_buf = buf;
+   image->arch.backup_start = kbuf->mem;
+   return 

[PATCH 08/11] ppc64/kexec_file: setup the stack for purgatory

2020-06-26 Thread Hari Bathini
To avoid any weird errors, the purgatory should run with its own
stack. Set one up by adding the stack buffer to .data section of
the purgatory.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/include/asm/kexec.h   |4 
 arch/powerpc/kexec/file_load_64.c  |   14 +-
 arch/powerpc/purgatory/trampoline_64.S |   15 +++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index bf47a01..e78cd0a 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -45,6 +45,10 @@
 #define KEXEC_ARCH KEXEC_ARCH_PPC
 #endif
 
+#ifdef CONFIG_KEXEC_FILE
+#define KEXEC_PURGATORY_STACK_SIZE 16384   /* 16KB stack size */
+#endif
+
 #define KEXEC_STATE_NONE 0
 #define KEXEC_STATE_IRQS_OFF 1
 #define KEXEC_STATE_REAL_MODE 2
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 4430336..8e66c28 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -878,7 +878,8 @@ int setup_purgatory_ppc64(struct kimage *image, const void 
*slave_code,
  const void *fdt, unsigned long kernel_load_addr,
  unsigned long fdt_load_addr)
 {
-   uint64_t toc_ptr;
+   uint64_t toc_ptr, stack_top;
+   void *stack_buf;
int ret;
 
ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
@@ -901,6 +902,17 @@ int setup_purgatory_ppc64(struct kimage *image, const void 
*slave_code,
goto out;
}
 
+   /* Setup the stack top */
+   stack_buf = kexec_purgatory_get_symbol_addr(image, "stack_buf");
+   if (!stack_buf)
+   goto out;
+
+   stack_top = (u64)stack_buf + KEXEC_PURGATORY_STACK_SIZE;
+   ret = kexec_purgatory_get_set_symbol(image, "stack", _top,
+sizeof(stack_top), false);
+   if (ret)
+   goto out;
+
/* Setup the TOC pointer */
toc_ptr = get_toc_ptr(image->purgatory_info.ehdr);
ret = kexec_purgatory_get_set_symbol(image, "my_toc", _ptr,
diff --git a/arch/powerpc/purgatory/trampoline_64.S 
b/arch/powerpc/purgatory/trampoline_64.S
index 7b4a5f7..80615b4 100644
--- a/arch/powerpc/purgatory/trampoline_64.S
+++ b/arch/powerpc/purgatory/trampoline_64.S
@@ -9,6 +9,7 @@
  * Copyright (C) 2013, Anton Blanchard, IBM Corporation
  */
 
+#include 
 #include 
 
.machine ppc64
@@ -53,6 +54,8 @@ master:
 
ld  %r2,(my_toc - 0b)(%r18) /* setup toc */
 
+   ld  %r1,(stack - 0b)(%r18)  /* setup stack */
+
/* load device-tree address */
ld  %r3, (dt_offset - 0b)(%r18)
mr  %r16,%r3/* save dt address in reg16 */
@@ -111,6 +114,12 @@ my_toc:
.8byte  0x0
.size my_toc, . - my_toc
 
+   .balign 8
+   .globl stack
+stack:
+   .8byte  0x0
+   .size stack, . - stack
+
.data
.balign 8
 .globl purgatory_sha256_digest
@@ -123,3 +132,9 @@ purgatory_sha256_digest:
 purgatory_sha_regions:
.skip   8 * 2 * 16
.size purgatory_sha_regions, . - purgatory_sha_regions
+
+   .balign 8
+.globl stack_buf
+stack_buf:
+   .skip   KEXEC_PURGATORY_STACK_SIZE
+   .size stack_buf, . - stack_buf



[PATCH 07/11] ppc64/kexec_file: add support to relocate purgatory

2020-06-26 Thread Hari Bathini
Right now purgatory implementation is only minimal. But if purgatory
code is to be enhanced to copy memory to the backup region and verify
sha256 digest, relocations may have to be applied to the purgatory.
So, add support to relocate purgatory in kexec_file_load system call
by setting up TOC pointer and applying RELA relocations as needed.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kexec/file_load_64.c  |  338 
 arch/powerpc/purgatory/trampoline_64.S |8 +
 2 files changed, 346 insertions(+)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index d85cba4d..4430336 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -626,6 +627,242 @@ static int update_usable_mem_fdt(void *fdt, struct 
crash_mem *usable_mem)
 }
 
 /**
+ * get_toc_section - Look for ".toc" symbol and return the corresponding 
section
+ * @ehdr:ELF header.
+ *
+ * Returns TOC section on success, NULL otherwise.
+ */
+static const Elf_Shdr *get_toc_section(const Elf_Ehdr *ehdr)
+{
+   const Elf_Shdr *sechdrs;
+   const char *secstrings;
+   int i;
+
+   if (!ehdr) {
+   pr_err("Purgatory elf load info missing?\n");
+   return NULL;
+   }
+
+   sechdrs = (void *)ehdr + ehdr->e_shoff;
+   secstrings = (void *)ehdr + sechdrs[ehdr->e_shstrndx].sh_offset;
+
+   for (i = 0; i < ehdr->e_shnum; i++) {
+   if ((sechdrs[i].sh_size != 0) &&
+   (strcmp(secstrings + sechdrs[i].sh_name, ".toc") == 0)) {
+   /* Return the ".toc" section */
+   pr_debug("TOC section number is %d\n", i);
+   return [i];
+   }
+   }
+
+   return NULL;
+}
+
+/**
+ * get_toc_ptr - r2 is the TOC pointer: it points 0x8000 into the TOC
+ * @ehdr:ELF header.
+ *
+ * Returns r2 on success, 0 otherwise.
+ */
+static unsigned long get_toc_ptr(const Elf_Ehdr *ehdr)
+{
+   const Elf_Shdr *sechdr;
+
+   sechdr = get_toc_section(ehdr);
+   if (!sechdr) {
+   pr_err("Could not get the TOC section!\n");
+   return 0;
+   }
+
+   return sechdr->sh_addr + 0x8000;
+}
+
+/* Helper functions to apply relocations */
+static int do_relative_toc(unsigned long val, uint16_t *loc,
+  unsigned long mask, int complain_signed)
+{
+   if (complain_signed && (val + 0x8000 > 0x)) {
+   pr_err("TOC16 relocation overflows (%lu)\n", val);
+   return -ENOEXEC;
+   }
+
+   if ((~mask & 0x) & val) {
+   pr_err("Bad TOC16 relocation (%lu)\n", val);
+   return -ENOEXEC;
+   }
+
+   *loc = (*loc & ~mask) | (val & mask);
+   return 0;
+}
+#ifdef PPC64_ELF_ABI_v2
+/* PowerPC64 specific values for the Elf64_Sym st_other field.  */
+#define STO_PPC64_LOCAL_BIT5
+#define STO_PPC64_LOCAL_MASK   (7 << STO_PPC64_LOCAL_BIT)
+#define PPC64_LOCAL_ENTRY_OFFSET(other)
\
+   (((1 << (((other) & STO_PPC64_LOCAL_MASK) >> STO_PPC64_LOCAL_BIT)) \
+>> 2) << 2)
+
+static unsigned int local_entry_offset(const Elf64_Sym *sym)
+{
+   /* If this symbol has a local entry point, use it. */
+   return PPC64_LOCAL_ENTRY_OFFSET(sym->st_other);
+}
+#else
+static unsigned int local_entry_offset(struct mem_sym *UNUSED(sym))
+{
+   return 0;
+}
+#endif
+
+/**
+ * kexec_do_relocs_ppc64 - Apply relocations based on relocation type.
+ * @my_r2: TOC pointer.
+ * @sym:   Symbol to relocate.
+ * @r_type:Relocation type.
+ * @loc:   Location to modify.
+ * @val:   Relocated symbol value.
+ * @addr:  Final location after relocation.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int kexec_do_relocs_ppc64(unsigned long my_r2, const Elf_Sym *sym,
+int r_type, void *loc, unsigned long val,
+unsigned long addr)
+{
+   int ret = 0;
+
+   switch (r_type) {
+   case R_PPC64_ADDR32:
+   /* Simply set it */
+   *(uint32_t *)loc = val;
+   break;
+
+   case R_PPC64_ADDR64:
+   /* Simply set it */
+   *(uint64_t *)loc = val;
+   break;
+
+   case R_PPC64_REL64:
+   *(uint64_t *)loc = val - (uint64_t)loc;
+   break;
+
+   case R_PPC64_REL32:
+   /* Convert value to relative */
+   val -= (unsigned long)loc;
+   if (val + 0x8000 > 0x) {
+   pr_err("REL32 %li out of range!\n", val);
+   return -ENOEXEC;
+   }
+
+   *(uint32_t *)loc = val;
+   break;
+

[PATCH 06/11] ppc64/kexec_file: restrict memory usage of kdump kernel

2020-06-26 Thread Hari Bathini
Kdump kernel, used for capturing the kernel core image, is supposed
to use only specific memory regions to avoid corrupting the image to
be captured. The regions are crashkernel range - the memory reserved
explicitly for kdump kernel, memory used for the tce-table, the OPAL
region and RTAS region as applicable. Restrict kdump kernel memory
to use only these regions by setting up usable-memory DT property.
Also, tell the kdump kernel to run at the loaded address by setting
the magic word at 0x5c.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kexec/file_load_64.c |  400 +
 1 file changed, 398 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index f1d7160..d85cba4d 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -17,9 +17,21 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 
+struct umem_info {
+   uint64_t *buf; /* data buffer for usable-memory property */
+   uint32_t idx;  /* current index */
+   uint32_t size; /* size allocated for the data buffer */
+
+   /* usable memory ranges to look up */
+   const struct crash_mem *umrngs;
+};
+
 const struct kexec_file_ops * const kexec_file_loaders[] = {
_elf64_ops,
NULL
@@ -75,6 +87,38 @@ static int get_exclude_memory_ranges(struct crash_mem 
**mem_ranges)
 }
 
 /**
+ * get_usable_memory_ranges - Get usable memory ranges. This list includes
+ *regions like crashkernel, opal/rtas & tce-table,
+ *that kdump kernel could use.
+ * @mem_ranges:   Range list to add the memory ranges to.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int get_usable_memory_ranges(struct crash_mem **mem_ranges)
+{
+   int ret;
+
+   /* First memory block & crashkernel region */
+   ret = add_mem_range(mem_ranges, 0, crashk_res.end + 1);
+   if (ret)
+   goto out;
+
+   ret = add_rtas_mem_range(mem_ranges, false);
+   if (ret)
+   goto out;
+
+   ret = add_opal_mem_range(mem_ranges, false);
+   if (ret)
+   goto out;
+
+   ret = add_tce_mem_ranges(mem_ranges);
+out:
+   if (ret)
+   pr_err("Failed to setup usable memory ranges\n");
+   return ret;
+}
+
+/**
  * __locate_mem_hole_ppc64 - Tests if the memory hole between buf_min & buf_max
  *   is large enough for the buffer. If true, sets
  *   kbuf->mem to the buffer.
@@ -267,6 +311,321 @@ static int kexec_locate_mem_hole_ppc64(struct kexec_buf 
*kbuf)
 }
 
 /**
+ * check_realloc_usable_mem - Reallocate buffer if it can't accommodate entries
+ * @um_info:  Usable memory buffer and ranges info.
+ * @cnt:  No. of entries to accommodate.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static uint64_t *check_realloc_usable_mem(struct umem_info *um_info, int cnt)
+{
+   void *tbuf;
+
+   if (um_info->size >=
+   ((um_info->idx + cnt) * sizeof(*(um_info->buf
+   return um_info->buf;
+
+   um_info->size += MEM_RANGE_CHUNK_SZ;
+   tbuf = krealloc(um_info->buf, um_info->size, GFP_KERNEL);
+   if (!tbuf) {
+   um_info->size -= MEM_RANGE_CHUNK_SZ;
+   return NULL;
+   }
+
+   memset(tbuf + um_info->idx, 0, MEM_RANGE_CHUNK_SZ);
+   return tbuf;
+}
+
+/**
+ * add_usable_mem - Add the usable memory ranges within the given memory range
+ *  to the buffer
+ * @um_info:Usable memory buffer and ranges info.
+ * @base:   Base address of memory range to look for.
+ * @end:End address of memory range to look for.
+ * @cnt:No. of usable memory ranges added to buffer.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int add_usable_mem(struct umem_info *um_info, uint64_t base,
+ uint64_t end, int *cnt)
+{
+   uint64_t loc_base, loc_end, *buf;
+   const struct crash_mem *umrngs;
+   int i, add;
+
+   *cnt = 0;
+   umrngs = um_info->umrngs;
+   for (i = 0; i < umrngs->nr_ranges; i++) {
+   add = 0;
+   loc_base = umrngs->ranges[i].start;
+   loc_end = umrngs->ranges[i].end;
+   if (loc_base >= base && loc_end <= end)
+   add = 1;
+   else if (base < loc_end && end > loc_base) {
+   if (loc_base < base)
+   loc_base = base;
+   if (loc_end > end)
+   loc_end = end;
+   add = 1;
+   }
+
+   if (add) {
+   buf = check_realloc_usable_mem(um_info, 2);
+   if (!buf)
+   return -ENOMEM;
+
+

[PATCH 05/11] powerpc/drmem: make lmb walk a bit more flexible

2020-06-26 Thread Hari Bathini
Currently, numa & prom are the users of drmem lmb walk code. Loading
kdump with kexec_file also needs to walk the drmem LMBs to setup the
usable memory ranges for kdump kernel. But there are couple of issues
in using the code as is. One, walk_drmem_lmb() code is built into the
.init section currently, while kexec_file needs it later. Two, there
is no scope to pass data to the callback function for processing and/
or erroring out on certain conditions.

Fix that by, moving drmem LMB walk code out of .init section, adding
scope to pass data to the callback function and bailing out when
an error is encountered in the callback function.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/include/asm/drmem.h |9 ++--
 arch/powerpc/kernel/prom.c   |   13 +++---
 arch/powerpc/mm/drmem.c  |   87 +-
 arch/powerpc/mm/numa.c   |   13 +++---
 4 files changed, 78 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209..17ccc64 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -90,13 +90,14 @@ static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
 }
 
 u64 drmem_lmb_memory_max(void);
-void __init walk_drmem_lmbs(struct device_node *dn,
-   void (*func)(struct drmem_lmb *, const __be32 **));
+int walk_drmem_lmbs(struct device_node *dn, void *data,
+   int (*func)(struct drmem_lmb *, const __be32 **, void *));
 int drmem_update_dt(void);
 
 #ifdef CONFIG_PPC_PSERIES
-void __init walk_drmem_lmbs_early(unsigned long node,
-   void (*func)(struct drmem_lmb *, const __be32 **));
+int __init
+walk_drmem_lmbs_early(unsigned long node, void *data,
+ int (*func)(struct drmem_lmb *, const __be32 **, void *));
 #endif
 
 static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 9cc49f2..7df78de 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -468,8 +468,9 @@ static bool validate_mem_limit(u64 base, u64 *size)
  * This contains a list of memory blocks along with NUMA affinity
  * information.
  */
-static void __init early_init_drmem_lmb(struct drmem_lmb *lmb,
-   const __be32 **usm)
+static int  __init early_init_drmem_lmb(struct drmem_lmb *lmb,
+   const __be32 **usm,
+   void *data)
 {
u64 base, size;
int is_kexec_kdump = 0, rngs;
@@ -484,7 +485,7 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
 */
if ((lmb->flags & DRCONF_MEM_RESERVED) ||
!(lmb->flags & DRCONF_MEM_ASSIGNED))
-   return;
+   return 0;
 
if (*usm)
is_kexec_kdump = 1;
@@ -499,7 +500,7 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
 */
rngs = dt_mem_next_cell(dt_root_size_cells, usm);
if (!rngs) /* there are no (base, size) duple */
-   return;
+   return 0;
}
 
do {
@@ -524,6 +525,8 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
if (lmb->flags & DRCONF_MEM_HOTREMOVABLE)
memblock_mark_hotplug(base, size);
} while (--rngs);
+
+   return 0;
 }
 #endif /* CONFIG_PPC_PSERIES */
 
@@ -534,7 +537,7 @@ static int __init early_init_dt_scan_memory_ppc(unsigned 
long node,
 #ifdef CONFIG_PPC_PSERIES
if (depth == 1 &&
strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
-   walk_drmem_lmbs_early(node, early_init_drmem_lmb);
+   walk_drmem_lmbs_early(node, NULL, early_init_drmem_lmb);
return 0;
}
 #endif
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 59327ce..b2eeea3 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -14,6 +14,8 @@
 #include 
 #include 
 
+static int n_root_addr_cells, n_root_size_cells;
+
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
 
@@ -189,12 +191,13 @@ int drmem_update_dt(void)
return rc;
 }
 
-static void __init read_drconf_v1_cell(struct drmem_lmb *lmb,
+static void read_drconf_v1_cell(struct drmem_lmb *lmb,
   const __be32 **prop)
 {
const __be32 *p = *prop;
 
-   lmb->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
+   lmb->base_addr = of_read_number(p, n_root_addr_cells);
+   p += n_root_addr_cells;
lmb->drc_index = of_read_number(p++, 1);
 
p++; /* skip reserved field */
@@ -205,29 +208,33 @@ static void __init read_drconf_v1_cell(struct drmem_lmb 
*lmb,
*prop = p;
 }
 
-static void __init __walk_drmem_v1_lmbs(const __be32 

[PATCH 00/11] ppc64: enable kdump support for kexec_file_load syscall

2020-06-26 Thread Hari Bathini
This patch series enables kdump support for kexec_file_load system
call (kexec -s -p) on PPC64. The changes are inspired from kexec-tools
code but heavily modified for kernel consumption. There is scope to
expand purgatory to verify sha digest but tried to keep purgatory
changes minimal in the interest of this series.

The first patch adds a weak arch_kexec_add_buffer function to override
locate memory hole logic suiting arch needs. There are some special
regions in ppc64 which should be avoided while loading buffer & there
are multiple callers to kexec_add_buffer making it too complicated to
maintain range sanity and using generic lookup at the same time.

The second patch marks ppc64 specific code within arch/powerpc/kexec
and arch/powerpc/purgatory to make the subsequent code changes easy
to understand.

The next patch adds helper function to setup different memory ranges
needed for loading kdump kernel, booting into it and exporting the
crashing kernel's elfcore.

The fourth patch overrides arch_kexec_add_buffer to locate memory
hole for kdump segments by accounting for the special memory regions,
referred to as excluded memory ranges, and calls __kexec_add_buffer
with kbuf->mem set to skip the generic locate memory hole lookup.

The fifth patch moves walk_drmem_lmbs() out of .init section with
a few changes to reuse it for setting up kdump kernel's usable memory
ranges. The next patch uses walk_drmem_lmbs() to look up the LMBs
and set linux,drconf-usable-memory & linux,usable-memory properties
in order to restrict kdump kernel's memory usage.

The seventh patch adds relocation support for the purgatory. Patch 8
helps setup the stack for the purgatory. The next patch setups up
backup region as a segment while loading kdump kernel and teaches
purgatory to copy it from source to destination.

Patch 10 builds the elfcore header for the running kernel & passes
the info to kdump kernel via "elfcorehdr=" parameter to export as
/proc/vmcore file. The last patch sets up the memory reserve map
for the kexec kernel and also claims kdump support as all the
necessary changes are added.

Tested the changes successfully on P8, P9 lpars & an OpenPOWER box.

---

Hari Bathini (11):
  kexec_file: allow archs to handle special regions while locating memory 
hole
  powerpc/kexec_file: mark PPC64 specific code
  powerpc/kexec_file: add helper functions for getting memory ranges
  ppc64/kexec_file: avoid stomping memory used by special regions
  powerpc/drmem: make lmb walk a bit more flexible
  ppc64/kexec_file: restrict memory usage of kdump kernel
  ppc64/kexec_file: add support to relocate purgatory
  ppc64/kexec_file: setup the stack for purgatory
  ppc64/kexec_file: setup backup region for kdump kernel
  ppc64/kexec_file: prepare elfcore header for crashing kernel
  ppc64/kexec_file: add appropriate regions for memory reserve map


 arch/powerpc/include/asm/crashdump-ppc64.h |   15 
 arch/powerpc/include/asm/drmem.h   |9 
 arch/powerpc/include/asm/kexec.h   |   35 +
 arch/powerpc/include/asm/kexec_ranges.h|   18 
 arch/powerpc/kernel/prom.c |   13 
 arch/powerpc/kexec/Makefile|2 
 arch/powerpc/kexec/elf_64.c|   35 +
 arch/powerpc/kexec/file_load.c |   78 +
 arch/powerpc/kexec/file_load_64.c  | 1461 
 arch/powerpc/kexec/ranges.c|  387 +++
 arch/powerpc/mm/drmem.c|   87 +-
 arch/powerpc/mm/numa.c |   13 
 arch/powerpc/purgatory/Makefile|   28 -
 arch/powerpc/purgatory/purgatory_64.c  |   35 +
 arch/powerpc/purgatory/trampoline.S|  117 --
 arch/powerpc/purgatory/trampoline_64.S |  153 +++
 include/linux/kexec.h  |5 
 kernel/kexec_file.c|   37 +
 18 files changed, 2327 insertions(+), 201 deletions(-)
 create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h
 create mode 100644 arch/powerpc/include/asm/kexec_ranges.h
 create mode 100644 arch/powerpc/kexec/file_load_64.c
 create mode 100644 arch/powerpc/kexec/ranges.c
 create mode 100644 arch/powerpc/purgatory/purgatory_64.c
 delete mode 100644 arch/powerpc/purgatory/trampoline.S
 create mode 100644 arch/powerpc/purgatory/trampoline_64.S



Re: [PATCH 0/8 v2] PCI: Align return values of PCIe capability and PCI accessors

2020-06-26 Thread Bjorn Helgaas
On Mon, Jun 15, 2020 at 09:32:17AM +0200, refactormys...@gmail.com wrote:
> From: Bolarinwa Olayemi Saheed 
> 
> 
> PATCH 1/8 to 7/8:
> PCIBIOS_ error codes have positive values and they are passed down the
> call heirarchy from accessors. For functions which are meant to return
> only a negative value on failure, passing on this value is a bug.
> To mitigate this, call pcibios_err_to_errno() before passing on return
> value from PCIe capability accessors call heirarchy. This function
> converts any positive PCIBIOS_ error codes to negative generic error
> values.
> 
> PATCH 8/8:
> The PCIe capability accessors can return 0, -EINVAL, or any PCIBIOS_ error
> code. The pci accessor on the other hand can only return 0 or any PCIBIOS_
> error code.This inconsistency among these accessor makes it harder for
> callers to check for errors.
> Return PCIBIOS_BAD_REGISTER_NUMBER instead of -EINVAL in all PCIe
> capability accessors.
> 
> MERGING:
> These may all be merged via the PCI tree, since it is a collection of
> similar fixes. This way they all get merged at once.
> 
> Version 2:
> * cc to maintainers and mailing lists
> * Edit the Subject to conform with previous style
> * reorder "Signed by" and "Suggested by"
> * made spelling corrections
> * fixed redundant initialisation in PATCH 3/8
> * include missing call to pcibios_err_to_errno() in PATCH 6/8 and 7/8
> 
> 
> Bolarinwa Olayemi Saheed (8):
>   dmaengine: ioatdma: Convert PCIBIOS_* errors to generic -E* errors
>   IB/hfi1: Convert PCIBIOS_* errors to generic -E* errors
>   IB/hfi1: Convert PCIBIOS_* errors to generic -E* errors
>   PCI: Convert PCIBIOS_* errors to generic -E* errors
>   scsi: smartpqi: Convert PCIBIOS_* errors to generic -E* errors
>   PCI/AER: Convert PCIBIOS_* errors to generic -E* errors
>   PCI/AER: Convert PCIBIOS_* errors to generic -E* errors
>   PCI: Align return values of PCIe capability and PCI accessorss
> 
>  drivers/dma/ioat/init.c   |  4 ++--
>  drivers/infiniband/hw/hfi1/pcie.c | 18 +-
>  drivers/pci/access.c  |  8 
>  drivers/pci/pci.c | 10 --
>  drivers/pci/pcie/aer.c| 12 ++--
>  drivers/scsi/smartpqi/smartpqi_init.c |  6 +-
>  6 files changed, 42 insertions(+), 16 deletions(-)

Since these are really fixing a single PCI API problem, not individual
driver-related problems, I squashed the pcibios_err_to_errno() patches
together (except IB/hfi1, since Jason will take those separately) and
applied them to pci/misc, thanks!

The squashed patch as applied is:

commit d20df83b66cc ("PCI: Convert PCIe capability PCIBIOS errors to errno")
Author: Bolarinwa Olayemi Saheed 
Date:   Mon Jun 15 09:32:18 2020 +0200

PCI: Convert PCIe capability PCIBIOS errors to errno

The PCI config accessors (pci_read_config_word(), et al) return
PCIBIOS_SUCCESSFUL (zero) or positive error values like
PCIBIOS_FUNC_NOT_SUPPORTED.

The PCIe capability accessors (pcie_capability_read_word(), et al)
similarly return PCIBIOS errors, but some callers assume they return
generic errno values like -EINVAL.

For example, the Myri-10G probe function returns a positive PCIBIOS error
if the pcie_capability_clear_and_set_word() in pcie_set_readrq() fails:

  myri10ge_probe
status = pcie_set_readrq
  return pcie_capability_clear_and_set_word
if (status)
  return status

A positive return from a PCI driver probe function would cause a "Driver
probe function unexpectedly returned" warning from local_pci_probe()
instead of the desired probe failure.

Convert PCIBIOS errors to generic errno for all callers of:

  pcie_capability_read_word
  pcie_capability_read_dword
  pcie_capability_write_word
  pcie_capability_write_dword
  pcie_capability_set_word
  pcie_capability_set_dword
  pcie_capability_clear_word
  pcie_capability_clear_dword
  pcie_capability_clear_and_set_word
  pcie_capability_clear_and_set_dword

that check the return code for anything other than zero.

[bhelgaas: commit log, squash together]
Suggested-by: Bjorn Helgaas 
Link: 
https://lore.kernel.org/r/20200615073225.24061-1-refactormys...@gmail.com
Signed-off-by: Bolarinwa Olayemi Saheed 
Signed-off-by: Bjorn Helgaas 

diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c
index 58d13564f88b..9a6a9ec3cf48 100644
--- a/drivers/dma/ioat/init.c
+++ b/drivers/dma/ioat/init.c
@@ -1195,13 +1195,13 @@ static int ioat3_dma_probe(struct ioatdma_device 
*ioat_dma, int dca)
/* disable relaxed ordering */
err = pcie_capability_read_word(pdev, IOAT_DEVCTRL_OFFSET, );
if (err)
-   return err;
+   return pcibios_err_to_errno(err);
 
/* clear relaxed ordering enable */
val16 &= ~IOAT_DEVCTRL_ROE;
err = pcie_capability_write_word(pdev, 

Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-06-26 Thread Leonardo Bras
On Fri, 2020-06-26 at 12:23 -0300, Leonardo Bras wrote:
> On Wed, 2020-06-24 at 03:24 -0300, Leonardo Bras wrote:
> > As of today, if a DDW is created and can't map the whole partition, it's
> > removed and the default DMA window "ibm,dma-window" is used instead.
> > 
> > Usually this DDW is bigger than the default DMA window, so it would be
> > better to make use of it instead.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> 
> I tested this change with a 256GB DDW which did not map the whole
> partition, with a MT27700 Family [ConnectX-4 Virtual Function].
> 
> I noticed the performance improvement is about the same as using DDW
> with IOMMU bypass.
> 
> 64 thread write throughput: +203.0%
> 64 thread read throughput: +17.5%
> 1 thread write throughput: +20.5%
> 1 thread read throughput: +3.43%
> Average write latency: -23.0%
> Average read latency:  -2.26%

The above improvements are based on the default DMA window, which is
currently used if DDW can't map the whole partition.

Those values are an average of 20 tests for each environment, 30
seconds each test.

I also did some intense testing, for 5 hour each:
64 thread write throughput 
64 thread read throughput

The throughput values are stable in the whole test, and I noticed no
error on dmesg / journalctl.



Re: [PATCH v2 6/6] powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00

2020-06-26 Thread Leonardo Bras
On Wed, 2020-06-24 at 03:24 -0300, Leonardo Bras wrote:
> As of today, enable_ddw() will return a non-null DMA address if the
> created DDW maps the whole partition. If the address is valid,
> iommu_bypass_supported_pSeriesLP() will consider iommu bypass enabled.
> 
> This can cause some trouble if the DDW happens to start at 0x00.
> 
> Instead if checking if the address is non-null, check directly if
> the DDW maps the whole partition, so it can bypass iommu.
> 
> Signed-off-by: Leonardo Bras 

This patch has a bug in it. I will rework it soon.
Please keep reviewing patches 1-5.

Best regards,
Leonardo



[PATCH] selftests/powerpc: Purge extra count_pmc() calls of ebb selftests

2020-06-26 Thread Desnes A. Nunes do Rosario
An extra count on ebb_state.stats.pmc_count[PMC_INDEX(pmc)] is being per-
formed when count_pmc() is used to reset PMCs on a few selftests. This
extra pmc_count can occasionally invalidate results, such as the ones from
cycles_test shown hereafter. The ebb_check_count() failed with an above
the upper limit error due to the extra value on ebb_state.stats.pmc_count.

Furthermore, this extra count is also indicated by extra PMC1 trace_log on
the output of the cycle test (as well as on pmc56_overflow_test):

==
   ...
   [21]: counter = 8
   [22]: register SPRN_MMCR0 = 0x8080
   [23]: register SPRN_PMC1  = 0x8004
   [24]: counter = 9
   [25]: register SPRN_MMCR0 = 0x8080
   [26]: register SPRN_PMC1  = 0x8004
   [27]: counter = 10
   [28]: register SPRN_MMCR0 = 0x8080
   [29]: register SPRN_PMC1  = 0x8004
>> [30]: register SPRN_PMC1  = 0x451e
PMC1 count (0x28546) above upper limit 0x283e8 (+0x15e)
[FAIL] Test FAILED on line 52
failure: cycles
==

Signed-off-by: Desnes A. Nunes do Rosario 
---
 .../selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c | 2 --
 tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c  | 2 --
 .../selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c| 2 --
 .../selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c | 2 --
 tools/testing/selftests/powerpc/pmu/ebb/ebb.c  | 2 --
 .../selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c  | 2 --
 .../selftests/powerpc/pmu/ebb/lost_exception_test.c| 1 -
 .../testing/selftests/powerpc/pmu/ebb/multi_counter_test.c | 7 ---
 .../selftests/powerpc/pmu/ebb/multi_ebb_procs_test.c   | 2 --
 .../testing/selftests/powerpc/pmu/ebb/pmae_handling_test.c | 2 --
 .../selftests/powerpc/pmu/ebb/pmc56_overflow_test.c| 2 --
 11 files changed, 26 deletions(-)

diff --git a/tools/testing/selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c 
b/tools/testing/selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c
index a2d7b0e3dca9..a26ac122c759 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c
+++ b/tools/testing/selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c
@@ -91,8 +91,6 @@ int back_to_back_ebbs(void)
ebb_global_disable();
ebb_freeze_pmcs();
 
-   count_pmc(1, sample_period);
-
dump_ebb_state();
 
event_close();
diff --git a/tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c 
b/tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c
index bc893813483e..bb9f587fa76e 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c
+++ b/tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c
@@ -42,8 +42,6 @@ int cycles(void)
ebb_global_disable();
ebb_freeze_pmcs();
 
-   count_pmc(1, sample_period);
-
dump_ebb_state();
 
event_close();
diff --git a/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c 
b/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c
index dcd351d20328..9ae795ce314e 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c
+++ b/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c
@@ -99,8 +99,6 @@ int cycles_with_freeze(void)
ebb_global_disable();
ebb_freeze_pmcs();
 
-   count_pmc(1, sample_period);
-
dump_ebb_state();
 
printf("EBBs while frozen %d\n", ebbs_while_frozen);
diff --git a/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c 
b/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c
index 94c99c12c0f2..4b45a2e70f62 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c
+++ b/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c
@@ -71,8 +71,6 @@ int cycles_with_mmcr2(void)
ebb_global_disable();
ebb_freeze_pmcs();
 
-   count_pmc(1, sample_period);
-
dump_ebb_state();
 
event_close();
diff --git a/tools/testing/selftests/powerpc/pmu/ebb/ebb.c 
b/tools/testing/selftests/powerpc/pmu/ebb/ebb.c
index dfbc5c3ad52d..21537d6eb6b7 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/ebb.c
+++ b/tools/testing/selftests/powerpc/pmu/ebb/ebb.c
@@ -396,8 +396,6 @@ int ebb_child(union pipe read_pipe, union pipe write_pipe)
ebb_global_disable();
ebb_freeze_pmcs();
 
-   count_pmc(1, sample_period);
-
dump_ebb_state();
 
event_close();
diff --git 
a/tools/testing/selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c 
b/tools/testing/selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c
index ca2f7d729155..b208bf6ad58d 100644
--- a/tools/testing/selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c
+++ b/tools/testing/selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c
@@ -38,8 +38,6 @@ static int victim_child(union pipe read_pipe, union pipe 
write_pipe)
ebb_global_disable();
ebb_freeze_pmcs();
 
-   count_pmc(1, sample_period);
-

Re: [PATCH v2 00/15] Documentation fixes

2020-06-26 Thread Jonathan Corbet
On Tue, 23 Jun 2020 09:08:56 +0200
Mauro Carvalho Chehab  wrote:

> As requested, this is a rebase of a previous series posted on Jan, 15.
> 
> Since then, several patches got merged via other trees or became
> obsolete. There were also 2 patches before that fits better at the
> ReST conversion patchset. So, I'll be sending it on another patch
> series together with the remaining ReST conversions.
> 
> I also added reviews/acks received.
> 
> So, the series reduced from 29 to 15 patches.
> 
> Let's hope b4 would be able to properly handle this one.

Nope.  I don't know what it is about your patch series, but b4 is never
able to put them together.

I've applied the series except for #1, which already went through the -mm
tree.

Thanks,

jon


[PATCH v2 1/3] powerpc/mm: Enable radix GTSE only if supported.

2020-06-26 Thread Bharata B Rao
Make GTSE an MMU feature and enable it by default for radix.
However for guest, conditionally enable it if hypervisor supports
it via OV5 vector. Let prom_init ask for radix GTSE only if the
support exists.

Having GTSE as an MMU feature will make it easy to enable radix
without GTSE. Currently radix assumes GTSE is enabled by default.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/mmu.h|  4 
 arch/powerpc/kernel/dt_cpu_ftrs.c |  1 +
 arch/powerpc/kernel/prom_init.c   | 13 -
 arch/powerpc/mm/init_64.c |  5 -
 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index f4ac25d4df05..884d51995934 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -28,6 +28,9 @@
  * Individual features below.
  */
 
+/* Guest Translation Shootdown Enable */
+#define MMU_FTR_GTSE   ASM_CONST(0x1000)
+
 /*
  * Support for 68 bit VA space. We added that from ISA 2.05
  */
@@ -173,6 +176,7 @@ enum {
 #endif
 #ifdef CONFIG_PPC_RADIX_MMU
MMU_FTR_TYPE_RADIX |
+   MMU_FTR_GTSE |
 #ifdef CONFIG_PPC_KUAP
MMU_FTR_RADIX_KUAP |
 #endif /* CONFIG_PPC_KUAP */
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index a0edeb391e3e..ac650c233cd9 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -336,6 +336,7 @@ static int __init feat_enable_mmu_radix(struct 
dt_cpu_feature *f)
 #ifdef CONFIG_PPC_RADIX_MMU
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE;
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_MMU;
 
return 1;
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 90c604d00b7d..cbc605cfdec0 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1336,12 +1336,15 @@ static void __init prom_check_platform_support(void)
}
}
 
-   if (supported.radix_mmu && supported.radix_gtse &&
-   IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
-   /* Radix preferred - but we require GTSE for now */
-   prom_debug("Asking for radix with GTSE\n");
+   if (supported.radix_mmu && IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
+   /* Radix preferred - Check if GTSE is also supported */
+   prom_debug("Asking for radix\n");
ibm_architecture_vec.vec5.mmu = OV5_FEAT(OV5_MMU_RADIX);
-   ibm_architecture_vec.vec5.radix_ext = OV5_FEAT(OV5_RADIX_GTSE);
+   if (supported.radix_gtse)
+   ibm_architecture_vec.vec5.radix_ext =
+   OV5_FEAT(OV5_RADIX_GTSE);
+   else
+   prom_debug("Radix GTSE isn't supported\n");
} else if (supported.hash_mmu) {
/* Default to hash mmu (if we can) */
prom_debug("Asking for hash\n");
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index bc73abf0bc25..152aa0200cef 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -407,12 +407,15 @@ static void __init early_check_vec5(void)
if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] &
OV5_FEAT(OV5_RADIX_GTSE))) {
pr_warn("WARNING: Hypervisor doesn't support RADIX with 
GTSE\n");
-   }
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
+   } else
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
/* Do radix anyway - the hypervisor said we had to */
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
} else if (mmu_supported == OV5_FEAT(OV5_MMU_HASH)) {
/* Hypervisor only supports hash - disable radix */
cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
}
 }
 
-- 
2.21.3



[PATCH v2 2/3] powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if enabled

2020-06-26 Thread Bharata B Rao
H_REGISTER_PROC_TBL asks for GTSE by default. GTSE flag bit should
be set only when GTSE is supported.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/platforms/pseries/lpar.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index fd26f3d21d7b..f82569a505f1 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1680,9 +1680,11 @@ static int pseries_lpar_register_process_table(unsigned 
long base,
 
if (table_size)
flags |= PROC_TABLE_NEW;
-   if (radix_enabled())
-   flags |= PROC_TABLE_RADIX | PROC_TABLE_GTSE;
-   else
+   if (radix_enabled()) {
+   flags |= PROC_TABLE_RADIX;
+   if (mmu_has_feature(MMU_FTR_GTSE))
+   flags |= PROC_TABLE_GTSE;
+   } else
flags |= PROC_TABLE_HPT_SLB;
for (;;) {
rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
-- 
2.21.3



Re: [PATCH] powerpc/pseries: Use doorbells even if XIVE is available

2020-06-26 Thread Cédric Le Goater
Adding David, 

On 6/25/20 3:11 AM, Michael Ellerman wrote:
> Nicholas Piggin  writes:
>> KVM supports msgsndp in guests by trapping and emulating the
>> instruction, so it was decided to always use XIVE for IPIs if it is
>> available. However on PowerVM systems, msgsndp can be used and gives
>> better performance. On large systems, high XIVE interrupt rates can
>> have sub-linear scaling, and using msgsndp can reduce the load on
>> the interrupt controller.
>>
>> So switch to using core local doorbells even if XIVE is available.
>> This reduces performance for KVM guests with an SMT topology by
>> about 50% for ping-pong context switching between SMT vCPUs.
> 
> You have to take explicit steps to configure KVM in that way with qemu.
> eg. "qemu .. -smp 8" will give you 8 SMT1 CPUs by default.
> 
>> An option vector (or dt-cpu-ftrs) could be defined to disable msgsndp
>> to get KVM performance back.

An option vector would require a PAPR change. Unless the architecture 
reserves some bits for the implementation, but I don't think so. Same
for CAS.

> Qemu/KVM populates /proc/device-tree/hypervisor, so we *could* look at
> that. Though adding PowerVM/KVM specific hacks is obviously a very
> slippery slope.

QEMU could advertise a property "emulated-msgsndp", or something similar, 
which would be interpreted by Linux as a CPU feature and taken into account 
when doing the IPIs.

The CPU setup for XIVE needs a cleanup also. There is no need to allocate
interrupts for IPIs anymore in that case.

Thanks,

C. 


> 
>> diff --git a/arch/powerpc/platforms/pseries/smp.c 
>> b/arch/powerpc/platforms/pseries/smp.c
>> index 6891710833be..a737a2f87c67 100644
>> --- a/arch/powerpc/platforms/pseries/smp.c
>> +++ b/arch/powerpc/platforms/pseries/smp.c
>> @@ -188,13 +188,14 @@ static int pseries_smp_prepare_cpu(int cpu)
>>  return 0;
>>  }
>>  
>> +static void  (*cause_ipi_offcore)(int cpu) __ro_after_init;
>> +
>>  static void smp_pseries_cause_ipi(int cpu)
> 
> This is static so the name could be more descriptive, it doesn't need
> the "smp_pseries" prefix.
> 
>>  {
>> -/* POWER9 should not use this handler */
>>  if (doorbell_try_core_ipi(cpu))
>>  return;
> 
> Seems like it would be worth making that static inline so we can avoid
> the function call overhead.
> 
>> -icp_ops->cause_ipi(cpu);
>> +cause_ipi_offcore(cpu);
>>  }
>>  
>>  static int pseries_cause_nmi_ipi(int cpu)
>> @@ -222,10 +223,7 @@ static __init void pSeries_smp_probe_xics(void)
>>  {
>>  xics_smp_probe();
>>  
>> -if (cpu_has_feature(CPU_FTR_DBELL) && !is_secure_guest())
>> -smp_ops->cause_ipi = smp_pseries_cause_ipi;
>> -else
>> -smp_ops->cause_ipi = icp_ops->cause_ipi;
>> +smp_ops->cause_ipi = icp_ops->cause_ipi;
>>  }
>>  
>>  static __init void pSeries_smp_probe(void)
>> @@ -238,6 +236,18 @@ static __init void pSeries_smp_probe(void)
> 
> The comment just above here says:
> 
>   /*
>* Don't use P9 doorbells when XIVE is enabled. IPIs
>* using MMIOs should be faster
>*/
>>  xive_smp_probe();
> 
> Which is no longer true.
> 
>>  else
>>  pSeries_smp_probe_xics();
> 
> I think you should just fold this in, it would make the logic slightly
> easier to follow.
> 
>> +/*
>> + * KVM emulates doorbells by reading the instruction, which
>> + * can't be done if the guest is secure. If a secure guest
>> + * runs under PowerVM, it could use msgsndp but would need a
>> + * way to distinguish.
>> + */
> 
> It's not clear what it needs to distinguish: That it's running under
> PowerVM and therefore *can* use msgsndp even though it's secure.
> 
> Also the comment just talks about the is_secure_guest() test, which is
> not obvious on first reading.
> 
>> +if (cpu_has_feature(CPU_FTR_DBELL) &&
>> +cpu_has_feature(CPU_FTR_SMT) && !is_secure_guest()) {
>> +cause_ipi_offcore = smp_ops->cause_ipi;
>> +smp_ops->cause_ipi = smp_pseries_cause_ipi;
>> +}
> 
> Because we're at the tail of the function I think this would be clearer
> if it used early returns, eg:
> 
>   // If the CPU doesn't have doorbells then we must use xics/xive
>   if (!cpu_has_feature(CPU_FTR_DBELL))
>   return;
> 
>   // If the CPU doesn't have SMT then doorbells don't help us
>   if (!cpu_has_feature(CPU_FTR_SMT))
>   return;
> 
>   // Secure guests can't use doorbells because ...
>   if (!is_secure_guest()
>   return;
> 
>   /*
>  * Otherwise we want to use doorbells for sibling threads and
>  * xics/xive for IPIs off the core, because it performs better
>  * on large systems ...
>  */
> cause_ipi_offcore = smp_ops->cause_ipi;
>   smp_ops->cause_ipi = smp_pseries_cause_ipi;
> }
> 
> 
> cheers
> 



Re: [PATCH 09/13] x86: Remove dev->archdata.iommu pointer

2020-06-26 Thread Borislav Petkov
On Thu, Jun 25, 2020 at 03:08:32PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> There are no users left, all drivers have been converted to use the
> per-device private pointer offered by IOMMU core.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/include/asm/device.h | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/device.h b/arch/x86/include/asm/device.h
> index 49bd6cf3eec9..7c0a52ca2f4d 100644
> --- a/arch/x86/include/asm/device.h
> +++ b/arch/x86/include/asm/device.h
> @@ -3,9 +3,6 @@
>  #define _ASM_X86_DEVICE_H
>  
>  struct dev_archdata {
> -#ifdef CONFIG_IOMMU_API
> - void *iommu; /* hook for IOMMU specific extension */
> -#endif
>  };
>  
>  struct pdev_archdata {
> -- 

Acked-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


[PATCH v3 0/2] Add cpu hotplug support for powerpc/perf/hv-24x7

2020-06-26 Thread Kajol Jain
This patchset add cpu hotplug support for hv_24x7 driver by adding
online/offline cpu hotplug function. It also add sysfs file
"cpumask" to expose current online cpu that can be used for
hv_24x7 event count.

Changelog:
v2 -> v3
- Corrected some of the typo mistakes and update commit message
  as suggested by Gautham R Shenoy.
- Added Reviewed-by tag for the first patch in the patchset.

v1 -> v2
- Changed function to pick active cpu incase of offline
  from "cpumask_any_but" to "cpumask_last", as
  cpumask_any_but function pick very next online cpu and incase where
  we are sequentially off-lining multiple cpus, "pmu_migrate_context"
  can add extra latency.
  - Suggested by: Gautham R Shenoy.

- Change documentation for cpumask and rather then hardcode the
  initialization for cpumask_attr_group, add loop to get very first
  NULL as suggested by Gautham R Shenoy.

Kajol Jain (2):
  powerpc/perf/hv-24x7: Add cpu hotplug support
  powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask

 .../sysfs-bus-event_source-devices-hv_24x7|  7 ++
 arch/powerpc/perf/hv-24x7.c   | 79 ++-
 include/linux/cpuhotplug.h|  1 +
 3 files changed, 86 insertions(+), 1 deletion(-)

-- 
2.18.2



[PATCH v3 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask

2020-06-26 Thread Kajol Jain
Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation.

Primary use to expose the cpumask is for the perf tool which has the
capability to parse the driver sysfs folder and understand the
cpumask file. Having cpumask file will reduce the number of perf command
line parameters (will avoid "-C" option in the perf tool
command line). It can also notify the user which is
the current cpu used to retrieve the counter data.

command:# cat /sys/devices/hv_24x7/cpumask
0

Signed-off-by: Kajol Jain 
---
 .../sysfs-bus-event_source-devices-hv_24x7|  7 
 arch/powerpc/perf/hv-24x7.c   | 36 +--
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 
b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
index e8698afcd952..f9dd3755b049 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
@@ -43,6 +43,13 @@ Description: read only
This sysfs interface exposes the number of cores per chip
present in the system.
 
+What:  /sys/devices/hv_24x7/cpumask
+Date:  June 2020
+Contact:   Linux on PowerPC Developer List 
+Description:   read only
+   This sysfs file exposes the cpumask which is designated to make
+   HCALLs to retrieve hv-24x7 pmu event counter data.
+
 What:  /sys/bus/event_source/devices/hv_24x7/event_descs/
 Date:  February 2014
 Contact:   Linux on PowerPC Developer List 
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index ce4739e2b407..3c699612d29f 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -448,6 +448,12 @@ static ssize_t device_show_string(struct device *dev,
return sprintf(buf, "%s\n", (char *)d->var);
 }
 
+static ssize_t cpumask_get_attr(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return cpumap_print_to_pagebuf(true, buf, _24x7_cpumask);
+}
+
 static ssize_t sockets_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1116,6 +1122,17 @@ static DEVICE_ATTR_RO(sockets);
 static DEVICE_ATTR_RO(chipspersocket);
 static DEVICE_ATTR_RO(coresperchip);
 
+static DEVICE_ATTR(cpumask, S_IRUGO, cpumask_get_attr, NULL);
+
+static struct attribute *cpumask_attrs[] = {
+   _attr_cpumask.attr,
+   NULL,
+};
+
+static struct attribute_group cpumask_attr_group = {
+   .attrs = cpumask_attrs,
+};
+
 static struct bin_attribute *if_bin_attrs[] = {
_attr_catalog,
NULL,
@@ -1143,6 +1160,11 @@ static const struct attribute_group *attr_groups[] = {
_desc_group,
_long_desc_group,
_group,
+   /*
+* This NULL is a placeholder for the cpumask attr which will update
+* onlyif cpuhotplug registration is successful
+*/
+   NULL,
NULL,
 };
 
@@ -1683,7 +1705,7 @@ static int hv_24x7_cpu_hotplug_init(void)
 
 static int hv_24x7_init(void)
 {
-   int r;
+   int r, i = -1;
unsigned long hret;
struct hv_perf_caps caps;
 
@@ -1727,8 +1749,18 @@ static int hv_24x7_init(void)
 
/* init cpuhotplug */
r = hv_24x7_cpu_hotplug_init();
-   if (r)
+   if (r) {
pr_err("hv_24x7: CPU hotplug init failed\n");
+   } else {
+   /*
+* Cpu hotplug init is successful, add the
+* cpumask file as part of pmu attr group and
+* assign it to very first NULL location.
+*/
+   while (attr_groups[++i])
+   /* nothing */;
+   attr_groups[i] = _attr_group;
+   }
 
r = perf_pmu_register(_24x7_pmu, h_24x7_pmu.name, -1);
if (r)
-- 
2.18.2



[PATCH v3 1/2] powerpc/perf/hv-24x7: Add cpu hotplug support

2020-06-26 Thread Kajol Jain
Patch here adds cpu hotplug functions to hv_24x7 pmu.
A new cpuhp_state "CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE" enum
is added.

The online callback function updates the cpumask only if its
empty. As the primary intention of adding hotplug support
is to designate a CPU to make HCALL to collect the
counter data.

The offline function test and clear corresponding cpu in a cpumask
and update cpumask to any other active cpu.

Signed-off-by: Kajol Jain 
Reviewed-by: Gautham R. Shenoy 
---
 arch/powerpc/perf/hv-24x7.c | 45 +
 include/linux/cpuhotplug.h  |  1 +
 2 files changed, 46 insertions(+)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index db213eb7cb02..ce4739e2b407 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -31,6 +31,8 @@ static int interface_version;
 /* Whether we have to aggregate result data for some domains. */
 static bool aggregate_result_elements;
 
+static cpumask_t hv_24x7_cpumask;
+
 static bool domain_is_valid(unsigned domain)
 {
switch (domain) {
@@ -1641,6 +1643,44 @@ static struct pmu h_24x7_pmu = {
.capabilities = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
+static int ppc_hv_24x7_cpu_online(unsigned int cpu)
+{
+   /* Make this CPU the designated target for counter collection */
+   if (cpumask_empty(_24x7_cpumask))
+   cpumask_set_cpu(cpu, _24x7_cpumask);
+
+   return 0;
+}
+
+static int ppc_hv_24x7_cpu_offline(unsigned int cpu)
+{
+   int target = -1;
+
+   /* Check if exiting cpu is used for collecting 24x7 events */
+   if (!cpumask_test_and_clear_cpu(cpu, _24x7_cpumask))
+   return 0;
+
+   /* Find a new cpu to collect 24x7 events */
+   target = cpumask_last(cpu_active_mask);
+
+   if (target < 0 || target >= nr_cpu_ids)
+   return -1;
+
+   /* Migrate 24x7 events to the new target */
+   cpumask_set_cpu(target, _24x7_cpumask);
+   perf_pmu_migrate_context(_24x7_pmu, cpu, target);
+
+   return 0;
+}
+
+static int hv_24x7_cpu_hotplug_init(void)
+{
+   return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
+ "perf/powerpc/hv_24x7:online",
+ ppc_hv_24x7_cpu_online,
+ ppc_hv_24x7_cpu_offline);
+}
+
 static int hv_24x7_init(void)
 {
int r;
@@ -1685,6 +1725,11 @@ static int hv_24x7_init(void)
if (r)
return r;
 
+   /* init cpuhotplug */
+   r = hv_24x7_cpu_hotplug_init();
+   if (r)
+   pr_err("hv_24x7: CPU hotplug init failed\n");
+
r = perf_pmu_register(_24x7_pmu, h_24x7_pmu.name, -1);
if (r)
return r;
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 191772d4a4d7..a2710e654b64 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -181,6 +181,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_POWERPC_CORE_IMC_ONLINE,
CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE,
CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE,
+   CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
CPUHP_AP_WATCHDOG_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
-- 
2.18.2



Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-06-26 Thread Michal Suchánek
On Fri, May 22, 2020 at 09:01:17AM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 22 May 2020, Aneesh Kumar K.V wrote:
> 
> > On 5/22/20 3:01 PM, Michal Suchánek wrote:
> > > On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Thu, 21 May 2020, Dan Williams wrote:
> > > > 
> > > > > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V
> > > > >  wrote:
> > > > > > 
> > > > > > > Moving on to the patch itself--Aneesh, have you audited other
> > > > > > > persistent
> > > > > > > memory users in the kernel?  For example, 
> > > > > > > drivers/md/dm-writecache.c
> > > > > > > does
> > > > > > > this:
> > > > > > > 
> > > > > > > static void writecache_commit_flushed(struct dm_writecache *wc, 
> > > > > > > bool
> > > > > > > wait_for_ios)
> > > > > > > {
> > > > > > >if (WC_MODE_PMEM(wc))
> > > > > > >wmb(); <==
> > > > > > >   else
> > > > > > >   ssd_commit_flushed(wc, wait_for_ios);
> > > > > > > }
> > > > > > > 
> > > > > > > I believe you'll need to make modifications there.
> > > > > > > 
> > > > > > 
> > > > > > Correct. Thanks for catching that.
> > > > > > 
> > > > > > 
> > > > > > I don't understand dm much, wondering how this will work with
> > > > > > non-synchronous DAX device?
> > > > > 
> > > > > That's a good point. DM-writecache needs to be cognizant of things
> > > > > like virtio-pmem that violate the rule that persisent memory writes
> > > > > can be flushed by CPU functions rather than calling back into the
> > > > > driver. It seems we need to always make the flush case a dax_operation
> > > > > callback to account for this.
> > > > 
> > > > dm-writecache is normally sitting on the top of dm-linear, so it would
> > > > need to pass the wmb() call through the dm core and dm-linear target ...
> > > > that would slow it down ... I remember that you already did it this way
> > > > some times ago and then removed it.
> > > > 
> > > > What's the exact problem with POWER? Could the POWER system have two 
> > > > types
> > > > of persistent memory that need two different ways of flushing?
> > > 
> > > As far as I understand the discussion so far
> > > 
> > >   - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency
> > >   - on POWER $newhardware uses $newinstruction to ensure pmem consistency
> > > (compatible with $oldinstruction on $oldhardware)
> > 
> > Correct.
> > 
> > >   - on some platforms instead of barrier instruction a callback into the
> > > driver is issued to ensure consistency 
> > 
> > This is virtio-pmem only at this point IIUC.
> > 
> > -aneesh
> 
> And does the virtio-pmem driver track which pages are dirty? Or does it 
> need to specify the range of pages to flush in the flush function?
> 
> > > None of this is reflected by the dm driver.
> 
> We could make a new dax method:
> void *(dax_get_flush_function)(void);
> 
> This would return a pointer to "wmb()" on x86 and something else on Power.
> 
> The method "dax_get_flush_function" would be called only once when 
> initializing the writecache driver (because the call would be slow because 
> it would have to go through the DM stack) and then, the returned function 
> would be called each time we need write ordering. The returned function 
> would do just "sfence; ret".

Hello,

as far as I understand the code virtio_pmem has a fush function defined
which indeed can make use of the region properties, such as memory
range. If such function exists you need quivalent of sync() - call into
the device in question. If it does not calling arch_pmem_flush_barrier()
instead of wmb() should suffice.

I am not aware of an interface to determine if the flush function exists
for a particular region.

Thanks

Michal


[PATCH v2 3/4] powerpc sstep: introduce macros to retrieve Prefix instruction operands

2020-06-26 Thread Balamuruhan S
retrieve prefix instruction operands RA and pc relative bit R values
using macros and adopt it in sstep.c and test_emulate_step.c.

Signed-off-by: Balamuruhan S 
---
 arch/powerpc/include/asm/sstep.h |  4 
 arch/powerpc/lib/sstep.c | 12 ++--
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/sstep.h b/arch/powerpc/include/asm/sstep.h
index 3b01c69a44aa..325975b4ef30 100644
--- a/arch/powerpc/include/asm/sstep.h
+++ b/arch/powerpc/include/asm/sstep.h
@@ -104,6 +104,10 @@ enum instruction_type {
 
 #define MKOP(t, f, s)  ((t) | (f) | SIZE(s))
 
+/* Prefix instruction operands */
+#define GET_PREFIX_RA(i)   (((i) >> 16) & 0x1f)
+#define GET_PREFIX_R(i)((i) & (1ul << 20))
+
 struct instruction_op {
int type;
int reg;
diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 5abe98216dc2..fb4c5767663d 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -200,8 +200,8 @@ static nokprobe_inline unsigned long mlsd_8lsd_ea(unsigned 
int instr,
unsigned int  dd;
unsigned long ea, d0, d1, d;
 
-   prefix_r = instr & (1ul << 20);
-   ra = (suffix >> 16) & 0x1f;
+   prefix_r = GET_PREFIX_R(instr);
+   ra = GET_PREFIX_RA(suffix);
 
d0 = instr & 0x3;
d1 = suffix & 0x;
@@ -1339,8 +1339,8 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
switch (opcode) {
 #ifdef __powerpc64__
case 1:
-   prefix_r = word & (1ul << 20);
-   ra = (suffix >> 16) & 0x1f;
+   prefix_r = GET_PREFIX_R(word);
+   ra = GET_PREFIX_RA(suffix);
rd = (suffix >> 21) & 0x1f;
op->reg = rd;
op->val = regs->gpr[rd];
@@ -2715,8 +2715,8 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
}
break;
case 1: /* Prefixed instructions */
-   prefix_r = word & (1ul << 20);
-   ra = (suffix >> 16) & 0x1f;
+   prefix_r = GET_PREFIX_R(word);
+   ra = GET_PREFIX_RA(suffix);
op->update_reg = ra;
rd = (suffix >> 21) & 0x1f;
op->reg = rd;
-- 
2.24.1



[PATCH v2 1/4] powerpc test_emulate_step: enhancement to test negative scenarios

2020-06-26 Thread Balamuruhan S
add provision to declare test is a negative scenario, verify
whether emulation fails and avoid executing it.

Signed-off-by: Balamuruhan S 
---
 arch/powerpc/lib/test_emulate_step.c | 30 +++-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/lib/test_emulate_step.c 
b/arch/powerpc/lib/test_emulate_step.c
index 0ca2b7cc8d8c..7c30a69c174f 100644
--- a/arch/powerpc/lib/test_emulate_step.c
+++ b/arch/powerpc/lib/test_emulate_step.c
@@ -118,6 +118,7 @@
 #define IGNORE_GPR(n)  (0x1UL << (n))
 #define IGNORE_XER (0x1UL << 32)
 #define IGNORE_CCR (0x1UL << 33)
+#define NEGATIVE_TEST  (0x1UL << 63)
 
 static void __init init_pt_regs(struct pt_regs *regs)
 {
@@ -1202,8 +1203,10 @@ static struct compute_test compute_tests[] = {
 };
 
 static int __init emulate_compute_instr(struct pt_regs *regs,
-   struct ppc_inst instr)
+   struct ppc_inst instr,
+   bool negative)
 {
+   int analysed;
extern s32 patch__exec_instr;
struct instruction_op op;
 
@@ -1212,13 +1215,17 @@ static int __init emulate_compute_instr(struct pt_regs 
*regs,
 
regs->nip = patch_site_addr(__exec_instr);
 
-   if (analyse_instr(, regs, instr) != 1 ||
-   GETTYPE(op.type) != COMPUTE) {
-   pr_info("execution failed, instruction = %s\n", 
ppc_inst_as_str(instr));
+   analysed = analyse_instr(, regs, instr);
+   if (analysed != 1 || GETTYPE(op.type) != COMPUTE) {
+   if (negative)
+   return -EFAULT;
+   pr_info("emulation failed, instruction = %s\n", 
ppc_inst_as_str(instr));
return -EFAULT;
}
-
-   emulate_update_regs(regs, );
+   if (analysed == 1 && negative)
+   pr_info("negative test failed, instruction = %s\n", 
ppc_inst_as_str(instr));
+   if (!negative)
+   emulate_update_regs(regs, );
return 0;
 }
 
@@ -1256,7 +1263,7 @@ static void __init run_tests_compute(void)
struct pt_regs *regs, exp, got;
unsigned int i, j, k;
struct ppc_inst instr;
-   bool ignore_gpr, ignore_xer, ignore_ccr, passed;
+   bool ignore_gpr, ignore_xer, ignore_ccr, passed, rc, negative;
 
for (i = 0; i < ARRAY_SIZE(compute_tests); i++) {
test = _tests[i];
@@ -1270,6 +1277,7 @@ static void __init run_tests_compute(void)
instr = test->subtests[j].instr;
flags = test->subtests[j].flags;
regs = >subtests[j].regs;
+   negative = flags & NEGATIVE_TEST;
ignore_xer = flags & IGNORE_XER;
ignore_ccr = flags & IGNORE_CCR;
passed = true;
@@ -1284,8 +1292,12 @@ static void __init run_tests_compute(void)
exp.msr = MSR_KERNEL;
got.msr = MSR_KERNEL;
 
-   if (emulate_compute_instr(, instr) ||
-   execute_compute_instr(, instr)) {
+   rc = emulate_compute_instr(, instr, negative) != 0;
+   if (negative) {
+   /* skip executing instruction */
+   passed = rc;
+   goto print;
+   } else if (rc || execute_compute_instr(, instr)) {
passed = false;
goto print;
}
-- 
2.24.1



Re: [bug] LTP mmap03 stuck in page fault loop after c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the user")

2020-06-26 Thread Aneesh Kumar K.V

On 6/26/20 1:17 PM, Aneesh Kumar K.V wrote:

Hi Jan,

On 6/26/20 12:29 PM, Jan Stancek wrote:

Hi,

LTP mmap03 is getting stuck in page fault loop after commit
   c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault 
error to the user")


System is ppc64le P9 lpar [1] running v5.8-rc2-34-g3e08a95294a4.

Here's a minimized reproducer:
- 8< -
#include 
#include 
#include 
#include 
#include 

int main(int ac, char **av)
{
 int page_sz = getpagesize();
 int fildes;
 char *addr;

 fildes = open("tempfile", O_WRONLY | O_CREAT, 0666);
 write(fildes, , sizeof(fildes));
 close(fildes);

 fildes = open("tempfile", O_RDONLY);
 unlink("tempfile");

 addr = mmap(0, page_sz, PROT_EXEC, MAP_FILE | MAP_PRIVATE, 
fildes, 0);


 printf("%d\n", *addr);
 return 0;
}
- >8 -


Thanks for the report. This is execute only key where vma has the 
implied read permission. So The patch do break this case. I will see how 
best we can handle  PROT_EXEC and the multi threaded test that required 
the change in the patch.




Can you check with this change? While checking for access permission we 
are checking against UAMOR value which i think is wrong. We just need to 
look at the AMR and IAMR values to check whether access is permitted or 
not. Even if UAMOR deny the userspace management of the key, we should

do the correct access check.

modified   arch/powerpc/mm/book3s64/pkeys.c
@@ -353,9 +353,6 @@ static bool pkey_access_permitted(int pkey, bool 
write, bool execute)

int pkey_shift;
u64 amr;

-   if (!is_pkey_enabled(pkey))
-   return true;
-
pkey_shift = pkeyshift(pkey);
if (execute && !(read_iamr() & (IAMR_EX_BIT << pkey_shift)))
return true;





Re: [PATCH] powerpc/pseries: Use doorbells even if XIVE is available

2020-06-26 Thread Cédric Le Goater
>>> An option vector (or dt-cpu-ftrs) could be defined to disable msgsndp
>>> to get KVM performance back.
> 
> An option vector would require a PAPR change. Unless the architecture 
> reserves some bits for the implementation, but I don't think so. Same
> for CAS.
> 
>> Qemu/KVM populates /proc/device-tree/hypervisor, so we *could* look at
>> that. Though adding PowerVM/KVM specific hacks is obviously a very
>> slippery slope.
> 
> QEMU could advertise a property "emulated-msgsndp", or something similar, 
> which would be interpreted by Linux as a CPU feature and taken into account 
> when doing the IPIs.

Could we remove msgsndp support from HFSCR in KVM and test it in pseries ? 

C.


Re: [PATCH v2 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask

2020-06-26 Thread kajoljain



On 6/26/20 1:15 PM, Gautham R Shenoy wrote:
> On Wed, Jun 24, 2020 at 05:58:31PM +0530, Madhavan Srinivasan wrote:
>>
>>
>> On 6/24/20 4:26 PM, Gautham R Shenoy wrote:
>>> Hi Kajol,
>>>
>>> On Wed, Jun 24, 2020 at 03:47:54PM +0530, Kajol Jain wrote:
 Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation.

 command:# cat /sys/devices/hv_24x7/cpumask
 0
>>> Since this sysfs interface is read-only, and the user cannot change
>>> the CPU which will be making the HCALLs to obtain the 24x7 counts,
>>> does the user even need to know if currently CPU X is the one which is
>>> going to make HCALLs to retrive the 24x7 counts ? Does it help in any
>>> kind of trouble-shooting ?
>> Primary use to expose the cpumask is for the perf tool.
>> Which has the capability to parse the driver sysfs folder
>> and understand the cpumask file. Having cpumask
>> file will reduce the number of perf commandline
>> parameters (will avoid "-C" option in the perf tool
>> command line). I can also notify the user which is
>> the current cpu used to retrieve the counter data.
> 
> Fair enough. Can we include this in the patch description ?

Sure will update in next version of patchset.

Thanks,
Kajol Jain

> 
>>
>>> It would have made sense if the interface was read-write, since a user
>>> can set this to a CPU which is not running user applications. This
>>> would help in minimising jitter on those active CPUs running the user
>>> applications.
>>
>> With cpumask backed by hotplug
>> notifiers, enabling user write access to it will
>> complicate the code with more additional check.
>> CPU will come to play only if the user request for
>> counter data. If not, then there will be no HCALLs made
>> using the CPU.
> 
> Well, I was wondering if you could make the interface writable because
> I couldn't think of the use of a read-only interface. With the
> perf-use case you have provided, I guess it makes sense. I am ok with
> it being a read-only interface.
> 
>>
>> Maddy
> 
> --
> Thanks and Regards
> gautham.
> 


Re: [bug] LTP mmap03 stuck in page fault loop after c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the user")

2020-06-26 Thread Aneesh Kumar K.V

Hi Jan,

On 6/26/20 12:29 PM, Jan Stancek wrote:

Hi,

LTP mmap03 is getting stuck in page fault loop after commit
   c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the 
user")

System is ppc64le P9 lpar [1] running v5.8-rc2-34-g3e08a95294a4.

Here's a minimized reproducer:
- 8< -
#include 
#include 
#include 
#include 
#include 

int main(int ac, char **av)
{
 int page_sz = getpagesize();
 int fildes;
 char *addr;

 fildes = open("tempfile", O_WRONLY | O_CREAT, 0666);
 write(fildes, , sizeof(fildes));
 close(fildes);

 fildes = open("tempfile", O_RDONLY);
 unlink("tempfile");

 addr = mmap(0, page_sz, PROT_EXEC, MAP_FILE | MAP_PRIVATE, fildes, 0);

 printf("%d\n", *addr);
 return 0;
}
- >8 -


Thanks for the report. This is execute only key where vma has the 
implied read permission. So The patch do break this case. I will see how 
best we can handle  PROT_EXEC and the multi threaded test that required 
the change in the patch.



-aneesh


Re: [PATCH v2 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask

2020-06-26 Thread Gautham R Shenoy
On Wed, Jun 24, 2020 at 05:58:31PM +0530, Madhavan Srinivasan wrote:
> 
> 
> On 6/24/20 4:26 PM, Gautham R Shenoy wrote:
> >Hi Kajol,
> >
> >On Wed, Jun 24, 2020 at 03:47:54PM +0530, Kajol Jain wrote:
> >>Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation.
> >>
> >>command:# cat /sys/devices/hv_24x7/cpumask
> >>0
> >Since this sysfs interface is read-only, and the user cannot change
> >the CPU which will be making the HCALLs to obtain the 24x7 counts,
> >does the user even need to know if currently CPU X is the one which is
> >going to make HCALLs to retrive the 24x7 counts ? Does it help in any
> >kind of trouble-shooting ?
> Primary use to expose the cpumask is for the perf tool.
> Which has the capability to parse the driver sysfs folder
> and understand the cpumask file. Having cpumask
> file will reduce the number of perf commandline
> parameters (will avoid "-C" option in the perf tool
> command line). I can also notify the user which is
> the current cpu used to retrieve the counter data.

Fair enough. Can we include this in the patch description ?

> 
> >It would have made sense if the interface was read-write, since a user
> >can set this to a CPU which is not running user applications. This
> >would help in minimising jitter on those active CPUs running the user
> >applications.
> 
> With cpumask backed by hotplug
> notifiers, enabling user write access to it will
> complicate the code with more additional check.
> CPU will come to play only if the user request for
> counter data. If not, then there will be no HCALLs made
> using the CPU.

Well, I was wondering if you could make the interface writable because
I couldn't think of the use of a read-only interface. With the
perf-use case you have provided, I guess it makes sense. I am ok with
it being a read-only interface.

> 
> Maddy

--
Thanks and Regards
gautham.


Re: [PATCH] powerpc/pseries: Use doorbells even if XIVE is available

2020-06-26 Thread Cédric Le Goater
[ ...  ]

>>> An option vector (or dt-cpu-ftrs) could be defined to disable msgsndp
>>> to get KVM performance back.
> 
> An option vector would require a PAPR change. Unless the architecture 
> reserves some bits for the implementation, but I don't think so. Same
> for CAS.
> 
>> Qemu/KVM populates /proc/device-tree/hypervisor, so we *could* look at
>> that. Though adding PowerVM/KVM specific hacks is obviously a very
>> slippery slope.
> 
> QEMU could advertise a property "emulated-msgsndp", or something similar, 
> which would be interpreted by Linux as a CPU feature and taken into account 
> when doing the IPIs.

This is really a PowerVM optimization. 

> The CPU setup for XIVE needs a cleanup also. There is no need to allocate
> interrupts for IPIs anymore in that case.

We need to keep these for the other cores. The XIVE layer is unchanged.
 
C. 


[bug] LTP mmap03 stuck in page fault loop after c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the user")

2020-06-26 Thread Jan Stancek
Hi,

LTP mmap03 is getting stuck in page fault loop after commit
  c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to 
the user")

System is ppc64le P9 lpar [1] running v5.8-rc2-34-g3e08a95294a4.

Here's a minimized reproducer:
- 8< -
#include 
#include 
#include 
#include 
#include 

int main(int ac, char **av)
{
int page_sz = getpagesize();
int fildes;
char *addr;

fildes = open("tempfile", O_WRONLY | O_CREAT, 0666);
write(fildes, , sizeof(fildes));
close(fildes);

fildes = open("tempfile", O_RDONLY);
unlink("tempfile");

addr = mmap(0, page_sz, PROT_EXEC, MAP_FILE | MAP_PRIVATE, fildes, 0);

printf("%d\n", *addr);
return 0;
}
- >8 -

This would previously end quickly with segmentation fault, after
commit c46241a370a6 test is stuck:

# perf stat timeout 5 ./a.out

 Performance counter stats for 'timeout 5 ./a.out':

  5,001.74 msec task-clock#1.000 CPUs utilized
 9  context-switches  #0.002 K/sec
 0  cpu-migrations#0.000 K/sec
 3,094,893  page-faults   #0.619 M/sec
18,940,869,512  cycles#3.787 GHz
  (33.39%)
 1,377,005,087  stalled-cycles-frontend   #7.27% frontend cycles 
idle (50.19%)
10,949,936,056  stalled-cycles-backend#   57.81% backend cycles 
idle  (16.62%)
21,133,828,748  instructions  #1.12  insn per cycle
  #0.52  stalled cycles per 
insn  (33.22%)
 4,395,016,137  branches  #  878.698 M/sec  
  (49.81%)
   164,499,002  branch-misses #3.74% of all branches
  (16.60%)

   5.001237248 seconds time elapsed

   0.321276000 seconds user
   4.680772000 seconds sys


access_pkey_error() in page fault handler now always seem to return false:

  __do_page_fault
access_pkey_error(is_pkey: 1, is_exec: 0, is_write: 0)
  arch_vma_access_permitted
pkey_access_permitted
  if (!is_pkey_enabled(pkey))
return true
  return false

Regards,
Jan

[1]
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   1
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32 KiB
L1i cache:   32 KiB
NUMA node0 CPU(s):
NUMA node1 CPU(s):   0-7
Physical sockets:2
Physical chips:  1
Physical cores/chip: 8
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf:  Mitigation; RFI Flush, L1D private per thread
Vulnerability Mds:   Not affected
Vulnerability Meltdown:  Mitigation; RFI Flush, L1D private per thread
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:Mitigation; __user pointer sanitization, ori31 
speculation barrier enabled
Vulnerability Spectre v2:Mitigation; Indirect branch cache disabled, 
Software link stack flush
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort:   Not affected



[PATCH] crypto: af_alg - Fix regression on empty requests

2020-06-26 Thread Herbert Xu
On Tue, Jun 23, 2020 at 10:02:17AM -0700, Eric Biggers wrote:
>
> The source code for the two failing AF_ALG tests is here:
> 
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/crypto/af_alg02.c
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/crypto/af_alg05.c
> 
> They use read() and write(), not send() and recv().
> 
> af_alg02 uses read() to read from a "salsa20" request socket without writing
> anything to it.  It is expected that this returns 0, i.e. that behaves like
> encrypting an empty message.
> 
> af_alg05 uses write() to write 15 bytes to a "cbc(aes-generic)" request 
> socket,
> then read() to read 15 bytes.  It is expected that this fails with EINVAL, 
> since
> the length is not aligned to the AES block size (16 bytes).

This patch should fix the regression:

---8<---
Some user-space programs rely on crypto requests that have no
control metadata.  This broke when a check was added to require
the presence of control metadata with the ctx->init flag.

This patch fixes the regression by removing the ctx->init flag.

This means that we do not distinguish the case of no metadata
as opposed to an empty request.  IOW it is always assumed that
if you call recv(2) before sending metadata that you are working
with an empty request.

Reported-by: Sachin Sant 
Reported-by: Naresh Kamboju 
Fixes: f3c802a1f300 ("crypto: algif_aead - Only wake up when...")
Signed-off-by: Herbert Xu 

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 9fcb91ea10c4..2d391117c020 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -635,7 +635,6 @@ void af_alg_pull_tsgl(struct sock *sk, size_t used, struct 
scatterlist *dst,
 
if (!ctx->used)
ctx->merge = 0;
-   ctx->init = ctx->more;
 }
 EXPORT_SYMBOL_GPL(af_alg_pull_tsgl);
 
@@ -757,8 +756,7 @@ int af_alg_wait_for_data(struct sock *sk, unsigned flags, 
unsigned min)
break;
timeout = MAX_SCHEDULE_TIMEOUT;
if (sk_wait_event(sk, ,
- ctx->init && (!ctx->more ||
-   (min && ctx->used >= min)),
+ !ctx->more || (min && ctx->used >= min),
  )) {
err = 0;
break;
@@ -847,7 +845,7 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
}
 
lock_sock(sk);
-   if (ctx->init && (init || !ctx->more)) {
+   if (!ctx->more && ctx->used) {
err = -EINVAL;
goto unlock;
}
@@ -858,7 +856,6 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
memcpy(ctx->iv, con.iv->iv, ivsize);
 
ctx->aead_assoclen = con.aead_assoclen;
-   ctx->init = true;
}
 
while (size) {
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index d48d2156e621..749fe42315be 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -106,7 +106,7 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr 
*msg,
size_t usedpages = 0;   /* [in]  RX bufs to be used from user */
size_t processed = 0;   /* [in]  TX bufs to be consumed */
 
-   if (!ctx->init || ctx->more) {
+   if (ctx->more) {
err = af_alg_wait_for_data(sk, flags, 0);
if (err)
return err;
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index a51ba22fef58..5b6fa5e8c00d 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -61,7 +61,7 @@ static int _skcipher_recvmsg(struct socket *sock, struct 
msghdr *msg,
int err = 0;
size_t len = 0;
 
-   if (!ctx->init || (ctx->more && ctx->used < bs)) {
+   if (ctx->more && ctx->used < bs) {
err = af_alg_wait_for_data(sk, flags, bs);
if (err)
return err;
diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
index ee6412314f8f..08c087cc89d6 100644
--- a/include/crypto/if_alg.h
+++ b/include/crypto/if_alg.h
@@ -135,7 +135,6 @@ struct af_alg_async_req {
  * SG?
  * @enc:   Cryptographic operation to be performed when
  * recvmsg is invoked.
- * @init:  True if metadata has been sent.
  * @len:   Length of memory allocated for this data structure.
  */
 struct af_alg_ctx {
@@ -152,7 +151,6 @@ struct af_alg_ctx {
bool more;
bool merge;
bool enc;
-   bool init;
 
unsigned int len;
 };
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt