[v2 PATCH] crypto: af_alg - Fix regression on empty requests

2020-07-01 Thread Herbert Xu
On Tue, Jun 30, 2020 at 02:18:11PM +0530, Naresh Kamboju wrote:
> 
> Since we are on this subject,
> LTP af_alg02  test case fails on stable 4.9 and stable 4.4
> This is not a regression because the test case has been failing from
> the beginning.
> 
> Is this test case expected to fail on stable 4.9 and 4.4 ?
> or any chance to fix this on these older branches ?
> 
> Test output:
> af_alg02.c:52: BROK: Timed out while reading from request socket.
> 
> ref:
> https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884917/suite/ltp-crypto-tests/test/af_alg02/history/
> https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884606/suite/ltp-crypto-tests/test/af_alg02/log

Actually this test really is broken.  Even though empty requests
are legal, they should never be done with no write(2) at all.
Because this fundamentally breaks the use of a blocking read(2)
to wait for more data.

Granted this has been broken since 2017 but I'm not going to
reintroduce this just because of a broken test case.

So please either remove af_alg02 or fix it by adding a control
message through sendmsg(2).

Thanks,

---8<---
Some user-space programs rely on crypto requests that have no
control metadata.  This broke when a check was added to require
the presence of control metadata with the ctx->init flag.

This patch fixes the regression by setting ctx->init as long as
one sendmsg(2) has been made, with or without a control message.

Reported-by: Sachin Sant 
Reported-by: Naresh Kamboju 
Fixes: f3c802a1f300 ("crypto: algif_aead - Only wake up when...")
Signed-off-by: Herbert Xu 

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 9fcb91ea10c41..5882ed46f1adb 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -851,6 +851,7 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
err = -EINVAL;
goto unlock;
}
+   ctx->init = true;
 
if (init) {
ctx->enc = enc;
@@ -858,7 +859,6 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
memcpy(ctx->iv, con.iv->iv, ivsize);
 
ctx->aead_assoclen = con.aead_assoclen;
-   ctx->init = true;
}
 
while (size) {
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH v8 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2020-07-01 Thread Daniel Axtens
Implement a limited form of KASAN for Book3S 64-bit machines running under
the Radix MMU:

 - Set aside the last 1/8th of the first contiguous block of physical
   memory to provide writable shadow for the linear map. For annoying
   reasons documented below, the memory size must be specified at compile
   time.

 - Enable the compiler instrumentation to check addresses and maintain the
   shadow region. (This is the guts of KASAN which we can easily reuse.)

 - Require kasan-vmalloc support to handle modules and anything else in
   vmalloc space.

 - KASAN needs to be able to validate all pointer accesses, but we can't
   back all kernel addresses with real memory - only linear map and
   vmalloc. On boot, set up a single page of read-only shadow that marks
   all these accesses as valid.

 - Make our stack-walking code KASAN-safe by using READ_ONCE_NOCHECK -
   generic code, arm64, s390 and x86 all do this for similar sorts of
   reasons: when unwinding a stack, we might touch memory that KASAN has
   marked as being out-of-bounds. In our case we often get this when
   checking for an exception frame because we're checking an arbitrary
   offset into the stack frame.

   See commit 20955746320e ("s390/kasan: avoid false positives during stack
   unwind"), commit bcaf669b4bdb ("arm64: disable kasan when accessing
   frame->fp in unwind_frame"), commit 91e08ab0c851 ("x86/dumpstack:
   Prevent KASAN false positive warnings") and commit 6e22c8366416
   ("tracing, kasan: Silence Kasan warning in check_stack of stack_tracer")

 - Document KASAN in both generic and powerpc docs.

Background
--

KASAN support on Book3S is a bit tricky to get right:

 - It would be good to support inline instrumentation so as to be able to
   catch stack issues that cannot be caught with outline mode.

 - Inline instrumentation requires a fixed offset.

 - Book3S runs code in real mode after booting. Most notably a lot of KVM
   runs in real mode, and it would be good to be able to instrument it.

 - Because code runs in real mode after boot, the offset has to point to
   valid memory both in and out of real mode.

[ppc64 mm note: The kernel installs a linear mapping at effective
address c000... onward. This is a one-to-one mapping with physical
memory from ... onward. Because of how memory accesses work on
powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the
same memory both with translations on (accessing as an 'effective
address'), and with translations off (accessing as a 'real
address'). This works in both guests and the hypervisor. For more
details, see s5.7 of Book III of version 3 of the ISA, in particular
the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this
KASAN implementation currently only supports Radix.]

One approach is just to give up on inline instrumentation. This way all
checks can be delayed until after everything set is up correctly, and the
address-to-shadow calculations can be overridden. However, the features and
speed boost provided by inline instrumentation are worth trying to do
better.

If _at compile time_ it is known how much contiguous physical memory a
system has, the top 1/8th of the first block of physical memory can be set
aside for the shadow. This is a big hammer and comes with 3 big
consequences:

 - there's no nice way to handle physically discontiguous memory, so only
   the first physical memory block can be used.

 - kernels will simply fail to boot on machines with less memory than
   specified when compiling.

 - kernels running on machines with more memory than specified when
   compiling will simply ignore the extra memory.

Despite the limitations, it can still find bugs,
e.g. http://patchwork.ozlabs.org/patch/1103775/

At the moment, this physical memory limit must be set _even for outline
mode_. This may be changed in a later series - a different implementation
could be added for outline mode that dynamically allocates shadow at a
fixed offset. For example, see https://patchwork.ozlabs.org/patch/795211/

Suggested-by: Michael Ellerman 
Cc: Balbir Singh  # ppc64 out-of-line radix version
Cc: Christophe Leroy  # ppc32 version
Reviewed-by:  # focussed mainly on Documentation
 and things impacting PPC32
Signed-off-by: Daniel Axtens 
---
 Documentation/dev-tools/kasan.rst|   9 +-
 Documentation/powerpc/kasan.txt  | 112 ++-
 arch/powerpc/Kconfig |   3 +-
 arch/powerpc/Kconfig.debug   |  23 +++-
 arch/powerpc/Makefile|  11 ++
 arch/powerpc/include/asm/book3s/64/hash.h|   4 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |   7 ++
 arch/powerpc/include/asm/book3s/64/radix.h   |   5 +
 arch/powerpc/include/asm/kasan.h |  11 +-
 arch/powerpc/kernel/Makefile |   2 +
 arch/powerpc/kernel/process.c|  16 +--
 

[PATCH v8 3/4] powerpc/mm/kasan: rename kasan_init_32.c to init_32.c

2020-07-01 Thread Daniel Axtens
kasan is already implied by the directory name, we don't need to
repeat it.

Suggested-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 
---
 arch/powerpc/mm/kasan/Makefile   | 2 +-
 arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%)

diff --git a/arch/powerpc/mm/kasan/Makefile b/arch/powerpc/mm/kasan/Makefile
index bb1a5408b86b..42fb628a44fd 100644
--- a/arch/powerpc/mm/kasan/Makefile
+++ b/arch/powerpc/mm/kasan/Makefile
@@ -2,6 +2,6 @@
 
 KASAN_SANITIZE := n
 
-obj-$(CONFIG_PPC32)   += kasan_init_32.o
+obj-$(CONFIG_PPC32)   += init_32.o
 obj-$(CONFIG_PPC_8xx)  += 8xx.o
 obj-$(CONFIG_PPC_BOOK3S_32)+= book3s_32.o
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/init_32.c
similarity index 100%
rename from arch/powerpc/mm/kasan/kasan_init_32.c
rename to arch/powerpc/mm/kasan/init_32.c
-- 
2.25.1



[PATCH v8 2/4] kasan: Document support on 32-bit powerpc

2020-07-01 Thread Daniel Axtens
KASAN is supported on 32-bit powerpc and the docs should reflect this.

Document s390 support while we're at it.

Suggested-by: Christophe Leroy 
Reviewed-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 
---
 Documentation/dev-tools/kasan.rst |  7 +--
 Documentation/powerpc/kasan.txt   | 12 
 2 files changed, 17 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/powerpc/kasan.txt

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index c652d740735d..554cbee1d240 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -22,7 +22,8 @@ global variables yet.
 Tag-based KASAN is only supported in Clang and requires version 7.0.0 or later.
 
 Currently generic KASAN is supported for the x86_64, arm64, xtensa, s390 and
-riscv architectures, and tag-based KASAN is supported only for arm64.
+riscv architectures. It is also supported on 32-bit powerpc kernels. Tag-based
+KASAN is supported only on arm64.
 
 Usage
 -
@@ -255,7 +256,9 @@ CONFIG_KASAN_VMALLOC
 
 
 With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the
-cost of greater memory usage. Currently this is only supported on x86.
+cost of greater memory usage. Currently this supported on x86, s390
+and 32-bit powerpc. It is optional, except on 32-bit powerpc kernels
+with module support, where it is required.
 
 This works by hooking into vmalloc and vmap, and dynamically
 allocating real shadow memory to back the mappings.
diff --git a/Documentation/powerpc/kasan.txt b/Documentation/powerpc/kasan.txt
new file mode 100644
index ..26bb0e8bb18c
--- /dev/null
+++ b/Documentation/powerpc/kasan.txt
@@ -0,0 +1,12 @@
+KASAN is supported on powerpc on 32-bit only.
+
+32 bit support
+==
+
+KASAN is supported on both hash and nohash MMUs on 32-bit.
+
+The shadow area sits at the top of the kernel virtual memory space above the
+fixmap area and occupies one eighth of the total kernel virtual memory space.
+
+Instrumentation of the vmalloc area is optional, unless built with modules,
+in which case it is required.
-- 
2.25.1



[PATCH v8 0/4] KASAN for powerpc64 radix

2020-07-01 Thread Daniel Axtens
Building on the work of Christophe, Aneesh and Balbir, I've ported
KASAN to 64-bit Book3S kernels running on the Radix MMU.

This provides full inline instrumentation on radix, but does require
that you be able to specify the amount of physically contiguous memory
on the system at compile time. More details in patch 4.

v8 is just a rebase of v7 on a more recent powerpc/merge and a fixup
of a whitespace error.

Module globals still don't work, but that's due to some 'clever'
renaming of a section that the powerpc module loading code does to
avoid more complicated relocations/tramplines rather than anything to
do with KASAN.

Daniel Axtens (4):
  kasan: define and use MAX_PTRS_PER_* for early shadow tables
  kasan: Document support on 32-bit powerpc
  powerpc/mm/kasan: rename kasan_init_32.c to init_32.c
  powerpc: Book3S 64-bit "heavyweight" KASAN support

 Documentation/dev-tools/kasan.rst |   8 +-
 Documentation/powerpc/kasan.txt   | 122 ++
 arch/powerpc/Kconfig  |   3 +-
 arch/powerpc/Kconfig.debug|  23 +++-
 arch/powerpc/Makefile |  11 ++
 arch/powerpc/include/asm/book3s/64/hash.h |   4 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   7 +
 arch/powerpc/include/asm/book3s/64/radix.h|   5 +
 arch/powerpc/include/asm/kasan.h  |  11 +-
 arch/powerpc/kernel/Makefile  |   2 +
 arch/powerpc/kernel/process.c |  16 ++-
 arch/powerpc/kernel/prom.c|  76 ++-
 arch/powerpc/mm/kasan/Makefile|   3 +-
 .../mm/kasan/{kasan_init_32.c => init_32.c}   |   0
 arch/powerpc/mm/kasan/init_book3s_64.c|  73 +++
 arch/powerpc/mm/ptdump/ptdump.c   |  10 +-
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 include/linux/kasan.h |  18 ++-
 mm/kasan/init.c   |   6 +-
 19 files changed, 377 insertions(+), 22 deletions(-)
 create mode 100644 Documentation/powerpc/kasan.txt
 rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%)
 create mode 100644 arch/powerpc/mm/kasan/init_book3s_64.c

-- 
2.25.1



[PATCH v8 1/4] kasan: define and use MAX_PTRS_PER_* for early shadow tables

2020-07-01 Thread Daniel Axtens
powerpc has a variable number of PTRS_PER_*, set at runtime based
on the MMU that the kernel is booted under.

This means the PTRS_PER_* are no longer constants, and therefore
breaks the build.

Define default MAX_PTRS_PER_*s in the same style as MAX_PTRS_PER_P4D.
As KASAN is the only user at the moment, just define them in the kasan
header, and have them default to PTRS_PER_* unless overridden in arch
code.

Suggested-by: Christophe Leroy 
Suggested-by: Balbir Singh 
Reviewed-by: Christophe Leroy 
Reviewed-by: Balbir Singh 
Signed-off-by: Daniel Axtens 
---
 include/linux/kasan.h | 18 +++---
 mm/kasan/init.c   |  6 +++---
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 82522e996c76..b6f94952333b 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -14,10 +14,22 @@ struct task_struct;
 #include 
 #include 
 
+#ifndef MAX_PTRS_PER_PTE
+#define MAX_PTRS_PER_PTE PTRS_PER_PTE
+#endif
+
+#ifndef MAX_PTRS_PER_PMD
+#define MAX_PTRS_PER_PMD PTRS_PER_PMD
+#endif
+
+#ifndef MAX_PTRS_PER_PUD
+#define MAX_PTRS_PER_PUD PTRS_PER_PUD
+#endif
+
 extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
-extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
-extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
-extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD];
+extern pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE];
+extern pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD];
+extern pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD];
 extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D];
 
 int kasan_populate_early_shadow(const void *shadow_start,
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..42bca3d27db8 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -46,7 +46,7 @@ static inline bool kasan_p4d_table(pgd_t pgd)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 3
-pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss;
+pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD] __page_aligned_bss;
 static inline bool kasan_pud_table(p4d_t p4d)
 {
return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud));
@@ -58,7 +58,7 @@ static inline bool kasan_pud_table(p4d_t p4d)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 2
-pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss;
 static inline bool kasan_pmd_table(pud_t pud)
 {
return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd));
@@ -69,7 +69,7 @@ static inline bool kasan_pmd_table(pud_t pud)
return false;
 }
 #endif
-pte_t kasan_early_shadow_pte[PTRS_PER_PTE] __page_aligned_bss;
+pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE] __page_aligned_bss;
 
 static inline bool kasan_pte_table(pmd_t pmd)
 {
-- 
2.25.1



[PATCH AUTOSEL 5.4 23/40] powerpc/kvm/book3s64: Fix kernel crash with nested kvm & DEBUG_VIRTUAL

2020-07-01 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit c1ed1754f271f6b7acb1bfdc8cfb62220fbed423 ]

With CONFIG_DEBUG_VIRTUAL=y, __pa() checks for addr value and if it's
less than PAGE_OFFSET it leads to a BUG().

  #define __pa(x)
  ({
VIRTUAL_BUG_ON((unsigned long)(x) < PAGE_OFFSET);
(unsigned long)(x) & 0x0fffUL;
  })

  kernel BUG at arch/powerpc/kvm/book3s_64_mmu_radix.c:43!
  cpu 0x70: Vector: 700 (Program Check) at [c018a2187360]
  pc: c0161b30: __kvmhv_copy_tofrom_guest_radix+0x130/0x1f0
  lr: c0161d5c: kvmhv_copy_from_guest_radix+0x3c/0x80
  ...
  kvmhv_copy_from_guest_radix+0x3c/0x80
  kvmhv_load_from_eaddr+0x48/0xc0
  kvmppc_ld+0x98/0x1e0
  kvmppc_load_last_inst+0x50/0x90
  kvmppc_hv_emulate_mmio+0x288/0x2b0
  kvmppc_book3s_radix_page_fault+0xd8/0x2b0
  kvmppc_book3s_hv_page_fault+0x37c/0x1050
  kvmppc_vcpu_run_hv+0xbb8/0x1080
  kvmppc_vcpu_run+0x34/0x50
  kvm_arch_vcpu_ioctl_run+0x2fc/0x410
  kvm_vcpu_ioctl+0x2b4/0x8f0
  ksys_ioctl+0xf4/0x150
  sys_ioctl+0x28/0x80
  system_call_exception+0x104/0x1d0
  system_call_common+0xe8/0x214

kvmhv_copy_tofrom_guest_radix() uses a NULL value for to/from to
indicate direction of copy.

Avoid calling __pa() if the value is NULL to avoid the BUG().

Signed-off-by: Aneesh Kumar K.V 
[mpe: Massage change log a bit to mention CONFIG_DEBUG_VIRTUAL]
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20200611120159.680284-1-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 43b56f8f6bebd..da8375437d161 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -38,7 +38,8 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
/* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
if (kvmhv_on_pseries())
return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
- __pa(to), __pa(from), n);
+ (to != NULL) ? __pa(to): 0,
+ (from != NULL) ? __pa(from): 0, n);
 
quadrant = 1;
if (!pid)
-- 
2.25.1



[PATCH AUTOSEL 5.4 22/40] ibmvnic: continue to init in CRQ reset returns H_CLOSED

2020-07-01 Thread Sasha Levin
From: Dany Madden 

[ Upstream commit 8b40eb73509f5704a0e8cd25de0163876299f1a7 ]

Continue the reset path when partner adapter is not ready or H_CLOSED is
returned from reset crq. This patch allows the CRQ init to proceed to
establish a valid CRQ for traffic to flow after reset.

Signed-off-by: Dany Madden 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 5a42ddeecfe50..143a9722ad11a 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1865,13 +1865,18 @@ static int do_reset(struct ibmvnic_adapter *adapter,
release_sub_crqs(adapter, 1);
} else {
rc = ibmvnic_reset_crq(adapter);
-   if (!rc)
+   if (rc == H_CLOSED || rc == H_SUCCESS) {
rc = vio_enable_interrupts(adapter->vdev);
+   if (rc)
+   netdev_err(adapter->netdev,
+  "Reset failed to enable 
interrupts. rc=%d\n",
+  rc);
+   }
}
 
if (rc) {
netdev_err(adapter->netdev,
-  "Couldn't initialize crq. rc=%d\n", rc);
+  "Reset couldn't initialize crq. rc=%d\n", 
rc);
goto out;
}
 
-- 
2.25.1



[PATCH AUTOSEL 5.7 29/53] powerpc/kvm/book3s64: Fix kernel crash with nested kvm & DEBUG_VIRTUAL

2020-07-01 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit c1ed1754f271f6b7acb1bfdc8cfb62220fbed423 ]

With CONFIG_DEBUG_VIRTUAL=y, __pa() checks for addr value and if it's
less than PAGE_OFFSET it leads to a BUG().

  #define __pa(x)
  ({
VIRTUAL_BUG_ON((unsigned long)(x) < PAGE_OFFSET);
(unsigned long)(x) & 0x0fffUL;
  })

  kernel BUG at arch/powerpc/kvm/book3s_64_mmu_radix.c:43!
  cpu 0x70: Vector: 700 (Program Check) at [c018a2187360]
  pc: c0161b30: __kvmhv_copy_tofrom_guest_radix+0x130/0x1f0
  lr: c0161d5c: kvmhv_copy_from_guest_radix+0x3c/0x80
  ...
  kvmhv_copy_from_guest_radix+0x3c/0x80
  kvmhv_load_from_eaddr+0x48/0xc0
  kvmppc_ld+0x98/0x1e0
  kvmppc_load_last_inst+0x50/0x90
  kvmppc_hv_emulate_mmio+0x288/0x2b0
  kvmppc_book3s_radix_page_fault+0xd8/0x2b0
  kvmppc_book3s_hv_page_fault+0x37c/0x1050
  kvmppc_vcpu_run_hv+0xbb8/0x1080
  kvmppc_vcpu_run+0x34/0x50
  kvm_arch_vcpu_ioctl_run+0x2fc/0x410
  kvm_vcpu_ioctl+0x2b4/0x8f0
  ksys_ioctl+0xf4/0x150
  sys_ioctl+0x28/0x80
  system_call_exception+0x104/0x1d0
  system_call_common+0xe8/0x214

kvmhv_copy_tofrom_guest_radix() uses a NULL value for to/from to
indicate direction of copy.

Avoid calling __pa() if the value is NULL to avoid the BUG().

Signed-off-by: Aneesh Kumar K.V 
[mpe: Massage change log a bit to mention CONFIG_DEBUG_VIRTUAL]
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20200611120159.680284-1-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index bc6c1aa3d0e92..ef40addd52c65 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -40,7 +40,8 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
/* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
if (kvmhv_on_pseries())
return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
- __pa(to), __pa(from), n);
+ (to != NULL) ? __pa(to): 0,
+ (from != NULL) ? __pa(from): 0, n);
 
quadrant = 1;
if (!pid)
-- 
2.25.1



[PATCH AUTOSEL 5.7 28/53] ibmvnic: continue to init in CRQ reset returns H_CLOSED

2020-07-01 Thread Sasha Levin
From: Dany Madden 

[ Upstream commit 8b40eb73509f5704a0e8cd25de0163876299f1a7 ]

Continue the reset path when partner adapter is not ready or H_CLOSED is
returned from reset crq. This patch allows the CRQ init to proceed to
establish a valid CRQ for traffic to flow after reset.

Signed-off-by: Dany Madden 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 1b4d04e4474bb..2dbcbdbb0e4ad 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1958,13 +1958,18 @@ static int do_reset(struct ibmvnic_adapter *adapter,
release_sub_crqs(adapter, 1);
} else {
rc = ibmvnic_reset_crq(adapter);
-   if (!rc)
+   if (rc == H_CLOSED || rc == H_SUCCESS) {
rc = vio_enable_interrupts(adapter->vdev);
+   if (rc)
+   netdev_err(adapter->netdev,
+  "Reset failed to enable 
interrupts. rc=%d\n",
+  rc);
+   }
}
 
if (rc) {
netdev_err(adapter->netdev,
-  "Couldn't initialize crq. rc=%d\n", rc);
+  "Reset couldn't initialize crq. rc=%d\n", 
rc);
goto out;
}
 
-- 
2.25.1



Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-07-01 Thread Leonardo Bras
On Thu, 2020-07-02 at 10:31 +1000, Alexey Kardashevskiy wrote:
> > In fact, there is a lot of places in this file where it's called direct
> > window. Should I replace everything?
> > Should it be in a separated patch?
> 
> If it looks simple and you write a nice commit log explaining all that
> and why you are not reusing the existing ibm,dma-window property 
> for that - sure, do it :)

Nice, I will do that :)

> (to provide a clue what "reset" will reset to? is there any other
> reason?)

That's the main reason here. 

The way I perceive this, ibm,dma-window should only point to the
default DMA window, which is guaranteed to always be the same, even if
it's destroyed and re-created. So there I see no point destroying /
overwriting it.

On the other hand, I also thought about using a new node name for this
window, but it would be very troublesome and I could see no real gain.

Thanks !



Re: Memory: 880608K/983040K .... 36896K reserved ?

2020-07-01 Thread Michael Ellerman
Joakim Tjernlund  writes:
> I cannot figure out how the xxxK reserved item works in:
>  Memory: 880608K/983040K available (9532K kernel code, 1104K rwdata, 3348K 
> rodata, 1088K init, 1201K bss, 36896K reserved ...

It's calculated as:

(physpages - totalram_pages() - totalcma_pages)

The naming is a bit historical I guess.

But roughly physpages is the total number of pages of RAM we think exist
in the system.

totalram_pages() is the total number of pages that have been freed to
the buddy allocator.

totalcma_pages is pages used by CMA which is probably 0 for you.

So the amount "reserved" is the memory that hasn't been freed to the
buddy allocator by memblock.

You should be able to see it in debugfs:

# cat /sys/kernel/debug/memblock/reserved 
   0: 0x..0x02b40e57
   1: 0x02b41000..0x02b413ff
   2: 0x02b5..0x02ba
   3: 0x0a91..0x0e93
   4: 0x0fe8..0x0fe9
   5: 0x0feac000..0x0ffebfff
   6: 0x0ffed400..0x0ffed7ff
   7: 0x0ffeda80..0x0ffeebff
   8: 0x0ffeee80..0x0ffe
   9: 0x0fff0280..0x0fff13ff
   ...

> Is there a way to tune(lower it) this memory?

Some or most of those reserved regions will be things your firmware told
you to reserve, so you need to work out what each region is. They might
be firmware things, or even holes in RAM, you need to dig deeper to find
out what is what.

cheers


Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable

2020-07-01 Thread Leonardo Bras
On Thu, 2020-07-02 at 10:43 +1000, Alexey Kardashevskiy wrote:
> > Or should one stick to #define in this case?
> 
> imho a matter of taste but after some grepping it feels like #define is
> mostly used which does not mean it is a good idea. Keep it enum and see
> if it passed mpe's filter :)

Good idea :)

Thanks !



Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable

2020-07-01 Thread Alexey Kardashevskiy



On 02/07/2020 10:36, Leonardo Bras wrote:
> On Thu, 2020-07-02 at 10:21 +1000, Alexey Kardashevskiy wrote:
>>> enum {
>>>DDW_QUERY_PE_DMA_WIN,
>>>DDW_CREATE_PE_DMA_WIN,
>>>DDW_REMOVE_PE_DMA_WIN,
>>>
>>>DDW_APPLICABLE_SIZE
>>> }
>>> IMO, it looks better than all the defines before.
>>>
>>> What do you think?
>>
>> No, not really, these come from a binary interface so the reader of this
>> cares about absolute numbers and rather wants to see them explicitly.
> 
> Makes sense to me.
> I am still getting experience on where to use enum vs define. Thanks
> for the tip!
> 
> Using something like 
> enum {
>   DDW_QUERY_PE_DMA_WIN = 0,
>   DDW_CREATE_PE_DMA_WIN = 1,
>   DDW_REMOVE_PE_DMA_WIN = 2,
> 
>   DDW_APPLICABLE_SIZE
> };
> 
> would be fine too?


This is fine too.


> Or should one stick to #define in this case?

imho a matter of taste but after some grepping it feels like #define is
mostly used which does not mean it is a good idea. Keep it enum and see
if it passed mpe's filter :)



-- 
Alexey


Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable

2020-07-01 Thread Leonardo Bras
On Thu, 2020-07-02 at 10:21 +1000, Alexey Kardashevskiy wrote:
> > enum {
> >DDW_QUERY_PE_DMA_WIN,
> >DDW_CREATE_PE_DMA_WIN,
> >DDW_REMOVE_PE_DMA_WIN,
> > 
> >DDW_APPLICABLE_SIZE
> > }
> > IMO, it looks better than all the defines before.
> > 
> > What do you think?
> 
> No, not really, these come from a binary interface so the reader of this
> cares about absolute numbers and rather wants to see them explicitly.

Makes sense to me.
I am still getting experience on where to use enum vs define. Thanks
for the tip!

Using something like 
enum {
DDW_QUERY_PE_DMA_WIN = 0,
DDW_CREATE_PE_DMA_WIN = 1,
DDW_REMOVE_PE_DMA_WIN = 2,

DDW_APPLICABLE_SIZE
};

would be fine too?
Or should one stick to #define in this case?

Thank you,



Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-07-01 Thread Alexey Kardashevskiy



On 02/07/2020 09:48, Leonardo Bras wrote:
> On Wed, 2020-07-01 at 16:57 -0300, Leonardo Bras wrote:
>>> It is not necessarily "direct" anymore as the name suggests, you may
>>> want to change that. DMA64_PROPNAME, may be. Thanks,
>>>
>>
>> Yeah, you are right.
>> I will change this for next version, also changing the string name to
>> reflect this.
>>
>> -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
>> +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>>
>> Is that ok?
>>
>> Thank you for helping!
> 
> In fact, there is a lot of places in this file where it's called direct
> window. Should I replace everything?
> Should it be in a separated patch?

If it looks simple and you write a nice commit log explaining all that
and why you are not reusing the existing ibm,dma-window property (to
provide a clue what "reset" will reset to? is there any other reason?)
for that - sure, do it :)



-- 
Alexey


Re: [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows

2020-07-01 Thread Leonardo Bras
On Thu, 2020-07-02 at 10:18 +1000, Alexey Kardashevskiy wrote:
> 
> On 02/07/2020 00:04, Leonardo Bras wrote:
> > On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote:
> > > > +#define DDW_EXT_SIZE   0
> > > > +#define DDW_EXT_RESET_DMA_WIN  1
> > > > +#define DDW_EXT_QUERY_OUT_SIZE 2
> > > 
> > > #define DDW_EXT_LAST (DDW_EXT_QUERY_OUT_SIZE + 1)
> > > ...
> > > 
> > > 
> > > > +
> > > >  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
> > > >  {
> > > > struct iommu_table_group *table_group;
> > > > @@ -339,7 +343,7 @@ struct direct_window {
> > > >  /* Dynamic DMA Window support */
> > > >  struct ddw_query_response {
> > > > u32 windows_available;
> > > > -   u32 largest_available_block;
> > > > +   u64 largest_available_block;
> > > > u32 page_size;
> > > > u32 migration_capable;
> > > >  };
> > > > @@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void)
> > > >  machine_arch_initcall(pseries, find_existing_ddw_windows);
> > > >  
> > > >  static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
> > > > -   struct ddw_query_response *query)
> > > > +struct ddw_query_response *query,
> > > > +struct device_node *parent)
> > > >  {
> > > > struct device_node *dn;
> > > > struct pci_dn *pdn;
> > > > -   u32 cfg_addr;
> > > > +   u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1];
> > > 
> > > ... and use DDW_EXT_LAST here.
> > 
> > Because of the growing nature of ddw-extensions, I intentionally let
> > this be (DDW_EXT_QUERY_OUT_SIZE + 1). If we create a DDW_EXT_LAST, it
> > will be incremented in the future if more extensions come to exist.
> > 
> > I mean, I previously saw no reason for allocating space for extensions
> > after the desired one, as they won't be used here.
> 
> Ah, my bad, you're right.
> 
> 
> > > 
> > > > u64 buid;
> > > > -   int ret;
> > > > +   int ret, out_sz;
> > > > +
> > > > +   /*
> > > > +* From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule 
> > > > how many
> > > > +* output parameters ibm,query-pe-dma-windows will have, 
> > > > ranging from
> > > > +* 5 to 6.
> > > > +*/
> > > > +
> > > > +   ret = of_property_read_u32_array(parent, "ibm,ddw-extensions",
> > > > +_ext[0],
> > > > +DDW_EXT_QUERY_OUT_SIZE + 1);
> > 
> > In this case, I made sure not to cross (DDW_EXT_QUERY_OUT_SIZE + 1)
> > while reading the extensions from the property.
> > 
> > What do you think about it? 
> 
> I think you want something like:
> 
> static inline int ddw_read_ext(const struct device_node *np, int extnum,
> u32 *ret)
> {
> retun of_property_read_u32_index(np, "ibm,ddw-extensions", extnum + 1, ret);
> }
> 
> These "+1"'s all over the place are confusing.

That's a great idea!

I was not aware it was possible to read a single value[index] directly
from the property, but it makes total sense to use it.

Thank you!



Re: [PATCH v2 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW

2020-07-01 Thread Alexey Kardashevskiy



On 02/07/2020 05:48, Leonardo Bras wrote:
> On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote:
>>
>> On 24/06/2020 16:24, Leonardo Bras wrote:
>>> On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the
>>> default DMA window for the device, before attempting to configure a DDW,
>>> in order to make the maximum resources available for the next DDW to be
>>> created.
>>>
>>> This is a requirement for some devices to use DDW, given they only
>>> allow one DMA window.
>>
>> Devices never know about these windows, it is purely PHB's side of
>> things. A device can access any address on the bus, the bus can generate
>> an exception if there is no window behind the address OR some other
>> device's MMIO. We could actually create a second window in addition to
>> the first one and allocate bus addresses from both, we just simplifying
>> this by merging two separate non-adjacent windows into one.
> 
> That's interesting, I was not aware of this. 
> I will try to improve this commit message with this info.
> Thanks for sharing!
> 
> If setting up a new DDW fails anywhere after the removal of this
>>> default DMA window, it's needed to restore the default DMA window.
>>> For this, an implementation of ibm,reset-pe-dma-windows rtas call is
>>> needed:
>>>
>>> Platforms supporting the DDW option starting with LoPAR level 2.7 implement
>>> ibm,ddw-extensions. The first extension available (index 2) carries the
>>> token for ibm,reset-pe-dma-windows rtas call, which is used to restore
>>> the default DMA window for a device, if it has been deleted.
>>>
>>> It does so by resetting the TCE table allocation for the PE to it's
>>> boot time value, available in "ibm,dma-window" device tree node.
>>>
>>> Signed-off-by: Leonardo Bras 
>>> ---
>>>  arch/powerpc/platforms/pseries/iommu.c | 70 ++
>>>  1 file changed, 61 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index a8840d9e1c35..4fcf00016fb1 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -1029,6 +1029,39 @@ static phys_addr_t ddw_memory_hotplug_max(void)
>>> return max_addr;
>>>  }
>>>  
>>> +/*
>>> + * Platforms supporting the DDW option starting with LoPAR level 2.7 
>>> implement
>>> + * ibm,ddw-extensions, which carries the rtas token for
>>> + * ibm,reset-pe-dma-windows.
>>> + * That rtas-call can be used to restore the default DMA window for the 
>>> device.
>>> + */
>>> +static void reset_dma_window(struct pci_dev *dev, struct device_node 
>>> *par_dn)
>>> +{
>>> +   int ret;
>>> +   u32 cfg_addr, ddw_ext[DDW_EXT_RESET_DMA_WIN + 1];
>>> +   u64 buid;
>>> +   struct device_node *dn;
>>> +   struct pci_dn *pdn;
>>> +
>>> +   ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions",
>>> +_ext[0], DDW_EXT_RESET_DMA_WIN + 
>>> 1);
>>> +   if (ret)
>>> +   return;
>>> +
>>> +   dn = pci_device_to_OF_node(dev);
>>> +   pdn = PCI_DN(dn);
>>> +   buid = pdn->phb->buid;
>>> +   cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
>>> +
>>> +   ret = rtas_call(ddw_ext[DDW_EXT_RESET_DMA_WIN], 3, 1, NULL, cfg_addr,
>>> +   BUID_HI(buid), BUID_LO(buid));
>>> +   if (ret)
>>> +   dev_info(>dev,
>>> +"ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ",
>>> +ddw_ext[1], cfg_addr, BUID_HI(buid), BUID_LO(buid),
>>
>> s/ddw_ext[1]/ddw_ext[DDW_EXT_RESET_DMA_WIN]/
> 
> Good catch! I missed this one.
> 
>>
>>
>>> +ret);
>>> +}
>>> +
>>>  /*
>>>   * If the PE supports dynamic dma windows, and there is space for a table
>>>   * that can map all pages in a linear offset, then setup such a table,
>>> @@ -1049,8 +1082,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> u64 dma_addr, max_addr;
>>> struct device_node *dn;
>>> u32 ddw_avail[DDW_APPLICABLE_SIZE];
>>> +
>>
>> Unrelated new empty line.
> 
> Fixed!
> 
>>
>>
>>> struct direct_window *window;
>>> -   struct property *win64;
>>> +   struct property *win64, *default_win = NULL, *ddw_ext = NULL;
>>> struct dynamic_dma_window_prop *ddwprop;
>>> struct failed_ddw_pdn *fpdn;
>>>  
>>> @@ -1085,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> if (ret)
>>> goto out_failed;
>>>  
>>> -   /*
>>> +   /*
>>>  * Query if there is a second window of size to map the
>>>  * whole partition.  Query returns number of windows, largest
>>>  * block assigned to PE (partition endpoint), and two bitmasks
>>> @@ -1096,15 +1130,31 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> if (ret != 0)
>>> goto out_failed;
>>>  
>>> +   /*
>>> +* If there is no window available, remove the default DMA window,
>>> +* if it's present. This 

Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable

2020-07-01 Thread Alexey Kardashevskiy



On 01/07/2020 23:28, Leonardo Bras wrote:
> On Wed, 2020-07-01 at 18:16 +1000, Alexey Kardashevskiy wrote:
>>
>> On 24/06/2020 16:24, Leonardo Bras wrote:
>>> Create defines to help handling ibm,ddw-applicable values, avoiding
>>> confusion about the index of given operations.
>>>
>>> Signed-off-by: Leonardo Bras 
>>> ---
>>>  arch/powerpc/platforms/pseries/iommu.c | 40 +++---
>>>  1 file changed, 23 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index 6d47b4a3ce39..68d2aa9c71a8 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -39,6 +39,11 @@
>>>  
>>>  #include "pseries.h"
>>>  
>>> +#define DDW_QUERY_PE_DMA_WIN   0
>>> +#define DDW_CREATE_PE_DMA_WIN  1
>>> +#define DDW_REMOVE_PE_DMA_WIN  2
>>> +#define DDW_APPLICABLE_SIZE3
>>
>> #define DDW_APPLICABLE_SIZE  (DDW_REMOVE_PE_DMA_WIN + 1)
>>
>> thanks,
> 
> Thanks for the feedback!
> About this (and patch #2), would it be better to use enum ?
> enum {
>   DDW_QUERY_PE_DMA_WIN,
>   DDW_CREATE_PE_DMA_WIN,
>   DDW_REMOVE_PE_DMA_WIN,
> 
>   DDW_APPLICABLE_SIZE
> }
> IMO, it looks better than all the defines before.
> 
> What do you think?

No, not really, these come from a binary interface so the reader of this
cares about absolute numbers and rather wants to see them explicitly.


-- 
Alexey


Re: [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows

2020-07-01 Thread Alexey Kardashevskiy



On 02/07/2020 00:04, Leonardo Bras wrote:
> On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote:
>>
>>> +#define DDW_EXT_SIZE   0
>>> +#define DDW_EXT_RESET_DMA_WIN  1
>>> +#define DDW_EXT_QUERY_OUT_SIZE 2
>>
>> #define DDW_EXT_LAST (DDW_EXT_QUERY_OUT_SIZE + 1)
>> ...
>>
>>
>>> +
>>>  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>>>  {
>>> struct iommu_table_group *table_group;
>>> @@ -339,7 +343,7 @@ struct direct_window {
>>>  /* Dynamic DMA Window support */
>>>  struct ddw_query_response {
>>> u32 windows_available;
>>> -   u32 largest_available_block;
>>> +   u64 largest_available_block;
>>> u32 page_size;
>>> u32 migration_capable;
>>>  };
>>> @@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void)
>>>  machine_arch_initcall(pseries, find_existing_ddw_windows);
>>>  
>>>  static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
>>> -   struct ddw_query_response *query)
>>> +struct ddw_query_response *query,
>>> +struct device_node *parent)
>>>  {
>>> struct device_node *dn;
>>> struct pci_dn *pdn;
>>> -   u32 cfg_addr;
>>> +   u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1];
>>
>> ... and use DDW_EXT_LAST here.
> 
> Because of the growing nature of ddw-extensions, I intentionally let
> this be (DDW_EXT_QUERY_OUT_SIZE + 1). If we create a DDW_EXT_LAST, it
> will be incremented in the future if more extensions come to exist.
> 
> I mean, I previously saw no reason for allocating space for extensions
> after the desired one, as they won't be used here.

Ah, my bad, you're right.


> 
>>
>>
>>> u64 buid;
>>> -   int ret;
>>> +   int ret, out_sz;
>>> +
>>> +   /*
>>> +* From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many
>>> +* output parameters ibm,query-pe-dma-windows will have, ranging from
>>> +* 5 to 6.
>>> +*/
>>> +
>>> +   ret = of_property_read_u32_array(parent, "ibm,ddw-extensions",
>>> +_ext[0],
>>> +DDW_EXT_QUERY_OUT_SIZE + 1);
> 
> In this case, I made sure not to cross (DDW_EXT_QUERY_OUT_SIZE + 1)
> while reading the extensions from the property.
> 
> What do you think about it? 

I think you want something like:

static inline int ddw_read_ext(const struct device_node *np, int extnum,
u32 *ret)
{
retun of_property_read_u32_index(np, "ibm,ddw-extensions", extnum + 1, ret);
}

These "+1"'s all over the place are confusing.


-- 
Alexey


Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-07-01 Thread Leonardo Bras
On Wed, 2020-07-01 at 16:57 -0300, Leonardo Bras wrote:
> > It is not necessarily "direct" anymore as the name suggests, you may
> > want to change that. DMA64_PROPNAME, may be. Thanks,
> > 
> 
> Yeah, you are right.
> I will change this for next version, also changing the string name to
> reflect this.
> 
> -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> 
> Is that ok?
> 
> Thank you for helping!

In fact, there is a lot of places in this file where it's called direct
window. Should I replace everything?
Should it be in a separated patch?

Best regards,
Leonardo



Re: [PATCH] ASoC: fsl_sai: Refine regcache usage with pm runtime

2020-07-01 Thread Mark Brown
On Mon, 29 Jun 2020 14:42:33 +0800, Shengjiu Wang wrote:
> When there is dedicated power domain bound with device, after probing
> the power will be disabled, then registers are not accessible in
> fsl_sai_dai_probe(), so regcache only need to be enabled in end of
> probe() and regcache_mark_dirty should be moved to pm runtime resume
> callback function.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl_sai: Refine regcache usage with pm runtime
  commit: d8d702e19e997cf3f172487e0659d0e68aa5ede5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark


Re: [PATCH v3] ASoC: fsl_asrc: Add an option to select internal ratio mode

2020-07-01 Thread Mark Brown
On Tue, 30 Jun 2020 21:56:07 +0800, Shengjiu Wang wrote:
> The ASRC not only supports ideal ratio mode, but also supports
> internal ratio mode.
> 
> For internal rato mode, the rate of clock source should be divided
> with no remainder by sample rate, otherwise there is sound
> distortion.
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl_asrc: Add an option to select internal ratio mode
  commit: d0250cf4f2abfbea64ed247230f08f5ae23979f0

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark


Re: [PATCH v2 6/6] powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00

2020-07-01 Thread Leonardo Bras
On Wed, 2020-07-01 at 18:04 +1000, Alexey Kardashevskiy wrote:
> 
> On 27/06/2020 03:46, Leonardo Bras wrote:
> > On Wed, 2020-06-24 at 03:24 -0300, Leonardo Bras wrote:
> > > As of today, enable_ddw() will return a non-null DMA address if the
> > > created DDW maps the whole partition. If the address is valid,
> > > iommu_bypass_supported_pSeriesLP() will consider iommu bypass enabled.
> > > 
> > > This can cause some trouble if the DDW happens to start at 0x00.
> > > 
> > > Instead if checking if the address is non-null, check directly if
> > > the DDW maps the whole partition, so it can bypass iommu.
> > > 
> > > Signed-off-by: Leonardo Bras 
> > 
> > This patch has a bug in it. I will rework it soon.
> 
> I'd rather suggest this:
> 
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20180725095032.2196-2-...@ozlabs.ru/
> 
> Although it does not look like you are actually going to have windows
> starting at 0. Thanks,

Yeah, agree. 
I am thinking of dropping this one, as I don't see much good to be done
here.

Thank you!



Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-07-01 Thread Leonardo Bras
On Wed, 2020-07-01 at 18:16 +1000, Alexey Kardashevskiy wrote:
> 
> On 24/06/2020 16:24, Leonardo Bras wrote:
> > As of today, if a DDW is created and can't map the whole partition, it's
> > removed and the default DMA window "ibm,dma-window" is used instead.
> > 
> > Usually this DDW is bigger than the default DMA window, so it would be
> > better to make use of it instead.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  arch/powerpc/platforms/pseries/iommu.c | 28 +-
> >  1 file changed, 19 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index 4fcf00016fb1..2d217cda4075 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -685,7 +685,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
> > *bus)
> > struct iommu_table *tbl;
> > struct device_node *dn, *pdn;
> > struct pci_dn *ppci;
> > -   const __be32 *dma_window = NULL;
> > +   const __be32 *dma_window = NULL, *alt_dma_window = NULL;
> >  
> > dn = pci_bus_to_OF_node(bus);
> >  
> > @@ -699,8 +699,13 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
> > *bus)
> > break;
> > }
> >  
> > +   /* If there is a DDW available, use it instead */
> > +   alt_dma_window = of_get_property(pdn, DIRECT64_PROPNAME, NULL);
> 
> It is not necessarily "direct" anymore as the name suggests, you may
> want to change that. DMA64_PROPNAME, may be. Thanks,
> 

Yeah, you are right.
I will change this for next version, also changing the string name to
reflect this.

-#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"

Is that ok?

Thank you for helping!




Re: [PATCH v2 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW

2020-07-01 Thread Leonardo Bras
On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote:
> 
> On 24/06/2020 16:24, Leonardo Bras wrote:
> > On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the
> > default DMA window for the device, before attempting to configure a DDW,
> > in order to make the maximum resources available for the next DDW to be
> > created.
> > 
> > This is a requirement for some devices to use DDW, given they only
> > allow one DMA window.
> 
> Devices never know about these windows, it is purely PHB's side of
> things. A device can access any address on the bus, the bus can generate
> an exception if there is no window behind the address OR some other
> device's MMIO. We could actually create a second window in addition to
> the first one and allocate bus addresses from both, we just simplifying
> this by merging two separate non-adjacent windows into one.

That's interesting, I was not aware of this. 
I will try to improve this commit message with this info.
Thanks for sharing!

> > > > If setting up a new DDW fails anywhere after the removal of this
> > default DMA window, it's needed to restore the default DMA window.
> > For this, an implementation of ibm,reset-pe-dma-windows rtas call is
> > needed:
> > 
> > Platforms supporting the DDW option starting with LoPAR level 2.7 implement
> > ibm,ddw-extensions. The first extension available (index 2) carries the
> > token for ibm,reset-pe-dma-windows rtas call, which is used to restore
> > the default DMA window for a device, if it has been deleted.
> > 
> > It does so by resetting the TCE table allocation for the PE to it's
> > boot time value, available in "ibm,dma-window" device tree node.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  arch/powerpc/platforms/pseries/iommu.c | 70 ++
> >  1 file changed, 61 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index a8840d9e1c35..4fcf00016fb1 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -1029,6 +1029,39 @@ static phys_addr_t ddw_memory_hotplug_max(void)
> > return max_addr;
> >  }
> >  
> > +/*
> > + * Platforms supporting the DDW option starting with LoPAR level 2.7 
> > implement
> > + * ibm,ddw-extensions, which carries the rtas token for
> > + * ibm,reset-pe-dma-windows.
> > + * That rtas-call can be used to restore the default DMA window for the 
> > device.
> > + */
> > +static void reset_dma_window(struct pci_dev *dev, struct device_node 
> > *par_dn)
> > +{
> > +   int ret;
> > +   u32 cfg_addr, ddw_ext[DDW_EXT_RESET_DMA_WIN + 1];
> > +   u64 buid;
> > +   struct device_node *dn;
> > +   struct pci_dn *pdn;
> > +
> > +   ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions",
> > +_ext[0], DDW_EXT_RESET_DMA_WIN + 
> > 1);
> > +   if (ret)
> > +   return;
> > +
> > +   dn = pci_device_to_OF_node(dev);
> > +   pdn = PCI_DN(dn);
> > +   buid = pdn->phb->buid;
> > +   cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
> > +
> > +   ret = rtas_call(ddw_ext[DDW_EXT_RESET_DMA_WIN], 3, 1, NULL, cfg_addr,
> > +   BUID_HI(buid), BUID_LO(buid));
> > +   if (ret)
> > +   dev_info(>dev,
> > +"ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ",
> > +ddw_ext[1], cfg_addr, BUID_HI(buid), BUID_LO(buid),
> 
> s/ddw_ext[1]/ddw_ext[DDW_EXT_RESET_DMA_WIN]/

Good catch! I missed this one.

> 
> 
> > +ret);
> > +}
> > +
> >  /*
> >   * If the PE supports dynamic dma windows, and there is space for a table
> >   * that can map all pages in a linear offset, then setup such a table,
> > @@ -1049,8 +1082,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> > device_node *pdn)
> > u64 dma_addr, max_addr;
> > struct device_node *dn;
> > u32 ddw_avail[DDW_APPLICABLE_SIZE];
> > +
> 
> Unrelated new empty line.

Fixed!

> 
> 
> > struct direct_window *window;
> > -   struct property *win64;
> > +   struct property *win64, *default_win = NULL, *ddw_ext = NULL;
> > struct dynamic_dma_window_prop *ddwprop;
> > struct failed_ddw_pdn *fpdn;
> >  
> > @@ -1085,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> > device_node *pdn)
> > if (ret)
> > goto out_failed;
> >  
> > -   /*
> > +   /*
> >  * Query if there is a second window of size to map the
> >  * whole partition.  Query returns number of windows, largest
> >  * block assigned to PE (partition endpoint), and two bitmasks
> > @@ -1096,15 +1130,31 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> > device_node *pdn)
> > if (ret != 0)
> > goto out_failed;
> >  
> > +   /*
> > +* If there is no window available, remove the default DMA window,
> > +* if it's present. This will make all the resources available to the
> > +* new DDW window.
> 

Memory: 880608K/983040K .... 36896K reserved ?

2020-07-01 Thread Joakim Tjernlund
I cannot figure out how the xxxK reserved item works in:
 Memory: 880608K/983040K available (9532K kernel code, 1104K rwdata, 3348K 
rodata, 1088K init, 1201K bss, 36896K reserved ...

Is there a way to tune(lower it) this memory?

 Jocke


Re: [PATCH 01/11] kexec_file: allow archs to handle special regions while locating memory hole

2020-07-01 Thread Hari Bathini



On 01/07/20 1:16 pm, Dave Young wrote:
> On 06/29/20 at 05:26pm, Hari Bathini wrote:
>> Hi Petr,
>>
>> On 29/06/20 5:09 pm, Petr Tesarik wrote:
>>> Hi Hari,
>>>
>>> is there any good reason to add two more functions with a very similar
>>> name to an existing function? AFAICS all you need is a way to call a
>>> PPC64-specific function from within kexec_add_buffer (PATCH 4/11), so
>>> you could add something like this:
>>>
>>> int __weak arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
>>> {
>>> return 0;
>>> }
>>>
>>> Call this function from kexec_add_buffer where appropriate and then
>>> override it for PPC64 (it roughly corresponds to your
>>> kexec_locate_mem_hole_ppc64() from PATCH 4/11).
>>>
>>> FWIW it would make it easier for me to follow the resulting code.
>>
>> Right, Petr.
>>
>> I was trying out a few things before I ended up with what I sent here.
>> Bu yeah.. I did realize arch_kexec_locate_mem_hole() would have been better
>> after sending out v1. Will take care of that in v2.
> 
> Another way is use arch private function to locate mem hole, then set
> kbuf->mem, and then call kexec_add_buf, it will skip the common locate
> hole function.

Dave, I did think about it. But there are a couple of places this can get
tricky. One is ima_add_kexec_buffer() and the other is kexec_elf_load().
These call sites could be updated to set kbuf->mem before kexec_add_buffer().
But the current approach seemed like the better option for it creates a
single point of control in setting up segment buffers and also, makes adding
any new segments simpler, arch-specific segments or otherwise.

Thanks
Hari


Re: [PATCH 10/20] dm: stop using ->queuedata

2020-07-01 Thread Mike Snitzer
On Wed, Jul 01 2020 at  4:59am -0400,
Christoph Hellwig  wrote:

> Instead of setting up the queuedata as well just use one private data
> field.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Mike Snitzer 



Re: [PATCH 04/11] ppc64/kexec_file: avoid stomping memory used by special regions

2020-07-01 Thread Hari Bathini



On 01/07/20 1:10 pm, Dave Young wrote:
> Hi Hari,
> On 06/27/20 at 12:35am, Hari Bathini wrote:
>> crashkernel region could have an overlap with special memory regions
>> like  opal, rtas, tce-table & such. These regions are referred to as
>> exclude memory ranges. Setup this ranges during image probe in order
>> to avoid them while finding the buffer for different kdump segments.
>> Implement kexec_locate_mem_hole_ppc64() that locates a memory hole
>> accounting for these ranges. Also, override arch_kexec_add_buffer()
>> to locate a memory hole & later call __kexec_add_buffer() function
>> with kbuf->mem set to skip the generic locate memory hole lookup.
>>
>> Signed-off-by: Hari Bathini 
>> ---
>>  arch/powerpc/include/asm/crashdump-ppc64.h |   10 +
>>  arch/powerpc/include/asm/kexec.h   |7 -
>>  arch/powerpc/kexec/elf_64.c|7 +
>>  arch/powerpc/kexec/file_load_64.c  |  292 
>> 
>>  4 files changed, 312 insertions(+), 4 deletions(-)
>>  create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h
>>
> [snip]
>>  /**
>> + * get_exclude_memory_ranges - Get exclude memory ranges. This list includes
>> + * regions like opal/rtas, tce-table, initrd,
>> + * kernel, htab which should be avoided while
>> + * setting up kexec load segments.
>> + * @mem_ranges:Range list to add the memory ranges to.
>> + *
>> + * Returns 0 on success, negative errno on error.
>> + */
>> +static int get_exclude_memory_ranges(struct crash_mem **mem_ranges)
>> +{
>> +int ret;
>> +
>> +ret = add_tce_mem_ranges(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_initrd_mem_range(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_htab_mem_range(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_kernel_mem_range(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_rtas_mem_range(mem_ranges, false);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_opal_mem_range(mem_ranges, false);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_reserved_ranges(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +/* exclude memory ranges should be sorted for easy lookup */
>> +sort_memory_ranges(*mem_ranges);
>> +out:
>> +if (ret)
>> +pr_err("Failed to setup exclude memory ranges\n");
>> +return ret;
>> +}
> 
> I'm confused about the "overlap with crashkernel memory", does that mean
> those normal kernel used memory could be put in crashkernel reserved

There are regions that could overlap with crashkernel region but they are
not normal kernel used memory though. These are regions that kernel and/or
f/w chose to place at a particular address for real mode accessibility
and/or memory layout between kernel & f/w kind of thing.

> memory range?  If so why can't just skip those areas while crashkernel
> doing the reservation?

crashkernel region has a dependency to be in the first memory block for it
to be accessible in real mode. Accommodating this requirement while addressing
other requirements would mean something like what we have now. A list of
possible special memory regions in crashkernel region to take care of.

I have plans to split crashkernel region into low & high to have exclusive
regions for crashkernel, even if that means to have two of them. But that
is for another day with its own set of complexities to deal with...

Thanks
Hari


Re: [PATCH 16/20] block: move ->make_request_fn to struct block_device_operations

2020-07-01 Thread Dan Williams
On Wed, Jul 1, 2020 at 2:01 AM Christoph Hellwig  wrote:
>
> The make_request_fn is a little weird in that it sits directly in
> struct request_queue instead of an operation vector.  Replace it with
> a block_device_operations method called submit_bio (which describes much
> better what it does).  Also remove the request_queue argument to it, as
> the queue can be derived pretty trivially from the bio.
>
> Signed-off-by: Christoph Hellwig 
> ---
[..]
>  drivers/nvdimm/blk.c  |  5 +-
>  drivers/nvdimm/btt.c  |  5 +-
>  drivers/nvdimm/pmem.c |  5 +-

For drivers/nvdimm

Acked-by: Dan Williams 


Re: [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows

2020-07-01 Thread Leonardo Bras
On Wed, 2020-07-01 at 18:17 +1000, Alexey Kardashevskiy wrote:
> 
> > +#define DDW_EXT_SIZE   0
> > +#define DDW_EXT_RESET_DMA_WIN  1
> > +#define DDW_EXT_QUERY_OUT_SIZE 2
> 
> #define DDW_EXT_LAST (DDW_EXT_QUERY_OUT_SIZE + 1)
> ...
> 
> 
> > +
> >  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
> >  {
> > struct iommu_table_group *table_group;
> > @@ -339,7 +343,7 @@ struct direct_window {
> >  /* Dynamic DMA Window support */
> >  struct ddw_query_response {
> > u32 windows_available;
> > -   u32 largest_available_block;
> > +   u64 largest_available_block;
> > u32 page_size;
> > u32 migration_capable;
> >  };
> > @@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void)
> >  machine_arch_initcall(pseries, find_existing_ddw_windows);
> >  
> >  static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
> > -   struct ddw_query_response *query)
> > +struct ddw_query_response *query,
> > +struct device_node *parent)
> >  {
> > struct device_node *dn;
> > struct pci_dn *pdn;
> > -   u32 cfg_addr;
> > +   u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1];
> 
> ... and use DDW_EXT_LAST here.

Because of the growing nature of ddw-extensions, I intentionally let
this be (DDW_EXT_QUERY_OUT_SIZE + 1). If we create a DDW_EXT_LAST, it
will be incremented in the future if more extensions come to exist.

I mean, I previously saw no reason for allocating space for extensions
after the desired one, as they won't be used here.

> 
> 
> > u64 buid;
> > -   int ret;
> > +   int ret, out_sz;
> > +
> > +   /*
> > +* From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many
> > +* output parameters ibm,query-pe-dma-windows will have, ranging from
> > +* 5 to 6.
> > +*/
> > +
> > +   ret = of_property_read_u32_array(parent, "ibm,ddw-extensions",
> > +_ext[0],
> > +DDW_EXT_QUERY_OUT_SIZE + 1);

In this case, I made sure not to cross (DDW_EXT_QUERY_OUT_SIZE + 1)
while reading the extensions from the property.

What do you think about it? 

Best regards,
Leonardo 



Re: rename ->make_request_fn and move it to the block_device_operations v2

2020-07-01 Thread Jens Axboe
On 7/1/20 2:59 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> this series moves the make_request_fn method into block_device_operations
> with the much more descriptive ->submit_bio name.  It then also gives
> generic_make_request a more descriptive name, and further optimize the
> path to issue to blk-mq, removing the need for the direct_make_request
> bypass.

Thanks, I'll try this again.

-- 
Jens Axboe



[PATCH v2 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

2020-07-01 Thread Vaibhav Jain
We add support for reporting 'fuel-gauge' NVDIMM metric via
PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
metric via the H_SCM_PERFORMANCE_STATS.

The metric value is returned from the pdsm by extending the return
payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
field 'dimm_fuel_gauge' to hold the metric value is introduced at the
end of the payload struct and its presence is indicated by by
extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.

The patch introduces a new function papr_pdsm_fuel_gauge() that is
called from papr_pdsm_health(). If fetching NVDIMM performance stats
is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
large enough to hold the performance stat and passes it to
drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
of the stat is then populated in the 'struct
nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
nd_papr_pdsm_health.extension_flags'

Signed-off-by: Vaibhav Jain 
---
Changelog:

v2:
* Restructure code in papr_pdsm_fuel_gauge() to handle error case
first [ Ira ]
* Ignore the return value of papr_pdsm_fuel_gauge() in
papr_psdm_health() [ Ira ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 +
 arch/powerpc/platforms/pseries/papr_scm.c | 49 +++
 2 files changed, 58 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 9ccecc1d6840..50ef95e2f5b1 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -72,6 +72,11 @@
 #define PAPR_PDSM_DIMM_CRITICAL  2
 #define PAPR_PDSM_DIMM_FATAL 3
 
+/* struct nd_papr_pdsm_health.extension_flags field flags */
+
+/* Indicate that the 'dimm_fuel_gauge' field is valid */
+#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -84,6 +89,7 @@
  * dimm_locked : Contents of the dimm cant be modified until CEC reboot
  * dimm_encrypted  : Contents of dimm are encrypted.
  * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_
+ * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100
  */
 struct nd_papr_pdsm_health {
union {
@@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
__u8 dimm_locked;
__u8 dimm_encrypted;
__u16 dimm_health;
+
+   /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
+   __u16 dimm_fuel_gauge;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index bde2433822b2..b02bed8475f9 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -498,6 +498,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned 
int cmd, void *buf,
return 0;
 }
 
+static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
+   union nd_pdsm_payload *payload)
+{
+   int rc, size;
+   u64 statval;
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+
+   /* Silently fail if fetching performance metrics isn't  supported */
+   if (!p->stat_buffer_len)
+   return 0;
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+   return -ENOMEM;
+
+   stat = >scm_statistic[0];
+   memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id));
+   stat->stat_val = 0;
+
+   /* Fetch the fuel gauge and populate it in payload */
+   rc = drc_pmem_query_stats(p, stats, size, 1, NULL);
+   if (rc) {
+   dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+   goto free_stats;
+   }
+
+   statval = be64_to_cpu(stat->stat_val);
+   dev_dbg(>pdev->dev,
+   "Fetched fuel-gauge %llu", statval);
+   payload->health.extension_flags |=
+ PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   payload->health.dimm_fuel_gauge = statval;
+
+   rc = sizeof(struct nd_papr_pdsm_health);
+
+free_stats:
+   kfree(stats);
+   return rc;
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_payload *payload)
@@ -538,6 +583,10 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
 
/* struct populated hence can release the mutex now */
mutex_unlock(>health_mutex);

[PATCH v2 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP

2020-07-01 Thread Vaibhav Jain
Update papr_scm.c to query dimm performance statistics from PHYP via
H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
provide a sysfs ABI documentation for the stats being reported and
their meanings.

During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
performance statistics is supported or not. If successful then a PHYP
returns a maximum possible buffer length needed to read all
performance stats. This returned value is stored in a per-nvdimm
attribute 'stat_buffer_len'.

The layout of request buffer for reading NVDIMM performance stats from
PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
papr_scm_perf_stat'. These structs are used in newly introduced
drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.

The sysfs access function perf_stats_show() uses value
'stat_buffer_len' to allocate a buffer large enough to hold all
possible NVDIMM performance stats and passes it to
drc_pmem_query_stats() to populate. Finally statistics reported in the
buffer are formatted into the sysfs access function output buffer.

Signed-off-by: Vaibhav Jain 
---
Changelog:

v2:
* Fixed a bug in drc_pmem_query_stats() that was passing a NULL pointer
to virt_to_pfn() [ Ira ]
* Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
to use big-endian types. [ Aneesh ]
* s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
* s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
* Conversion from Big endian to cpu endian happens later rather than
just after its fetched from PHYP.
* Changed a log statement to unambiguously report dimm performance
stats are not available for the given nvdimm [ Ira ]
* Restructed some code to handle error case first [ Ira ]
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 
 arch/powerpc/platforms/pseries/papr_scm.c | 134 ++
 2 files changed, 161 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 5b10d036a8d4..c1a67275c43f 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -25,3 +25,30 @@ Description:
  NVDIMM have been scrubbed.
* "locked"  : Indicating that NVDIMM contents cant
  be modified until next power cycle.
+
+What:  /sys/bus/nd/devices/nmemX/papr/perf_stats
+Date:  May, 2020
+KernelVersion: v5.9
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:
+   (RO) Report various performance stats related to papr-scm NVDIMM
+   device.  Each stat is reported on a new line with each line
+   composed of a stat-identifier followed by it value. Below are
+   currently known dimm performance stats which are reported:
+
+   * "CtlResCt" : Controller Reset Count
+   * "CtlResTm" : Controller Reset Elapsed Time
+   * "PonSecs " : Power-on Seconds
+   * "MemLife " : Life Remaining
+   * "CritRscU" : Critical Resource Utilization
+   * "HostLCnt" : Host Load Count
+   * "HostSCnt" : Host Store Count
+   * "HostSDur" : Host Store Duration
+   * "HostLDur" : Host Load Duration
+   * "MedRCnt " : Media Read Count
+   * "MedWCnt " : Media Write Count
+   * "MedRDur " : Media Read Duration
+   * "MedWDur " : Media Write Duration
+   * "CchRHCnt" : Cache Read Hit Count
+   * "CchWHCnt" : Cache Write Hit Count
+   * "FastWCnt" : Fast Write Count
\ No newline at end of file
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 9c569078a09f..bde2433822b2 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -62,6 +62,26 @@
PAPR_PMEM_HEALTH_FATAL |\
PAPR_PMEM_HEALTH_UNHEALTHY)
 
+#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
+#define PAPR_SCM_PERF_STATS_VERSION 0x1
+
+/* Struct holding a single performance metric */
+struct papr_scm_perf_stat {
+   u8 stat_id[8];
+   __be64 stat_val;
+} __packed;
+
+/* Struct exchanged between kernel and PHYP for fetching drc perf stats */
+struct papr_scm_perf_stats {
+   u8 eye_catcher[8];
+   /* Should be PAPR_SCM_PERF_STATS_VERSION */
+   __be32 stats_version;
+   /* Number of stats following */
+   __be32 num_statistics;
+   /* zero or more performance matrics */
+   struct papr_scm_perf_stat scm_statistic[];
+} __packed;
+
 /* private struct associated with each region */
 struct papr_scm_priv {
struct 

[PATCH v2 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric

2020-07-01 Thread Vaibhav Jain
Changes since v1 [1]:

* Minor restructuring of code as suggested by Ira
* Renaming of few members of 'struct par_scm_perf_[stat|stats]'
* Fixed a bug where a NULL pointer was potentially passed to
virt_to_phys().
* Using Big endian type rather than cpu native type so receive data
from PHYP in 'struct par_scm_perf_[stat|stats]'
* Some minor log message improvements.

[1] 
https://lore.kernel.org/linux-nvdimm/20200622042451.22448-1-vaib...@linux.ibm.com
---

This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes [2] output for
should be like:

$ sudo ndctl list -DH
[
  {
"dev":"nmem0",
"health":{
  "health_state":"ok",
  "life_used_percentage":0,
  "shutdown_state":"clean"
}
  }
]

PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and  'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.

Structure of the patchset
=
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.

Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.

References
==
[2]
https://github.com/vaibhav92/ndctl/tree/papr_scm_health_v13_run_guage

Vaibhav Jain (2):
  powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
  powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 +++
 arch/powerpc/include/uapi/asm/papr_pdsm.h |   9 +
 arch/powerpc/platforms/pseries/papr_scm.c | 183 ++
 3 files changed, 219 insertions(+)

-- 
2.26.2



Re: [PATCH v2 30/30] misc: cxl: flash: Remove unused pointer

2020-07-01 Thread Lee Jones
On Wed, 01 Jul 2020, Greg KH wrote:

> On Wed, Jul 01, 2020 at 09:31:18AM +0100, Lee Jones wrote:
> > The DRC index pointer us updated on an OPCODE_ADD, but never
> > actually read.  Remove the used pointer and shift up OPCODE_ADD
> > to group with OPCODE_DELETE which also provides a noop.
> > 
> > Fixes the following W=1 kernel build warning:
> > 
> >  drivers/misc/cxl/flash.c: In function ‘update_devicetree’:
> >  drivers/misc/cxl/flash.c:178:16: warning: variable ‘drc_index’ set but not 
> > used [-Wunused-but-set-variable]
> >  178 | __be32 *data, drc_index, phandle;
> >  | ^
> > 
> > Cc: Frederic Barrat 
> > Cc: Andrew Donnellan 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Lee Jones 
> > ---
> >  drivers/misc/cxl/flash.c | 7 ++-
> >  1 file changed, 2 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/misc/cxl/flash.c b/drivers/misc/cxl/flash.c
> > index cb9cca35a2263..24e3dfcc91a74 100644
> > --- a/drivers/misc/cxl/flash.c
> > +++ b/drivers/misc/cxl/flash.c
> > @@ -175,7 +175,7 @@ static int update_devicetree(struct cxl *adapter, s32 
> > scope)
> > struct update_nodes_workarea *unwa;
> > u32 action, node_count;
> > int token, rc, i;
> > -   __be32 *data, drc_index, phandle;
> > +   __be32 *data, phandle;
> > char *buf;
> >  
> > token = rtas_token("ibm,update-nodes");
> > @@ -206,15 +206,12 @@ static int update_devicetree(struct cxl *adapter, s32 
> > scope)
> >  
> > switch (action) {
> > case OPCODE_DELETE:
> > +   case OPCODE_ADD:
> > /* nothing to do */
> > break;
> > case OPCODE_UPDATE:
> > update_node(phandle, scope);
> > break;
> > -   case OPCODE_ADD:
> > -   /* nothing to do, just move pointer */
> > -   drc_index = *data++;
> > -   break;
> 
> I think this is not correct, as *data++ changes the value there, and so
> this changes the logic of the code.

Great spot.

Looks like I overlooked that the pointer itself is being incremented.

> Dropping this one from my queue.

Sounds good.

I'll fix-up and send this separately at a later date.

-- 
Lee Jones [李琼斯]
Senior Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable

2020-07-01 Thread Leonardo Bras
On Wed, 2020-07-01 at 18:16 +1000, Alexey Kardashevskiy wrote:
> 
> On 24/06/2020 16:24, Leonardo Bras wrote:
> > Create defines to help handling ibm,ddw-applicable values, avoiding
> > confusion about the index of given operations.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  arch/powerpc/platforms/pseries/iommu.c | 40 +++---
> >  1 file changed, 23 insertions(+), 17 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index 6d47b4a3ce39..68d2aa9c71a8 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -39,6 +39,11 @@
> >  
> >  #include "pseries.h"
> >  
> > +#define DDW_QUERY_PE_DMA_WIN   0
> > +#define DDW_CREATE_PE_DMA_WIN  1
> > +#define DDW_REMOVE_PE_DMA_WIN  2
> > +#define DDW_APPLICABLE_SIZE3
> 
> #define DDW_APPLICABLE_SIZE  (DDW_REMOVE_PE_DMA_WIN + 1)
> 
> thanks,

Thanks for the feedback!
About this (and patch #2), would it be better to use enum ?
enum {
DDW_QUERY_PE_DMA_WIN,
DDW_CREATE_PE_DMA_WIN,
DDW_REMOVE_PE_DMA_WIN,

DDW_APPLICABLE_SIZE
}
IMO, it looks better than all the defines before.

What do you think?

Best regards,



Re: [PATCH v2 30/30] misc: cxl: flash: Remove unused pointer

2020-07-01 Thread Greg KH
On Wed, Jul 01, 2020 at 09:31:18AM +0100, Lee Jones wrote:
> The DRC index pointer us updated on an OPCODE_ADD, but never
> actually read.  Remove the used pointer and shift up OPCODE_ADD
> to group with OPCODE_DELETE which also provides a noop.
> 
> Fixes the following W=1 kernel build warning:
> 
>  drivers/misc/cxl/flash.c: In function ‘update_devicetree’:
>  drivers/misc/cxl/flash.c:178:16: warning: variable ‘drc_index’ set but not 
> used [-Wunused-but-set-variable]
>  178 | __be32 *data, drc_index, phandle;
>  | ^
> 
> Cc: Frederic Barrat 
> Cc: Andrew Donnellan 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Lee Jones 
> ---
>  drivers/misc/cxl/flash.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/misc/cxl/flash.c b/drivers/misc/cxl/flash.c
> index cb9cca35a2263..24e3dfcc91a74 100644
> --- a/drivers/misc/cxl/flash.c
> +++ b/drivers/misc/cxl/flash.c
> @@ -175,7 +175,7 @@ static int update_devicetree(struct cxl *adapter, s32 
> scope)
>   struct update_nodes_workarea *unwa;
>   u32 action, node_count;
>   int token, rc, i;
> - __be32 *data, drc_index, phandle;
> + __be32 *data, phandle;
>   char *buf;
>  
>   token = rtas_token("ibm,update-nodes");
> @@ -206,15 +206,12 @@ static int update_devicetree(struct cxl *adapter, s32 
> scope)
>  
>   switch (action) {
>   case OPCODE_DELETE:
> + case OPCODE_ADD:
>   /* nothing to do */
>   break;
>   case OPCODE_UPDATE:
>   update_node(phandle, scope);
>   break;
> - case OPCODE_ADD:
> - /* nothing to do, just move pointer */
> - drc_index = *data++;
> - break;

I think this is not correct, as *data++ changes the value there, and so
this changes the logic of the code.

Dropping this one from my queue.

thanks,

greg k-h


Re: [PATCH 04/11] ppc64/kexec_file: avoid stomping memory used by special regions

2020-07-01 Thread piliu



On 07/01/2020 03:40 PM, Dave Young wrote:
> Hi Hari,
> On 06/27/20 at 12:35am, Hari Bathini wrote:
>> crashkernel region could have an overlap with special memory regions
>> like  opal, rtas, tce-table & such. These regions are referred to as
>> exclude memory ranges. Setup this ranges during image probe in order
>> to avoid them while finding the buffer for different kdump segments.
>> Implement kexec_locate_mem_hole_ppc64() that locates a memory hole
>> accounting for these ranges. Also, override arch_kexec_add_buffer()
>> to locate a memory hole & later call __kexec_add_buffer() function
>> with kbuf->mem set to skip the generic locate memory hole lookup.
>>
>> Signed-off-by: Hari Bathini 
>> ---
>>  arch/powerpc/include/asm/crashdump-ppc64.h |   10 +
>>  arch/powerpc/include/asm/kexec.h   |7 -
>>  arch/powerpc/kexec/elf_64.c|7 +
>>  arch/powerpc/kexec/file_load_64.c  |  292 
>> 
>>  4 files changed, 312 insertions(+), 4 deletions(-)
>>  create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h
>>
> [snip]
>>  /**
>> + * get_exclude_memory_ranges - Get exclude memory ranges. This list includes
>> + * regions like opal/rtas, tce-table, initrd,
>> + * kernel, htab which should be avoided while
>> + * setting up kexec load segments.
>> + * @mem_ranges:Range list to add the memory ranges to.
>> + *
>> + * Returns 0 on success, negative errno on error.
>> + */
>> +static int get_exclude_memory_ranges(struct crash_mem **mem_ranges)
>> +{
>> +int ret;
>> +
>> +ret = add_tce_mem_ranges(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_initrd_mem_range(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_htab_mem_range(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_kernel_mem_range(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_rtas_mem_range(mem_ranges, false);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_opal_mem_range(mem_ranges, false);
>> +if (ret)
>> +goto out;
>> +
>> +ret = add_reserved_ranges(mem_ranges);
>> +if (ret)
>> +goto out;
>> +
>> +/* exclude memory ranges should be sorted for easy lookup */
>> +sort_memory_ranges(*mem_ranges);
>> +out:
>> +if (ret)
>> +pr_err("Failed to setup exclude memory ranges\n");
>> +return ret;
>> +}
> 
> I'm confused about the "overlap with crashkernel memory", does that mean
> those normal kernel used memory could be put in crashkernel reserved
> memory range?  If so why can't just skip those areas while crashkernel
> doing the reservation?
I raised the same question in another mail. As Hari's answer, "kexec -p"
skips these ranges in user space. And the same logic should be done in
"kexec -s -p"

Regards,
Pingfan



Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Michal Hocko
On Tue 30-06-20 09:31:25, Srikar Dronamraju wrote:
> * Christopher Lameter  [2020-06-29 14:58:40]:
> 
> > On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
> > 
> > > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > > possible nodes, marks node 0 as online at boot.  However in practice,
> > > there are systems which have node 0 as memoryless and cpuless.
> > 
> > Maybe add something to explain why you are not simply mapping the
> > existing memory to NUMA node 0 which is after all just a numbering scheme
> > used by the kernel and can be used arbitrarily?
> > 
> 
> I thought Michal Hocko already gave a clear picture on why mapping is a bad
> idea. https://lore.kernel.org/lkml/20200316085425.gb11...@dhcp22.suse.cz/t/#u
> Are you suggesting that we add that as part of the changelog?

Well, I was not aware x86 already does renumber. So there is a certain
precendence. As I've said I do not really like that but this is what
already is happening. If renumbering is not an option then just handle
that in the ppc code explicitly. Generic solution would be preferable of
course but as I've said it is really hard to check for correctness and
potential subtle issues.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Michal Hocko
On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> On 01.07.20 13:06, David Hildenbrand wrote:
> > On 01.07.20 13:01, Srikar Dronamraju wrote:
> >> * David Hildenbrand  [2020-07-01 12:15:54]:
> >>
> >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>  * Michal Hocko  [2020-07-01 10:42:00]:
> 
> >
> >>
> >> 2. Also existence of dummy node also leads to inconsistent 
> >> information. The
> >> number of online nodes is inconsistent with the information in the
> >> device-tree and resource-dump
> >>
> >> 3. When the dummy node is present, single node non-Numa systems end up 
> >> showing
> >> up as NUMA systems and numa_balancing gets enabled. This will mean we 
> >> take
> >> the hit from the unnecessary numa hinting faults.
> >
> > I have to say that I dislike the node online/offline state and directly
> > exporting that to the userspace. Users should only care whether the node
> > has memory/cpus. Numa nodes can be online without any memory. Just
> > offline all the present memory blocks but do not physically hot remove
> > them and you are in the same situation. If users are confused by an
> > output of tools like numactl -H then those could be updated and hide
> > nodes without any memory
> >
> > The autonuma problem sounds interesting but again this patch doesn't
> > really solve the underlying problem because I strongly suspect that the
> > problem is still there when a numa node gets all its memory offline as
> > mentioned above.

I would really appreciate a feedback to these two as well.

> > While I completely agree that making node 0 special is wrong, I have
> > still hard time to review this very simply looking patch because all the
> > numa initialization is so spread around that this might just blow up
> > at unexpected places. IIRC we have discussed testing in the previous
> > version and David has provided a way to emulate these configurations
> > on x86. Did you manage to use those instruction for additional testing
> > on other than ppc architectures?
> >
> 
>  I have tried all the steps that David mentioned and reported back at
>  https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> 
>  As a summary, David's steps are still not creating a memoryless/cpuless 
>  on
>  x86 VM.
> >>>
> >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> >>> online*. Once you hotplug some memory, it will switch online. Once you
> >>> remove memory, it will switch back offline.
> >>>
> >>
> >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> >> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> >> boot.
> > 
> > I was just correcting your statement, because it was wrong.
> > 
> > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > have CPUs nor memory. That would imply that we can, in fact, never have
> > node 0 offline during boot.
> > 
> 
> Yep, looks like it.
> 
> [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]

This begs a question whether ppc can do the same thing?

I would swear that we've had x86 system with node 0 but I cannot really
find it and it is possible that it was not x86 after all...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread David Hildenbrand
On 01.07.20 13:06, David Hildenbrand wrote:
> On 01.07.20 13:01, Srikar Dronamraju wrote:
>> * David Hildenbrand  [2020-07-01 12:15:54]:
>>
>>> On 01.07.20 12:04, Srikar Dronamraju wrote:
 * Michal Hocko  [2020-07-01 10:42:00]:

>
>>
>> 2. Also existence of dummy node also leads to inconsistent information. 
>> The
>> number of online nodes is inconsistent with the information in the
>> device-tree and resource-dump
>>
>> 3. When the dummy node is present, single node non-Numa systems end up 
>> showing
>> up as NUMA systems and numa_balancing gets enabled. This will mean we 
>> take
>> the hit from the unnecessary numa hinting faults.
>
> I have to say that I dislike the node online/offline state and directly
> exporting that to the userspace. Users should only care whether the node
> has memory/cpus. Numa nodes can be online without any memory. Just
> offline all the present memory blocks but do not physically hot remove
> them and you are in the same situation. If users are confused by an
> output of tools like numactl -H then those could be updated and hide
> nodes without any memory
>
> The autonuma problem sounds interesting but again this patch doesn't
> really solve the underlying problem because I strongly suspect that the
> problem is still there when a numa node gets all its memory offline as
> mentioned above.
>
> While I completely agree that making node 0 special is wrong, I have
> still hard time to review this very simply looking patch because all the
> numa initialization is so spread around that this might just blow up
> at unexpected places. IIRC we have discussed testing in the previous
> version and David has provided a way to emulate these configurations
> on x86. Did you manage to use those instruction for additional testing
> on other than ppc architectures?
>

 I have tried all the steps that David mentioned and reported back at
 https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u

 As a summary, David's steps are still not creating a memoryless/cpuless on
 x86 VM.
>>>
>>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>>> online*. Once you hotplug some memory, it will switch online. Once you
>>> remove memory, it will switch back offline.
>>>
>>
>> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
>> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
>> boot.
> 
> I was just correcting your statement, because it was wrong.
> 
> Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> have CPUs nor memory. That would imply that we can, in fact, never have
> node 0 offline during boot.
> 

Yep, looks like it.

[0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
[0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
[0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
[0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
[0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
[0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
[0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]



-- 
Thanks,

David / dhildenb



Re: [PATCH v2 02/10] KVM: PPC: Book3S HV: Save/restore new PMU registers

2020-07-01 Thread Paul Mackerras
On Wed, Jul 01, 2020 at 05:20:54AM -0400, Athira Rajeev wrote:
> PowerISA v3.1 has added new performance monitoring unit (PMU)
> special purpose registers (SPRs). They are
> 
> Monitor Mode Control Register 3 (MMCR3)
> Sampled Instruction Event Register A (SIER2)
> Sampled Instruction Event Register B (SIER3)
> 
> Patch addes support to save/restore these new
> SPRs while entering/exiting guest.

This mostly looks reasonable, at a quick glance at least, but I am
puzzled by two of the changes you are making.  See below.

> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 6bf66649..c265800 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1698,7 +1698,8 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
> u64 id,
>   *val = get_reg_val(id, vcpu->arch.sdar);
>   break;
>   case KVM_REG_PPC_SIER:
> - *val = get_reg_val(id, vcpu->arch.sier);
> + i = id - KVM_REG_PPC_SIER;
> + *val = get_reg_val(id, vcpu->arch.sier[i]);

This is inside a switch (id) statement, so here we know that id is
KVM_REG_PPC_SIER.  In other words i will always be zero, so what is
the point of doing the subtraction?

>   break;
>   case KVM_REG_PPC_IAMR:
>   *val = get_reg_val(id, vcpu->arch.iamr);
> @@ -1919,7 +1920,8 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
> u64 id,
>   vcpu->arch.sdar = set_reg_val(id, *val);
>   break;
>   case KVM_REG_PPC_SIER:
> - vcpu->arch.sier = set_reg_val(id, *val);
> + i = id - KVM_REG_PPC_SIER;
> + vcpu->arch.sier[i] = set_reg_val(id, *val);

Same comment here.

I think that new defines for the new registers will need to be added
to arch/powerpc/include/uapi/asm/kvm.h and
Documentation/virt/kvm/api.rst, and then new cases will need to be
added to these switch statements.

By the way, please cc kvm-...@vger.kernel.org and k...@vger.kernel.org
on KVM patches.

Paul.


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread David Hildenbrand
On 01.07.20 13:01, Srikar Dronamraju wrote:
> * David Hildenbrand  [2020-07-01 12:15:54]:
> 
>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>>> * Michal Hocko  [2020-07-01 10:42:00]:
>>>

>
> 2. Also existence of dummy node also leads to inconsistent information. 
> The
> number of online nodes is inconsistent with the information in the
> device-tree and resource-dump
>
> 3. When the dummy node is present, single node non-Numa systems end up 
> showing
> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> the hit from the unnecessary numa hinting faults.

 I have to say that I dislike the node online/offline state and directly
 exporting that to the userspace. Users should only care whether the node
 has memory/cpus. Numa nodes can be online without any memory. Just
 offline all the present memory blocks but do not physically hot remove
 them and you are in the same situation. If users are confused by an
 output of tools like numactl -H then those could be updated and hide
 nodes without any memory

 The autonuma problem sounds interesting but again this patch doesn't
 really solve the underlying problem because I strongly suspect that the
 problem is still there when a numa node gets all its memory offline as
 mentioned above.

 While I completely agree that making node 0 special is wrong, I have
 still hard time to review this very simply looking patch because all the
 numa initialization is so spread around that this might just blow up
 at unexpected places. IIRC we have discussed testing in the previous
 version and David has provided a way to emulate these configurations
 on x86. Did you manage to use those instruction for additional testing
 on other than ppc architectures?

>>>
>>> I have tried all the steps that David mentioned and reported back at
>>> https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
>>>
>>> As a summary, David's steps are still not creating a memoryless/cpuless on
>>> x86 VM.
>>
>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>> online*. Once you hotplug some memory, it will switch online. Once you
>> remove memory, it will switch back offline.
>>
> 
> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> boot.

I was just correcting your statement, because it was wrong.

Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
have CPUs nor memory. That would imply that we can, in fact, never have
node 0 offline during boot.

-- 
Thanks,

David / dhildenb



Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Srikar Dronamraju
* David Hildenbrand  [2020-07-01 12:15:54]:

> On 01.07.20 12:04, Srikar Dronamraju wrote:
> > * Michal Hocko  [2020-07-01 10:42:00]:
> > 
> >>
> >>>
> >>> 2. Also existence of dummy node also leads to inconsistent information. 
> >>> The
> >>> number of online nodes is inconsistent with the information in the
> >>> device-tree and resource-dump
> >>>
> >>> 3. When the dummy node is present, single node non-Numa systems end up 
> >>> showing
> >>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> >>> the hit from the unnecessary numa hinting faults.
> >>
> >> I have to say that I dislike the node online/offline state and directly
> >> exporting that to the userspace. Users should only care whether the node
> >> has memory/cpus. Numa nodes can be online without any memory. Just
> >> offline all the present memory blocks but do not physically hot remove
> >> them and you are in the same situation. If users are confused by an
> >> output of tools like numactl -H then those could be updated and hide
> >> nodes without any memory
> >>
> >> The autonuma problem sounds interesting but again this patch doesn't
> >> really solve the underlying problem because I strongly suspect that the
> >> problem is still there when a numa node gets all its memory offline as
> >> mentioned above.
> >>
> >> While I completely agree that making node 0 special is wrong, I have
> >> still hard time to review this very simply looking patch because all the
> >> numa initialization is so spread around that this might just blow up
> >> at unexpected places. IIRC we have discussed testing in the previous
> >> version and David has provided a way to emulate these configurations
> >> on x86. Did you manage to use those instruction for additional testing
> >> on other than ppc architectures?
> >>
> > 
> > I have tried all the steps that David mentioned and reported back at
> > https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> > 
> > As a summary, David's steps are still not creating a memoryless/cpuless on
> > x86 VM.
> 
> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> online*. Once you hotplug some memory, it will switch online. Once you
> remove memory, it will switch back offline.
> 

Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
boot.  The code in question tries to handle a cpuless/memoryless node 0 at
boot.

With the steps that you gave the node 0 was always populated, node 1 or
some other node would be memoryless/cpuless and offline. But that should
have no impact by patch.

I don't see how adding/hotplugging/removing memory to a node after boot is
going to affect the changes that I have made. Please do correct me if I have
misunderstood.


-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread David Hildenbrand
On 01.07.20 12:04, Srikar Dronamraju wrote:
> * Michal Hocko  [2020-07-01 10:42:00]:
> 
>>
>>>
>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>> number of online nodes is inconsistent with the information in the
>>> device-tree and resource-dump
>>>
>>> 3. When the dummy node is present, single node non-Numa systems end up 
>>> showing
>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>> the hit from the unnecessary numa hinting faults.
>>
>> I have to say that I dislike the node online/offline state and directly
>> exporting that to the userspace. Users should only care whether the node
>> has memory/cpus. Numa nodes can be online without any memory. Just
>> offline all the present memory blocks but do not physically hot remove
>> them and you are in the same situation. If users are confused by an
>> output of tools like numactl -H then those could be updated and hide
>> nodes without any memory
>>
>> The autonuma problem sounds interesting but again this patch doesn't
>> really solve the underlying problem because I strongly suspect that the
>> problem is still there when a numa node gets all its memory offline as
>> mentioned above.
>>
>> While I completely agree that making node 0 special is wrong, I have
>> still hard time to review this very simply looking patch because all the
>> numa initialization is so spread around that this might just blow up
>> at unexpected places. IIRC we have discussed testing in the previous
>> version and David has provided a way to emulate these configurations
>> on x86. Did you manage to use those instruction for additional testing
>> on other than ppc architectures?
>>
> 
> I have tried all the steps that David mentioned and reported back at
> https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> 
> As a summary, David's steps are still not creating a memoryless/cpuless on
> x86 VM.

Now, that is wrong. You get a memoryless/cpuless node, which is *not
online*. Once you hotplug some memory, it will switch online. Once you
remove memory, it will switch back offline.

-- 
Thanks,

David / dhildenb



Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Srikar Dronamraju
* Michal Hocko  [2020-07-01 10:42:00]:

> 
> > 
> > 2. Also existence of dummy node also leads to inconsistent information. The
> > number of online nodes is inconsistent with the information in the
> > device-tree and resource-dump
> > 
> > 3. When the dummy node is present, single node non-Numa systems end up 
> > showing
> > up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > the hit from the unnecessary numa hinting faults.
> 
> I have to say that I dislike the node online/offline state and directly
> exporting that to the userspace. Users should only care whether the node
> has memory/cpus. Numa nodes can be online without any memory. Just
> offline all the present memory blocks but do not physically hot remove
> them and you are in the same situation. If users are confused by an
> output of tools like numactl -H then those could be updated and hide
> nodes without any memory
> 
> The autonuma problem sounds interesting but again this patch doesn't
> really solve the underlying problem because I strongly suspect that the
> problem is still there when a numa node gets all its memory offline as
> mentioned above.
> 
> While I completely agree that making node 0 special is wrong, I have
> still hard time to review this very simply looking patch because all the
> numa initialization is so spread around that this might just blow up
> at unexpected places. IIRC we have discussed testing in the previous
> version and David has provided a way to emulate these configurations
> on x86. Did you manage to use those instruction for additional testing
> on other than ppc architectures?
> 

I have tried all the steps that David mentioned and reported back at
https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u

As a summary, David's steps are still not creating a memoryless/cpuless on
x86 VM. I have tried booting with Numa/non-numa on all the x86 machines that
I could get to.

-- 
Thanks and Regards
Srikar Dronamraju


[powerpc:merge] BUILD SUCCESS 3c0356e8994ed88e5234897c0ffee4188f8b9287

2020-07-01 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
merge
branch HEAD: 3c0356e8994ed88e5234897c0ffee4188f8b9287  Automatic merge of 
'master', 'next' and 'fixes' (2020-06-30 22:07)

elapsed time: 1282m

configs tested: 98
configs skipped: 1

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm  allyesconfig
arm  allmodconfig
arm   allnoconfig
arm64allyesconfig
arm64   defconfig
arm64allmodconfig
arm64 allnoconfig
i386 allyesconfig
i386defconfig
i386  debian-10.3
i386  allnoconfig
ia64 allmodconfig
ia64defconfig
ia64  allnoconfig
ia64 allyesconfig
m68k allmodconfig
m68k  allnoconfig
m68k   sun3_defconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
nios2allyesconfig
openriscdefconfig
c6x  allyesconfig
c6x   allnoconfig
openrisc allyesconfig
nds32   defconfig
nds32 allnoconfig
csky allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
h8300allmodconfig
xtensa  defconfig
arc defconfig
arc  allyesconfig
sh   allmodconfig
shallnoconfig
microblazeallnoconfig
mips allyesconfig
mips  allnoconfig
mips allmodconfig
pariscallnoconfig
parisc  defconfig
parisc   allyesconfig
parisc   allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  rhel-kconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a001-20200630
i386 randconfig-a003-20200630
i386 randconfig-a002-20200630
i386 randconfig-a004-20200630
i386 randconfig-a005-20200630
i386 randconfig-a006-20200630
x86_64   randconfig-a011-20200630
x86_64   randconfig-a014-20200630
x86_64   randconfig-a013-20200630
x86_64   randconfig-a015-20200630
x86_64   randconfig-a016-20200630
x86_64   randconfig-a012-20200630
i386 randconfig-a011-20200630
i386 randconfig-a016-20200630
i386 randconfig-a015-20200630
i386 randconfig-a012-20200630
i386 randconfig-a014-20200630
i386 randconfig-a013-20200630
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
s390 allyesconfig
s390  allnoconfig
s390 allmodconfig
s390defconfig
sparcallyesconfig
sparc   defconfig
sparc64 defconfig
sparc64   allnoconfig
sparc64  allyesconfig
sparc64  allmodconfig
um   allmodconfig
umallnoconfig
um   allyesconfig
um  defconfig
x86_64   rhel
x86_64 rhel-7.2-clear
x86_64lkp
x86_64  fedora-25
x86_64   rhel-7.6
x86_64rhel-7.6-kselftests
x86_64   rhel-8.3
x86_64  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[PATCH v2 10/10] powerpc/perf: Add extended regs support for power10 platform

2020-07-01 Thread Athira Rajeev
Include capability flag `PERF_PMU_CAP_EXTENDED_REGS` for power10
and expose MMCR3, SIER2, SIER3 registers as part of extended regs.
Also introduce `PERF_REG_PMU_MASK_31` to define extended mask
value at runtime for power10

Signed-off-by: Athira Rajeev 
---
 arch/powerpc/include/uapi/asm/perf_regs.h   |  6 ++
 arch/powerpc/perf/perf_regs.c   | 10 +-
 arch/powerpc/perf/power10-pmu.c |  6 ++
 tools/arch/powerpc/include/uapi/asm/perf_regs.h |  6 ++
 tools/perf/arch/powerpc/include/perf_regs.h |  3 +++
 tools/perf/arch/powerpc/util/perf_regs.c|  6 ++
 6 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h 
b/arch/powerpc/include/uapi/asm/perf_regs.h
index 485b1d5..020b51c 100644
--- a/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -52,6 +52,9 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_MMCR0,
PERF_REG_POWERPC_MMCR1,
PERF_REG_POWERPC_MMCR2,
+   PERF_REG_POWERPC_MMCR3,
+   PERF_REG_POWERPC_SIER2,
+   PERF_REG_POWERPC_SIER3,
/* Max regs without the extended regs */
PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
@@ -62,4 +65,7 @@ enum perf_event_powerpc_regs {
 #define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \
- PERF_REG_PMU_MASK)
 
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
+#define PERF_REG_PMU_MASK_31   (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) \
+   - PERF_REG_PMU_MASK)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index c8a7e8c..c969935 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -81,6 +81,12 @@ static u64 get_ext_regs_value(int idx)
return mfspr(SPRN_MMCR1);
case PERF_REG_POWERPC_MMCR2:
return mfspr(SPRN_MMCR2);
+   case PERF_REG_POWERPC_MMCR3:
+   return mfspr(SPRN_MMCR3);
+   case PERF_REG_POWERPC_SIER2:
+   return mfspr(SPRN_SIER2);
+   case PERF_REG_POWERPC_SIER3:
+   return mfspr(SPRN_SIER3);
default: return 0;
}
 }
@@ -89,7 +95,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
u64 PERF_REG_EXTENDED_MAX;
 
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   PERF_REG_EXTENDED_MAX = PERF_REG_POWERPC_SIER3 + 1;
+   else if (cpu_has_feature(CPU_FTR_ARCH_300))
PERF_REG_EXTENDED_MAX = PERF_REG_POWERPC_MMCR2 + 1;
 
if (idx == PERF_REG_POWERPC_SIER &&
diff --git a/arch/powerpc/perf/power10-pmu.c b/arch/powerpc/perf/power10-pmu.c
index 07fb919..51082d6 100644
--- a/arch/powerpc/perf/power10-pmu.c
+++ b/arch/powerpc/perf/power10-pmu.c
@@ -86,6 +86,8 @@
 #define POWER10_MMCRA_IFM3 0xC000UL
 #define POWER10_MMCRA_BHRB_MASK0xC000UL
 
+extern u64 mask_var;
+
 /* Table of alternatives, sorted by column 0 */
 static const unsigned int power10_event_alternatives[][MAX_ALT] = {
{ PM_RUN_CYC_ALT,   PM_RUN_CYC },
@@ -397,6 +399,7 @@ static void power10_config_bhrb(u64 pmu_bhrb_filter)
.cache_events   = _cache_events,
.attr_groups= power10_pmu_attr_groups,
.bhrb_nr= 32,
+   .capabilities   = PERF_PMU_CAP_EXTENDED_REGS,
 };
 
 int init_power10_pmu(void)
@@ -408,6 +411,9 @@ int init_power10_pmu(void)
strcmp(cur_cpu_spec->oprofile_cpu_type, "ppc64/power10"))
return -ENODEV;
 
+   /* Set the PERF_REG_EXTENDED_MASK here */
+   mask_var = PERF_REG_PMU_MASK_31;
+
rc = register_power_pmu(_pmu);
if (rc)
return rc;
diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h 
b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
index 485b1d5..020b51c 100644
--- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -52,6 +52,9 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_MMCR0,
PERF_REG_POWERPC_MMCR1,
PERF_REG_POWERPC_MMCR2,
+   PERF_REG_POWERPC_MMCR3,
+   PERF_REG_POWERPC_SIER2,
+   PERF_REG_POWERPC_SIER3,
/* Max regs without the extended regs */
PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
@@ -62,4 +65,7 @@ enum perf_event_powerpc_regs {
 #define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \
- PERF_REG_PMU_MASK)
 
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
+#define PERF_REG_PMU_MASK_31   (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) \
+   - PERF_REG_PMU_MASK)
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git 

[PATCH v2 09/10] tools/perf: Add perf tools support for extended register capability in powerpc

2020-07-01 Thread Athira Rajeev
From: Anju T Sudhakar 

Add extended regs to sample_reg_mask in the tool side to use
with `-I?` option. Perf tools side uses extended mask to display
the platform supported register names (with -I? option) to the user
and also send this mask to the kernel to capture the extended registers
in each sample. Hence decide the mask value based on the processor
version.

Signed-off-by: Anju T Sudhakar 
[Decide extended mask at run time based on platform]
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
---
 tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++-
 tools/perf/arch/powerpc/include/perf_regs.h |  5 ++-
 tools/perf/arch/powerpc/util/perf_regs.c| 55 +
 3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h 
b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
index f599064..485b1d5 100644
--- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_DSISR,
PERF_REG_POWERPC_SIER,
PERF_REG_POWERPC_MMCRA,
-   PERF_REG_POWERPC_MAX,
+   /* Extended registers */
+   PERF_REG_POWERPC_MMCR0,
+   PERF_REG_POWERPC_MMCR1,
+   PERF_REG_POWERPC_MMCR2,
+   /* Max regs without the extended regs */
+   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
+
+#define PERF_REG_PMU_MASK  ((1ULL << PERF_REG_POWERPC_MAX) - 1)
+
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
+#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \
+   - PERF_REG_PMU_MASK)
+
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/tools/perf/arch/powerpc/include/perf_regs.h 
b/tools/perf/arch/powerpc/include/perf_regs.h
index e18a355..46ed00d 100644
--- a/tools/perf/arch/powerpc/include/perf_regs.h
+++ b/tools/perf/arch/powerpc/include/perf_regs.h
@@ -64,7 +64,10 @@
[PERF_REG_POWERPC_DAR] = "dar",
[PERF_REG_POWERPC_DSISR] = "dsisr",
[PERF_REG_POWERPC_SIER] = "sier",
-   [PERF_REG_POWERPC_MMCRA] = "mmcra"
+   [PERF_REG_POWERPC_MMCRA] = "mmcra",
+   [PERF_REG_POWERPC_MMCR0] = "mmcr0",
+   [PERF_REG_POWERPC_MMCR1] = "mmcr1",
+   [PERF_REG_POWERPC_MMCR2] = "mmcr2",
 };
 
 static inline const char *perf_reg_name(int id)
diff --git a/tools/perf/arch/powerpc/util/perf_regs.c 
b/tools/perf/arch/powerpc/util/perf_regs.c
index 0a52429..9179230 100644
--- a/tools/perf/arch/powerpc/util/perf_regs.c
+++ b/tools/perf/arch/powerpc/util/perf_regs.c
@@ -6,9 +6,14 @@
 
 #include "../../../util/perf_regs.h"
 #include "../../../util/debug.h"
+#include "../../../util/event.h"
+#include "../../../util/header.h"
+#include "../../../perf-sys.h"
 
 #include 
 
+#define PVR_POWER9 0x004E
+
 const struct sample_reg sample_reg_masks[] = {
SMPL_REG(r0, PERF_REG_POWERPC_R0),
SMPL_REG(r1, PERF_REG_POWERPC_R1),
@@ -55,6 +60,9 @@
SMPL_REG(dsisr, PERF_REG_POWERPC_DSISR),
SMPL_REG(sier, PERF_REG_POWERPC_SIER),
SMPL_REG(mmcra, PERF_REG_POWERPC_MMCRA),
+   SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
+   SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
+   SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
SMPL_REG_END
 };
 
@@ -163,3 +171,50 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 
return SDT_ARG_VALID;
 }
+
+uint64_t arch__intr_reg_mask(void)
+{
+   struct perf_event_attr attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .sample_type= PERF_SAMPLE_REGS_INTR,
+   .precise_ip = 1,
+   .disabled   = 1,
+   .exclude_kernel = 1,
+   };
+   int fd, ret;
+   char buffer[64];
+   u32 version;
+   u64 extended_mask = 0;
+
+   /* Get the PVR value to set the extended
+* mask specific to platform
+*/
+   get_cpuid(buffer, sizeof(buffer));
+   ret = sscanf(buffer, "%u,", );
+
+   if (ret != 1) {
+   pr_debug("Failed to get the processor version, unable to output 
extended registers\n");
+   return PERF_REGS_MASK;
+   }
+
+   if (version == PVR_POWER9)
+   extended_mask = PERF_REG_PMU_MASK_300;
+   else
+   return PERF_REGS_MASK;
+
+   attr.sample_regs_intr = extended_mask;
+   attr.sample_period = 1;
+   event_attr_init();
+
+   /*
+* check if the pmu supports perf extended regs, before
+* returning the register mask to sample.
+*/
+   fd = sys_perf_event_open(, 0, -1, -1, 0);
+   if (fd != -1) {
+   close(fd);
+   return (extended_mask | PERF_REGS_MASK);
+   }
+   return PERF_REGS_MASK;
+}
-- 
1.8.3.1



[PATCH v2 08/10] powerpc/perf: Add support for outputting extended regs in perf intr_regs

2020-07-01 Thread Athira Rajeev
From: Anju T Sudhakar 

Add support for perf extended register capability in powerpc.
The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the
PMU which support extended registers. The generic code define the mask
of extended registers as 0 for non supported architectures.

Patch adds extended regs support for power9 platform by
exposing MMCR0, MMCR1 and MMCR2 registers.

REG_RESERVED mask needs update to include extended regs.
`PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers,
is defined at runtime in the kernel based on platform since the supported
registers may differ from one processor version to another and hence the
MASK value.

with patch
--

available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26
r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe
trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2

PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0
... intr regs: mask 0x ABI 64-bit
 r00xc012b77c
 r10xc03fe5e03930
 r20xc1b0e000
 r30xc03fdcddf800
 r40xc03fc788
 r50x9c422724be
 r60xc03fe5e03908
 r70xff63bddc8706
 r80x9e4
 r90x0
 r10   0x1
 r11   0x0
 r12   0xc01299c0
 r13   0xc03c4800
 r14   0x0
 r15   0x7fffdd8b8b00
 r16   0x0
 r17   0x7fffdd8be6b8
 r18   0x7e7076607730
 r19   0x2f
 r20   0xc0001fc26c68
 r21   0xc0002041e4227e00
 r22   0xc0002018fb60
 r23   0x1
 r24   0xc03ffec4d900
 r25   0x8000
 r26   0x0
 r27   0x1
 r28   0x1
 r29   0xc1be1260
 r30   0x6008010
 r31   0xc03ffebb7218
 nip   0xc012b910
 msr   0x90009033
 orig_r3 0xc012b86c
 ctr   0xc01299c0
 link  0xc012b77c
 xer   0x0
 ccr   0x2800
 softe 0x1
 trap  0xf00
 dar   0x0
 dsisr 0x800
 sier  0x0
 mmcra 0x800
 mmcr0 0x82008090
 mmcr1 0x1e00
 mmcr2 0x0
 ... thread: perf:4784

Signed-off-by: Anju T Sudhakar 
[Defined PERF_REG_EXTENDED_MASK at run time to add support for different 
platforms ]
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/perf_event_server.h |  8 +++
 arch/powerpc/include/uapi/asm/perf_regs.h| 14 +++-
 arch/powerpc/perf/core-book3s.c  |  1 +
 arch/powerpc/perf/perf_regs.c| 34 +---
 arch/powerpc/perf/power9-pmu.c   |  6 +
 5 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index cb207f8..e8d35b6 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -15,6 +15,9 @@
 #define MAX_EVENT_ALTERNATIVES 8
 #define MAX_LIMITED_HWCOUNTERS 2
 
+extern u64 mask_var;
+#define PERF_REG_EXTENDED_MASK  mask_var
+
 struct perf_event;
 
 /*
@@ -55,6 +58,11 @@ struct power_pmu {
int *blacklist_ev;
/* BHRB entries in the PMU */
int bhrb_nr;
+   /*
+* set this flag with `PERF_PMU_CAP_EXTENDED_REGS` if
+* the pmu supports extended perf regs capability
+*/
+   int capabilities;
 };
 
 /*
diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h 
b/arch/powerpc/include/uapi/asm/perf_regs.h
index f599064..485b1d5 100644
--- a/arch/powerpc/include/uapi/asm/perf_regs.h
+++ b/arch/powerpc/include/uapi/asm/perf_regs.h
@@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
PERF_REG_POWERPC_DSISR,
PERF_REG_POWERPC_SIER,
PERF_REG_POWERPC_MMCRA,
-   PERF_REG_POWERPC_MAX,
+   /* Extended registers */
+   PERF_REG_POWERPC_MMCR0,
+   PERF_REG_POWERPC_MMCR1,
+   PERF_REG_POWERPC_MMCR2,
+   /* Max regs without the extended regs */
+   PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
 };
+
+#define PERF_REG_PMU_MASK  ((1ULL << PERF_REG_POWERPC_MAX) - 1)
+
+/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
+#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) \
+   - PERF_REG_PMU_MASK)
+
 #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 9709606..382d770 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2317,6 +2317,7 @@ int register_power_pmu(struct power_pmu *pmu)
pmu->name);
 
power_pmu.attr_groups = ppmu->attr_groups;
+   power_pmu.capabilities |= (ppmu->capabilities & 
PERF_PMU_CAP_EXTENDED_REGS);
 
 #ifdef MSR_HV
/*
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index a213a0a..c8a7e8c 

[PATCH v2 07/10] powerpc/perf: support BHRB disable bit and new filtering modes

2020-07-01 Thread Athira Rajeev
PowerISA v3.1 has few updates for the Branch History Rolling Buffer(BHRB).
First is the addition of BHRB disable bit and second new filtering
modes for BHRB.

BHRB disable is controlled via Monitor Mode Control Register A (MMCRA)
bit 26, namely "BHRB Recording Disable (BHRBRD)". This field controls
whether BHRB entries are written when BHRB recording is enabled by other
bits. Patch implements support for this BHRB disable bit.

Secondly PowerISA v3.1 introduce filtering support for
PERF_SAMPLE_BRANCH_IND_CALL/COND. The patch adds BHRB filter support
for "ind_call" and "cond" in power10_bhrb_filter_map().

'commit bb19af816025 ("powerpc/perf: Prevent kernel address leak to userspace 
via BHRB buffer")'
added a check in bhrb_read() to filter the kernel address from BHRB buffer. 
Patch here modified
it to avoid that check for PowerISA v3.1 based processors, since PowerISA v3.1 
allows
only MSR[PR]=1 address to be written to BHRB buffer.

Signed-off-by: Athira Rajeev 
---
 arch/powerpc/perf/core-book3s.c   | 27 +--
 arch/powerpc/perf/isa207-common.c | 13 +
 arch/powerpc/perf/power10-pmu.c   | 13 +++--
 arch/powerpc/platforms/powernv/idle.c | 14 ++
 4 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index fad5159..9709606 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -466,9 +466,13 @@ static void power_pmu_bhrb_read(struct perf_event *event, 
struct cpu_hw_events *
 * addresses at this point. Check the privileges before
 * exporting it to userspace (avoid exposure of regions
 * where we could have speculative execution)
+* Incase of ISA 310, BHRB will capture only user-space
+* address,hence include a check before filtering code
 */
-   if (is_kernel_addr(addr) && 
perf_allow_kernel(>attr) != 0)
-   continue;
+   if (!(ppmu->flags & PPMU_ARCH_310S))
+   if (is_kernel_addr(addr) &&
+   perf_allow_kernel(>attr) != 0)
+   continue;
 
/* Branches are read most recent first (ie. mfbhrb 0 is
 * the most recent branch).
@@ -1212,7 +1216,7 @@ static void write_mmcr0(struct cpu_hw_events *cpuhw, 
unsigned long mmcr0)
 static void power_pmu_disable(struct pmu *pmu)
 {
struct cpu_hw_events *cpuhw;
-   unsigned long flags, mmcr0, val;
+   unsigned long flags, mmcr0, val, mmcra = 0;
 
if (!ppmu)
return;
@@ -1245,12 +1249,23 @@ static void power_pmu_disable(struct pmu *pmu)
mb();
isync();
 
+   val = mmcra = cpuhw->mmcr[2];
+
/*
 * Disable instruction sampling if it was enabled
 */
-   if (cpuhw->mmcr[2] & MMCRA_SAMPLE_ENABLE) {
-   mtspr(SPRN_MMCRA,
- cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE);
+   if (cpuhw->mmcr[2] & MMCRA_SAMPLE_ENABLE)
+   mmcra = cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE;
+
+   /* Disable BHRB via mmcra [:26] for p10 if needed */
+   if (!(cpuhw->mmcr[2] & MMCRA_BHRB_DISABLE))
+   mmcra |= MMCRA_BHRB_DISABLE;
+
+   /* Write SPRN_MMCRA if mmcra has either disabled
+* instruction sampling or BHRB
+*/
+   if (val != mmcra) {
+   mtspr(SPRN_MMCRA, mmcra);
mb();
isync();
}
diff --git a/arch/powerpc/perf/isa207-common.c 
b/arch/powerpc/perf/isa207-common.c
index 7d4839e..463d925 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -404,6 +404,12 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
 
mmcra = mmcr1 = mmcr2 = mmcr3 = 0;
 
+   /* Disable bhrb unless explicitly requested
+* by setting MMCRA [:26] bit.
+*/
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   mmcra |= MMCRA_BHRB_DISABLE;
+
/* Second pass: assign PMCs, set all MMCR1 fields */
for (i = 0; i < n_ev; ++i) {
pmc = (event[i] >> EVENT_PMC_SHIFT) & EVENT_PMC_MASK;
@@ -475,10 +481,17 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
}
 
if (event[i] & EVENT_WANTS_BHRB) {
+   /* set MMCRA[:26] to 0 for Power10 to enable BHRB */
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   mmcra &= ~MMCRA_BHRB_DISABLE;
val = (event[i] >> EVENT_IFM_SHIFT) & EVENT_IFM_MASK;
mmcra |= val 

[PATCH v2 05/10] powerpc/perf: Update Power PMU cache_events to u64 type

2020-07-01 Thread Athira Rajeev
Events of type PERF_TYPE_HW_CACHE was described for Power PMU
as: int (*cache_events)[type][op][result];

where type, op, result values unpacked from the event attribute config
value is used to generate the raw event code at runtime.

So far the event code values which used to create these cache-related
events were within 32 bit and `int` type worked. In power10,
some of the event codes are of 64-bit value and hence update the
Power PMU cache_events to `u64` type in `power_pmu` struct.
Also propagate this change to existing all PMU driver code paths
which are using ppmu->cache_events.

Signed-off-by: Athira Rajeev
---
 arch/powerpc/include/asm/perf_event_server.h | 2 +-
 arch/powerpc/perf/core-book3s.c  | 2 +-
 arch/powerpc/perf/generic-compat-pmu.c   | 2 +-
 arch/powerpc/perf/mpc7450-pmu.c  | 2 +-
 arch/powerpc/perf/power5+-pmu.c  | 2 +-
 arch/powerpc/perf/power5-pmu.c   | 2 +-
 arch/powerpc/perf/power6-pmu.c   | 2 +-
 arch/powerpc/perf/power7-pmu.c   | 2 +-
 arch/powerpc/perf/power8-pmu.c   | 2 +-
 arch/powerpc/perf/power9-pmu.c   | 2 +-
 arch/powerpc/perf/ppc970-pmu.c   | 2 +-
 11 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 895aeaa..cb207f8 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -47,7 +47,7 @@ struct power_pmu {
const struct attribute_group**attr_groups;
int n_generic;
int *generic_events;
-   int (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
+   u64 (*cache_events)[PERF_COUNT_HW_CACHE_MAX]
   [PERF_COUNT_HW_CACHE_OP_MAX]
   [PERF_COUNT_HW_CACHE_RESULT_MAX];
 
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 5c64bd3..58bfb9a 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1820,7 +1820,7 @@ static void hw_perf_event_destroy(struct perf_event 
*event)
 static int hw_perf_cache_event(u64 config, u64 *eventp)
 {
unsigned long type, op, result;
-   int ev;
+   u64 ev;
 
if (!ppmu->cache_events)
return -EINVAL;
diff --git a/arch/powerpc/perf/generic-compat-pmu.c 
b/arch/powerpc/perf/generic-compat-pmu.c
index 5e5a54d..eb8a6aaf 100644
--- a/arch/powerpc/perf/generic-compat-pmu.c
+++ b/arch/powerpc/perf/generic-compat-pmu.c
@@ -101,7 +101,7 @@ enum {
  * 0 means not supported, -1 means nonsensical, other values
  * are event codes.
  */
-static int generic_compat_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
+static u64 generic_compat_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[ C(L1D) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0,
diff --git a/arch/powerpc/perf/mpc7450-pmu.c b/arch/powerpc/perf/mpc7450-pmu.c
index 4d5ef92..cf1eb89 100644
--- a/arch/powerpc/perf/mpc7450-pmu.c
+++ b/arch/powerpc/perf/mpc7450-pmu.c
@@ -354,7 +354,7 @@ static void mpc7450_disable_pmc(unsigned int pmc, unsigned 
long mmcr[])
  * 0 means not supported, -1 means nonsensical, other values
  * are event codes.
  */
-static int mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
+static u64 mpc7450_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(L1D)] = {/*  RESULT_ACCESS   RESULT_MISS */
[C(OP_READ)] = {0,  0x225   },
[C(OP_WRITE)] = {   0,  0x227   },
diff --git a/arch/powerpc/perf/power5+-pmu.c b/arch/powerpc/perf/power5+-pmu.c
index f857454..9252281 100644
--- a/arch/powerpc/perf/power5+-pmu.c
+++ b/arch/powerpc/perf/power5+-pmu.c
@@ -618,7 +618,7 @@ static void power5p_disable_pmc(unsigned int pmc, unsigned 
long mmcr[])
  * 0 means not supported, -1 means nonsensical, other values
  * are event codes.
  */
-static int power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
+static u64 power5p_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(L1D)] = {/*  RESULT_ACCESS   RESULT_MISS */
[C(OP_READ)] = {0x1c10a8,   0x3c1088},
[C(OP_WRITE)] = {   0x2c10a8,   0xc10c3 },
diff --git a/arch/powerpc/perf/power5-pmu.c b/arch/powerpc/perf/power5-pmu.c
index da52eca..3b36630 100644
--- a/arch/powerpc/perf/power5-pmu.c
+++ b/arch/powerpc/perf/power5-pmu.c
@@ -560,7 +560,7 @@ static void power5_disable_pmc(unsigned int pmc, unsigned 
long mmcr[])
  * 0 means not supported, -1 means nonsensical, other values
  * are event codes.
  */
-static int power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
+static u64 power5_cache_events[C(MAX)][C(OP_MAX)][C(RESULT_MAX)] = {
[C(L1D)] = {/*  RESULT_ACCESS   RESULT_MISS 

[PATCH v2 06/10] powerpc/perf: power10 Performance Monitoring support

2020-07-01 Thread Athira Rajeev
Base enablement patch to register performance monitoring
hardware support for power10. Patch introduce the raw event
encoding format, defines the supported list of events, config
fields for the event attributes and their corresponding bit values
which are exported via sysfs.

Patch also enhances the support function in isa207_common.c to
include power10 pmu hardware.

[Enablement of base PMU driver code]
Signed-off-by: Madhavan Srinivasan 
[Addition of ISA macros for counter support functions]
Signed-off-by: Athira Rajeev 
---
 arch/powerpc/perf/Makefile  |   2 +-
 arch/powerpc/perf/core-book3s.c |   2 +
 arch/powerpc/perf/internal.h|   1 +
 arch/powerpc/perf/isa207-common.c   |  59 -
 arch/powerpc/perf/isa207-common.h   |  33 ++-
 arch/powerpc/perf/power10-events-list.h |  70 ++
 arch/powerpc/perf/power10-pmu.c | 410 
 7 files changed, 566 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/perf/power10-events-list.h
 create mode 100644 arch/powerpc/perf/power10-pmu.c

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 53d614e..c02854d 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -9,7 +9,7 @@ obj-$(CONFIG_PPC_PERF_CTRS) += core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   isa207-common.o power8-pmu.o power9-pmu.o \
-  generic-compat-pmu.o
+  generic-compat-pmu.o power10-pmu.o
 obj32-$(CONFIG_PPC_PERF_CTRS)  += mpc7450-pmu.o
 
 obj-$(CONFIG_PPC_POWERNV)  += imc-pmu.o
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 58bfb9a..fad5159 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2333,6 +2333,8 @@ static int __init init_ppc64_pmu(void)
return 0;
else if (!init_power9_pmu())
return 0;
+   else if (!init_power10_pmu())
+   return 0;
else if (!init_ppc970_pmu())
return 0;
else
diff --git a/arch/powerpc/perf/internal.h b/arch/powerpc/perf/internal.h
index f755c64..80bbf72 100644
--- a/arch/powerpc/perf/internal.h
+++ b/arch/powerpc/perf/internal.h
@@ -9,4 +9,5 @@
 extern int init_power7_pmu(void);
 extern int init_power8_pmu(void);
 extern int init_power9_pmu(void);
+extern int init_power10_pmu(void);
 extern int init_generic_compat_pmu(void);
diff --git a/arch/powerpc/perf/isa207-common.c 
b/arch/powerpc/perf/isa207-common.c
index 4c86da5..7d4839e 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -55,7 +55,9 @@ static bool is_event_valid(u64 event)
 {
u64 valid_mask = EVENT_VALID_MASK;
 
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   valid_mask = p10_EVENT_VALID_MASK;
+   else if (cpu_has_feature(CPU_FTR_ARCH_300))
valid_mask = p9_EVENT_VALID_MASK;
 
return !(event & ~valid_mask);
@@ -69,6 +71,14 @@ static inline bool is_event_marked(u64 event)
return false;
 }
 
+static unsigned long sdar_mod_val(u64 event)
+{
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   return p10_SDAR_MODE(event);
+
+   return p9_SDAR_MODE(event);
+}
+
 static void mmcra_sdar_mode(u64 event, unsigned long *mmcra)
 {
/*
@@ -79,7 +89,7 @@ static void mmcra_sdar_mode(u64 event, unsigned long *mmcra)
 * MMCRA[SDAR_MODE] will be programmed as "0b01" for continous sampling
 * mode and will be un-changed when setting MMCRA[63] (Marked events).
 *
-* Incase of Power9:
+* Incase of Power9/power10:
 * Marked event: MMCRA[SDAR_MODE] will be set to 0b00 ('No Updates'),
 *   or if group already have any marked events.
 * For rest
@@ -90,8 +100,8 @@ static void mmcra_sdar_mode(u64 event, unsigned long *mmcra)
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
if (is_event_marked(event) || (*mmcra & MMCRA_SAMPLE_ENABLE))
*mmcra &= MMCRA_SDAR_MODE_NO_UPDATES;
-   else if (p9_SDAR_MODE(event))
-   *mmcra |=  p9_SDAR_MODE(event) << MMCRA_SDAR_MODE_SHIFT;
+   else if (sdar_mod_val(event))
+   *mmcra |= sdar_mod_val(event) << MMCRA_SDAR_MODE_SHIFT;
else
*mmcra |= MMCRA_SDAR_MODE_DCACHE;
} else
@@ -134,7 +144,11 @@ static bool is_thresh_cmp_valid(u64 event)
/*
 * Check the mantissa upper two bits are not zero, unless the
 * exponent is also zero. See the THRESH_CMP_MANTISSA doc.
+* Power10: thresh_cmp is replaced by l2_l3 event select.
 */
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   

[PATCH v2 04/10] powerpc/perf: Add power10_feat to dt_cpu_ftrs

2020-07-01 Thread Athira Rajeev
From: Madhavan Srinivasan 

Add power10 feature function to dt_cpu_ftrs.c along
with a power10 specific init() to initialize pmu sprs.

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/reg.h|  3 +++
 arch/powerpc/kernel/cpu_setup_power.S |  7 +++
 arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++
 3 files changed, 36 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 21a1b2d..900ada1 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1068,6 +1068,9 @@
 #define MMCR0_PMC2_LOADMISSTIME0x5
 #endif
 
+/* BHRB disable bit for PowerISA v3.10 */
+#define MMCRA_BHRB_DISABLE 0x0020
+
 /*
  * SPRG usage:
  *
diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index efdcfa7..e8b3370c 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -233,3 +233,10 @@ __init_PMU_ISA207:
li  r5,0
mtspr   SPRN_MMCRS,r5
blr
+
+__init_PMU_ISA31:
+   li  r5,0
+   mtspr   SPRN_MMCR3,r5
+   LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
+   mtspr   SPRN_MMCRA,r5
+   blr
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index a0edeb3..14a513f 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -449,6 +449,31 @@ static int __init feat_enable_pmu_power9(struct 
dt_cpu_feature *f)
return 1;
 }
 
+static void init_pmu_power10(void)
+{
+   init_pmu_power9();
+
+   mtspr(SPRN_MMCR3, 0);
+   mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
+}
+
+static int __init feat_enable_pmu_power10(struct dt_cpu_feature *f)
+{
+   hfscr_pmu_enable();
+
+   init_pmu_power10();
+   init_pmu_registers = init_pmu_power10;
+
+   cur_cpu_spec->cpu_features |= CPU_FTR_MMCRA;
+   cur_cpu_spec->cpu_user_features |= PPC_FEATURE_PSERIES_PERFMON_COMPAT;
+
+   cur_cpu_spec->num_pmcs  = 6;
+   cur_cpu_spec->pmc_type  = PPC_PMC_IBM;
+   cur_cpu_spec->oprofile_cpu_type = "ppc64/power10";
+
+   return 1;
+}
+
 static int __init feat_enable_tm(struct dt_cpu_feature *f)
 {
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
@@ -638,6 +663,7 @@ struct dt_cpu_feature_match {
{"pc-relative-addressing", feat_enable, 0},
{"machine-check-power9", feat_enable_mce_power9, 0},
{"performance-monitor-power9", feat_enable_pmu_power9, 0},
+   {"performance-monitor-power10", feat_enable_pmu_power10, 0},
{"event-based-branch-v3", feat_enable, 0},
{"random-number-generator", feat_enable, 0},
{"system-call-vectored", feat_disable, 0},
-- 
1.8.3.1



[PATCH v2 03/10] powerpc/xmon: Add PowerISA v3.1 PMU SPRs

2020-07-01 Thread Athira Rajeev
From: Madhavan Srinivasan 

PowerISA v3.1 added three new perfromance
monitoring unit (PMU) speical purpose register (SPR).
They are Monitor Mode Control Register 3 (MMCR3),
Sampled Instruction Event Register 2 (SIER2),
Sampled Instruction Event Register 3 (SIER3).

Patch here adds a new dump function dump_310_sprs
to print these SPR values.

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/xmon/xmon.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 7efe4bc..8917fe8 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2022,6 +2022,20 @@ static void dump_300_sprs(void)
 #endif
 }
 
+static void dump_310_sprs(void)
+{
+#ifdef CONFIG_PPC64
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return;
+
+   printf("mmcr3  = %.16lx\n",
+   mfspr(SPRN_MMCR3));
+
+   printf("sier2  = %.16lx  sier3  = %.16lx\n",
+   mfspr(SPRN_SIER2), mfspr(SPRN_SIER3));
+#endif
+}
+
 static void dump_one_spr(int spr, bool show_unimplemented)
 {
unsigned long val;
@@ -2076,6 +2090,7 @@ static void super_regs(void)
dump_206_sprs();
dump_207_sprs();
dump_300_sprs();
+   dump_310_sprs();
 
return;
}
-- 
1.8.3.1



[PATCH v2 02/10] KVM: PPC: Book3S HV: Save/restore new PMU registers

2020-07-01 Thread Athira Rajeev
PowerISA v3.1 has added new performance monitoring unit (PMU)
special purpose registers (SPRs). They are

Monitor Mode Control Register 3 (MMCR3)
Sampled Instruction Event Register A (SIER2)
Sampled Instruction Event Register B (SIER3)

Patch addes support to save/restore these new
SPRs while entering/exiting guest.

Signed-off-by: Athira Rajeev 
---
 arch/powerpc/include/asm/kvm_book3s_asm.h |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  4 ++--
 arch/powerpc/kernel/asm-offsets.c |  3 +++
 arch/powerpc/kvm/book3s_hv.c  |  6 --
 arch/powerpc/kvm/book3s_hv_interrupts.S   |  8 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 24 
 6 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 45704f2..078f464 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -119,7 +119,7 @@ struct kvmppc_host_state {
void __iomem *xive_tima_virt;
u32 saved_xirr;
u64 dabr;
-   u64 host_mmcr[7];   /* MMCR 0,1,A, SIAR, SDAR, MMCR2, SIER */
+   u64 host_mmcr[10];  /* MMCR 0,1,A, SIAR, SDAR, MMCR2, SIER, MMCR3, 
SIER2/3 */
u32 host_pmc[8];
u64 host_purr;
u64 host_spurr;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 7e2d061..d718061 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -637,12 +637,12 @@ struct kvm_vcpu_arch {
u32 ccr1;
u32 dbsr;
 
-   u64 mmcr[5];
+   u64 mmcr[6];
u32 pmc[8];
u32 spmc[2];
u64 siar;
u64 sdar;
-   u64 sier;
+   u64 sier[3];
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
u64 tfhar;
u64 texasr;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 6657dc6..20a8b1e 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -696,6 +696,9 @@ int main(void)
HSTATE_FIELD(HSTATE_SDAR, host_mmcr[4]);
HSTATE_FIELD(HSTATE_MMCR2, host_mmcr[5]);
HSTATE_FIELD(HSTATE_SIER, host_mmcr[6]);
+   HSTATE_FIELD(HSTATE_MMCR3, host_mmcr[7]);
+   HSTATE_FIELD(HSTATE_SIER2, host_mmcr[8]);
+   HSTATE_FIELD(HSTATE_SIER3, host_mmcr[9]);
HSTATE_FIELD(HSTATE_PMC1, host_pmc[0]);
HSTATE_FIELD(HSTATE_PMC2, host_pmc[1]);
HSTATE_FIELD(HSTATE_PMC3, host_pmc[2]);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6bf66649..c265800 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1698,7 +1698,8 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
*val = get_reg_val(id, vcpu->arch.sdar);
break;
case KVM_REG_PPC_SIER:
-   *val = get_reg_val(id, vcpu->arch.sier);
+   i = id - KVM_REG_PPC_SIER;
+   *val = get_reg_val(id, vcpu->arch.sier[i]);
break;
case KVM_REG_PPC_IAMR:
*val = get_reg_val(id, vcpu->arch.iamr);
@@ -1919,7 +1920,8 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
vcpu->arch.sdar = set_reg_val(id, *val);
break;
case KVM_REG_PPC_SIER:
-   vcpu->arch.sier = set_reg_val(id, *val);
+   i = id - KVM_REG_PPC_SIER;
+   vcpu->arch.sier[i] = set_reg_val(id, *val);
break;
case KVM_REG_PPC_IAMR:
vcpu->arch.iamr = set_reg_val(id, *val);
diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S 
b/arch/powerpc/kvm/book3s_hv_interrupts.S
index 63fd81f..59822cb 100644
--- a/arch/powerpc/kvm/book3s_hv_interrupts.S
+++ b/arch/powerpc/kvm/book3s_hv_interrupts.S
@@ -140,6 +140,14 @@ BEGIN_FTR_SECTION
std r8, HSTATE_MMCR2(r13)
std r9, HSTATE_SIER(r13)
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
+BEGIN_FTR_SECTION
+   mfspr   r5, SPRN_MMCR3
+   mfspr   r6, SPRN_SIER2
+   mfspr   r7, SPRN_SIER3
+   std r5, HSTATE_MMCR3(r13)
+   std r6, HSTATE_SIER2(r13)
+   std r7, HSTATE_SIER3(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_31)
mfspr   r3, SPRN_PMC1
mfspr   r5, SPRN_PMC2
mfspr   r6, SPRN_PMC3
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 7194389..57b6c14 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -3436,6 +3436,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_PMAO_BUG)
mtspr   SPRN_SIAR, r7
mtspr   SPRN_SDAR, r8
 BEGIN_FTR_SECTION
+   ld  r5, VCPU_MMCR + 40(r4)
+   ld  r6, VCPU_SIER + 8(r4)
+   ld  r7, VCPU_SIER + 16(r4)
+   mtspr   SPRN_MMCR3, r5
+   mtspr   SPRN_SIER2, r6
+   mtspr   SPRN_SIER3, r7
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_31)
+BEGIN_FTR_SECTION
   

[PATCH v2 01/10] powerpc/perf: Add support for ISA3.1 PMU SPRs

2020-07-01 Thread Athira Rajeev
From: Madhavan Srinivasan 

PowerISA v3.1 includes new performance monitoring unit(PMU)
special purpose registers (SPRs). They are

Monitor Mode Control Register 3 (MMCR3)
Sampled Instruction Event Register 2 (SIER2)
Sampled Instruction Event Register 3 (SIER3)

MMCR3 is added for further sampling related configuration
control. SIER2/SIER3 are added to provide additional
information about the sampled instruction.

Patch adds new PPMU flag called "PPMU_ARCH_310S" to support
handling of these new SPRs, updates the struct thread_struct
to include these new SPRs, increase the size of mmcr[] array
by one to include MMCR3 in struct cpu_hw_event. This is needed
to support programming of MMCR3 SPR during event_[enable/disable].
Patch also adds the sysfs support for the MMCR3 SPR along with
SPRN_ macros for these new pmu sprs.

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/perf_event_server.h |  1 +
 arch/powerpc/include/asm/processor.h |  4 
 arch/powerpc/include/asm/reg.h   |  6 ++
 arch/powerpc/kernel/sysfs.c  |  8 
 arch/powerpc/perf/core-book3s.c  | 29 ++--
 5 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h 
b/arch/powerpc/include/asm/perf_event_server.h
index 3e9703f..895aeaa 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -69,6 +69,7 @@ struct power_pmu {
 #define PPMU_HAS_SIER  0x0040 /* Has SIER */
 #define PPMU_ARCH_207S 0x0080 /* PMC is architecture v2.07S */
 #define PPMU_NO_SIAR   0x0100 /* Do not use SIAR */
+#define PPMU_ARCH_310S 0x0200 /* Has MMCR3, SIER2 and SIER3 */
 
 /*
  * Values for flags to get_alternatives()
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 52a6783..a466e94 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -272,6 +272,10 @@ struct thread_struct {
unsignedmmcr0;
 
unsignedused_ebb;
+   unsigned long   mmcr3;
+   unsigned long   sier2;
+   unsigned long   sier3;
+
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 88e6c78..21a1b2d 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -876,7 +876,9 @@
 #define   MMCR0_FCHV   0x0001UL /* freeze conditions in hypervisor mode */
 #define SPRN_MMCR1 798
 #define SPRN_MMCR2 785
+#define SPRN_MMCR3 754
 #define SPRN_UMMCR2769
+#define SPRN_UMMCR3738
 #define SPRN_MMCRA 0x312
 #define   MMCRA_SDSYNC 0x8000UL /* SDAR synced with SIAR */
 #define   MMCRA_SDAR_DCACHE_MISS 0x4000UL
@@ -918,6 +920,10 @@
 #define   SIER_SIHV0x100   /* Sampled MSR_HV */
 #define   SIER_SIAR_VALID  0x040   /* SIAR contents valid */
 #define   SIER_SDAR_VALID  0x020   /* SDAR contents valid */
+#define SPRN_SIER2 752
+#define SPRN_SIER3 753
+#define SPRN_USIER2736
+#define SPRN_USIER3737
 #define SPRN_SIAR  796
 #define SPRN_SDAR  797
 #define SPRN_TACR  888
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 571b325..46b4ebc 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -622,8 +622,10 @@ void ppc_enable_pmcs(void)
 SYSFS_PMCSETUP(pmc8, SPRN_PMC8);
 
 SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
+SYSFS_PMCSETUP(mmcr3, SPRN_MMCR3);
 
 static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
+static DEVICE_ATTR(mmcr3, 0600, show_mmcr3, store_mmcr3);
 #endif /* HAS_PPC_PMC56 */
 
 
@@ -886,6 +888,9 @@ static int register_cpu_online(unsigned int cpu)
 #ifdef CONFIG_PMU_SYSFS
if (cpu_has_feature(CPU_FTR_MMCRA))
device_create_file(s, _attr_mmcra);
+
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   device_create_file(s, _attr_mmcr3);
 #endif /* CONFIG_PMU_SYSFS */
 
if (cpu_has_feature(CPU_FTR_PURR)) {
@@ -980,6 +985,9 @@ static int unregister_cpu_online(unsigned int cpu)
 #ifdef CONFIG_PMU_SYSFS
if (cpu_has_feature(CPU_FTR_MMCRA))
device_remove_file(s, _attr_mmcra);
+
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   device_remove_file(s, _attr_mmcr3);
 #endif /* CONFIG_PMU_SYSFS */
 
if (cpu_has_feature(CPU_FTR_PURR)) {
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index cd6a742..5c64bd3 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -39,10 +39,10 @@ struct cpu_hw_events {
unsigned int flags[MAX_HWEVENTS];
/*
 * The order of the MMCR array is:
-*  - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2
+*  - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2, MMCR3
 *  - 32-bit, MMCR0, MMCR1, MMCR2
 */
-   unsigned long mmcr[4];
+   unsigned 

[PATCH v2 00/10] powerpc/perf: Add support for power10 PMU Hardware

2020-07-01 Thread Athira Rajeev
The patch series adds support for power10 PMU hardware.

Anju T Sudhakar (2):
  powerpc/perf: Add support for outputting extended regs in perf
intr_regs
  tools/perf: Add perf tools support for extended register capability in
powerpc

Athira Rajeev (5):
  KVM: PPC: Book3S HV: Save/restore new PMU registers
  powerpc/perf: Update Power PMU cache_events to u64 type
  powerpc/perf: power10 Performance Monitoring support
  powerpc/perf: support BHRB disable bit and new filtering modes
  powerpc/perf: Add extended regs support for power10 platform

Madhavan Srinivasan (3):
  powerpc/perf: Add support for ISA3.1 PMU SPRs
  powerpc/xmon: Add PowerISA v3.1 PMU SPRs
  powerpc/perf: Add power10_feat to dt_cpu_ftrs

---

Changes from v1 -> v2
- Added support for extended regs in powerpc
  for power9/power10 platform ( patches 8, 9, 10)
- Addressed change/removal of some event codes
  in the PMU driver

---

 arch/powerpc/include/asm/kvm_book3s_asm.h   |   2 +-
 arch/powerpc/include/asm/kvm_host.h |   4 +-
 arch/powerpc/include/asm/perf_event_server.h|  11 +-
 arch/powerpc/include/asm/processor.h|   4 +
 arch/powerpc/include/asm/reg.h  |   9 +
 arch/powerpc/include/uapi/asm/perf_regs.h   |  20 +-
 arch/powerpc/kernel/asm-offsets.c   |   3 +
 arch/powerpc/kernel/cpu_setup_power.S   |   7 +
 arch/powerpc/kernel/dt_cpu_ftrs.c   |  26 ++
 arch/powerpc/kernel/sysfs.c |   8 +
 arch/powerpc/kvm/book3s_hv.c|   6 +-
 arch/powerpc/kvm/book3s_hv_interrupts.S |   8 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  24 ++
 arch/powerpc/perf/Makefile  |   2 +-
 arch/powerpc/perf/core-book3s.c |  61 +++-
 arch/powerpc/perf/generic-compat-pmu.c  |   2 +-
 arch/powerpc/perf/internal.h|   1 +
 arch/powerpc/perf/isa207-common.c   |  72 +++-
 arch/powerpc/perf/isa207-common.h   |  33 +-
 arch/powerpc/perf/mpc7450-pmu.c |   2 +-
 arch/powerpc/perf/perf_regs.c   |  42 ++-
 arch/powerpc/perf/power10-events-list.h |  70 
 arch/powerpc/perf/power10-pmu.c | 425 
 arch/powerpc/perf/power5+-pmu.c |   2 +-
 arch/powerpc/perf/power5-pmu.c  |   2 +-
 arch/powerpc/perf/power6-pmu.c  |   2 +-
 arch/powerpc/perf/power7-pmu.c  |   2 +-
 arch/powerpc/perf/power8-pmu.c  |   2 +-
 arch/powerpc/perf/power9-pmu.c  |   8 +-
 arch/powerpc/perf/ppc970-pmu.c  |   2 +-
 arch/powerpc/platforms/powernv/idle.c   |  14 +
 arch/powerpc/xmon/xmon.c|  15 +
 tools/arch/powerpc/include/uapi/asm/perf_regs.h |  20 +-
 tools/perf/arch/powerpc/include/perf_regs.h |   8 +-
 tools/perf/arch/powerpc/util/perf_regs.c|  61 
 35 files changed, 939 insertions(+), 41 deletions(-)
 create mode 100644 arch/powerpc/perf/power10-events-list.h
 create mode 100644 arch/powerpc/perf/power10-pmu.c

-- 
1.8.3.1



[PATCH 20/20] block: remove direct_make_request

2020-07-01 Thread Christoph Hellwig
Now that submit_bio_noacct has a decent blk-mq fast path there is no
more need for this bypass.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c  | 28 
 drivers/md/dm.c   |  5 +
 drivers/nvme/host/multipath.c |  2 +-
 include/linux/blkdev.h|  1 -
 4 files changed, 2 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2ff166f0d24ee3..bf882b8d84450c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1211,34 +1211,6 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio_noacct);
 
-/**
- * direct_make_request - hand a buffer directly to its device driver for I/O
- * @bio:  The bio describing the location in memory and on the device.
- *
- * This function behaves like submit_bio_noacct(), but does not protect
- * against recursion.  Must only be used if the called driver is known
- * to be blk-mq based.
- */
-blk_qc_t direct_make_request(struct bio *bio)
-{
-   struct gendisk *disk = bio->bi_disk;
-
-   if (WARN_ON_ONCE(!disk->queue->mq_ops)) {
-   bio_io_error(bio);
-   return BLK_QC_T_NONE;
-   }
-   if (!submit_bio_checks(bio))
-   return BLK_QC_T_NONE;
-   if (unlikely(bio_queue_enter(bio)))
-   return BLK_QC_T_NONE;
-   if (!blk_crypto_bio_prep()) {
-   blk_queue_exit(disk->queue);
-   return BLK_QC_T_NONE;
-   }
-   return blk_mq_submit_bio(bio);
-}
-EXPORT_SYMBOL_GPL(direct_make_request);
-
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The  bio which describes the I/O
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b32b539dbace56..2cb33896198c4c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1302,10 +1302,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
/* the bio has been remapped so dispatch it */
trace_block_bio_remap(clone->bi_disk->queue, clone,
  bio_dev(io->orig_bio), sector);
-   if (md->type == DM_TYPE_NVME_BIO_BASED)
-   ret = direct_make_request(clone);
-   else
-   ret = submit_bio_noacct(clone);
+   ret = submit_bio_noacct(clone);
break;
case DM_MAPIO_KILL:
free_tio(tio);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f07fa47c251d9d..a986ac52c4cc7f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -314,7 +314,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
trace_block_bio_remap(bio->bi_disk->queue, bio,
  disk_devt(ns->head->disk),
  bio->bi_iter.bi_sector);
-   ret = direct_make_request(bio);
+   ret = submit_bio_noacct(bio);
} else if (nvme_available_path(head)) {
dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b73cfa6a5141df..1cc913ffdbe21e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -853,7 +853,6 @@ static inline void rq_flush_dcache_pages(struct request *rq)
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 blk_qc_t submit_bio_noacct(struct bio *bio);
-extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_put_request(struct request *);
 extern struct request *blk_get_request(struct request_queue *, unsigned int op,
-- 
2.26.2



[PATCH 19/20] block: shortcut __submit_bio_noacct for blk-mq drivers

2020-07-01 Thread Christoph Hellwig
For blk-mq drivers bios can only be inserted for the same queue.  So
bypass the complicated sorting logic in __submit_bio_noacct with
a blk-mq simpler submission helper.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 57b5dc00d44cb1..2ff166f0d24ee3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1152,6 +1152,34 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
return ret;
 }
 
+static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
+{
+   struct gendisk *disk = bio->bi_disk;
+   struct bio_list bio_list;
+   blk_qc_t ret = BLK_QC_T_NONE;
+
+   bio_list_init(_list);
+   current->bio_list = _list;
+
+   do {
+   WARN_ON_ONCE(bio->bi_disk != disk);
+
+   if (unlikely(bio_queue_enter(bio) != 0))
+   continue;
+
+   if (!blk_crypto_bio_prep()) {
+   blk_queue_exit(disk->queue);
+   ret = BLK_QC_T_NONE;
+   continue;
+   }
+
+   ret = blk_mq_submit_bio(bio);
+   } while ((bio = bio_list_pop(_list)));
+
+   current->bio_list = NULL;
+   return ret;
+}
+
 /**
  * submit_bio_noacct - re-submit a bio to the block device layer for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -1177,6 +1205,8 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
return BLK_QC_T_NONE;
}
 
+   if (!bio->bi_disk->fops->submit_bio)
+   return __submit_bio_noacct_mq(bio);
return __submit_bio_noacct(bio);
 }
 EXPORT_SYMBOL(submit_bio_noacct);
-- 
2.26.2



[PATCH 17/20] block: rename generic_make_request to submit_bio_noacct

2020-07-01 Thread Christoph Hellwig
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig 
---
 Documentation/block/biodoc.rst|  2 +-
 .../fault-injection/fault-injection.rst   |  2 +-
 Documentation/trace/ftrace.rst|  4 +--
 block/bio.c   | 14 
 block/blk-core.c  | 32 +--
 block/blk-crypto-fallback.c   |  2 +-
 block/blk-crypto.c|  2 +-
 block/blk-merge.c |  2 +-
 block/blk-throttle.c  |  4 +--
 block/bounce.c|  2 +-
 drivers/block/drbd/drbd_int.h |  6 ++--
 drivers/block/drbd/drbd_main.c|  2 +-
 drivers/block/drbd/drbd_receiver.c|  2 +-
 drivers/block/drbd/drbd_req.c |  2 +-
 drivers/block/drbd/drbd_worker.c  |  2 +-
 drivers/block/pktcdvd.c   |  2 +-
 drivers/lightnvm/pblk-read.c  |  2 +-
 drivers/md/bcache/bcache.h|  2 +-
 drivers/md/bcache/btree.c |  2 +-
 drivers/md/bcache/request.c   |  7 ++--
 drivers/md/dm-cache-target.c  |  6 ++--
 drivers/md/dm-clone-target.c  | 10 +++---
 drivers/md/dm-crypt.c |  6 ++--
 drivers/md/dm-delay.c |  2 +-
 drivers/md/dm-era-target.c|  2 +-
 drivers/md/dm-integrity.c |  4 +--
 drivers/md/dm-mpath.c |  2 +-
 drivers/md/dm-raid1.c |  2 +-
 drivers/md/dm-snap-persistent.c   |  2 +-
 drivers/md/dm-snap.c  |  6 ++--
 drivers/md/dm-thin.c  |  4 +--
 drivers/md/dm-verity-target.c |  2 +-
 drivers/md/dm-writecache.c|  2 +-
 drivers/md/dm-zoned-target.c  |  2 +-
 drivers/md/dm.c   | 10 +++---
 drivers/md/md-faulty.c|  4 +--
 drivers/md/md-linear.c|  4 +--
 drivers/md/md-multipath.c |  4 +--
 drivers/md/raid0.c|  8 ++---
 drivers/md/raid1.c| 14 
 drivers/md/raid10.c   | 28 
 drivers/md/raid5.c| 10 +++---
 drivers/nvme/host/multipath.c |  2 +-
 include/linux/blkdev.h|  2 +-
 44 files changed, 115 insertions(+), 118 deletions(-)

diff --git a/Documentation/block/biodoc.rst b/Documentation/block/biodoc.rst
index 267384159bf793..afda5e30a82e5a 100644
--- a/Documentation/block/biodoc.rst
+++ b/Documentation/block/biodoc.rst
@@ -1036,7 +1036,7 @@ Now the generic block layer performs partition-remapping 
early and thus
 provides drivers with a sector number relative to whole device, rather than
 having to take partition number into account in order to arrive at the true
 sector number. The routine blk_partition_remap() is invoked by
-generic_make_request even before invoking the queue specific ->submit_bio,
+submit_bio_noacct even before invoking the queue specific ->submit_bio,
 so the i/o scheduler also gets to operate on whole disk sector numbers. This
 should typically not require changes to block drivers, it just never gets
 to invoke its own partition sector offset calculations since all bios
diff --git a/Documentation/fault-injection/fault-injection.rst 
b/Documentation/fault-injection/fault-injection.rst
index f51bb21d20e44b..f850ad018b70a8 100644
--- a/Documentation/fault-injection/fault-injection.rst
+++ b/Documentation/fault-injection/fault-injection.rst
@@ -24,7 +24,7 @@ Available fault injection capabilities
 
   injects disk IO errors on devices permitted by setting
   /sys/block//make-it-fail or
-  /sys/block///make-it-fail. (generic_make_request())
+  /sys/block///make-it-fail. (submit_bio_noacct())
 
 - fail_mmc_request
 
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 430a16283103d4..80ba765a82379e 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1453,7 +1453,7 @@ function-trace, we get a much larger output::
=> __blk_run_queue_uncond
=> __blk_run_queue
=> blk_queue_bio
-   => generic_make_request
+   => submit_bio_noacct
=> submit_bio
=> submit_bh
=> __ext3_get_inode_loc
@@ -1738,7 +1738,7 @@ tracers.
=> __blk_run_queue_uncond
=> __blk_run_queue
=> blk_queue_bio
-   => generic_make_request
+   => submit_bio_noacct
=> submit_bio
=> submit_bh
=> ext3_bread
diff --git a/block/bio.c b/block/bio.c
index fc1299f9d86a24..ef91782fd668ce 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -358,7 +358,7 @@ 

[PATCH 18/20] block: refator submit_bio_noacct

2020-07-01 Thread Christoph Hellwig
Split out a __submit_bio_noacct helper for the actual de-recursion
algorithm, and simplify the loop by using a continue when we can't
enter the queue for a bio.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 143 +--
 1 file changed, 75 insertions(+), 68 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ff9a88d2d244cb..57b5dc00d44cb1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1084,6 +1084,74 @@ static blk_qc_t __submit_bio(struct bio *bio)
return ret;
 }
 
+/*
+ * The loop in this function may be a bit non-obvious, and so deserves some
+ * explanation:
+ *
+ *  - Before entering the loop, bio->bi_next is NULL (as all callers ensure
+ *that), so we have a list with a single bio.
+ *  - We pretend that we have just taken it off a longer list, so we assign
+ *bio_list to a pointer to the bio_list_on_stack, thus initialising the
+ *bio_list of new bios to be added.  ->submit_bio() may indeed add some 
more
+ *bios through a recursive call to submit_bio_noacct.  If it did, we find a
+ *non-NULL value in bio_list and re-enter the loop from the top.
+ *  - In this case we really did just take the bio of the top of the list (no
+ *pretending) and so remove it from bio_list, and call into ->submit_bio()
+ *again.
+ *
+ * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
+ * bio_list_on_stack[1] contains bios that were submitted before the current
+ * ->submit_bio_bio, but that haven't been processed yet.
+ */
+static blk_qc_t __submit_bio_noacct(struct bio *bio)
+{
+   struct bio_list bio_list_on_stack[2];
+   blk_qc_t ret = BLK_QC_T_NONE;
+
+   BUG_ON(bio->bi_next);
+
+   bio_list_init(_list_on_stack[0]);
+   current->bio_list = bio_list_on_stack;
+
+   do {
+   struct request_queue *q = bio->bi_disk->queue;
+   struct bio_list lower, same;
+
+   if (unlikely(bio_queue_enter(bio) != 0))
+   continue;
+
+   /*
+* Create a fresh bio_list for all subordinate requests.
+*/
+   bio_list_on_stack[1] = bio_list_on_stack[0];
+   bio_list_init(_list_on_stack[0]);
+
+   ret = __submit_bio(bio);
+
+   /*
+* Sort new bios into those for a lower level and those for the
+* same level.
+*/
+   bio_list_init();
+   bio_list_init();
+   while ((bio = bio_list_pop(_list_on_stack[0])) != NULL)
+   if (q == bio->bi_disk->queue)
+   bio_list_add(, bio);
+   else
+   bio_list_add(, bio);
+
+   /*
+* Now assemble so we handle the lowest level first.
+*/
+   bio_list_merge(_list_on_stack[0], );
+   bio_list_merge(_list_on_stack[0], );
+   bio_list_merge(_list_on_stack[0], _list_on_stack[1]);
+   } while ((bio = bio_list_pop(_list_on_stack[0])));
+
+   current->bio_list = NULL;
+   return ret;
+}
+
 /**
  * submit_bio_noacct - re-submit a bio to the block device layer for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -1095,82 +1163,21 @@ static blk_qc_t __submit_bio(struct bio *bio)
  */
 blk_qc_t submit_bio_noacct(struct bio *bio)
 {
-   /*
-* bio_list_on_stack[0] contains bios submitted by the current
-* ->submit_bio.
-* bio_list_on_stack[1] contains bios that were submitted before the
-* current ->submit_bio_bio, but that haven't been processed yet.
-*/
-   struct bio_list bio_list_on_stack[2];
-   blk_qc_t ret = BLK_QC_T_NONE;
-
if (!submit_bio_checks(bio))
-   goto out;
+   return BLK_QC_T_NONE;
 
/*
-* We only want one ->submit_bio to be active at a time, else
-* stack usage with stacked devices could be a problem.  So use
-* current->bio_list to keep a list of requests submited by a
-* ->submit_bio method.  current->bio_list is also used as a
-* flag to say if submit_bio_noacct is currently active in this
-* task or not.  If it is NULL, then no make_request is active.  If
-* it is non-NULL, then a make_request is active, and new requests
-* should be added at the tail
+* We only want one ->submit_bio to be active at a time, else stack
+* usage with stacked devices could be a problem.  Use current->bio_list
+* to collect a list of requests submited by a ->submit_bio method while
+* it is active, and then process them after it returned.
 */
if (current->bio_list) {
bio_list_add(>bio_list[0], bio);
-   goto out;
+   return BLK_QC_T_NONE;
}
 
-   /* following loop 

[PATCH 16/20] block: move ->make_request_fn to struct block_device_operations

2020-07-01 Thread Christoph Hellwig
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig 
---
 Documentation/block/biodoc.rst|  2 +-
 .../block/writeback_cache_control.rst |  2 +-
 arch/m68k/emu/nfblock.c   |  5 +-
 arch/xtensa/platforms/iss/simdisk.c   |  5 +-
 block/blk-cgroup.c|  2 +-
 block/blk-core.c  | 53 +++
 block/blk-mq.c| 10 ++--
 block/blk.h   |  2 -
 drivers/block/brd.c   |  5 +-
 drivers/block/drbd/drbd_int.h |  2 +-
 drivers/block/drbd/drbd_main.c|  9 ++--
 drivers/block/drbd/drbd_req.c |  2 +-
 drivers/block/null_blk_main.c | 17 --
 drivers/block/pktcdvd.c   | 11 ++--
 drivers/block/ps3vram.c   | 15 +++---
 drivers/block/rsxx/dev.c  |  7 ++-
 drivers/block/umem.c  |  5 +-
 drivers/block/zram/zram_drv.c | 11 ++--
 drivers/lightnvm/core.c   |  8 +--
 drivers/lightnvm/pblk-init.c  | 12 +++--
 drivers/md/bcache/request.c   |  4 +-
 drivers/md/bcache/request.h   |  4 +-
 drivers/md/bcache/super.c | 23 +---
 drivers/md/dm.c   | 23 
 drivers/md/md.c   |  5 +-
 drivers/nvdimm/blk.c  |  5 +-
 drivers/nvdimm/btt.c  |  5 +-
 drivers/nvdimm/pmem.c |  5 +-
 drivers/nvme/host/core.c  |  1 +
 drivers/nvme/host/multipath.c |  5 +-
 drivers/nvme/host/nvme.h  |  1 +
 drivers/s390/block/dcssblk.c  |  9 ++--
 drivers/s390/block/xpram.c|  6 +--
 include/linux/blk-mq.h|  2 +-
 include/linux/blkdev.h|  7 +--
 include/linux/lightnvm.h  |  3 +-
 36 files changed, 153 insertions(+), 140 deletions(-)

diff --git a/Documentation/block/biodoc.rst b/Documentation/block/biodoc.rst
index b964796ec9c780..267384159bf793 100644
--- a/Documentation/block/biodoc.rst
+++ b/Documentation/block/biodoc.rst
@@ -1036,7 +1036,7 @@ Now the generic block layer performs partition-remapping 
early and thus
 provides drivers with a sector number relative to whole device, rather than
 having to take partition number into account in order to arrive at the true
 sector number. The routine blk_partition_remap() is invoked by
-generic_make_request even before invoking the queue specific make_request_fn,
+generic_make_request even before invoking the queue specific ->submit_bio,
 so the i/o scheduler also gets to operate on whole disk sector numbers. This
 should typically not require changes to block drivers, it just never gets
 to invoke its own partition sector offset calculations since all bios
diff --git a/Documentation/block/writeback_cache_control.rst 
b/Documentation/block/writeback_cache_control.rst
index 2c752c57c14c62..b208488d0aae85 100644
--- a/Documentation/block/writeback_cache_control.rst
+++ b/Documentation/block/writeback_cache_control.rst
@@ -47,7 +47,7 @@ the Forced Unit Access is implemented.  The REQ_PREFLUSH and 
REQ_FUA flags
 may both be set on a single bio.
 
 
-Implementation details for make_request_fn based block drivers
+Implementation details for bio based block drivers
 --
 
 These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit
diff --git a/arch/m68k/emu/nfblock.c b/arch/m68k/emu/nfblock.c
index 87e8b1700acd28..92d26c81244134 100644
--- a/arch/m68k/emu/nfblock.c
+++ b/arch/m68k/emu/nfblock.c
@@ -59,7 +59,7 @@ struct nfhd_device {
struct gendisk *disk;
 };
 
-static blk_qc_t nfhd_make_request(struct request_queue *queue, struct bio *bio)
+static blk_qc_t nfhd_submit_bio(struct bio *bio)
 {
struct nfhd_device *dev = bio->bi_disk->private_data;
struct bio_vec bvec;
@@ -93,6 +93,7 @@ static int nfhd_getgeo(struct block_device *bdev, struct 
hd_geometry *geo)
 
 static const struct block_device_operations nfhd_ops = {
.owner  = THIS_MODULE,
+   .submit_bio = nfhd_submit_bio,
.getgeo = nfhd_getgeo,
 };
 
@@ -118,7 +119,7 @@ static int __init nfhd_init_one(int id, u32 blocks, u32 
bsize)
dev->bsize = bsize;
dev->bshift = ffs(bsize) - 10;
 
-   dev->queue = blk_alloc_queue(nfhd_make_request, NUMA_NO_NODE);
+   dev->queue = 

[PATCH 15/20] block: remove the nr_sectors variable in generic_make_request_checks

2020-07-01 Thread Christoph Hellwig
The variable is only used once, so just open code the bio_sector()
there.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 37435d0d433564..28f60985dc75cc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -974,7 +974,6 @@ static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
struct request_queue *q = bio->bi_disk->queue;
-   int nr_sectors = bio_sectors(bio);
blk_status_t status = BLK_STS_IOERR;
 
might_sleep();
@@ -1007,7 +1006,7 @@ generic_make_request_checks(struct bio *bio)
if (op_is_flush(bio->bi_opf) &&
!test_bit(QUEUE_FLAG_WC, >queue_flags)) {
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
-   if (!nr_sectors) {
+   if (!bio_sectors(bio)) {
status = BLK_STS_OK;
goto end_io;
}
-- 
2.26.2



[PATCH 12/20] block: remove the request_queue argument from blk_queue_split

2020-07-01 Thread Christoph Hellwig
The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig 
---
 block/blk-merge.c | 21 ++---
 block/blk-mq.c|  2 +-
 block/blk.h   |  3 +--
 drivers/block/drbd/drbd_req.c |  2 +-
 drivers/block/pktcdvd.c   |  2 +-
 drivers/block/ps3vram.c   |  2 +-
 drivers/block/rsxx/dev.c  |  2 +-
 drivers/block/umem.c  |  2 +-
 drivers/lightnvm/pblk-init.c  |  4 ++--
 drivers/md/dm.c   |  2 +-
 drivers/md/md.c   |  2 +-
 drivers/nvme/host/multipath.c |  9 -
 drivers/s390/block/dcssblk.c  |  2 +-
 drivers/s390/block/xpram.c|  2 +-
 include/linux/blkdev.h|  2 +-
 15 files changed, 28 insertions(+), 31 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 9c9fb21584b64e..20fa2290604105 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -283,20 +283,20 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 
 /**
  * __blk_queue_split - split a bio and submit the second half
- * @q:   [in] request queue pointer
  * @bio: [in, out] bio to be split
  * @nr_segs: [out] number of segments in the first bio
  *
  * Split a bio into two bios, chain the two bios, submit the second half and
  * store a pointer to the first half in *@bio. If the second bio is still too
  * big it will be split by a recursive call to this function. Since this
- * function may allocate a new bio from @q->bio_split, it is the responsibility
- * of the caller to ensure that @q is only released after processing of the
+ * function may allocate a new bio from @bio->bi_disk->queue->bio_split, it is
+ * the responsibility of the caller to ensure that
+ * @bio->bi_disk->queue->bio_split is only released after processing of the
  * split bio has finished.
  */
-void __blk_queue_split(struct request_queue *q, struct bio **bio,
-   unsigned int *nr_segs)
+void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
 {
+   struct request_queue *q = (*bio)->bi_disk->queue;
struct bio *split = NULL;
 
switch (bio_op(*bio)) {
@@ -345,20 +345,19 @@ void __blk_queue_split(struct request_queue *q, struct 
bio **bio,
 
 /**
  * blk_queue_split - split a bio and submit the second half
- * @q:   [in] request queue pointer
  * @bio: [in, out] bio to be split
  *
  * Split a bio into two bios, chains the two bios, submit the second half and
  * store a pointer to the first half in *@bio. Since this function may allocate
- * a new bio from @q->bio_split, it is the responsibility of the caller to
- * ensure that @q is only released after processing of the split bio has
- * finished.
+ * a new bio from @bio->bi_disk->queue->bio_split, it is the responsibility of
+ * the caller to ensure that @bio->bi_disk->queue->bio_split is only released
+ * after processing of the split bio has finished.
  */
-void blk_queue_split(struct request_queue *q, struct bio **bio)
+void blk_queue_split(struct bio **bio)
 {
unsigned int nr_segs;
 
-   __blk_queue_split(q, bio, _segs);
+   __blk_queue_split(bio, _segs);
 }
 EXPORT_SYMBOL(blk_queue_split);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 65e0846fd06519..dbadb7defd618a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2166,7 +2166,7 @@ blk_qc_t blk_mq_make_request(struct request_queue *q, 
struct bio *bio)
blk_status_t ret;
 
blk_queue_bounce(q, );
-   __blk_queue_split(q, , _segs);
+   __blk_queue_split(, _segs);
 
if (!bio_integrity_prep(bio))
goto queue_exit;
diff --git a/block/blk.h b/block/blk.h
index 0184a31fe4dfaf..0114fd92c8a05b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -220,8 +220,7 @@ ssize_t part_timeout_show(struct device *, struct 
device_attribute *, char *);
 ssize_t part_timeout_store(struct device *, struct device_attribute *,
const char *, size_t);
 
-void __blk_queue_split(struct request_queue *q, struct bio **bio,
-   unsigned int *nr_segs);
+void __blk_queue_split(struct bio **bio, unsigned int *nr_segs);
 int ll_back_merge_fn(struct request *req, struct bio *bio,
unsigned int nr_segs);
 int ll_front_merge_fn(struct request *req,  struct bio *bio,
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 3f09b2ab977822..9368680474223a 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1598,7 +1598,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, 
struct bio *bio)
struct drbd_device *device = bio->bi_disk->private_data;
unsigned long start_jif;
 
-   blk_queue_split(q, );
+   blk_queue_split();
 
start_jif = jiffies;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 27a33adc41e487..29b0c62dc86c1f 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2434,7 +2434,7 @@ static blk_qc_t 

[PATCH 14/20] block: remove the NULL queue check in generic_make_request_checks

2020-07-01 Thread Christoph Hellwig
All registers disks must have a valid queue pointer, so don't bother to
log a warning for that case.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 95dca74534ff73..37435d0d433564 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -973,22 +973,12 @@ static inline blk_status_t blk_check_zone_append(struct 
request_queue *q,
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
-   struct request_queue *q;
+   struct request_queue *q = bio->bi_disk->queue;
int nr_sectors = bio_sectors(bio);
blk_status_t status = BLK_STS_IOERR;
-   char b[BDEVNAME_SIZE];
 
might_sleep();
 
-   q = bio->bi_disk->queue;
-   if (unlikely(!q)) {
-   printk(KERN_ERR
-  "generic_make_request: Trying to access "
-   "nonexistent block-device %s (%Lu)\n",
-   bio_devname(bio, b), (long long)bio->bi_iter.bi_sector);
-   goto end_io;
-   }
-
/*
 * For a REQ_NOWAIT based request, return -EOPNOTSUPP
 * if queue is not a request based queue.
-- 
2.26.2



[PATCH 13/20] block: tidy up a warning in bio_check_ro

2020-07-01 Thread Christoph Hellwig
The "generic_make_request: " prefix has no value, and will soon become
stale.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 76cfd5709f66cd..95dca74534ff73 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -869,8 +869,7 @@ static inline bool bio_check_ro(struct bio *bio, struct 
hd_struct *part)
return false;
 
WARN_ONCE(1,
-  "generic_make_request: Trying to write "
-   "to read-only block-device %s (partno %d)\n",
+  "Trying to write to read-only block-device %s (partno 
%d)\n",
bio_devname(bio, b), part->partno);
/* Older lvm-tools actually trigger this */
return false;
-- 
2.26.2



[PATCH 11/20] fs: remove a weird comment in submit_bh_wbc

2020-07-01 Thread Christoph Hellwig
All bios can get remapped if submitted to partitions.  No need to
comment on that.

Signed-off-by: Christoph Hellwig 
---
 fs/buffer.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 64fe82ec65ff1f..2725ebbcfdc246 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3040,12 +3040,7 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
if (test_set_buffer_req(bh) && (op == REQ_OP_WRITE))
clear_buffer_write_io_error(bh);
 
-   /*
-* from here on down, it's all bio -- do the initial mapping,
-* submit_bio -> generic_make_request may further map this bio around
-*/
bio = bio_alloc(GFP_NOIO, 1);
-
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio_set_dev(bio, bh->b_bdev);
bio->bi_write_hint = write_hint;
-- 
2.26.2



[PATCH 10/20] dm: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e44473fe0f4873..c8d91f271c272e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1789,7 +1789,7 @@ static blk_qc_t dm_process_bio(struct mapped_device *md,
 
 static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct mapped_device *md = q->queuedata;
+   struct mapped_device *md = bio->bi_disk->private_data;
blk_qc_t ret = BLK_QC_T_NONE;
int srcu_idx;
struct dm_table *map;
@@ -1995,7 +1995,6 @@ static struct mapped_device *alloc_dev(int minor)
md->queue = blk_alloc_queue(dm_make_request, numa_node_id);
if (!md->queue)
goto bad;
-   md->queue->queuedata = md;
 
md->disk = alloc_disk_node(1, md->numa_node_id);
if (!md->disk)
-- 
2.26.2



[PATCH 02/20] simdisk: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 arch/xtensa/platforms/iss/simdisk.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/xtensa/platforms/iss/simdisk.c 
b/arch/xtensa/platforms/iss/simdisk.c
index 49322b66cda931..31b5020077a059 100644
--- a/arch/xtensa/platforms/iss/simdisk.c
+++ b/arch/xtensa/platforms/iss/simdisk.c
@@ -103,7 +103,7 @@ static void simdisk_transfer(struct simdisk *dev, unsigned 
long sector,
 
 static blk_qc_t simdisk_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct simdisk *dev = q->queuedata;
+   struct simdisk *dev = bio->bi_disk->private_data;
struct bio_vec bvec;
struct bvec_iter iter;
sector_t sector = bio->bi_iter.bi_sector;
@@ -273,8 +273,6 @@ static int __init simdisk_setup(struct simdisk *dev, int 
which,
goto out_alloc_queue;
}
 
-   dev->queue->queuedata = dev;
-
dev->gd = alloc_disk(SIMDISK_MINORS);
if (dev->gd == NULL) {
pr_err("alloc_disk failed\n");
-- 
2.26.2



[PATCH 09/20] bcache: stop setting ->queuedata

2020-07-01 Thread Christoph Hellwig
Nothing in bcache actually uses the ->queuedata field.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/bcache/super.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 2014016f9a60d3..21aa168113d30b 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -876,7 +876,6 @@ static int bcache_device_init(struct bcache_device *d, 
unsigned int block_size,
return -ENOMEM;
 
d->disk->queue  = q;
-   q->queuedata= d;
q->backing_dev_info->congested_data = d;
q->limits.max_hw_sectors= UINT_MAX;
q->limits.max_sectors   = UINT_MAX;
-- 
2.26.2



rename ->make_request_fn and move it to the block_device_operations v2

2020-07-01 Thread Christoph Hellwig
Hi Jens,

this series moves the make_request_fn method into block_device_operations
with the much more descriptive ->submit_bio name.  It then also gives
generic_make_request a more descriptive name, and further optimize the
path to issue to blk-mq, removing the need for the direct_make_request
bypass.

Changes since v1:
 - fix a null pointer dereference when dispatching from bio to request
   based drivers
 - clean up a few more comments

Diffstat:
 Documentation/block/biodoc.rst|2 
 Documentation/block/writeback_cache_control.rst   |2 
 Documentation/fault-injection/fault-injection.rst |2 
 Documentation/trace/ftrace.rst|4 
 arch/m68k/emu/nfblock.c   |8 
 arch/xtensa/platforms/iss/simdisk.c   |9 
 block/bio.c   |   14 -
 block/blk-cgroup.c|2 
 block/blk-core.c  |  255 +-
 block/blk-crypto-fallback.c   |2 
 block/blk-crypto.c|2 
 block/blk-merge.c |   23 -
 block/blk-mq.c|   12 -
 block/blk-throttle.c  |4 
 block/blk.h   |5 
 block/bounce.c|2 
 drivers/block/brd.c   |5 
 drivers/block/drbd/drbd_int.h |8 
 drivers/block/drbd/drbd_main.c|   12 -
 drivers/block/drbd/drbd_receiver.c|2 
 drivers/block/drbd/drbd_req.c |8 
 drivers/block/drbd/drbd_worker.c  |2 
 drivers/block/null_blk_main.c |   19 +
 drivers/block/pktcdvd.c   |   15 -
 drivers/block/ps3vram.c   |   20 -
 drivers/block/rsxx/dev.c  |   14 -
 drivers/block/umem.c  |   11 
 drivers/block/zram/zram_drv.c |   14 -
 drivers/lightnvm/core.c   |8 
 drivers/lightnvm/pblk-init.c  |   16 -
 drivers/lightnvm/pblk-read.c  |2 
 drivers/md/bcache/bcache.h|2 
 drivers/md/bcache/btree.c |2 
 drivers/md/bcache/request.c   |   11 
 drivers/md/bcache/request.h   |4 
 drivers/md/bcache/super.c |   24 +-
 drivers/md/dm-cache-target.c  |6 
 drivers/md/dm-clone-target.c  |   10 
 drivers/md/dm-crypt.c |6 
 drivers/md/dm-delay.c |2 
 drivers/md/dm-era-target.c|2 
 drivers/md/dm-integrity.c |4 
 drivers/md/dm-mpath.c |2 
 drivers/md/dm-raid1.c |2 
 drivers/md/dm-snap-persistent.c   |2 
 drivers/md/dm-snap.c  |6 
 drivers/md/dm-thin.c  |4 
 drivers/md/dm-verity-target.c |2 
 drivers/md/dm-writecache.c|2 
 drivers/md/dm-zoned-target.c  |2 
 drivers/md/dm.c   |   41 +--
 drivers/md/md-faulty.c|4 
 drivers/md/md-linear.c|4 
 drivers/md/md-multipath.c |4 
 drivers/md/md.c   |7 
 drivers/md/raid0.c|8 
 drivers/md/raid1.c|   14 -
 drivers/md/raid10.c   |   28 +-
 drivers/md/raid5.c|   10 
 drivers/nvdimm/blk.c  |5 
 drivers/nvdimm/btt.c  |5 
 drivers/nvdimm/pmem.c |5 
 drivers/nvme/host/core.c  |1 
 drivers/nvme/host/multipath.c |   18 -
 drivers/nvme/host/nvme.h  |1 
 drivers/s390/block/dcssblk.c  |   11 
 drivers/s390/block/xpram.c|8 
 fs/buffer.c   |5 
 include/linux/blk-mq.h|2 
 include/linux/blkdev.h|   12 -
 include/linux/lightnvm.h  |3 
 71 files changed, 387 insertions(+), 408 deletions(-)


[PATCH 03/20] drbd: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_main.c | 1 -
 drivers/block/drbd/drbd_req.c  | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 45fbd526c453bc..26f4e0aa7393b4 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2805,7 +2805,6 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
if (!q)
goto out_no_q;
device->rq_queue = q;
-   q->queuedata   = device;
 
disk = alloc_disk(1);
if (!disk)
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index c80a2f1c3c2a73..3f09b2ab977822 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1595,7 +1595,7 @@ void do_submit(struct work_struct *ws)
 
 blk_qc_t drbd_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct drbd_device *device = (struct drbd_device *) q->queuedata;
+   struct drbd_device *device = bio->bi_disk->private_data;
unsigned long start_jif;
 
blk_queue_split(q, );
-- 
2.26.2



[PATCH 06/20] rsxx: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/rsxx/dev.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index 3ba07ab30c84f5..6a4d8d26e32cbd 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -119,7 +119,7 @@ static void bio_dma_done_cb(struct rsxx_cardinfo *card,
 
 static blk_qc_t rsxx_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct rsxx_cardinfo *card = q->queuedata;
+   struct rsxx_cardinfo *card = bio->bi_disk->private_data;
struct rsxx_bio_meta *bio_meta;
blk_status_t st = BLK_STS_IOERR;
 
@@ -267,8 +267,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
card->queue->limits.discard_alignment   = RSXX_HW_BLK_SIZE;
}
 
-   card->queue->queuedata = card;
-
snprintf(card->gendisk->disk_name, sizeof(card->gendisk->disk_name),
 "rsxx%d", card->disk_id);
card->gendisk->major = card->major;
@@ -289,7 +287,6 @@ void rsxx_destroy_dev(struct rsxx_cardinfo *card)
card->gendisk = NULL;
 
blk_cleanup_queue(card->queue);
-   card->queue->queuedata = NULL;
unregister_blkdev(card->major, DRIVER_NAME);
 }
 
-- 
2.26.2



[PATCH 07/20] umem: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/umem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 1e2aa5ae27963c..5498f1cf36b3fe 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -521,7 +521,8 @@ static int mm_check_plugged(struct cardinfo *card)
 
 static blk_qc_t mm_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct cardinfo *card = q->queuedata;
+   struct cardinfo *card = bio->bi_disk->private_data;
+
pr_debug("mm_make_request %llu %u\n",
 (unsigned long long)bio->bi_iter.bi_sector,
 bio->bi_iter.bi_size);
@@ -888,7 +889,6 @@ static int mm_pci_probe(struct pci_dev *dev, const struct 
pci_device_id *id)
card->queue = blk_alloc_queue(mm_make_request, NUMA_NO_NODE);
if (!card->queue)
goto failed_alloc;
-   card->queue->queuedata = card;
 
tasklet_init(>tasklet, process_page, (unsigned long)card);
 
-- 
2.26.2



[PATCH 04/20] null_blk: stop using ->queuedata for bio mode

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/null_blk_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 82259242b9b5c9..93ce0a00b2ae01 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1392,7 +1392,7 @@ static blk_qc_t null_queue_bio(struct request_queue *q, 
struct bio *bio)
 {
sector_t sector = bio->bi_iter.bi_sector;
sector_t nr_sectors = bio_sectors(bio);
-   struct nullb *nullb = q->queuedata;
+   struct nullb *nullb = bio->bi_disk->private_data;
struct nullb_queue *nq = nullb_to_queue(nullb);
struct nullb_cmd *cmd;
 
-- 
2.26.2



[PATCH 01/20] nfblock: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Geert Uytterhoeven 
Acked-by: Geert Uytterhoeven 
---
 arch/m68k/emu/nfblock.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/m68k/emu/nfblock.c b/arch/m68k/emu/nfblock.c
index c3a630440512e9..87e8b1700acd28 100644
--- a/arch/m68k/emu/nfblock.c
+++ b/arch/m68k/emu/nfblock.c
@@ -61,7 +61,7 @@ struct nfhd_device {
 
 static blk_qc_t nfhd_make_request(struct request_queue *queue, struct bio *bio)
 {
-   struct nfhd_device *dev = queue->queuedata;
+   struct nfhd_device *dev = bio->bi_disk->private_data;
struct bio_vec bvec;
struct bvec_iter iter;
int dir, len, shift;
@@ -122,7 +122,6 @@ static int __init nfhd_init_one(int id, u32 blocks, u32 
bsize)
if (dev->queue == NULL)
goto free_dev;
 
-   dev->queue->queuedata = dev;
blk_queue_logical_block_size(dev->queue, bsize);
 
dev->disk = alloc_disk(16);
-- 
2.26.2



[PATCH 05/20] ps3vram: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/ps3vram.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 821d4d8b1d763e..5a1d1d137c7248 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -587,7 +587,7 @@ static struct bio *ps3vram_do_bio(struct 
ps3_system_bus_device *dev,
 
 static blk_qc_t ps3vram_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct ps3_system_bus_device *dev = q->queuedata;
+   struct ps3_system_bus_device *dev = bio->bi_disk->private_data;
struct ps3vram_priv *priv = ps3_system_bus_get_drvdata(dev);
int busy;
 
@@ -745,7 +745,6 @@ static int ps3vram_probe(struct ps3_system_bus_device *dev)
}
 
priv->queue = queue;
-   queue->queuedata = dev;
blk_queue_max_segments(queue, BLK_MAX_SEGMENTS);
blk_queue_max_segment_size(queue, BLK_MAX_SEGMENT_SIZE);
blk_queue_max_hw_sectors(queue, BLK_SAFE_MAX_SECTORS);
-- 
2.26.2



[PATCH 08/20] zram: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/zram/zram_drv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6e2ad90b17a376..0564e3f384089e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1586,7 +1586,7 @@ static void __zram_make_request(struct zram *zram, struct 
bio *bio)
  */
 static blk_qc_t zram_make_request(struct request_queue *queue, struct bio *bio)
 {
-   struct zram *zram = queue->queuedata;
+   struct zram *zram = bio->bi_disk->private_data;
 
if (!valid_io_request(zram, bio->bi_iter.bi_sector,
bio->bi_iter.bi_size)) {
@@ -1912,7 +1912,6 @@ static int zram_add(void)
zram->disk->first_minor = device_id;
zram->disk->fops = _devops;
zram->disk->queue = queue;
-   zram->disk->queue->queuedata = zram;
zram->disk->private_data = zram;
snprintf(zram->disk->disk_name, 16, "zram%d", device_id);
 
-- 
2.26.2



Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Michal Hocko


On Wed 24-06-20 14:58:46, Srikar Dronamraju wrote:
> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.
> 
> This can cause numa_balancing to be enabled on systems with only one node
> with memory and CPUs. The existence of this dummy node which is cpuless and
> memoryless node can confuse users/scripts looking at output of lscpu /
> numactl.
> 
> By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> always online.
> 
> v5.8-rc2
>  available: 2 nodes (0,2)
>  node 0 cpus:
>  node 0 size: 0 MB
>  node 0 free: 0 MB
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31490 MB
>  node distances:
>  node   0   2
>0:  10  20
>2:  20  10
> 
> proc and sys files
> --
>  /sys/devices/system/node/online:0,2
>  /proc/sys/kernel/numa_balancing:1
>  /sys/devices/system/node/has_cpu:   2
>  /sys/devices/system/node/has_memory:2
>  /sys/devices/system/node/has_normal_memory: 2
>  /sys/devices/system/node/possible:  0-31
> 
> v5.8-rc2 + patch
> --
>  available: 1 nodes (2)
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31487 MB
>  node distances:
>  node   2
>2:  10
> 
> proc and sys files
> --
> /sys/devices/system/node/online:2
> /proc/sys/kernel/numa_balancing:0
> /sys/devices/system/node/has_cpu:   2
> /sys/devices/system/node/has_memory:2
> /sys/devices/system/node/has_normal_memory: 2
> /sys/devices/system/node/possible:  0-31
> 
> Note: On Powerpc, cpu_to_node of possible but not present cpus would
> previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
> numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
> queried from vphn"). Without the 2 commits, Powerpc system might crash.
> 
> 1. User space applications like Numactl, lscpu, that parse the sysfs tend to
> believe there is an extra online node. This tends to confuse users and
> applications. Other user space applications start believing that system was
> not able to use all the resources (i.e missing resources) or the system was
> not setup correctly.
> 
> 2. Also existence of dummy node also leads to inconsistent information. The
> number of online nodes is inconsistent with the information in the
> device-tree and resource-dump
> 
> 3. When the dummy node is present, single node non-Numa systems end up showing
> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> the hit from the unnecessary numa hinting faults.

I have to say that I dislike the node online/offline state and directly
exporting that to the userspace. Users should only care whether the node
has memory/cpus. Numa nodes can be online without any memory. Just
offline all the present memory blocks but do not physically hot remove
them and you are in the same situation. If users are confused by an
output of tools like numactl -H then those could be updated and hide
nodes without any memory

The autonuma problem sounds interesting but again this patch doesn't
really solve the underlying problem because I strongly suspect that the
problem is still there when a numa node gets all its memory offline as
mentioned above.

While I completely agree that making node 0 special is wrong, I have
still hard time to review this very simply looking patch because all the
numa initialization is so spread around that this might just blow up
at unexpected places. IIRC we have discussed testing in the previous
version and David has provided a way to emulate these configurations
on x86. Did you manage to use those instruction for additional testing
on other than ppc architectures?

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Cc: Michal Hocko 
> Cc: Mel Gorman 
> Cc: Vlastimil Babka 
> Cc: "Kirill A. Shutemov" 
> Cc: Christopher Lameter 
> Cc: Michael Ellerman 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Gautham R Shenoy 
> Cc: Satheesh Rajendran 
> Cc: David Hildenbrand 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v4:->v5:
> - rebased to v5.8-rc2
> link v4: 
> http://lore.kernel.org/lkml/20200512132937.19295-1-sri...@linux.vnet.ibm.com/t/#u
> 
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> Link v2: 
> https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u
> 
>  mm/page_alloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eb0f1410d4..5187664558e1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy);
>   */
>  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
>   [N_POSSIBLE] = NODE_MASK_ALL,
> +#ifdef CONFIG_NUMA
> + [N_ONLINE] = 

[PATCH v2 30/30] misc: cxl: flash: Remove unused pointer

2020-07-01 Thread Lee Jones
The DRC index pointer us updated on an OPCODE_ADD, but never
actually read.  Remove the used pointer and shift up OPCODE_ADD
to group with OPCODE_DELETE which also provides a noop.

Fixes the following W=1 kernel build warning:

 drivers/misc/cxl/flash.c: In function ‘update_devicetree’:
 drivers/misc/cxl/flash.c:178:16: warning: variable ‘drc_index’ set but not 
used [-Wunused-but-set-variable]
 178 | __be32 *data, drc_index, phandle;
 | ^

Cc: Frederic Barrat 
Cc: Andrew Donnellan 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Lee Jones 
---
 drivers/misc/cxl/flash.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/misc/cxl/flash.c b/drivers/misc/cxl/flash.c
index cb9cca35a2263..24e3dfcc91a74 100644
--- a/drivers/misc/cxl/flash.c
+++ b/drivers/misc/cxl/flash.c
@@ -175,7 +175,7 @@ static int update_devicetree(struct cxl *adapter, s32 scope)
struct update_nodes_workarea *unwa;
u32 action, node_count;
int token, rc, i;
-   __be32 *data, drc_index, phandle;
+   __be32 *data, phandle;
char *buf;
 
token = rtas_token("ibm,update-nodes");
@@ -206,15 +206,12 @@ static int update_devicetree(struct cxl *adapter, s32 
scope)
 
switch (action) {
case OPCODE_DELETE:
+   case OPCODE_ADD:
/* nothing to do */
break;
case OPCODE_UPDATE:
update_node(phandle, scope);
break;
-   case OPCODE_ADD:
-   /* nothing to do, just move pointer */
-   drc_index = *data++;
-   break;
}
}
}
-- 
2.25.1



[PATCH v2 28/30] misc: ocxl: config: Provide correct formatting to function headers

2020-07-01 Thread Lee Jones
A nice attempt was made to provide kerneldoc headers for
read_template_version() and read_afu_lpc_memory_info() however,
the provided formatting does not match what is expected by
kerneldoc.

Fixes the following W=1 warnings:

 drivers/misc/ocxl/config.c:286: warning: Function parameter or member 'dev' 
not described in 'read_template_version'
 drivers/misc/ocxl/config.c:286: warning: Function parameter or member 'fn' not 
described in 'read_template_version'
 drivers/misc/ocxl/config.c:286: warning: Function parameter or member 'len' 
not described in 'read_template_version'
 drivers/misc/ocxl/config.c:286: warning: Function parameter or member 
'version' not described in 'read_template_version'
 drivers/misc/ocxl/config.c:489: warning: Function parameter or member 'dev' 
not described in 'read_afu_lpc_memory_info'
 drivers/misc/ocxl/config.c:489: warning: Function parameter or member 'fn' not 
described in 'read_afu_lpc_memory_info'
 drivers/misc/ocxl/config.c:489: warning: Function parameter or member 'afu' 
not described in 'read_afu_lpc_memory_info'

Cc: Frederic Barrat 
Cc: Andrew Donnellan 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Lee Jones 
---
 drivers/misc/ocxl/config.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index c8e19bfb5ef90..e3b99a39d207e 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -273,11 +273,11 @@ static int read_afu_info(struct pci_dev *dev, struct 
ocxl_fn_config *fn,
 }
 
 /**
- * Read the template version from the AFU
- * dev: the device for the AFU
- * fn: the AFU offsets
- * len: outputs the template length
- * version: outputs the major<<8,minor version
+ * read_template_version - Read the template version from the AFU
+ * @dev: the device for the AFU
+ * @fn: the AFU offsets
+ * @len: outputs the template length
+ * @version: outputs the major<<8,minor version
  *
  * Returns 0 on success, negative on failure
  */
@@ -476,10 +476,10 @@ static int validate_afu(struct pci_dev *dev, struct 
ocxl_afu_config *afu)
 }
 
 /**
- * Populate AFU metadata regarding LPC memory
- * dev: the device for the AFU
- * fn: the AFU offsets
- * afu: the AFU struct to populate the LPC metadata into
+ * read_afu_lpc_memory_info - Populate AFU metadata regarding LPC memory
+ * @dev: the device for the AFU
+ * @fn: the AFU offsets
+ * @afu: the AFU struct to populate the LPC metadata into
  *
  * Returns 0 on success, negative on failure
  */
-- 
2.25.1



[PATCH v2 27/30] misc: cxl: hcalls: Demote half-assed kerneldoc attempt

2020-07-01 Thread Lee Jones
Function headers will need a lot of work before they reach the
standards expected of kerneldoc.  Demote them down to basic
comments/headers, for now at least.

Fixes the following W=1 kernel build warnings:

 drivers/misc/cxl/hcalls.c:175: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_detach_process'
 drivers/misc/cxl/hcalls.c:175: warning: Function parameter or member 
'process_token' not described in 'cxl_h_detach_process'
 drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_control_function'
 drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'op' not 
described in 'cxl_h_control_function'
 drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'p1' not 
described in 'cxl_h_control_function'
 drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'p2' not 
described in 'cxl_h_control_function'
 drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'p3' not 
described in 'cxl_h_control_function'
 drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'p4' not 
described in 'cxl_h_control_function'
 drivers/misc/cxl/hcalls.c:207: warning: Function parameter or member 'out' not 
described in 'cxl_h_control_function'
 drivers/misc/cxl/hcalls.c:245: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_reset_afu'
 drivers/misc/cxl/hcalls.c:258: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_suspend_process'
 drivers/misc/cxl/hcalls.c:258: warning: Function parameter or member 
'process_token' not described in 'cxl_h_suspend_process'
 drivers/misc/cxl/hcalls.c:271: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_resume_process'
 drivers/misc/cxl/hcalls.c:271: warning: Function parameter or member 
'process_token' not described in 'cxl_h_resume_process'
 drivers/misc/cxl/hcalls.c:284: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_read_error_state'
 drivers/misc/cxl/hcalls.c:284: warning: Function parameter or member 'state' 
not described in 'cxl_h_read_error_state'
 drivers/misc/cxl/hcalls.c:300: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_get_afu_err'
 drivers/misc/cxl/hcalls.c:300: warning: Function parameter or member 'offset' 
not described in 'cxl_h_get_afu_err'
 drivers/misc/cxl/hcalls.c:300: warning: Function parameter or member 
'buf_address' not described in 'cxl_h_get_afu_err'
 drivers/misc/cxl/hcalls.c:300: warning: Function parameter or member 'len' not 
described in 'cxl_h_get_afu_err'
 drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_get_config'
 drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 'cr_num' 
not described in 'cxl_h_get_config'
 drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 'offset' 
not described in 'cxl_h_get_config'
 drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 
'buf_address' not described in 'cxl_h_get_config'
 drivers/misc/cxl/hcalls.c:320: warning: Function parameter or member 'len' not 
described in 'cxl_h_get_config'
 drivers/misc/cxl/hcalls.c:333: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_terminate_process'
 drivers/misc/cxl/hcalls.c:333: warning: Function parameter or member 
'process_token' not described in 'cxl_h_terminate_process'
 drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_collect_vpd'
 drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 'record' 
not described in 'cxl_h_collect_vpd'
 drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 
'list_address' not described in 'cxl_h_collect_vpd'
 drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 'num' not 
described in 'cxl_h_collect_vpd'
 drivers/misc/cxl/hcalls.c:351: warning: Function parameter or member 'out' not 
described in 'cxl_h_collect_vpd'
 drivers/misc/cxl/hcalls.c:362: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_get_fn_error_interrupt'
 drivers/misc/cxl/hcalls.c:362: warning: Function parameter or member 'reg' not 
described in 'cxl_h_get_fn_error_interrupt'
 drivers/misc/cxl/hcalls.c:374: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_ack_fn_error_interrupt'
 drivers/misc/cxl/hcalls.c:374: warning: Function parameter or member 'value' 
not described in 'cxl_h_ack_fn_error_interrupt'
 drivers/misc/cxl/hcalls.c:386: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_get_error_log'
 drivers/misc/cxl/hcalls.c:386: warning: Function parameter or member 'value' 
not described in 'cxl_h_get_error_log'
 drivers/misc/cxl/hcalls.c:399: warning: Function parameter or member 
'unit_address' not described in 'cxl_h_collect_int_info'
 

Re: [PATCH v2 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW

2020-07-01 Thread Alexey Kardashevskiy



On 24/06/2020 16:24, Leonardo Bras wrote:
> On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the
> default DMA window for the device, before attempting to configure a DDW,
> in order to make the maximum resources available for the next DDW to be
> created.
> 
> This is a requirement for some devices to use DDW, given they only
> allow one DMA window.


Devices never know about these windows, it is purely PHB's side of
things. A device can access any address on the bus, the bus can generate
an exception if there is no window behind the address OR some other
device's MMIO. We could actually create a second window in addition to
the first one and allocate bus addresses from both, we just simplifying
this by merging two separate non-adjacent windows into one.


> 
> If setting up a new DDW fails anywhere after the removal of this
> default DMA window, it's needed to restore the default DMA window.
> For this, an implementation of ibm,reset-pe-dma-windows rtas call is
> needed:
> 
> Platforms supporting the DDW option starting with LoPAR level 2.7 implement
> ibm,ddw-extensions. The first extension available (index 2) carries the
> token for ibm,reset-pe-dma-windows rtas call, which is used to restore
> the default DMA window for a device, if it has been deleted.
> 
> It does so by resetting the TCE table allocation for the PE to it's
> boot time value, available in "ibm,dma-window" device tree node.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 70 ++
>  1 file changed, 61 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index a8840d9e1c35..4fcf00016fb1 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1029,6 +1029,39 @@ static phys_addr_t ddw_memory_hotplug_max(void)
>   return max_addr;
>  }
>  
> +/*
> + * Platforms supporting the DDW option starting with LoPAR level 2.7 
> implement
> + * ibm,ddw-extensions, which carries the rtas token for
> + * ibm,reset-pe-dma-windows.
> + * That rtas-call can be used to restore the default DMA window for the 
> device.
> + */
> +static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> +{
> + int ret;
> + u32 cfg_addr, ddw_ext[DDW_EXT_RESET_DMA_WIN + 1];
> + u64 buid;
> + struct device_node *dn;
> + struct pci_dn *pdn;
> +
> + ret = of_property_read_u32_array(par_dn, "ibm,ddw-extensions",
> +  _ext[0], DDW_EXT_RESET_DMA_WIN + 
> 1);
> + if (ret)
> + return;
> +
> + dn = pci_device_to_OF_node(dev);
> + pdn = PCI_DN(dn);
> + buid = pdn->phb->buid;
> + cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
> +
> + ret = rtas_call(ddw_ext[DDW_EXT_RESET_DMA_WIN], 3, 1, NULL, cfg_addr,
> + BUID_HI(buid), BUID_LO(buid));
> + if (ret)
> + dev_info(>dev,
> +  "ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ",
> +  ddw_ext[1], cfg_addr, BUID_HI(buid), BUID_LO(buid),


s/ddw_ext[1]/ddw_ext[DDW_EXT_RESET_DMA_WIN]/


> +  ret);
> +}
> +
>  /*
>   * If the PE supports dynamic dma windows, and there is space for a table
>   * that can map all pages in a linear offset, then setup such a table,
> @@ -1049,8 +1082,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   u64 dma_addr, max_addr;
>   struct device_node *dn;
>   u32 ddw_avail[DDW_APPLICABLE_SIZE];
> +

Unrelated new empty line.


>   struct direct_window *window;
> - struct property *win64;
> + struct property *win64, *default_win = NULL, *ddw_ext = NULL;
>   struct dynamic_dma_window_prop *ddwprop;
>   struct failed_ddw_pdn *fpdn;
>  
> @@ -1085,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   if (ret)
>   goto out_failed;
>  
> -   /*
> + /*
>* Query if there is a second window of size to map the
>* whole partition.  Query returns number of windows, largest
>* block assigned to PE (partition endpoint), and two bitmasks
> @@ -1096,15 +1130,31 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   if (ret != 0)
>   goto out_failed;
>  
> + /*
> +  * If there is no window available, remove the default DMA window,
> +  * if it's present. This will make all the resources available to the
> +  * new DDW window.
> +  * If anything fails after this, we need to restore it, so also check
> +  * for extensions presence.
> +  */
>   if (query.windows_available == 0) {


Does phyp really always advertise 0 windows for these VFs? What is in
the largest_available_block when windows_available==0?


> - /*
> -  * no additional windows are available for this device.
> - 

Re: [PATCH v2 2/6] powerpc/pseries/iommu: Update call to ibm,query-pe-dma-windows

2020-07-01 Thread Alexey Kardashevskiy



On 24/06/2020 16:24, Leonardo Bras wrote:
> From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can make the number of
> outputs from "ibm,query-pe-dma-windows" go from 5 to 6.
> 
> This change of output size is meant to expand the address size of
> largest_available_block PE TCE from 32-bit to 64-bit, which ends up
> shifting page_size and migration_capable.
> 
> This ends up requiring the update of
> ddw_query_response->largest_available_block from u32 to u64, and manually
> assigning the values from the buffer into this struct, according to
> output size.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 57 +-
>  1 file changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 68d2aa9c71a8..558e5441c355 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -44,6 +44,10 @@
>  #define DDW_REMOVE_PE_DMA_WIN2
>  #define DDW_APPLICABLE_SIZE  3
>  
> +#define DDW_EXT_SIZE 0
> +#define DDW_EXT_RESET_DMA_WIN1
> +#define DDW_EXT_QUERY_OUT_SIZE   2


#define DDW_EXT_LAST (DDW_EXT_QUERY_OUT_SIZE + 1)
...


> +
>  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>  {
>   struct iommu_table_group *table_group;
> @@ -339,7 +343,7 @@ struct direct_window {
>  /* Dynamic DMA Window support */
>  struct ddw_query_response {
>   u32 windows_available;
> - u32 largest_available_block;
> + u64 largest_available_block;
>   u32 page_size;
>   u32 migration_capable;
>  };
> @@ -875,13 +879,29 @@ static int find_existing_ddw_windows(void)
>  machine_arch_initcall(pseries, find_existing_ddw_windows);
>  
>  static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
> - struct ddw_query_response *query)
> +  struct ddw_query_response *query,
> +  struct device_node *parent)
>  {
>   struct device_node *dn;
>   struct pci_dn *pdn;
> - u32 cfg_addr;
> + u32 cfg_addr, query_out[5], ddw_ext[DDW_EXT_QUERY_OUT_SIZE + 1];


... and use DDW_EXT_LAST here.


>   u64 buid;
> - int ret;
> + int ret, out_sz;
> +
> + /*
> +  * From LoPAR level 2.8, "ibm,ddw-extensions" index 3 can rule how many
> +  * output parameters ibm,query-pe-dma-windows will have, ranging from
> +  * 5 to 6.
> +  */
> +
> + ret = of_property_read_u32_array(parent, "ibm,ddw-extensions",
> +  _ext[0],
> +  DDW_EXT_QUERY_OUT_SIZE + 1);
> + if (ret && ddw_ext[DDW_EXT_SIZE] > 1 &&

>= DDW_EXT_QUERY_OUT_SIZE ?  Thanks,


> + ddw_ext[DDW_EXT_QUERY_OUT_SIZE] == 1)
> + out_sz = 6;
> + else
> + out_sz = 5;
>  
>   /*
>* Get the config address and phb buid of the PE window.
> @@ -894,11 +914,28 @@ static int query_ddw(struct pci_dev *dev, const u32 
> *ddw_avail,
>   buid = pdn->phb->buid;
>   cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
>  
> - ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query,
> + ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, out_sz, query_out,
>   cfg_addr, BUID_HI(buid), BUID_LO(buid));
> - dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
> - " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr,
> -  BUID_HI(buid), BUID_LO(buid), ret);
> + dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned 
> %d\n",
> +  ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
> +  BUID_LO(buid), ret);
> +
> + switch (out_sz) {
> + case 5:
> + query->windows_available = query_out[0];
> + query->largest_available_block = query_out[1];
> + query->page_size = query_out[2];
> + query->migration_capable = query_out[3];
> + break;
> + case 6:
> + query->windows_available = query_out[0];
> + query->largest_available_block = ((u64)query_out[1] << 32) |
> +  query_out[2];
> + query->page_size = query_out[3];
> + query->migration_capable = query_out[4];
> + break;
> + }
> +
>   return ret;
>  }
>  
> @@ -1046,7 +1083,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>* of page sizes: supported and supported for migrate-dma.
>*/
>   dn = pci_device_to_OF_node(dev);
> - ret = query_ddw(dev, ddw_avail, );
> + ret = query_ddw(dev, ddw_avail, , pdn);
>   if (ret != 0)
>   goto out_failed;
>  
> @@ -1074,7 +,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   /* check largest block * page size > max memory hotplug addr */
>   max_addr = 

Re: [PATCH v2 1/6] powerpc/pseries/iommu: Create defines for operations in ibm,ddw-applicable

2020-07-01 Thread Alexey Kardashevskiy



On 24/06/2020 16:24, Leonardo Bras wrote:
> Create defines to help handling ibm,ddw-applicable values, avoiding
> confusion about the index of given operations.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 40 +++---
>  1 file changed, 23 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 6d47b4a3ce39..68d2aa9c71a8 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -39,6 +39,11 @@
>  
>  #include "pseries.h"
>  
> +#define DDW_QUERY_PE_DMA_WIN 0
> +#define DDW_CREATE_PE_DMA_WIN1
> +#define DDW_REMOVE_PE_DMA_WIN2
> +#define DDW_APPLICABLE_SIZE  3


#define DDW_APPLICABLE_SIZE  (DDW_REMOVE_PE_DMA_WIN + 1)

thanks,


> +
>  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>  {
>   struct iommu_table_group *table_group;
> @@ -771,12 +776,12 @@ static void remove_ddw(struct device_node *np, bool 
> remove_prop)
>  {
>   struct dynamic_dma_window_prop *dwp;
>   struct property *win64;
> - u32 ddw_avail[3];
> + u32 ddw_avail[DDW_APPLICABLE_SIZE];
>   u64 liobn;
>   int ret = 0;
>  
>   ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
> -  _avail[0], 3);
> +  _avail[0], DDW_APPLICABLE_SIZE);
>  
>   win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
>   if (!win64)
> @@ -798,15 +803,15 @@ static void remove_ddw(struct device_node *np, bool 
> remove_prop)
>   pr_debug("%pOF successfully cleared tces in window.\n",
>np);
>  
> - ret = rtas_call(ddw_avail[2], 1, 1, NULL, liobn);
> + ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
>   if (ret)
>   pr_warn("%pOF: failed to remove direct window: rtas returned "
>   "%d to ibm,remove-pe-dma-window(%x) %llx\n",
> - np, ret, ddw_avail[2], liobn);
> + np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
>   else
>   pr_debug("%pOF: successfully removed direct window: rtas 
> returned "
>   "%d to ibm,remove-pe-dma-window(%x) %llx\n",
> - np, ret, ddw_avail[2], liobn);
> + np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
>  
>  delprop:
>   if (remove_prop)
> @@ -889,11 +894,11 @@ static int query_ddw(struct pci_dev *dev, const u32 
> *ddw_avail,
>   buid = pdn->phb->buid;
>   cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
>  
> - ret = rtas_call(ddw_avail[0], 3, 5, (u32 *)query,
> -   cfg_addr, BUID_HI(buid), BUID_LO(buid));
> + ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, 5, (u32 *)query,
> + cfg_addr, BUID_HI(buid), BUID_LO(buid));
>   dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
> - " returned %d\n", ddw_avail[0], cfg_addr, BUID_HI(buid),
> - BUID_LO(buid), ret);
> + " returned %d\n", ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr,
> +  BUID_HI(buid), BUID_LO(buid), ret);
>   return ret;
>  }
>  
> @@ -920,15 +925,16 @@ static int create_ddw(struct pci_dev *dev, const u32 
> *ddw_avail,
>  
>   do {
>   /* extra outputs are LIOBN and dma-addr (hi, lo) */
> - ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create,
> - cfg_addr, BUID_HI(buid), BUID_LO(buid),
> - page_shift, window_shift);
> + ret = rtas_call(ddw_avail[DDW_CREATE_PE_DMA_WIN], 5, 4,
> + (u32 *)create, cfg_addr, BUID_HI(buid),
> + BUID_LO(buid), page_shift, window_shift);
>   } while (rtas_busy_delay(ret));
>   dev_info(>dev,
>   "ibm,create-pe-dma-window(%x) %x %x %x %x %x returned %d "
> - "(liobn = 0x%x starting addr = %x %x)\n", ddw_avail[1],
> -  cfg_addr, BUID_HI(buid), BUID_LO(buid), page_shift,
> -  window_shift, ret, create->liobn, create->addr_hi, 
> create->addr_lo);
> + "(liobn = 0x%x starting addr = %x %x)\n",
> +  ddw_avail[DDW_CREATE_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
> +  BUID_LO(buid), page_shift, window_shift, ret, create->liobn,
> +  create->addr_hi, create->addr_lo);
>  
>   return ret;
>  }
> @@ -996,7 +1002,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   int page_shift;
>   u64 dma_addr, max_addr;
>   struct device_node *dn;
> - u32 ddw_avail[3];
> + u32 ddw_avail[DDW_APPLICABLE_SIZE];
>   struct direct_window *window;
>   struct property *win64;
>   struct dynamic_dma_window_prop *ddwprop;
> @@ -1029,7 +1035,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> 

Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-07-01 Thread Alexey Kardashevskiy



On 24/06/2020 16:24, Leonardo Bras wrote:
> As of today, if a DDW is created and can't map the whole partition, it's
> removed and the default DMA window "ibm,dma-window" is used instead.
> 
> Usually this DDW is bigger than the default DMA window, so it would be
> better to make use of it instead.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 28 +-
>  1 file changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 4fcf00016fb1..2d217cda4075 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -685,7 +685,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
> *bus)
>   struct iommu_table *tbl;
>   struct device_node *dn, *pdn;
>   struct pci_dn *ppci;
> - const __be32 *dma_window = NULL;
> + const __be32 *dma_window = NULL, *alt_dma_window = NULL;
>  
>   dn = pci_bus_to_OF_node(bus);
>  
> @@ -699,8 +699,13 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
> *bus)
>   break;
>   }
>  
> + /* If there is a DDW available, use it instead */
> + alt_dma_window = of_get_property(pdn, DIRECT64_PROPNAME, NULL);


It is not necessarily "direct" anymore as the name suggests, you may
want to change that. DMA64_PROPNAME, may be. Thanks,


> + if (alt_dma_window)
> + dma_window = alt_dma_window;
> +
>   if (dma_window == NULL) {
> - pr_debug("  no ibm,dma-window property !\n");
> + pr_debug("  no ibm,dma-window nor 
> linux,direct64-ddr-window-info property !\n");
>   return;
>   }
>  
> @@ -1166,16 +1171,19 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
> query.page_size);
>   goto out_failed;
>   }
> +
>   /* verify the window * number of ptes will map the partition */
> - /* check largest block * page size > max memory hotplug addr */
>   max_addr = ddw_memory_hotplug_max();
>   if (query.largest_available_block < (max_addr >> page_shift)) {
> - dev_dbg(>dev, "can't map partition max 0x%llx with %llu "
> -   "%llu-sized pages\n", max_addr,  
> query.largest_available_block,
> -   1ULL << page_shift);
> - goto out_failed;
> + dev_dbg(>dev, "can't map partition max 0x%llx with %llu 
> %llu-sized pages\n",
> + max_addr, query.largest_available_block,
> + 1ULL << page_shift);
> +
> + len = order_base_2(query.largest_available_block << page_shift);
> + } else {
> + len = order_base_2(max_addr);
>   }
> - len = order_base_2(max_addr);
> +
>   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
>   if (!win64) {
>   dev_info(>dev,
> @@ -1229,7 +1237,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   list_add(>list, _window_list);
>   spin_unlock(_window_list_lock);
>  
> - dma_addr = be64_to_cpu(ddwprop->dma_base);
> + /* Only returns the dma_addr if DDW maps the whole partition */
> + if (len == order_base_2(max_addr))
> + dma_addr = be64_to_cpu(ddwprop->dma_base);
>   goto out_unlock;
>  
>  out_free_window:
> 

-- 
Alexey


[powerpc:fixes-test] BUILD SUCCESS 19ab500edb5d6020010caba48ce3b4ce4182ab63

2020-07-01 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
fixes-test
branch HEAD: 19ab500edb5d6020010caba48ce3b4ce4182ab63  powerpc/mm/pkeys: Make 
pkey access check work on execute_only_key

elapsed time: 2625m

configs tested: 163
configs skipped: 11

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm  allyesconfig
arm  allmodconfig
arm   allnoconfig
arm64allyesconfig
arm64   defconfig
arm64allmodconfig
arm64 allnoconfig
mips tb0287_defconfig
powerpc  mpc866_ads_defconfig
mips  pistachio_defconfig
sparc64  allyesconfig
arm s3c6400_defconfig
mips  malta_kvm_defconfig
powerpcmvme5100_defconfig
arm s3c2410_defconfig
microblaze  defconfig
mips   mtx1_defconfig
arm orion5x_defconfig
mipsbcm63xx_defconfig
riscvnommu_k210_defconfig
i386 alldefconfig
mipsvocore2_defconfig
um i386_defconfig
mipsmaltaup_defconfig
m68k  multi_defconfig
m68kdefconfig
sh  rts7751r2d1_defconfig
arm   viper_defconfig
powerpc  ppc44x_defconfig
sh   se7724_defconfig
mips  pic32mzda_defconfig
armpleb_defconfig
arm ezx_defconfig
xtensa   common_defconfig
mips   capcella_defconfig
arm lpc32xx_defconfig
arm   versatile_defconfig
xtensa  iss_defconfig
shsh7785lcr_defconfig
mips   sb1250_swarm_defconfig
mips   gcw0_defconfig
mips decstation_defconfig
sparcallyesconfig
arcnsim_700_defconfig
mips   rbtx49xx_defconfig
i386  allnoconfig
i386 allyesconfig
i386defconfig
i386  debian-10.3
ia64 allmodconfig
ia64defconfig
ia64  allnoconfig
ia64 allyesconfig
m68k allmodconfig
m68k  allnoconfig
m68k   sun3_defconfig
m68k allyesconfig
nios2   defconfig
nios2allyesconfig
openriscdefconfig
c6x  allyesconfig
c6x   allnoconfig
openrisc allyesconfig
nds32   defconfig
nds32 allnoconfig
csky allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
h8300allmodconfig
xtensa  defconfig
arc defconfig
sh   allmodconfig
shallnoconfig
microblazeallnoconfig
arc  allyesconfig
mips allyesconfig
mips  allnoconfig
mips allmodconfig
pariscallnoconfig
parisc  defconfig
parisc   allyesconfig
parisc   allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  rhel-kconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a001-20200630
i386 randconfig-a003-20200630
i386 randconfig-a002-20200630
i386 randconfig-a004-20200630
i386 randconfig-a005-20200630
i386 randconfig-a006-20200630
i386 randconfig-a002-20200701
i386 randconfig-a001-20200701
i386 randconfig-a006-20200701
i386

Re: [PATCH v2 6/6] powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00

2020-07-01 Thread Alexey Kardashevskiy



On 27/06/2020 03:46, Leonardo Bras wrote:
> On Wed, 2020-06-24 at 03:24 -0300, Leonardo Bras wrote:
>> As of today, enable_ddw() will return a non-null DMA address if the
>> created DDW maps the whole partition. If the address is valid,
>> iommu_bypass_supported_pSeriesLP() will consider iommu bypass enabled.
>>
>> This can cause some trouble if the DDW happens to start at 0x00.
>>
>> Instead if checking if the address is non-null, check directly if
>> the DDW maps the whole partition, so it can bypass iommu.
>>
>> Signed-off-by: Leonardo Bras 
> 
> This patch has a bug in it. I will rework it soon.

I'd rather suggest this:

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20180725095032.2196-2-...@ozlabs.ru/

Although it does not look like you are actually going to have windows
starting at 0. Thanks,

> Please keep reviewing patches 1-5.
> 
> Best regards,
> Leonardo
> 

-- 
Alexey


Re: [PATCH 01/11] kexec_file: allow archs to handle special regions while locating memory hole

2020-07-01 Thread Dave Young
On 06/29/20 at 05:26pm, Hari Bathini wrote:
> Hi Petr,
> 
> On 29/06/20 5:09 pm, Petr Tesarik wrote:
> > Hi Hari,
> > 
> > is there any good reason to add two more functions with a very similar
> > name to an existing function? AFAICS all you need is a way to call a
> > PPC64-specific function from within kexec_add_buffer (PATCH 4/11), so
> > you could add something like this:
> > 
> > int __weak arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
> > {
> > return 0;
> > }
> > 
> > Call this function from kexec_add_buffer where appropriate and then
> > override it for PPC64 (it roughly corresponds to your
> > kexec_locate_mem_hole_ppc64() from PATCH 4/11).
> > 
> > FWIW it would make it easier for me to follow the resulting code.
> 
> Right, Petr.
> 
> I was trying out a few things before I ended up with what I sent here.
> Bu yeah.. I did realize arch_kexec_locate_mem_hole() would have been better
> after sending out v1. Will take care of that in v2.

Another way is use arch private function to locate mem hole, then set
kbuf->mem, and then call kexec_add_buf, it will skip the common locate
hole function.

But other than that I have some confusion about those excluded ranges.
Replied a question to patch 4.

Thanks
Dave



Re: [PATCH 04/11] ppc64/kexec_file: avoid stomping memory used by special regions

2020-07-01 Thread Dave Young
Hi Hari,
On 06/27/20 at 12:35am, Hari Bathini wrote:
> crashkernel region could have an overlap with special memory regions
> like  opal, rtas, tce-table & such. These regions are referred to as
> exclude memory ranges. Setup this ranges during image probe in order
> to avoid them while finding the buffer for different kdump segments.
> Implement kexec_locate_mem_hole_ppc64() that locates a memory hole
> accounting for these ranges. Also, override arch_kexec_add_buffer()
> to locate a memory hole & later call __kexec_add_buffer() function
> with kbuf->mem set to skip the generic locate memory hole lookup.
> 
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/include/asm/crashdump-ppc64.h |   10 +
>  arch/powerpc/include/asm/kexec.h   |7 -
>  arch/powerpc/kexec/elf_64.c|7 +
>  arch/powerpc/kexec/file_load_64.c  |  292 
> 
>  4 files changed, 312 insertions(+), 4 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h
> 
[snip]
>  /**
> + * get_exclude_memory_ranges - Get exclude memory ranges. This list includes
> + * regions like opal/rtas, tce-table, initrd,
> + * kernel, htab which should be avoided while
> + * setting up kexec load segments.
> + * @mem_ranges:Range list to add the memory ranges to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int get_exclude_memory_ranges(struct crash_mem **mem_ranges)
> +{
> + int ret;
> +
> + ret = add_tce_mem_ranges(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_initrd_mem_range(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_htab_mem_range(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_kernel_mem_range(mem_ranges);
> + if (ret)
> + goto out;
> +
> + ret = add_rtas_mem_range(mem_ranges, false);
> + if (ret)
> + goto out;
> +
> + ret = add_opal_mem_range(mem_ranges, false);
> + if (ret)
> + goto out;
> +
> + ret = add_reserved_ranges(mem_ranges);
> + if (ret)
> + goto out;
> +
> + /* exclude memory ranges should be sorted for easy lookup */
> + sort_memory_ranges(*mem_ranges);
> +out:
> + if (ret)
> + pr_err("Failed to setup exclude memory ranges\n");
> + return ret;
> +}

I'm confused about the "overlap with crashkernel memory", does that mean
those normal kernel used memory could be put in crashkernel reserved
memory range?  If so why can't just skip those areas while crashkernel
doing the reservation?

Thanks
Dave



[PATCH v7 6/7] powerpc/pmem: Avoid the barrier in flush routines

2020-07-01 Thread Aneesh Kumar K.V
nvdimm expect the flush routines to just mark the cache clean. The barrier
that mark the store globally visible is done in nvdimm_flush().

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/lib/pmem.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
index 5a61aaeb6930..21210fa676e5 100644
--- a/arch/powerpc/lib/pmem.c
+++ b/arch/powerpc/lib/pmem.c
@@ -19,9 +19,6 @@ static inline void __clean_pmem_range(unsigned long start, 
unsigned long stop)
 
for (i = 0; i < size >> shift; i++, addr += bytes)
asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): "memory");
-
-
-   asm volatile(PPC_PHWSYNC ::: "memory");
 }
 
 static inline void __flush_pmem_range(unsigned long start, unsigned long stop)
@@ -34,9 +31,6 @@ static inline void __flush_pmem_range(unsigned long start, 
unsigned long stop)
 
for (i = 0; i < size >> shift; i++, addr += bytes)
asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): "memory");
-
-
-   asm volatile(PPC_PHWSYNC ::: "memory");
 }
 
 static inline void clean_pmem_range(unsigned long start, unsigned long stop)
-- 
2.26.2



Re: [PATCH v3] ASoC: fsl_asrc: Add an option to select internal ratio mode

2020-07-01 Thread Nicolin Chen
On Tue, Jun 30, 2020 at 09:56:07PM +0800, Shengjiu Wang wrote:
> The ASRC not only supports ideal ratio mode, but also supports
> internal ratio mode.
> 
> For internal rato mode, the rate of clock source should be divided
> with no remainder by sample rate, otherwise there is sound
> distortion.
> 
> Add function fsl_asrc_select_clk() to find proper clock source for
> internal ratio mode, if the clock source is available then internal
> ratio mode will be selected.
> 
> With change, the ideal ratio mode is not the only option for user.
> 
> Signed-off-by: Shengjiu Wang 

Reviewed-by: Nicolin Chen 


Re: [PATCH v2 4/4] mm/vmalloc: Hugepage vmalloc mappings

2020-07-01 Thread Zefan Li
>  static void *__vmalloc_node(unsigned long size, unsigned long align,
> - gfp_t gfp_mask, pgprot_t prot,
> - int node, const void *caller);
> + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
> + int node, const void *caller);
>  static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> -  pgprot_t prot, int node)
> +  pgprot_t prot, unsigned int page_shift,
> +  int node)
>  {
>   struct page **pages;
> + unsigned long addr = (unsigned long)area->addr;
> + unsigned long size = get_vm_area_size(area);
> + unsigned int page_order = page_shift - PAGE_SHIFT;
>   unsigned int nr_pages, array_size, i;
>   const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
>   const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
>   const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
> - 0 :
> - __GFP_HIGHMEM;
> + 0 : __GFP_HIGHMEM;
>  
> - nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
> + nr_pages = size >> page_shift;

while try out this patchset, we encountered a BUG_ON in account_kernel_stack()
in kernel/fork.c.

BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);

which obviously should be updated accordingly.

>   array_size = (nr_pages * sizeof(struct page *));
>  
>   /* Please note that the recursion is strictly bounded. */
>   if (array_size > PAGE_SIZE) {
>   pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask,
> - PAGE_KERNEL, node, area->caller);
> + PAGE_KERNEL, 0, node, area->caller);
>   } else {
>   pages = kmalloc_node(array_size, nested_gfp, node);
>   }



[PATCH v7 5/7] powerpc/pmem: Update ppc64 to use the new barrier instruction.

2020-07-01 Thread Aneesh Kumar K.V
pmem on POWER10 can now use phwsync instead of hwsync to ensure
all previous writes are architecturally visible for the platform
buffer flush.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/barrier.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index 123adcefd40f..35c1b8f3aa68 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -7,6 +7,10 @@
 
 #include 
 
+#ifndef __ASSEMBLY__
+#include 
+#endif
+
 /*
  * Memory barrier.
  * The sync instruction guarantees that all memory accesses initiated
@@ -97,6 +101,15 @@ do {
\
 #define barrier_nospec()
 #endif /* CONFIG_PPC_BARRIER_NOSPEC */
 
+/*
+ * pmem_wmb() ensures that all stores for which the modification
+ * are written to persistent storage by preceding dcbfps/dcbstps
+ * instructions have updated persistent storage before any data
+ * access or data transfer caused by subsequent instructions is
+ * initiated.
+ */
+#define pmem_wmb() __asm__ __volatile__(PPC_PHWSYNC ::: "memory")
+
 #include 
 
 #endif /* _ASM_POWERPC_BARRIER_H */
-- 
2.26.2



[PATCH v7 7/7] powerpc/pmem: Initialize pmem device on newer hardware

2020-07-01 Thread Aneesh Kumar K.V
With kernel now supporting new pmem flush/sync instructions, we can now
enable the kernel to initialize the device. On P10 these devices would
appear with a new compatible string. For PAPR device we have

compatible   "ibm,pmemory-v2"

and for OF pmem device we have

compatible   "pmem-region-v2"

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 1 +
 drivers/nvdimm/of_pmem.c  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 9c569078a09f..66c19c0fe566 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -876,6 +876,7 @@ static int papr_scm_remove(struct platform_device *pdev)
 
 static const struct of_device_id papr_scm_match[] = {
{ .compatible = "ibm,pmemory" },
+   { .compatible = "ibm,pmemory-v2" },
{ },
 };
 
diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
index 6826a274a1f1..10dbdcdfb9ce 100644
--- a/drivers/nvdimm/of_pmem.c
+++ b/drivers/nvdimm/of_pmem.c
@@ -90,6 +90,7 @@ static int of_pmem_region_remove(struct platform_device *pdev)
 
 static const struct of_device_id of_pmem_region_match[] = {
{ .compatible = "pmem-region" },
+   { .compatible = "pmem-region-v2" },
{ },
 };
 
-- 
2.26.2



[PATCH v7 4/7] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-07-01 Thread Aneesh Kumar K.V
Architectures like ppc64 provide persistent memory specific barriers
that will ensure that all stores for which the modifications are
written to persistent storage by preceding dcbfps and dcbstps
instructions have updated persistent storage before any data
access or data transfer caused by subsequent instructions is initiated.
This is in addition to the ordering done by wmb()

Update nvdimm core such that architecture can use barriers other than
wmb to ensure all previous writes are architecturally visible for
the platform buffer flush.

Reviewed-by: Dan Williams 
Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/memory-barriers.txt | 14 ++
 drivers/md/dm-writecache.c|  2 +-
 drivers/nvdimm/region_devs.c  |  8 
 include/asm-generic/barrier.h | 10 ++
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index eaabc3134294..ff07cd3b2f82 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1935,6 +1935,20 @@ There are some more advanced barrier functions:
  relaxed I/O accessors and the Documentation/DMA-API.txt file for more
  information on consistent memory.
 
+ (*) pmem_wmb();
+
+ This is for use with persistent memory to ensure that stores for which
+ modifications are written to persistent storage reached a platform
+ durability domain.
+
+ For example, after a non-temporal write to pmem region, we use pmem_wmb()
+ to ensure that stores have reached a platform durability domain. This 
ensures
+ that stores have updated persistent storage before any data access or
+ data transfer caused by subsequent instructions is initiated. This is
+ in addition to the ordering done by wmb().
+
+ For load from persistent memory, existing read memory barriers are 
sufficient
+ to ensure read ordering.
 
 ===
 IMPLICIT KERNEL MEMORY BARRIERS
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 74f3c506f084..00534fa4a384 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -536,7 +536,7 @@ static void ssd_commit_superblock(struct dm_writecache *wc)
 static void writecache_commit_flushed(struct dm_writecache *wc, bool 
wait_for_ios)
 {
if (WC_MODE_PMEM(wc))
-   wmb();
+   pmem_wmb();
else
ssd_commit_flushed(wc, wait_for_ios);
 }
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 4502f9c4708d..c3237c2b03a6 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1206,13 +1206,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));
 
/*
-* The first wmb() is needed to 'sfence' all previous writes
-* such that they are architecturally visible for the platform
-* buffer flush.  Note that we've already arranged for pmem
+* The pmem_wmb() is needed to 'sfence' all
+* previous writes such that they are architecturally visible for
+* the platform buffer flush. Note that we've already arranged for pmem
 * writes to avoid the cache via memcpy_flushcache().  The final
 * wmb() ensures ordering for the NVDIMM flush write.
 */
-   wmb();
+   pmem_wmb();
for (i = 0; i < nd_region->ndr_mappings; i++)
if (ndrd_get_flush_wpq(ndrd, i, 0))
writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 2eacaf7d62f6..b589bb216ee5 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -257,5 +257,15 @@ do {   
\
 })
 #endif
 
+/*
+ * pmem_wmb() ensures that all stores for which the modification
+ * are written to persistent storage by preceding instructions have
+ * updated persistent storage before any data  access or data transfer
+ * caused by subsequent instructions is initiated.
+ */
+#ifndef pmem_wmb
+#define pmem_wmb() wmb()
+#endif
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
-- 
2.26.2



[PATCH v7 3/7] powerpc/pmem: Add flush routines using new pmem store and sync instruction

2020-07-01 Thread Aneesh Kumar K.V
Start using dcbstps; phwsync; sequence for flushing persistent memory range.
The new instructions are implemented as a variant of dcbf and hwsync and on
P8 and P9 they will be executed as those instructions. We avoid using them on
older hardware. This helps to avoid difficult to debug bugs.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/cacheflush.h |  1 +
 arch/powerpc/lib/pmem.c   | 50 ---
 2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cacheflush.h 
b/arch/powerpc/include/asm/cacheflush.h
index de600b915a3c..54764c6e922d 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -6,6 +6,7 @@
 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /*
diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
index 0666a8d29596..5a61aaeb6930 100644
--- a/arch/powerpc/lib/pmem.c
+++ b/arch/powerpc/lib/pmem.c
@@ -9,20 +9,62 @@
 
 #include 
 
+static inline void __clean_pmem_range(unsigned long start, unsigned long stop)
+{
+   unsigned long shift = l1_dcache_shift();
+   unsigned long bytes = l1_dcache_bytes();
+   void *addr = (void *)(start & ~(bytes - 1));
+   unsigned long size = stop - (unsigned long)addr + (bytes - 1);
+   unsigned long i;
+
+   for (i = 0; i < size >> shift; i++, addr += bytes)
+   asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): "memory");
+
+
+   asm volatile(PPC_PHWSYNC ::: "memory");
+}
+
+static inline void __flush_pmem_range(unsigned long start, unsigned long stop)
+{
+   unsigned long shift = l1_dcache_shift();
+   unsigned long bytes = l1_dcache_bytes();
+   void *addr = (void *)(start & ~(bytes - 1));
+   unsigned long size = stop - (unsigned long)addr + (bytes - 1);
+   unsigned long i;
+
+   for (i = 0; i < size >> shift; i++, addr += bytes)
+   asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): "memory");
+
+
+   asm volatile(PPC_PHWSYNC ::: "memory");
+}
+
+static inline void clean_pmem_range(unsigned long start, unsigned long stop)
+{
+   if (cpu_has_feature(CPU_FTR_ARCH_207S))
+   return __clean_pmem_range(start, stop);
+}
+
+static inline void flush_pmem_range(unsigned long start, unsigned long stop)
+{
+   if (cpu_has_feature(CPU_FTR_ARCH_207S))
+   return __flush_pmem_range(start, stop);
+}
+
 /*
  * CONFIG_ARCH_HAS_PMEM_API symbols
  */
 void arch_wb_cache_pmem(void *addr, size_t size)
 {
unsigned long start = (unsigned long) addr;
-   flush_dcache_range(start, start + size);
+   clean_pmem_range(start, start + size);
 }
 EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
 
 void arch_invalidate_pmem(void *addr, size_t size)
 {
unsigned long start = (unsigned long) addr;
-   flush_dcache_range(start, start + size);
+   flush_pmem_range(start, start + size);
 }
 EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
 
@@ -35,7 +77,7 @@ long __copy_from_user_flushcache(void *dest, const void 
__user *src,
unsigned long copied, start = (unsigned long) dest;
 
copied = __copy_from_user(dest, src, size);
-   flush_dcache_range(start, start + size);
+   clean_pmem_range(start, start + size);
 
return copied;
 }
@@ -45,7 +87,7 @@ void *memcpy_flushcache(void *dest, const void *src, size_t 
size)
unsigned long start = (unsigned long) dest;
 
memcpy(dest, src, size);
-   flush_dcache_range(start, start + size);
+   clean_pmem_range(start, start + size);
 
return dest;
 }
-- 
2.26.2



  1   2   >