date:20190924

[PATCH v9 8/8] KVM: PPC: Ultravisor: Add PPC_UV config option

2019-09-24 Thread Bharata B Rao

From: Anshuman Khandual 

CONFIG_PPC_UV adds support for ultravisor.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Bharata B Rao 
Signed-off-by: Ram Pai 
[ Update config help and commit message ]
Signed-off-by: Claudio Carvalho 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/Kconfig | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d8dcd8820369..044838794112 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -448,6 +448,23 @@ config PPC_TRANSACTIONAL_MEM
help
  Support user-mode Transactional Memory on POWERPC.
 
+config PPC_UV
+   bool "Ultravisor support"
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   select ZONE_DEVICE
+   select DEV_PAGEMAP_OPS
+   select DEVICE_PRIVATE
+   select MEMORY_HOTPLUG
+   select MEMORY_HOTREMOVE
+   default n
+   help
+ This option paravirtualizes the kernel to run in POWER platforms that
+ supports the Protected Execution Facility (PEF). On such platforms,
+ the ultravisor firmware runs at a privilege level above the
+ hypervisor.
+
+ If unsure, say "N".
+
 config LD_HEAD_STUB_CATCH
bool "Reserve 256 bytes to cope with linker stubs in HEAD text" if 
EXPERT
depends on PPC64
-- 
2.21.0

[PATCH v9 7/8] KVM: PPC: Support reset of secure guest

2019-09-24 Thread Bharata B Rao

Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
This ioctl will be issued by QEMU during reset and includes the
the following steps:

- Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
- Unpin the VPA pages so that they can be migrated back to secure
  side when guest becomes secure again. This is required because
  pinned pages can't be migrated.
- Reinitialize guest's partitioned scoped page tables. These are
  freed when guest becomes secure (H_SVM_INIT_DONE)
- Release all device pages of the secure guest.

After these steps, guest is ready to issue UV_ESM call once again
to switch to secure mode.

Signed-off-by: Bharata B Rao 
Signed-off-by: Sukadev Bhattiprolu 
[Implementation of uv_svm_terminate() and its call from
guest shutdown path]
Signed-off-by: Ram Pai 
[Unpinning of VPA pages]
---
 Documentation/virt/kvm/api.txt  | 19 ++
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  7 ++
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  5 ++
 arch/powerpc/kvm/book3s_hv.c| 74 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 60 +
 arch/powerpc/kvm/powerpc.c  | 12 
 include/uapi/linux/kvm.h|  1 +
 9 files changed, 181 insertions(+)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 2d067767b617..8e7a02e547e9 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -4111,6 +4111,25 @@ Valid values for 'action':
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+4.121 KVM_PPC_SVM_OFF
+
+Capability: basic
+Architectures: powerpc
+Type: vm ioctl
+Parameters: none
+Returns: 0 on successful completion,
+Errors:
+  EINVAL:if ultravisor failed to terminate the secure guest
+  ENOMEM:if hypervisor failed to allocate new radix page tables for guest
+
+This ioctl is used to turn off the secure mode of the guest or transition
+the guest from secure mode to normal mode. This is invoked when the guest
+is reset. This has no effect if called for a normal guest.
+
+This ioctl issues an ultravisor call to terminate the secure guest,
+unpins the VPA pages, reinitializes guest's partition scoped page
+tables and releases all the device pages that are used to track the
+secure pages by hypervisor.
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index fc924ef00b91..6b8cc8edd0ab 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -13,6 +13,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long page_shift);
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
+void kvmppc_uvmem_free_memslot_pfns(struct kvm *kvm,
+   struct kvm_memslots *slots);
 #else
 static inline unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
@@ -37,5 +39,10 @@ static inline unsigned long kvmppc_h_svm_init_done(struct 
kvm *kvm)
 {
return H_UNSUPPORTED;
 }
+
+static inline void kvmppc_uvmem_free_memslot_pfns(struct kvm *kvm,
+ struct kvm_memslots *slots)
+{
+}
 #endif /* CONFIG_PPC_UV */
 #endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2484e6a8f5ca..e4093d067354 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -177,6 +177,7 @@ extern void kvm_spapr_tce_release_iommu_group(struct kvm 
*kvm,
 extern int kvmppc_switch_mmu_to_hpt(struct kvm *kvm);
 extern int kvmppc_switch_mmu_to_radix(struct kvm *kvm);
 extern void kvmppc_setup_partition_table(struct kvm *kvm);
+extern int kvmppc_reinit_partition_table(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
@@ -321,6 +322,7 @@ struct kvmppc_ops {
   int size);
int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
  int size);
+   int (*svm_off)(struct kvm *kvm);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index cf200d4ce703..3a27a0c0be05 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -30,5 +30,6 @@
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 #define UV_PAGE_INVAL  0xF138
+#define UV_SVM_TERMINATE   0xF13C
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */

[PATCH v9 6/8] KVM: PPC: Radix changes for secure guest

2019-09-24 Thread Bharata B Rao

- After the guest becomes secure, when we handle a page fault of a page
  belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
- Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
- Ensure all those routines that walk the secondary page tables of
  the guest don't do so in case of secure VM. For secure guest, the
  active secondary page tables are in secure memory and the secondary
  page tables in HV are freed when guest becomes secure.

Signed-off-by: Bharata B Rao 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/include/asm/kvm_host.h   | 12 
 arch/powerpc/include/asm/ultravisor-api.h |  1 +
 arch/powerpc/include/asm/ultravisor.h |  5 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 22 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c| 20 
 5 files changed, 60 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 726d35eb3bfe..c0c6603ddd6b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -877,6 +877,8 @@ static inline void kvm_arch_vcpu_block_finish(struct 
kvm_vcpu *vcpu) {}
 #ifdef CONFIG_PPC_UV
 int kvmppc_uvmem_init(void);
 void kvmppc_uvmem_free(void);
+bool kvmppc_is_guest_secure(struct kvm *kvm);
+int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gpa);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -884,6 +886,16 @@ static inline int kvmppc_uvmem_init(void)
 }
 
 static inline void kvmppc_uvmem_free(void) {}
+
+static inline bool kvmppc_is_guest_secure(struct kvm *kvm)
+{
+   return false;
+}
+
+static inline int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gpa)
+{
+   return -EFAULT;
+}
 #endif /* CONFIG_PPC_UV */
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 46b1ee381695..cf200d4ce703 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -29,5 +29,6 @@
 #define UV_UNREGISTER_MEM_SLOT 0xF124
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
+#define UV_PAGE_INVAL  0xF138
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 719c0c3930b9..b333241bbe4c 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -57,4 +57,9 @@ static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid)
return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid);
 }
 
+static inline int uv_page_inval(u64 lpid, u64 gpa, u64 page_shift)
+{
+   return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..93ad34e63045 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Supported radix tree geometry.
@@ -915,6 +917,9 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
if (!(dsisr & DSISR_PRTABLE_FAULT))
gpa |= ea & 0xfff;
 
+   if (kvmppc_is_guest_secure(kvm))
+   return kvmppc_send_page_to_uv(kvm, gpa & PAGE_MASK);
+
/* Get the corresponding memslot */
memslot = gfn_to_memslot(kvm, gfn);
 
@@ -972,6 +977,11 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
 
+   if (kvmppc_is_guest_secure(kvm)) {
+   uv_page_inval(kvm->arch.lpid, gpa, PAGE_SIZE);
+   return 0;
+   }
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
@@ -989,6 +999,9 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
int ref = 0;
unsigned long old, *rmapp;
 
+   if (kvmppc_is_guest_secure(kvm))
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1026,9 @@ int kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
 
+   if (kvmppc_is_guest_secure(kvm))
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
@@ -1030,6 +1046,9 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
int ret = 0;
unsigned long old, *rmapp;
 
+

[PATCH v9 5/8] KVM: PPC: Handle memory plug/unplug to secure VM

2019-09-24 Thread Bharata B Rao

Register the new memslot with UV during plug and unregister
the memslot during unplug.

Signed-off-by: Bharata B Rao 
Acked-by: Paul Mackerras 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/include/asm/ultravisor-api.h |  1 +
 arch/powerpc/include/asm/ultravisor.h |  5 +
 arch/powerpc/kvm/book3s_hv.c  | 21 +
 3 files changed, 27 insertions(+)

diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index c578d9b13a56..46b1ee381695 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -26,6 +26,7 @@
 #define UV_WRITE_PATE  0xF104
 #define UV_RETURN  0xF11C
 #define UV_REGISTER_MEM_SLOT   0xF120
+#define UV_UNREGISTER_MEM_SLOT 0xF124
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 58ccf5e2d6bb..719c0c3930b9 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -52,4 +52,9 @@ static inline int uv_register_mem_slot(u64 lpid, u64 
start_gpa, u64 size,
size, flags, slotid);
 }
 
+static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid)
+{
+   return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 3ba27fed3018..c5320cc0a534 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "book3s.h"
 
@@ -4517,6 +4518,26 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) &&
((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES))
kvmppc_radix_flush_memslot(kvm, old);
+   /*
+* If UV hasn't yet called H_SVM_INIT_START, don't register memslots.
+*/
+   if (!kvm->arch.secure_guest)
+   return;
+
+   switch (change) {
+   case KVM_MR_CREATE:
+   uv_register_mem_slot(kvm->arch.lpid,
+new->base_gfn << PAGE_SHIFT,
+new->npages * PAGE_SIZE,
+0, new->id);
+   break;
+   case KVM_MR_DELETE:
+   uv_unregister_mem_slot(kvm->arch.lpid, old->id);
+   break;
+   default:
+   /* TODO: Handle KVM_MR_MOVE */
+   break;
+   }
 }
 
 /*
-- 
2.21.0

[PATCH v9 4/8] KVM: PPC: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls

2019-09-24 Thread Bharata B Rao

H_SVM_INIT_START: Initiate securing a VM
H_SVM_INIT_DONE: Conclude securing a VM

As part of H_SVM_INIT_START, register all existing memslots with
the UV. H_SVM_INIT_DONE call by UV informs HV that transition of
the guest to secure mode is complete.

These two states (transition to secure mode STARTED and transition
to secure mode COMPLETED) are recorded in kvm->arch.secure_guest.
Setting these states will cause the assembly code that enters the
guest to call the UV_RETURN ucall instead of trying to enter the
guest directly.

Signed-off-by: Bharata B Rao 
Acked-by: Paul Mackerras 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/include/asm/hvcall.h   |  2 ++
 arch/powerpc/include/asm/kvm_book3s_uvmem.h | 12 
 arch/powerpc/include/asm/kvm_host.h |  4 +++
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  7 +
 arch/powerpc/kvm/book3s_hv.c|  7 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 34 +
 7 files changed, 67 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 4e98dd992bd1..13bd870609c3 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -348,6 +348,8 @@
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
+#define H_SVM_INIT_START   0xEF08
+#define H_SVM_INIT_DONE0xEF0C
 
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 9603c2b48d67..fc924ef00b91 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -11,6 +11,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long gra,
unsigned long flags,
unsigned long page_shift);
+unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
+unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
 #else
 static inline unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
@@ -25,5 +27,15 @@ kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra,
 {
return H_UNSUPPORTED;
 }
+
+static inline unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
 #endif /* CONFIG_PPC_UV */
 #endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index a2e7502346a3..726d35eb3bfe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -281,6 +281,10 @@ struct kvm_hpt_info {
 
 struct kvm_resize_hpt;
 
+/* Flag values for kvm_arch.secure_guest */
+#define KVMPPC_SECURE_INIT_START 0x1 /* H_SVM_INIT_START has been called */
+#define KVMPPC_SECURE_INIT_DONE  0x2 /* H_SVM_INIT_DONE completed */
+
 struct kvm_arch {
unsigned int lpid;
unsigned int smt_mode;  /* # vcpus per virtual core */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 1cd1f595fd81..c578d9b13a56 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -25,6 +25,7 @@
 /* opcodes */
 #define UV_WRITE_PATE  0xF104
 #define UV_RETURN  0xF11C
+#define UV_REGISTER_MEM_SLOT   0xF120
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 0fc4a974b2e8..58ccf5e2d6bb 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -45,4 +45,11 @@ static inline int uv_page_out(u64 lpid, u64 dst_ra, u64 
src_gpa, u64 flags,
page_shift);
 }
 
+static inline int uv_register_mem_slot(u64 lpid, u64 start_gpa, u64 size,
+  u64 flags, u64 slotid)
+{
+   return ucall_norets(UV_REGISTER_MEM_SLOT, lpid, start_gpa,
+   size, flags, slotid);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ef532cce85f9..3ba27fed3018 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1089,6 +1089,13 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
kvmppc_get_gpr(vcpu, 5),
kvmppc_get_gpr(vcpu, 6));
break;
+   case H_SVM_INIT_START:
+   ret = kvmppc_h_svm_init_start(vcpu->kvm);
+   break;
+

[PATCH v9 3/8] KVM: PPC: Shared pages support for secure guests

2019-09-24 Thread Bharata B Rao

A secure guest will share some of its pages with hypervisor (Eg. virtio
bounce buffers etc). Support sharing of pages between hypervisor and
ultravisor.

Shared page is reachable via both HV and UV side page tables. Once a
secure page is converted to shared page, the device page that represents
the secure page is unmapped from the HV side page tables.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h  |  3 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c | 86 --
 2 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2595d0144958..4e98dd992bd1 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,9 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Flags for H_SVM_PAGE_IN */
+#define H_PAGE_IN_SHARED0x1
+
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 312f0fedde0b..5e5b5a3e9eec 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -19,7 +19,10 @@
  * available in the platform for running secure guests is hotplugged.
  * Whenever a page belonging to the guest becomes secure, a page from this
  * private device memory is used to represent and track that secure page
- * on the HV side.
+ * on the HV side. Some pages (like virtio buffers, VPA pages etc) are
+ * shared between UV and HV. However such pages aren't represented by
+ * device private memory and mappings to shared memory exist in both
+ * UV and HV page tables.
  *
  * For each page that gets moved into secure memory, a device PFN is used
  * on the HV side and migration PTE corresponding to that PFN would be
@@ -80,6 +83,7 @@ struct kvmppc_uvmem_page_pvt {
unsigned long *rmap;
struct kvm *kvm;
unsigned long gpa;
+   bool skip_page_out;
 };
 
 /*
@@ -190,8 +194,70 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned 
long start,
return ret;
 }
 
+/*
+ * Shares the page with HV, thus making it a normal page.
+ *
+ * - If the page is already secure, then provision a new page and share
+ * - If the page is a normal page, share the existing page
+ *
+ * In the former case, uses dev_pagemap_ops.migrate_to_ram handler
+ * to unmap the device page from QEMU's page tables.
+ */
+static unsigned long
+kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long page_shift)
+{
+
+   int ret = H_PARAMETER;
+   struct page *uvmem_page;
+   struct kvmppc_uvmem_page_pvt *pvt;
+   unsigned long pfn;
+   unsigned long *rmap;
+   struct kvm_memory_slot *slot;
+   unsigned long gfn = gpa >> page_shift;
+   int srcu_idx;
+
+   srcu_idx = srcu_read_lock(>srcu);
+   slot = gfn_to_memslot(kvm, gfn);
+   if (!slot)
+   goto out;
+
+   rmap = >arch.rmap[gfn - slot->base_gfn];
+   mutex_lock(>arch.uvmem_lock);
+   if (kvmppc_rmap_type(rmap) == KVMPPC_RMAP_UVMEM_PFN) {
+   uvmem_page = pfn_to_page(*rmap & ~KVMPPC_RMAP_UVMEM_PFN);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = true;
+   }
+
+retry:
+   mutex_unlock(>arch.uvmem_lock);
+   pfn = gfn_to_pfn(kvm, gfn);
+   if (is_error_noslot_pfn(pfn))
+   goto out;
+
+   mutex_lock(>arch.uvmem_lock);
+   if (kvmppc_rmap_type(rmap) == KVMPPC_RMAP_UVMEM_PFN) {
+   uvmem_page = pfn_to_page(*rmap & ~KVMPPC_RMAP_UVMEM_PFN);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = true;
+   kvm_release_pfn_clean(pfn);
+   goto retry;
+   }
+
+   if (!uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0, page_shift))
+   ret = H_SUCCESS;
+   kvm_release_pfn_clean(pfn);
+   mutex_unlock(>arch.uvmem_lock);
+out:
+   srcu_read_unlock(>srcu, srcu_idx);
+   return ret;
+}
+
 /*
  * H_SVM_PAGE_IN: Move page from normal memory to secure memory.
+ *
+ * H_PAGE_IN_SHARED flag makes the page shared which means that the same
+ * memory in is visible from both UV and HV.
  */
 unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
@@ -208,9 +274,12 @@ kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (page_shift != PAGE_SHIFT)
return H_P3;
 
-   if (flags)
+   if (flags & ~H_PAGE_IN_SHARED)
return H_P2;
 
+   if (flags & H_PAGE_IN_SHARED)
+   return kvmppc_share_page(kvm, gpa, page_shift);
+
ret = H_PARAMETER;
srcu_idx = srcu_read_lock(>srcu);
down_read(>mm->mmap_sem);
@@ -292,8 +361,17 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned 
long start,
pvt = spage->zone_device_data;

[PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-09-24 Thread Bharata B Rao

Manage migration of pages betwen normal and secure memory of secure
guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.

H_SVM_PAGE_IN: Move the content of a normal page to secure page
H_SVM_PAGE_OUT: Move the content of a secure page to normal page

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created.
Whenever a page belonging to the guest becomes secure, a page from
this private device memory is used to represent and track that secure
page on the HV side. The movement of pages between normal and secure
memory is done via migrate_vma_pages() using UV_PAGE_IN and
UV_PAGE_OUT ucalls.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h   |   4 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  29 ++
 arch/powerpc/include/asm/kvm_host.h |  13 +
 arch/powerpc/include/asm/ultravisor-api.h   |   2 +
 arch/powerpc/include/asm/ultravisor.h   |  14 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_hv.c|  20 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 481 
 8 files changed, 566 insertions(+)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2023e327..2595d0144958 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,10 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Platform-specific hcalls used by the Ultravisor */
+#define H_SVM_PAGE_IN  0xEF00
+#define H_SVM_PAGE_OUT 0xEF04
+
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
 #define H_SET_MODE_RESOURCE_SET_DAWR   2
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
new file mode 100644
index ..9603c2b48d67
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __POWERPC_KVM_PPC_HMM_H__
+#define __POWERPC_KVM_PPC_HMM_H__
+
+#ifdef CONFIG_PPC_UV
+unsigned long kvmppc_h_svm_page_in(struct kvm *kvm,
+  unsigned long gra,
+  unsigned long flags,
+  unsigned long page_shift);
+unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
+   unsigned long gra,
+   unsigned long flags,
+   unsigned long page_shift);
+#else
+static inline unsigned long
+kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
+unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long
+kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra,
+ unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+#endif /* CONFIG_PPC_UV */
+#endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 81cd221ccc04..a2e7502346a3 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -336,6 +336,7 @@ struct kvm_arch {
 #endif
struct kvmppc_ops *kvm_ops;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   struct mutex uvmem_lock;
struct mutex mmu_setup_lock;/* nests inside vcpu mutexes */
u64 l1_ptcr;
int max_nested_lpid;
@@ -869,4 +870,16 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu 
*vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
 
+#ifdef CONFIG_PPC_UV
+int kvmppc_uvmem_init(void);
+void kvmppc_uvmem_free(void);
+#else
+static inline int kvmppc_uvmem_init(void)
+{
+   return 0;
+}
+
+static inline void kvmppc_uvmem_free(void) {}
+#endif /* CONFIG_PPC_UV */
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 6a0f9c74f959..1cd1f595fd81 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -25,5 +25,7 @@
 /* opcodes */
 #define UV_WRITE_PATE  0xF104
 #define UV_RETURN  0xF11C
+#define UV_PAGE_IN 0xF128
+#define UV_PAGE_OUT0xF12C
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index d7aa97aa7834..0fc4a974b2e8 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -31,4 +31,18 @@ static inline int uv_register_pate(u64 lpid, u64 dw0, u64 
dw1)

[PATCH v9 1/8] KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot

2019-09-24 Thread Bharata B Rao

From: Suraj Jitindar Singh 

The rmap array in the guest memslot is an array of size number of guest
pages, allocated at memslot creation time. Each rmap entry in this array
is used to store information about the guest page to which it
corresponds. For example for a hpt guest it is used to store a lock bit,
rc bits, a present bit and the index of a hpt entry in the guest hpt
which maps this page. For a radix guest which is running nested guests
it is used to store a pointer to a linked list of nested rmap entries
which store the nested guest physical address which maps this guest
address and for which there is a pte in the shadow page table.

As there are currently two uses for the rmap array, and the potential
for this to expand to more in the future, define a type field (being the
top 8 bits of the rmap entry) to be used to define the type of the rmap
entry which is currently present and define two values for this field
for the two current uses of the rmap array.

Since the nested case uses the rmap entry to store a pointer, define
this type as having the two high bits set as is expected for a pointer.
Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering).

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
Signed-off-by: Bharata B Rao 
[Added rmap type KVMPPC_RMAP_UVMEM_PFN]
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/include/asm/kvm_host.h | 28 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  2 +-
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 4bb552d639b8..81cd221ccc04 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -232,11 +232,31 @@ struct revmap_entry {
 };
 
 /*
- * We use the top bit of each memslot->arch.rmap entry as a lock bit,
- * and bit 32 as a present flag.  The bottom 32 bits are the
- * index in the guest HPT of a HPTE that points to the page.
+ * The rmap array of size number of guest pages is allocated for each memslot.
+ * This array is used to store usage specific information about the guest page.
+ * Below are the encodings of the various possible usage types.
  */
-#define KVMPPC_RMAP_LOCK_BIT   63
+/* Free bits which can be used to define a new usage */
+#define KVMPPC_RMAP_TYPE_MASK  0xff00
+#define KVMPPC_RMAP_NESTED 0xc000  /* Nested rmap array */
+#define KVMPPC_RMAP_HPT0x0100  /* HPT guest */
+#define KVMPPC_RMAP_UVMEM_PFN  0x0200  /* Secure GPA */
+
+static inline unsigned long kvmppc_rmap_type(unsigned long *rmap)
+{
+   return (*rmap & KVMPPC_RMAP_TYPE_MASK);
+}
+
+/*
+ * rmap usage definition for a hash page table (hpt) guest:
+ * 0x0800  Lock bit
+ * 0x0180  RC bits
+ * 0x0001  Present bit
+ * 0x  HPT index bits
+ * The bottom 32 bits are the index in the guest HPT of a HPTE that points to
+ * the page.
+ */
+#define KVMPPC_RMAP_LOCK_BIT   43
 #define KVMPPC_RMAP_RC_SHIFT   32
 #define KVMPPC_RMAP_REFERENCED (HPTE_R_R << KVMPPC_RMAP_RC_SHIFT)
 #define KVMPPC_RMAP_PRESENT0x1ul
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 63e0ce91e29d..7186c65c61c9 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -99,7 +99,7 @@ void kvmppc_add_revmap_chain(struct kvm *kvm, struct 
revmap_entry *rev,
} else {
rev->forw = rev->back = pte_index;
*rmap = (*rmap & ~KVMPPC_RMAP_INDEX) |
-   pte_index | KVMPPC_RMAP_PRESENT;
+   pte_index | KVMPPC_RMAP_PRESENT | KVMPPC_RMAP_HPT;
}
unlock_rmap(rmap);
 }
-- 
2.21.0

[PATCH v9 0/8] KVM: PPC: Driver to manage pages of secure guest

2019-09-24 Thread Bharata B Rao

[The main change in this version is the introduction of new
locking to prevent concurrent page-in and page-out calls. More
details about this are present in patch 2/8]

Hi,

A pseries guest can be run as a secure guest on Ultravisor-enabled
POWER platforms. On such platforms, this driver will be used to manage
the movement of guest pages between the normal memory managed by
hypervisor(HV) and secure memory managed by Ultravisor(UV).

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created.
Whenever a page belonging to the guest becomes secure, a page from
this private device memory is used to represent and track that secure
page on the HV side. The movement of pages between normal and secure
memory is done via migrate_vma_pages(). The reverse movement is driven
via pagemap_ops.migrate_to_ram().

The page-in or page-out requests from UV will come to HV as hcalls and
HV will call back into UV via uvcalls to satisfy these page requests.

These patches are against hmm.git
(https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm)

plus

Claudio Carvalho's base ultravisor enablement patches that are present
in Michael Ellerman's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/log/?h=topic/ppc-kvm)

These patches along with Claudio's above patches are required to
run secure pseries guests on KVM. This patchset is based on hmm.git
because hmm.git has migrate_vma cleanup and not-device memremap_pages
patchsets that are required by this patchset.

Changes in v9
=
- Prevent concurrent page-in and page-out calls.
- Ensure device PFNs are allocated for zero-pages that are sent to UV.
- Failure to migrate a page during page-in will now return error via
  hcall.
- Address review comments by Suka
- Misc cleanups

v8: 
https://lore.kernel.org/linux-mm/20190910082946.7849-2-bhar...@linux.ibm.com/T/

Anshuman Khandual (1):
  KVM: PPC: Ultravisor: Add PPC_UV config option

Bharata B Rao (6):
  KVM: PPC: Move pages between normal and secure memory
  KVM: PPC: Shared pages support for secure guests
  KVM: PPC: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls
  KVM: PPC: Handle memory plug/unplug to secure VM
  KVM: PPC: Radix changes for secure guest
  KVM: PPC: Support reset of secure guest

Suraj Jitindar Singh (1):
  KVM: PPC: Book3S HV: Define usage types for rmap array in guest
memslot

 Documentation/virt/kvm/api.txt  |  19 +
 arch/powerpc/Kconfig|  17 +
 arch/powerpc/include/asm/hvcall.h   |   9 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  48 ++
 arch/powerpc/include/asm/kvm_host.h |  57 +-
 arch/powerpc/include/asm/kvm_ppc.h  |   2 +
 arch/powerpc/include/asm/ultravisor-api.h   |   6 +
 arch/powerpc/include/asm/ultravisor.h   |  36 ++
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |  22 +
 arch/powerpc/kvm/book3s_hv.c| 122 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |   2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 673 
 arch/powerpc/kvm/powerpc.c  |  12 +
 include/uapi/linux/kvm.h|   1 +
 15 files changed, 1024 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

-- 
2.21.0

RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property

2019-09-24 Thread Biwen Li

> > >
> > > > > > > > >
> > > > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle
> > > > > > > > > an errata
> > > > > > > > > A-008646 on LS1021A
> > > > > > > > >
> > > > > > > > > Signed-off-by: Biwen Li 
> > > > > > > > > ---
> > > > > > > > > Change in v3:
> > > > > > > > >   - rename property name
> > > > > > > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> > > > > > > > >
> > > > > > > > > Change in v2:
> > > > > > > > >   - update desc of the property 'fsl,rcpm-scfg'
> > > > > > > > >
> > > > > > > > >  Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14
> > > > > > > > > ++
> > > > > > > > >  1 file changed, 14 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git
> > > > > > > > > a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > > > index 5a33619d881d..157dcf6da17c 100644
> > > > > > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > > > @@ -34,6 +34,11 @@ Chassis VersionExample
> > > Chips
> > > > > > > > >  Optional properties:
> > > > > > > > >   - little-endian : RCPM register block is Little Endian.
> > > > > > > > > Without it
> > > > RCPM
> > > > > > > > > will be Big Endian (default case).
> > > > > > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for
> > > > > > > > > + SoC LS1021A,
> > > > > > > >
> > > > > > > > You probably should mention this is related to a hardware
> > > > > > > > issue on LS1021a and only needed on LS1021a.
> > > > > > > Okay, got it, thanks, I will add this in v4.
> > > > > > > >
> > > > > > > > > +   Must include n + 1 entries (n =
> > > > > > > > > + #fsl,rcpm-wakeup-cells, such
> > as:
> > > > > > > > > +   #fsl,rcpm-wakeup-cells equal to 2, then must include
> > > > > > > > > + 2
> > > > > > > > > + +
> > > > > > > > > + 1
> > > > entries).
> > > > > > > >
> > > > > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR
> > > > > > > > registers on an
> > > > SoC.
> > > > > > > > However you are defining an offset to scfg registers here.
> > > > > > > > Why these two are related?  The length here should
> > > > > > > > actually be related to the #address-cells of the soc/.
> > > > > > > > But since this is only needed for LS1021, you can
> > > > > > > just make it 3.
> > > > > > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0
> > > > > > > device node(fsl,rcpm-wakeup = < 0x0 0x2000>;
> > > > > > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for
> > > > IPPDEXPCR1).
> > > > > > > But because of the hardware issue on LS1021A, I need store
> > > > > > > the value of IPPDEXPCR registers to an alt address. So I
> > > > > > > defining an offset to scfg registers, then RCPM driver get
> > > > > > > an abosolute address from offset, RCPM driver write the
> > > > > > > value of IPPDEXPCR registers to these abosolute
> > > > > > > addresses(backup the value of IPPDEXPCR
> > > > registers).
> > > > > >
> > > > > > I understand what you are trying to do.  The problem is that
> > > > > > the new fsl,ippdexpcr-alt-addr property contains a phandle and an
> offset.
> > > > > > The size of it shouldn't be related to #fsl,rcpm-wakeup-cells.
> > > > > You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/*
> > > > > SCFG_SPARECR8 */
> > > >
> > > > No.  The #address-cell for the soc/ is 2, so the offset to scfg
> > > > should be 0x0 0x51c.  The total size should be 3, but it shouldn't
> > > > be coming from #fsl,rcpm-wakeup-cells like you mentioned in the
> binding.
> > > Oh, I got it. You want that fsl,ippdexpcr-alt-add is relative with
> > > #address-cells instead of #fsl,rcpm-wakeup-cells.
> >
> > Yes.
> I got an example from drivers/pci/controller/dwc/pci-layerscape.c
> and arch/arm/boot/dts/ls1021a.dtsi as follows:
> fsl,pcie-scfg = < 0>, 0 is an index
> 
> In my fsl,ippdexpcr-alt-addr = < 0x0 0x51c>, It means that 0x0 is an alt
> offset address for IPPDEXPCR0, 0x51c is an alt offset address For
> IPPDEXPCR1 instead of 0x0 and 0x51c compose to an alt address of
> SCFG_SPARECR8.
Maybe I need write it as:
fsl,ippdexpcr-alt-addr = < 0x0 0x0 0x0 0x51c>;
first two 0x0 compose an alt offset address for IPPDEXPCR0,
last 0x0 and 0x51c compose an alt address for IPPDEXPCR1,

Best Regards,
Biwen Li 
> >
> > Regards,
> > Leo
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > +   The first entry must be a link to the SCFG device node.
> > > > > > > > > +   The non-first entry must be offset of registers of SCFG.
> > > > > > > > >
> > > > > > > > >  Example:
> > > > > > > > >  The RCPM node for T4240:
> > > > > > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240:
> > > > > > > > >   #fsl,rcpm-wakeup-cells = <2>;
> > > > > > > > >   };
> > > > > > > > >
> > > > > > > > > +The RCPM node for LS1021A:
> > > > > > > > > + rcpm: rcpm@1ee2140 {
> > > > > > > > > +

[PATCH v4 5/5] Powerpc/Watchpoint: Support for 8xx in ptrace-hwbreak.c selftest

2019-09-24 Thread Ravi Bangoria

On the 8xx, signals are generated after executing the instruction.
So no need to manually single-step on 8xx.

Signed-off-by: Ravi Bangoria 
---
 .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 26 ++-
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
index 654131591fca..58505277346d 100644
--- a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
@@ -22,6 +22,11 @@
 #include 
 #include "ptrace.h"
 
+#define SPRN_PVR   0x11F
+#define PVR_8xx0x0050
+
+bool is_8xx;
+
 /*
  * Use volatile on all global var so that compiler doesn't
  * optimise their load/stores. Otherwise selftest can fail.
@@ -205,13 +210,15 @@ static void check_success(pid_t child_pid, const char 
*name, const char *type,
 
printf("%s, %s, len: %d: Ok\n", name, type, len);
 
-   /*
-* For ptrace registered watchpoint, signal is generated
-* before executing load/store. Singlestep the instruction
-* and then continue the test.
-*/
-   ptrace(PTRACE_SINGLESTEP, child_pid, NULL, 0);
-   wait(NULL);
+   if (!is_8xx) {
+   /*
+* For ptrace registered watchpoint, signal is generated
+* before executing load/store. Singlestep the instruction
+* and then continue the test.
+*/
+   ptrace(PTRACE_SINGLESTEP, child_pid, NULL, 0);
+   wait(NULL);
+   }
 }
 
 static void ptrace_set_debugreg(pid_t child_pid, unsigned long wp_addr)
@@ -489,5 +496,10 @@ static int ptrace_hwbreak(void)
 
 int main(int argc, char **argv, char **envp)
 {
+   int pvr = 0;
+   asm __volatile__ ("mfspr %0,%1" : "=r"(pvr) : "i"(SPRN_PVR));
+   if (pvr == PVR_8xx)
+   is_8xx = true;
+
return test_harness(ptrace_hwbreak, "ptrace-hwbreak");
 }
-- 
2.21.0

[PATCH v4 4/5] Powerpc/Watchpoint: Add dar outside test in perf-hwbreak.c selftest

2019-09-24 Thread Ravi Bangoria

So far we used to ignore exception if dar points outside of user
specified range. But now we are ignoring it only if actual load/
store range does not overlap with user specified range. Include
selftests for the same:

  # ./tools/testing/selftests/powerpc/ptrace/perf-hwbreak
  ...
  TESTED: No overlap
  TESTED: Partial overlap
  TESTED: Partial overlap
  TESTED: No overlap
  TESTED: Full overlap
  success: perf_hwbreak

Signed-off-by: Ravi Bangoria 
---
 .../selftests/powerpc/ptrace/perf-hwbreak.c   | 111 +-
 1 file changed, 110 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c
index 200337daec42..389c545675c6 100644
--- a/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c
@@ -148,6 +148,113 @@ static int runtestsingle(int readwriteflag, int 
exclude_user, int arraytest)
return 0;
 }
 
+static int runtest_dar_outside(void)
+{
+   volatile char target[8];
+   volatile __u16 temp16;
+   volatile __u64 temp64;
+   struct perf_event_attr attr;
+   int break_fd;
+   unsigned long long breaks;
+   int fail = 0;
+   size_t res;
+
+   /* setup counters */
+   memset(, 0, sizeof(attr));
+   attr.disabled = 1;
+   attr.type = PERF_TYPE_BREAKPOINT;
+   attr.exclude_kernel = 1;
+   attr.exclude_hv = 1;
+   attr.exclude_guest = 1;
+   attr.bp_type = HW_BREAKPOINT_RW;
+   /* watch middle half of target array */
+   attr.bp_addr = (__u64)(target + 2);
+   attr.bp_len = 4;
+   break_fd = sys_perf_event_open(, 0, -1, -1, 0);
+   if (break_fd < 0) {
+   perror("sys_perf_event_open");
+   exit(1);
+   }
+
+   /* Shouldn't hit. */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp16 = *((__u16 *)target);
+   *((__u16 *)target) = temp16;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 0) {
+   printf("TESTED: No overlap\n");
+   } else {
+   printf("FAILED: No overlap: %lld != 0\n", breaks);
+   fail = 1;
+   }
+
+   /* Hit */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp16 = *((__u16 *)(target + 1));
+   *((__u16 *)(target + 1)) = temp16;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 2) {
+   printf("TESTED: Partial overlap\n");
+   } else {
+   printf("FAILED: Partial overlap: %lld != 2\n", breaks);
+   fail = 1;
+   }
+
+   /* Hit */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp16 = *((__u16 *)(target + 5));
+   *((__u16 *)(target + 5)) = temp16;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 2) {
+   printf("TESTED: Partial overlap\n");
+   } else {
+   printf("FAILED: Partial overlap: %lld != 2\n", breaks);
+   fail = 1;
+   }
+
+   /* Shouldn't Hit */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp16 = *((__u16 *)(target + 6));
+   *((__u16 *)(target + 6)) = temp16;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 0) {
+   printf("TESTED: No overlap\n");
+   } else {
+   printf("FAILED: No overlap: %lld != 0\n", breaks);
+   fail = 1;
+   }
+
+   /* Hit */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp64 = *((__u64 *)target);
+   *((__u64 *)target) = temp64;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 2) {
+   printf("TESTED: Full overlap\n");
+   } else {
+   printf("FAILED: Full overlap: %lld != 2\n", breaks);
+   fail = 1;
+   }
+
+   close(break_fd);
+   return fail;
+}
+
 static int runtest(void)
 {
int rwflag;
@@ -172,7 +279,9 @@ static int runtest(void)
return ret;
}
}
-   return 0;
+
+   ret = runtest_dar_outside();
+   return ret;
 }
 
 
-- 
2.21.0

[PATCH v4 3/5] Powerpc/Watchpoint: Rewrite ptrace-hwbreak.c selftest

2019-09-24 Thread Ravi Bangoria

ptrace-hwbreak.c selftest is logically broken. On powerpc, when
watchpoint is created with ptrace, signals are generated before
executing the instruction and user has to manually singlestep
the instruction with watchpoint disabled, which selftest never
does and thus it keeps on getting the signal at the same
instruction. If we fix it, selftest fails because the logical
connection between tracer(parent) and tracee(child) is also
broken. Rewrite the selftest and add new tests for unaligned
access.

With patch:
  $ ./tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak
  test: ptrace-hwbreak
  tags: git_version:powerpc-5.3-4-224-g218b868240c7-dirty
  PTRACE_SET_DEBUGREG, WO, len: 1: Ok
  PTRACE_SET_DEBUGREG, WO, len: 2: Ok
  PTRACE_SET_DEBUGREG, WO, len: 4: Ok
  PTRACE_SET_DEBUGREG, WO, len: 8: Ok
  PTRACE_SET_DEBUGREG, RO, len: 1: Ok
  PTRACE_SET_DEBUGREG, RO, len: 2: Ok
  PTRACE_SET_DEBUGREG, RO, len: 4: Ok
  PTRACE_SET_DEBUGREG, RO, len: 8: Ok
  PTRACE_SET_DEBUGREG, RW, len: 1: Ok
  PTRACE_SET_DEBUGREG, RW, len: 2: Ok
  PTRACE_SET_DEBUGREG, RW, len: 4: Ok
  PTRACE_SET_DEBUGREG, RW, len: 8: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_EXACT, WO, len: 1: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_EXACT, RO, len: 1: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_EXACT, RW, len: 1: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, WO, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, RO, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, RW, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, WO, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, RO, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, RW, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, DAR OUTSIDE, RW, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, DAWR_MAX_LEN, RW, len: 512: Ok
  success: ptrace-hwbreak

Signed-off-by: Ravi Bangoria 
---
 .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 571 +++---
 1 file changed, 361 insertions(+), 210 deletions(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
index 3066d310f32b..654131591fca 100644
--- a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
@@ -22,318 +22,469 @@
 #include 
 #include "ptrace.h"
 
-/* Breakpoint access modes */
-enum {
-   BP_X = 1,
-   BP_RW = 2,
-   BP_W = 4,
-};
-
-static pid_t child_pid;
-static struct ppc_debug_info dbginfo;
-
-static void get_dbginfo(void)
-{
-   int ret;
-
-   ret = ptrace(PPC_PTRACE_GETHWDBGINFO, child_pid, NULL, );
-   if (ret) {
-   perror("Can't get breakpoint info\n");
-   exit(-1);
-   }
-}
-
-static bool hwbreak_present(void)
-{
-   return (dbginfo.num_data_bps != 0);
-}
+/*
+ * Use volatile on all global var so that compiler doesn't
+ * optimise their load/stores. Otherwise selftest can fail.
+ */
+static volatile __u64 glvar;
 
-static bool dawr_present(void)
-{
-   return !!(dbginfo.features & PPC_DEBUG_FEATURE_DATA_BP_DAWR);
-}
+#define DAWR_MAX_LEN 512
+static volatile __u8 big_var[DAWR_MAX_LEN] __attribute__((aligned(512)));
 
-static void set_breakpoint_addr(void *addr)
-{
-   int ret;
+#define A_LEN 6
+#define B_LEN 6
+struct gstruct {
+   __u8 a[A_LEN]; /* double word aligned */
+   __u8 b[B_LEN]; /* double word unaligned */
+};
+static volatile struct gstruct gstruct __attribute__((aligned(512)));
 
-   ret = ptrace(PTRACE_SET_DEBUGREG, child_pid, 0, addr);
-   if (ret) {
-   perror("Can't set breakpoint addr\n");
-   exit(-1);
-   }
-}
 
-static int set_hwbreakpoint_addr(void *addr, int range)
+static void get_dbginfo(pid_t child_pid, struct ppc_debug_info *dbginfo)
 {
-   int ret;
-
-   struct ppc_hw_breakpoint info;
-
-   info.version = 1;
-   info.trigger_type = PPC_BREAKPOINT_TRIGGER_RW;
-   info.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
-   if (range > 0)
-   info.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
-   info.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
-   info.addr = (__u64)addr;
-   info.addr2 = (__u64)addr + range;
-   info.condition_value = 0;
-
-   ret = ptrace(PPC_PTRACE_SETHWDEBUG, child_pid, 0, );
-   if (ret < 0) {
-   perror("Can't set breakpoint\n");
+   if (ptrace(PPC_PTRACE_GETHWDBGINFO, child_pid, NULL, dbginfo)) {
+   perror("Can't get breakpoint info");
exit(-1);
}
-   return ret;
 }
 
-static int del_hwbreakpoint_addr(int watchpoint_handle)
+static bool dawr_present(struct ppc_debug_info *dbginfo)
 {
-   int ret;
-
-   ret = ptrace(PPC_PTRACE_DELHWDEBUG, child_pid, 0, watchpoint_handle);
-   if (ret < 0) {
-   perror("Can't delete hw breakpoint\n");
-   exit(-1);
-   }
-   return ret;
+   return !!(dbginfo->features & PPC_DEBUG_FEATURE_DATA_BP_DAWR);
 }

[PATCH v4 2/5] Powerpc/Watchpoint: Don't ignore extraneous exceptions blindly

2019-09-24 Thread Ravi Bangoria

On Powerpc, watchpoint match range is double-word granular. On a
watchpoint hit, DAR is set to the first byte of overlap between
actual access and watched range. And thus it's quite possible that
DAR does not point inside user specified range. Ex, say user creates
a watchpoint with address range 0x1004 to 0x1007. So hw would be
configured to watch from 0x1000 to 0x1007. If there is a 4 byte
access from 0x1002 to 0x1005, DAR will point to 0x1002 and thus
interrupt handler considers it as extraneous, but it's actually not,
because part of the access belongs to what user has asked.

Instead of blindly ignoring the exception, get actual address range
by analysing an instruction, and ignore only if actual range does
not overlap with user specified range.

Note: The behaviour is unchanged for 8xx.

Signed-off-by: Ravi Bangoria 
---
 arch/powerpc/kernel/hw_breakpoint.c | 52 +
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index 5a2d8c306c40..c04a345e2cc2 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -179,33 +179,49 @@ void thread_change_pc(struct task_struct *tsk, struct 
pt_regs *regs)
tsk->thread.last_hit_ubp = NULL;
 }
 
-static bool is_larx_stcx_instr(struct pt_regs *regs, unsigned int instr)
+static bool dar_within_range(unsigned long dar, struct arch_hw_breakpoint 
*info)
 {
-   int ret, type;
-   struct instruction_op op;
+   return ((info->address <= dar) && (dar - info->address < info->len));
+}
 
-   ret = analyse_instr(, regs, instr);
-   type = GETTYPE(op.type);
-   return (!ret && (type == LARX || type == STCX));
+static bool
+dar_range_overlaps(unsigned long dar, int size, struct arch_hw_breakpoint 
*info)
+{
+   return ((dar <= info->address + info->len - 1) &&
+   (dar + size - 1 >= info->address));
 }
 
 /*
  * Handle debug exception notifications.
  */
 static bool stepping_handler(struct pt_regs *regs, struct perf_event *bp,
-unsigned long addr)
+struct arch_hw_breakpoint *info)
 {
unsigned int instr = 0;
+   int ret, type, size;
+   struct instruction_op op;
+   unsigned long addr = info->address;
 
if (__get_user_inatomic(instr, (unsigned int *)regs->nip))
goto fail;
 
-   if (is_larx_stcx_instr(regs, instr)) {
+   ret = analyse_instr(, regs, instr);
+   type = GETTYPE(op.type);
+   size = GETSIZE(op.type);
+
+   if (!ret && (type == LARX || type == STCX)) {
printk_ratelimited("Breakpoint hit on instruction that can't be 
emulated."
   " Breakpoint at 0x%lx will be disabled.\n", 
addr);
goto disable;
}
 
+   /*
+* If it's extraneous event, we still need to emulate/single-
+* step the instruction, but we don't generate an event.
+*/
+   if (size && !dar_range_overlaps(regs->dar, size, info))
+   info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
+
/* Do not emulate user-space instructions, instead single-step them */
if (user_mode(regs)) {
current->thread.last_hit_ubp = bp;
@@ -237,7 +253,6 @@ int hw_breakpoint_handler(struct die_args *args)
struct perf_event *bp;
struct pt_regs *regs = args->regs;
struct arch_hw_breakpoint *info;
-   unsigned long dar = regs->dar;
 
/* Disable breakpoints during exception handling */
hw_breakpoint_disable();
@@ -269,19 +284,14 @@ int hw_breakpoint_handler(struct die_args *args)
goto out;
}
 
-   /*
-* Verify if dar lies within the address range occupied by the symbol
-* being watched to filter extraneous exceptions.  If it doesn't,
-* we still need to single-step the instruction, but we don't
-* generate an event.
-*/
info->type &= ~HW_BRK_TYPE_EXTRANEOUS_IRQ;
-   if (!((bp->attr.bp_addr <= dar) &&
- (dar - bp->attr.bp_addr < bp->attr.bp_len)))
-   info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
-
-   if (!IS_ENABLED(CONFIG_PPC_8xx) && !stepping_handler(regs, bp, 
info->address))
-   goto out;
+   if (IS_ENABLED(CONFIG_PPC_8xx)) {
+   if (!dar_within_range(regs->dar, info))
+   info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
+   } else {
+   if (!stepping_handler(regs, bp, info))
+   goto out;
+   }
 
/*
 * As a policy, the callback is invoked in a 'trigger-after-execute'
-- 
2.21.0

[PATCH v4 1/5] Powerpc/Watchpoint: Fix length calculation for unaligned target

2019-09-24 Thread Ravi Bangoria

Watchpoint match range is always doubleword(8 bytes) aligned on
powerpc. If the given range is crossing doubleword boundary, we
need to increase the length such that next doubleword also get
covered. Ex,

  address   len = 6 bytes
|=.
   |v--|--v|
   | | | | | | | | | | | | | | | | |
   |---|---|
<---8 bytes--->

In such case, current code configures hw as:
  start_addr = address & ~HW_BREAKPOINT_ALIGN
  len = 8 bytes

And thus read/write in last 4 bytes of the given range is ignored.
Fix this by including next doubleword in the length. Plus, fix
ptrace code which is messing up address/len.

Signed-off-by: Ravi Bangoria 
---
 arch/powerpc/include/asm/debug.h |  1 +
 arch/powerpc/include/asm/hw_breakpoint.h |  9 +++--
 arch/powerpc/kernel/dawr.c   |  6 ++--
 arch/powerpc/kernel/hw_breakpoint.c  | 24 +++--
 arch/powerpc/kernel/process.c| 46 
 arch/powerpc/kernel/ptrace.c | 37 ++-
 arch/powerpc/xmon/xmon.c |  3 +-
 7 files changed, 83 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 7756026b95ca..9c1b4aaa374b 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -45,6 +45,7 @@ static inline int debugger_break_match(struct pt_regs *regs) 
{ return 0; }
 static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; }
 #endif
 
+int hw_breakpoint_validate_len(struct arch_hw_breakpoint *hw);
 void __set_breakpoint(struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
diff --git a/arch/powerpc/include/asm/hw_breakpoint.h 
b/arch/powerpc/include/asm/hw_breakpoint.h
index 67e2da195eae..27ac6f5d2891 100644
--- a/arch/powerpc/include/asm/hw_breakpoint.h
+++ b/arch/powerpc/include/asm/hw_breakpoint.h
@@ -14,6 +14,7 @@ struct arch_hw_breakpoint {
unsigned long   address;
u16 type;
u16 len; /* length of the target data symbol */
+   u16 hw_len; /* length programmed in hw */
 };
 
 /* Note: Don't change the the first 6 bits below as they are in the same order
@@ -33,6 +34,11 @@ struct arch_hw_breakpoint {
 #define HW_BRK_TYPE_PRIV_ALL   (HW_BRK_TYPE_USER | HW_BRK_TYPE_KERNEL | \
 HW_BRK_TYPE_HYP)
 
+#define HW_BREAKPOINT_ALIGN 0x7
+
+#define DABR_MAX_LEN   8
+#define DAWR_MAX_LEN   512
+
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 #include 
 #include 
@@ -44,8 +50,6 @@ struct pmu;
 struct perf_sample_data;
 struct task_struct;
 
-#define HW_BREAKPOINT_ALIGN 0x7
-
 extern int hw_breakpoint_slots(int type);
 extern int arch_bp_generic_fields(int type, int *gen_bp_type);
 extern int arch_check_bp_in_kernelspace(struct arch_hw_breakpoint *hw);
@@ -70,6 +74,7 @@ static inline void hw_breakpoint_disable(void)
brk.address = 0;
brk.type = 0;
brk.len = 0;
+   brk.hw_len = 0;
if (ppc_breakpoint_available())
__set_breakpoint();
 }
diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c
index 5f66b95b6858..8531623aa9b2 100644
--- a/arch/powerpc/kernel/dawr.c
+++ b/arch/powerpc/kernel/dawr.c
@@ -30,10 +30,10 @@ int set_dawr(struct arch_hw_breakpoint *brk)
 * DAWR length is stored in field MDR bits 48:53.  Matches range in
 * doublewords (64 bits) baised by -1 eg. 0b00=1DW and
 * 0b11=64DW.
-* brk->len is in bytes.
+* brk->hw_len is in bytes.
 * This aligns up to double word size, shifts and does the bias.
 */
-   mrd = ((brk->len + 7) >> 3) - 1;
+   mrd = ((brk->hw_len + 7) >> 3) - 1;
dawrx |= (mrd & 0x3f) << (63 - 53);
 
if (ppc_md.set_dawr)
@@ -54,7 +54,7 @@ static ssize_t dawr_write_file_bool(struct file *file,
const char __user *user_buf,
size_t count, loff_t *ppos)
 {
-   struct arch_hw_breakpoint null_brk = {0, 0, 0};
+   struct arch_hw_breakpoint null_brk = {0, 0, 0, 0};
size_t rc;
 
/* Send error to user if they hypervisor won't allow us to write DAWR */
diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index 1007ec36b4cb..5a2d8c306c40 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -133,9 +133,9 @@ int hw_breakpoint_arch_parse(struct perf_event *bp,
 const struct perf_event_attr *attr,
 struct arch_hw_breakpoint *hw)
 {
-   int ret = -EINVAL, length_max;
+   int ret = -EINVAL;
 
-   if (!bp)
+   if (!bp || !attr->bp_len)
return ret;
 
hw->type = HW_BRK_TYPE_TRANSLATE;
@@ -155,26 +155,10 @@ int hw_breakpoint_arch_parse(struct perf_event *bp,
hw->address =

[PATCH v4 0/5] Powerpc/Watchpoint: Few important fixes

2019-09-24 Thread Ravi Bangoria

v3: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-July/193339.html

v3->v4:
 - Instead of considering exception as extraneous when dar is outside of
   user specified range, analyse the instruction and check for overlap
   between user specified range and actual load/store range.
 - Add selftest for the same in perf-hwbreak.c
 - Make ptrace-hwbreak.c selftest more strict by checking address in
   check_success.
 - Support for 8xx in ptrace-hwbreak.c selftest (Build tested only)
 - Rebase to powerpc/next

@Christope, Can you please check Patch 5. I've just build-tested it
with ep88xc_defconfig.

Ravi Bangoria (5):
  Powerpc/Watchpoint: Fix length calculation for unaligned target
  Powerpc/Watchpoint: Don't ignore extraneous exceptions blindly
  Powerpc/Watchpoint: Rewrite ptrace-hwbreak.c selftest
  Powerpc/Watchpoint: Add dar outside test in perf-hwbreak.c selftest
  Powerpc/Watchpoint: Support for 8xx in ptrace-hwbreak.c selftest

 arch/powerpc/include/asm/debug.h  |   1 +
 arch/powerpc/include/asm/hw_breakpoint.h  |   9 +-
 arch/powerpc/kernel/dawr.c|   6 +-
 arch/powerpc/kernel/hw_breakpoint.c   |  76 ++-
 arch/powerpc/kernel/process.c |  46 ++
 arch/powerpc/kernel/ptrace.c  |  37 +-
 arch/powerpc/xmon/xmon.c  |   3 +-
 .../selftests/powerpc/ptrace/perf-hwbreak.c   | 111 +++-
 .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 579 +++---
 9 files changed, 595 insertions(+), 273 deletions(-)

-- 
2.21.0

RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property

2019-09-24 Thread Biwen Li

> >
> > > > > > > >
> > > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an
> > > > > > > > errata
> > > > > > > > A-008646 on LS1021A
> > > > > > > >
> > > > > > > > Signed-off-by: Biwen Li 
> > > > > > > > ---
> > > > > > > > Change in v3:
> > > > > > > > - rename property name
> > > > > > > >   fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> > > > > > > >
> > > > > > > > Change in v2:
> > > > > > > > - update desc of the property 'fsl,rcpm-scfg'
> > > > > > > >
> > > > > > > >  Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14
> > > > > > > > ++
> > > > > > > >  1 file changed, 14 insertions(+)
> > > > > > > >
> > > > > > > > diff --git
> > > > > > > > a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > > index 5a33619d881d..157dcf6da17c 100644
> > > > > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > > @@ -34,6 +34,11 @@ Chassis Version  Example
> > Chips
> > > > > > > >  Optional properties:
> > > > > > > >   - little-endian : RCPM register block is Little Endian.
> > > > > > > > Without it
> > > RCPM
> > > > > > > > will be Big Endian (default case).
> > > > > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC
> > > > > > > > + LS1021A,
> > > > > > >
> > > > > > > You probably should mention this is related to a hardware
> > > > > > > issue on LS1021a and only needed on LS1021a.
> > > > > > Okay, got it, thanks, I will add this in v4.
> > > > > > >
> > > > > > > > +   Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such
> as:
> > > > > > > > +   #fsl,rcpm-wakeup-cells equal to 2, then must include 2
> > > > > > > > + +
> > > > > > > > + 1
> > > entries).
> > > > > > >
> > > > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers
> > > > > > > on an
> > > SoC.
> > > > > > > However you are defining an offset to scfg registers here.
> > > > > > > Why these two are related?  The length here should actually
> > > > > > > be related to the #address-cells of the soc/.  But since
> > > > > > > this is only needed for LS1021, you can
> > > > > > just make it 3.
> > > > > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0
> > > > > > device node(fsl,rcpm-wakeup = < 0x0 0x2000>;
> > > > > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for
> > > IPPDEXPCR1).
> > > > > > But because of the hardware issue on LS1021A, I need store the
> > > > > > value of IPPDEXPCR registers to an alt address. So I defining
> > > > > > an offset to scfg registers, then RCPM driver get an abosolute
> > > > > > address from offset, RCPM driver write the value of IPPDEXPCR
> > > > > > registers to these abosolute addresses(backup the value of
> > > > > > IPPDEXPCR
> > > registers).
> > > > >
> > > > > I understand what you are trying to do.  The problem is that the
> > > > > new fsl,ippdexpcr-alt-addr property contains a phandle and an offset.
> > > > > The size of it shouldn't be related to #fsl,rcpm-wakeup-cells.
> > > > You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/*
> > > > SCFG_SPARECR8 */
> > >
> > > No.  The #address-cell for the soc/ is 2, so the offset to scfg
> > > should be 0x0 0x51c.  The total size should be 3, but it shouldn't
> > > be coming from #fsl,rcpm-wakeup-cells like you mentioned in the binding.
> > Oh, I got it. You want that fsl,ippdexpcr-alt-add is relative with
> > #address-cells instead of #fsl,rcpm-wakeup-cells.
> 
> Yes.
I got an example from drivers/pci/controller/dwc/pci-layerscape.c
and arch/arm/boot/dts/ls1021a.dtsi as follows:
fsl,pcie-scfg = < 0>, 0 is an index

In my fsl,ippdexpcr-alt-addr = < 0x0 0x51c>,
It means that 0x0 is an alt offset address for IPPDEXPCR0, 0x51c is an alt 
offset address
For IPPDEXPCR1 instead of 0x0 and 0x51c compose to an alt address of 
SCFG_SPARECR8.
> 
> Regards,
> Leo
> > >
> > > > >
> > > > > > >
> > > > > > > > +   The first entry must be a link to the SCFG device node.
> > > > > > > > +   The non-first entry must be offset of registers of SCFG.
> > > > > > > >
> > > > > > > >  Example:
> > > > > > > >  The RCPM node for T4240:
> > > > > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240:
> > > > > > > > #fsl,rcpm-wakeup-cells = <2>;
> > > > > > > > };
> > > > > > > >
> > > > > > > > +The RCPM node for LS1021A:
> > > > > > > > +   rcpm: rcpm@1ee2140 {
> > > > > > > > +   compatible = "fsl,ls1021a-rcpm", 
> > > > > > > > "fsl,qoriq-rcpm-
> > > > 2.1+";
> > > > > > > > +   reg = <0x0 0x1ee2140 0x0 0x8>;
> > > > > > > > +   #fsl,rcpm-wakeup-cells = <2>;
> > > > > > > > +   fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /*
> > > > > > > > SCFG_SPARECR8 */
> > > > > > > > +   };
> > > > > > > > +
> > > > > > > > +
> > > > > > > >  * Freescale RCPM Wakeup Source Device Tree Bindings
>

RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property

2019-09-24 Thread Leo Li




> -Original Message-
> From: Biwen Li
> Sent: Tuesday, September 24, 2019 10:47 PM
> To: Leo Li ; shawn...@kernel.org;
> robh...@kernel.org; mark.rutl...@arm.com; Ran Wang
> 
> Cc: linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; devicet...@vger.kernel.org
> Subject: RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-
> addr' property
> 
> > > > > > >
> > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an
> > > > > > > errata
> > > > > > > A-008646 on LS1021A
> > > > > > >
> > > > > > > Signed-off-by: Biwen Li 
> > > > > > > ---
> > > > > > > Change in v3:
> > > > > > >   - rename property name
> > > > > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> > > > > > >
> > > > > > > Change in v2:
> > > > > > >   - update desc of the property 'fsl,rcpm-scfg'
> > > > > > >
> > > > > > >  Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14
> > > > > > > ++
> > > > > > >  1 file changed, 14 insertions(+)
> > > > > > >
> > > > > > > diff --git
> > > > > > > a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > index 5a33619d881d..157dcf6da17c 100644
> > > > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > > @@ -34,6 +34,11 @@ Chassis VersionExample
> Chips
> > > > > > >  Optional properties:
> > > > > > >   - little-endian : RCPM register block is Little Endian.
> > > > > > > Without it
> > RCPM
> > > > > > > will be Big Endian (default case).
> > > > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC
> > > > > > > + LS1021A,
> > > > > >
> > > > > > You probably should mention this is related to a hardware
> > > > > > issue on LS1021a and only needed on LS1021a.
> > > > > Okay, got it, thanks, I will add this in v4.
> > > > > >
> > > > > > > +   Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such 
> > > > > > > as:
> > > > > > > +   #fsl,rcpm-wakeup-cells equal to 2, then must include 2 +
> > > > > > > + 1
> > entries).
> > > > > >
> > > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on
> > > > > > an
> > SoC.
> > > > > > However you are defining an offset to scfg registers here.
> > > > > > Why these two are related?  The length here should actually be
> > > > > > related to the #address-cells of the soc/.  But since this is
> > > > > > only needed for LS1021, you can
> > > > > just make it 3.
> > > > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0
> > > > > device node(fsl,rcpm-wakeup = < 0x0 0x2000>;
> > > > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for
> > IPPDEXPCR1).
> > > > > But because of the hardware issue on LS1021A, I need store the
> > > > > value of IPPDEXPCR registers to an alt address. So I defining an
> > > > > offset to scfg registers, then RCPM driver get an abosolute
> > > > > address from offset, RCPM driver write the value of IPPDEXPCR
> > > > > registers to these abosolute addresses(backup the value of
> > > > > IPPDEXPCR
> > registers).
> > > >
> > > > I understand what you are trying to do.  The problem is that the
> > > > new fsl,ippdexpcr-alt-addr property contains a phandle and an offset.
> > > > The size of it shouldn't be related to #fsl,rcpm-wakeup-cells.
> > > You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/*
> > > SCFG_SPARECR8 */
> >
> > No.  The #address-cell for the soc/ is 2, so the offset to scfg should
> > be 0x0 0x51c.  The total size should be 3, but it shouldn't be coming
> > from #fsl,rcpm-wakeup-cells like you mentioned in the binding.
> Oh, I got it. You want that fsl,ippdexpcr-alt-add is relative with 
> #address-cells
> instead of #fsl,rcpm-wakeup-cells.

Yes.

Regards,
Leo
> >
> > > >
> > > > > >
> > > > > > > +   The first entry must be a link to the SCFG device node.
> > > > > > > +   The non-first entry must be offset of registers of SCFG.
> > > > > > >
> > > > > > >  Example:
> > > > > > >  The RCPM node for T4240:
> > > > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240:
> > > > > > >   #fsl,rcpm-wakeup-cells = <2>;
> > > > > > >   };
> > > > > > >
> > > > > > > +The RCPM node for LS1021A:
> > > > > > > + rcpm: rcpm@1ee2140 {
> > > > > > > + compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-
> > > 2.1+";
> > > > > > > + reg = <0x0 0x1ee2140 0x0 0x8>;
> > > > > > > + #fsl,rcpm-wakeup-cells = <2>;
> > > > > > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /*
> > > > > > > SCFG_SPARECR8 */
> > > > > > > + };
> > > > > > > +
> > > > > > > +
> > > > > > >  * Freescale RCPM Wakeup Source Device Tree Bindings
> > > > > > >  ---
> > > > > > >  Required fsl,rcpm-wakeup property should be added to a
> > > > > > > device node if the device
> > > > > > > --
> > > > > > > 2.17.1

RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property

2019-09-24 Thread Biwen Li

> > > > > >
> > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an
> > > > > > errata
> > > > > > A-008646 on LS1021A
> > > > > >
> > > > > > Signed-off-by: Biwen Li 
> > > > > > ---
> > > > > > Change in v3:
> > > > > > - rename property name
> > > > > >   fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> > > > > >
> > > > > > Change in v2:
> > > > > > - update desc of the property 'fsl,rcpm-scfg'
> > > > > >
> > > > > >  Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14
> > > > > > ++
> > > > > >  1 file changed, 14 insertions(+)
> > > > > >
> > > > > > diff --git
> > > > > > a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > index 5a33619d881d..157dcf6da17c 100644
> > > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > > @@ -34,6 +34,11 @@ Chassis Version  Example Chips
> > > > > >  Optional properties:
> > > > > >   - little-endian : RCPM register block is Little Endian. Without it
> RCPM
> > > > > > will be Big Endian (default case).
> > > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC
> > > > > > + LS1021A,
> > > > >
> > > > > You probably should mention this is related to a hardware issue
> > > > > on LS1021a and only needed on LS1021a.
> > > > Okay, got it, thanks, I will add this in v4.
> > > > >
> > > > > > +   Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as:
> > > > > > +   #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1
> entries).
> > > > >
> > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an
> SoC.
> > > > > However you are defining an offset to scfg registers here.  Why
> > > > > these two are related?  The length here should actually be
> > > > > related to the #address-cells of the soc/.  But since this is
> > > > > only needed for LS1021, you can
> > > > just make it 3.
> > > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device
> > > > node(fsl,rcpm-wakeup = < 0x0 0x2000>;
> > > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for
> IPPDEXPCR1).
> > > > But because of the hardware issue on LS1021A, I need store the
> > > > value of IPPDEXPCR registers to an alt address. So I defining an
> > > > offset to scfg registers, then RCPM driver get an abosolute
> > > > address from offset, RCPM driver write the value of IPPDEXPCR
> > > > registers to these abosolute addresses(backup the value of IPPDEXPCR
> registers).
> > >
> > > I understand what you are trying to do.  The problem is that the new
> > > fsl,ippdexpcr-alt-addr property contains a phandle and an offset.
> > > The size of it shouldn't be related to #fsl,rcpm-wakeup-cells.
> > You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/*
> > SCFG_SPARECR8 */
> 
> No.  The #address-cell for the soc/ is 2, so the offset to scfg should be 0x0
> 0x51c.  The total size should be 3, but it shouldn't be coming from
> #fsl,rcpm-wakeup-cells like you mentioned in the binding.
Oh, I got it. You want that fsl,ippdexpcr-alt-add is relative with 
#address-cells instead of #fsl,rcpm-wakeup-cells.
> 
> > >
> > > > >
> > > > > > +   The first entry must be a link to the SCFG device node.
> > > > > > +   The non-first entry must be offset of registers of SCFG.
> > > > > >
> > > > > >  Example:
> > > > > >  The RCPM node for T4240:
> > > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240:
> > > > > > #fsl,rcpm-wakeup-cells = <2>;
> > > > > > };
> > > > > >
> > > > > > +The RCPM node for LS1021A:
> > > > > > +   rcpm: rcpm@1ee2140 {
> > > > > > +   compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-
> > 2.1+";
> > > > > > +   reg = <0x0 0x1ee2140 0x0 0x8>;
> > > > > > +   #fsl,rcpm-wakeup-cells = <2>;
> > > > > > +   fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /*
> > > > > > SCFG_SPARECR8 */
> > > > > > +   };
> > > > > > +
> > > > > > +
> > > > > >  * Freescale RCPM Wakeup Source Device Tree Bindings
> > > > > >  ---
> > > > > >  Required fsl,rcpm-wakeup property should be added to a device
> > > > > > node if the device
> > > > > > --
> > > > > > 2.17.1

Re: [PATCH V3 0/2] mm/debug: Add tests for architecture exported page table helpers

2019-09-24 Thread Anshuman Khandual

On 09/24/2019 06:01 PM, Mike Rapoport wrote:
> On Tue, Sep 24, 2019 at 02:51:01PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Sep 20, 2019 at 12:03:21PM +0530, Anshuman Khandual wrote:
>>> This series adds a test validation for architecture exported page table
>>> helpers. Patch in the series adds basic transformation tests at various
>>> levels of the page table. Before that it exports gigantic page allocation
>>> function from HugeTLB.
>>>
>>> This test was originally suggested by Catalin during arm64 THP migration
>>> RFC discussion earlier. Going forward it can include more specific tests
>>> with respect to various generic MM functions like THP, HugeTLB etc and
>>> platform specific tests.
>>>
>>> https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/
>>>
>>> Testing:
>>>
>>> Successfully build and boot tested on both arm64 and x86 platforms without
>>> any test failing. Only build tested on some other platforms. Build failed
>>> on some platforms (known) in pud_clear_tests() as there were no available
>>> __pgd() definitions.
>>>
>>> - ARM32
>>> - IA64
>>
>> Hm. Grep shows __pgd() definitions for both of them. Is it for specific
>> config?
>  
> For ARM32 it's defined only for 3-lelel page tables, i.e with LPAE on.
> For IA64 it's defined for !STRICT_MM_TYPECHECKS which is even not a config
> option, but a define in arch/ia64/include/asm/page.h

Right. So now where we go from here ! We will need help from platform folks to
fix this unless its trivial. I did propose this on last thread (v2), wondering 
if
it will be a better idea to restrict DEBUG_ARCH_PGTABLE_TEST among architectures
which have fixed all pending issues whether build or run time. Though enabling 
all
platforms where the test builds at the least might make more sense, we might 
have
to just exclude arm32 and ia64 for now. Then run time problems can be fixed 
later
platform by platform. Any thoughts ?

BTW the test is known to run successfully on arm64, x86, ppc32 platforms. Gerald
has been trying to get it working on s390. in the meantime., if there are other
volunteers to test this on ppc64, sparc, riscv, mips, m68k etc platforms, it 
will
be really helpful.

- Anshuman

RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property

2019-09-24 Thread Leo Li




> -Original Message-
> From: Biwen Li
> Sent: Tuesday, September 24, 2019 10:30 PM
> To: Leo Li ; shawn...@kernel.org;
> robh...@kernel.org; mark.rutl...@arm.com; Ran Wang
> 
> Cc: linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; devicet...@vger.kernel.org
> Subject: RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-
> addr' property
> 
> > > > >
> > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an
> > > > > errata
> > > > > A-008646 on LS1021A
> > > > >
> > > > > Signed-off-by: Biwen Li 
> > > > > ---
> > > > > Change in v3:
> > > > >   - rename property name
> > > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> > > > >
> > > > > Change in v2:
> > > > >   - update desc of the property 'fsl,rcpm-scfg'
> > > > >
> > > > >  Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14
> > > > > ++
> > > > >  1 file changed, 14 insertions(+)
> > > > >
> > > > > diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > index 5a33619d881d..157dcf6da17c 100644
> > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > > @@ -34,6 +34,11 @@ Chassis VersionExample Chips
> > > > >  Optional properties:
> > > > >   - little-endian : RCPM register block is Little Endian. Without it 
> > > > > RCPM
> > > > > will be Big Endian (default case).
> > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC
> > > > > + LS1021A,
> > > >
> > > > You probably should mention this is related to a hardware issue on
> > > > LS1021a and only needed on LS1021a.
> > > Okay, got it, thanks, I will add this in v4.
> > > >
> > > > > +   Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as:
> > > > > +   #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1 
> > > > > entries).
> > > >
> > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an SoC.
> > > > However you are defining an offset to scfg registers here.  Why
> > > > these two are related?  The length here should actually be related
> > > > to the #address-cells of the soc/.  But since this is only needed
> > > > for LS1021, you can
> > > just make it 3.
> > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device
> > > node(fsl,rcpm-wakeup = < 0x0 0x2000>;
> > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for IPPDEXPCR1).
> > > But because of the hardware issue on LS1021A, I need store the value
> > > of IPPDEXPCR registers to an alt address. So I defining an offset to
> > > scfg registers, then RCPM driver get an abosolute address from
> > > offset, RCPM driver write the value of IPPDEXPCR registers to these
> > > abosolute addresses(backup the value of IPPDEXPCR registers).
> >
> > I understand what you are trying to do.  The problem is that the new
> > fsl,ippdexpcr-alt-addr property contains a phandle and an offset.  The
> > size of it shouldn't be related to #fsl,rcpm-wakeup-cells.
> You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/*
> SCFG_SPARECR8 */

No.  The #address-cell for the soc/ is 2, so the offset to scfg should be 0x0 
0x51c.  The total size should be 3, but it shouldn't be coming from 
#fsl,rcpm-wakeup-cells like you mentioned in the binding.

> >
> > > >
> > > > > +   The first entry must be a link to the SCFG device node.
> > > > > +   The non-first entry must be offset of registers of SCFG.
> > > > >
> > > > >  Example:
> > > > >  The RCPM node for T4240:
> > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240:
> > > > >   #fsl,rcpm-wakeup-cells = <2>;
> > > > >   };
> > > > >
> > > > > +The RCPM node for LS1021A:
> > > > > + rcpm: rcpm@1ee2140 {
> > > > > + compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-
> 2.1+";
> > > > > + reg = <0x0 0x1ee2140 0x0 0x8>;
> > > > > + #fsl,rcpm-wakeup-cells = <2>;
> > > > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /*
> > > > > SCFG_SPARECR8 */
> > > > > + };
> > > > > +
> > > > > +
> > > > >  * Freescale RCPM Wakeup Source Device Tree Bindings
> > > > >  ---
> > > > >  Required fsl,rcpm-wakeup property should be added to a device
> > > > > node if the device
> > > > > --
> > > > > 2.17.1

Re: [PATCH V4 4/4] ASoC: fsl_asrc: Fix error with S24_3LE format bitstream in i.MX8

2019-09-24 Thread S.j. Wang

Hi

> On Tue, Sep 24, 2019 at 06:52:35PM +0800, Shengjiu Wang wrote:
> > There is error "aplay: pcm_write:2023: write error: Input/output error"
> > on i.MX8QM/i.MX8QXP platform for S24_3LE format.
> >
> > In i.MX8QM/i.MX8QXP, the DMA is EDMA, which don't support 24bit
> > sample, but we didn't add any constraint, that cause issues.
> >
> > So we need to query the caps of dma, then update the hw parameters
> > according to the caps.
> >
> > Signed-off-by: Shengjiu Wang 
> > ---
> >  sound/soc/fsl/fsl_asrc.c |  4 +--
> >  sound/soc/fsl/fsl_asrc.h |  3 ++
> >  sound/soc/fsl/fsl_asrc_dma.c | 59
> > +++-
> >  3 files changed, 56 insertions(+), 10 deletions(-)
> >
> > @@ -270,12 +268,17 @@ static int fsl_asrc_dma_hw_free(struct
> > snd_pcm_substream *substream)
> >
> >  static int fsl_asrc_dma_startup(struct snd_pcm_substream *substream)
> > {
> > + bool tx = substream->stream == SNDRV_PCM_STREAM_PLAYBACK;
> >   struct snd_soc_pcm_runtime *rtd = substream->private_data;
> >   struct snd_pcm_runtime *runtime = substream->runtime;
> >   struct snd_soc_component *component =
> snd_soc_rtdcom_lookup(rtd,
> > DRV_NAME);
> > + struct snd_dmaengine_dai_dma_data *dma_data;
> >   struct device *dev = component->dev;
> >   struct fsl_asrc *asrc_priv = dev_get_drvdata(dev);
> >   struct fsl_asrc_pair *pair;
> > + struct dma_chan *tmp_chan = NULL;
> > + u8 dir = tx ? OUT : IN;
> > + int ret = 0;
> >
> >   pair = kzalloc(sizeof(struct fsl_asrc_pair), GFP_KERNEL);
> 
> Sorry, I didn't catch it previously. We would need to release this memory
> also for all error-out paths, as the code doesn't have any error-out routine,
> prior to applying this change.
> 
> >   if (!pair)
> > @@ -285,11 +288,51 @@ static int fsl_asrc_dma_startup(struct
> > snd_pcm_substream *substream)
> 
> > + /* Request a dummy pair, which will be released later.
> > +  * Request pair function needs channel num as input, for this
> > +  * dummy pair, we just request "1" channel temporary.
> > +  */
> 
> "temporary" => "temporarily"
> 
> > + ret = fsl_asrc_request_pair(1, pair);
> > + if (ret < 0) {
> > + dev_err(dev, "failed to request asrc pair\n");
> > + return ret;
> > + }
> > +
> > + /* Request a dummy dma channel, which will be release later. */
> 
> "release" => "released"

Ok, will update them.

Best regards
Wang shengjiu

RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property

2019-09-24 Thread Biwen Li

> > > >
> > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an errata
> > > > A-008646 on LS1021A
> > > >
> > > > Signed-off-by: Biwen Li 
> > > > ---
> > > > Change in v3:
> > > > - rename property name
> > > >   fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> > > >
> > > > Change in v2:
> > > > - update desc of the property 'fsl,rcpm-scfg'
> > > >
> > > >  Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14
> > > > ++
> > > >  1 file changed, 14 insertions(+)
> > > >
> > > > diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > index 5a33619d881d..157dcf6da17c 100644
> > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > > @@ -34,6 +34,11 @@ Chassis Version  Example Chips
> > > >  Optional properties:
> > > >   - little-endian : RCPM register block is Little Endian. Without it 
> > > > RCPM
> > > > will be Big Endian (default case).
> > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC
> > > > + LS1021A,
> > >
> > > You probably should mention this is related to a hardware issue on
> > > LS1021a and only needed on LS1021a.
> > Okay, got it, thanks, I will add this in v4.
> > >
> > > > +   Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as:
> > > > +   #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1 entries).
> > >
> > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an SoC.
> > > However you are defining an offset to scfg registers here.  Why
> > > these two are related?  The length here should actually be related
> > > to the #address-cells of the soc/.  But since this is only needed
> > > for LS1021, you can
> > just make it 3.
> > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device
> > node(fsl,rcpm-wakeup = < 0x0 0x2000>;
> > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for IPPDEXPCR1).
> > But because of the hardware issue on LS1021A, I need store the value
> > of IPPDEXPCR registers to an alt address. So I defining an offset to
> > scfg registers, then RCPM driver get an abosolute address from offset,
> > RCPM driver write the value of IPPDEXPCR registers to these abosolute
> > addresses(backup the value of IPPDEXPCR registers).
> 
> I understand what you are trying to do.  The problem is that the new
> fsl,ippdexpcr-alt-addr property contains a phandle and an offset.  The size
> of it shouldn't be related to #fsl,rcpm-wakeup-cells.
You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/* SCFG_SPARECR8 */
> 
> > >
> > > > +   The first entry must be a link to the SCFG device node.
> > > > +   The non-first entry must be offset of registers of SCFG.
> > > >
> > > >  Example:
> > > >  The RCPM node for T4240:
> > > > @@ -43,6 +48,15 @@ The RCPM node for T4240:
> > > > #fsl,rcpm-wakeup-cells = <2>;
> > > > };
> > > >
> > > > +The RCPM node for LS1021A:
> > > > +   rcpm: rcpm@1ee2140 {
> > > > +   compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-2.1+";
> > > > +   reg = <0x0 0x1ee2140 0x0 0x8>;
> > > > +   #fsl,rcpm-wakeup-cells = <2>;
> > > > +   fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /*
> > > > SCFG_SPARECR8 */
> > > > +   };
> > > > +
> > > > +
> > > >  * Freescale RCPM Wakeup Source Device Tree Bindings
> > > >  ---
> > > >  Required fsl,rcpm-wakeup property should be added to a device
> > > > node if the device
> > > > --
> > > > 2.17.1

RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property

2019-09-24 Thread Leo Li




> -Original Message-
> From: Biwen Li
> Sent: Tuesday, September 24, 2019 10:13 PM
> To: Leo Li ; shawn...@kernel.org;
> robh...@kernel.org; mark.rutl...@arm.com; Ran Wang
> 
> Cc: linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; devicet...@vger.kernel.org
> Subject: RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-
> addr' property
> 
> > >
> > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an errata
> > > A-008646 on LS1021A
> > >
> > > Signed-off-by: Biwen Li 
> > > ---
> > > Change in v3:
> > >   - rename property name
> > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> > >
> > > Change in v2:
> > >   - update desc of the property 'fsl,rcpm-scfg'
> > >
> > >  Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14
> > > ++
> > >  1 file changed, 14 insertions(+)
> > >
> > > diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > index 5a33619d881d..157dcf6da17c 100644
> > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > > @@ -34,6 +34,11 @@ Chassis VersionExample Chips
> > >  Optional properties:
> > >   - little-endian : RCPM register block is Little Endian. Without it RCPM
> > > will be Big Endian (default case).
> > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC LS1021A,
> >
> > You probably should mention this is related to a hardware issue on
> > LS1021a and only needed on LS1021a.
> Okay, got it, thanks, I will add this in v4.
> >
> > > +   Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as:
> > > +   #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1 entries).
> >
> > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an SoC.
> > However you are defining an offset to scfg registers here.  Why these
> > two are related?  The length here should actually be related to the
> > #address-cells of the soc/.  But since this is only needed for LS1021, you 
> > can
> just make it 3.
> I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device
> node(fsl,rcpm-wakeup = < 0x0 0x2000>;
> 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for IPPDEXPCR1).
> But because of the hardware issue on LS1021A, I need store the value of
> IPPDEXPCR registers to an alt address. So I defining an offset to scfg 
> registers,
> then RCPM driver get an abosolute address from offset,  RCPM driver write
> the value of IPPDEXPCR registers to these abosolute addresses(backup the
> value of IPPDEXPCR registers).

I understand what you are trying to do.  The problem is that the new 
fsl,ippdexpcr-alt-addr property contains a phandle and an offset.  The size of 
it shouldn't be related to #fsl,rcpm-wakeup-cells.

> >
> > > +   The first entry must be a link to the SCFG device node.
> > > +   The non-first entry must be offset of registers of SCFG.
> > >
> > >  Example:
> > >  The RCPM node for T4240:
> > > @@ -43,6 +48,15 @@ The RCPM node for T4240:
> > >   #fsl,rcpm-wakeup-cells = <2>;
> > >   };
> > >
> > > +The RCPM node for LS1021A:
> > > + rcpm: rcpm@1ee2140 {
> > > + compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-2.1+";
> > > + reg = <0x0 0x1ee2140 0x0 0x8>;
> > > + #fsl,rcpm-wakeup-cells = <2>;
> > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /*
> > > SCFG_SPARECR8 */
> > > + };
> > > +
> > > +
> > >  * Freescale RCPM Wakeup Source Device Tree Bindings
> > >  ---
> > >  Required fsl,rcpm-wakeup property should be added to a device node
> > > if the device
> > > --
> > > 2.17.1

RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property

2019-09-24 Thread Biwen Li

> >
> > The 'fsl,ippdexpcr-alt-addr' property is used to handle an errata
> > A-008646 on LS1021A
> >
> > Signed-off-by: Biwen Li 
> > ---
> > Change in v3:
> > - rename property name
> >   fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> >
> > Change in v2:
> > - update desc of the property 'fsl,rcpm-scfg'
> >
> >  Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14
> > ++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > index 5a33619d881d..157dcf6da17c 100644
> > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
> > @@ -34,6 +34,11 @@ Chassis Version  Example Chips
> >  Optional properties:
> >   - little-endian : RCPM register block is Little Endian. Without it RCPM
> > will be Big Endian (default case).
> > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC LS1021A,
> 
> You probably should mention this is related to a hardware issue on LS1021a
> and only needed on LS1021a.
Okay, got it, thanks, I will add this in v4.
> 
> > +   Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as:
> > +   #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1 entries).
> 
> #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an SoC.
> However you are defining an offset to scfg registers here.  Why these two
> are related?  The length here should actually be related to the #address-cells
> of the soc/.  But since this is only needed for LS1021, you can just make it 
> 3.
I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device 
node(fsl,rcpm-wakeup = < 0x0 0x2000>;
0x0 is a value for IPPDEXPCR0, 0x2000 is a value for IPPDEXPCR1).
But because of the hardware issue on LS1021A, I need store the value of 
IPPDEXPCR registers
to an alt address. So I defining an offset to scfg registers, then RCPM driver 
get an abosolute address from offset,
 RCPM driver write the value of IPPDEXPCR registers to these abosolute 
addresses(backup the value of IPPDEXPCR registers).
> 
> > +   The first entry must be a link to the SCFG device node.
> > +   The non-first entry must be offset of registers of SCFG.
> >
> >  Example:
> >  The RCPM node for T4240:
> > @@ -43,6 +48,15 @@ The RCPM node for T4240:
> > #fsl,rcpm-wakeup-cells = <2>;
> > };
> >
> > +The RCPM node for LS1021A:
> > +   rcpm: rcpm@1ee2140 {
> > +   compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-2.1+";
> > +   reg = <0x0 0x1ee2140 0x0 0x8>;
> > +   #fsl,rcpm-wakeup-cells = <2>;
> > +   fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /*
> > SCFG_SPARECR8 */
> > +   };
> > +
> > +
> >  * Freescale RCPM Wakeup Source Device Tree Bindings
> >  ---
> >  Required fsl,rcpm-wakeup property should be added to a device node if
> > the device
> > --
> > 2.17.1

Re: [PATCH 00/11] opencapi: enable card reset and link retraining

2019-09-24 Thread Alexey Kardashevskiy




On 24/09/2019 16:39, Frederic Barrat wrote:
> 
> 
> Le 24/09/2019 à 06:24, Alexey Kardashevskiy a écrit :
>> Hi Fred,
>>
>> what is this made against of? It does not apply on the master. Thanks,
> 
> It applies on v5.3. And I can see there's a conflict with the current
> state in the merge window. I'll resubmit.


Not really necessary, just mention the exact sha1 or tag in the cover
letter next time. Thanks,



> 
>   Fred
> 
> 
> 
>> On 10/09/2019 01:45, Frederic Barrat wrote:
>>> This is the linux part of the work to use the PCI hotplug framework to
>>> control an opencapi card so that it can be reset and re-read after
>>> flashing a new FPGA image. I had posted it earlier as an RFC and this
>>> version is mostly similar, with just some minor editing.
>>>
>>> It needs support in skiboot:
>>> http://patchwork.ozlabs.org/project/skiboot/list/?series=129724
>>> On an old skiboot, it will do nothing.
>>>
>>> A virtual PCI slot is created for the opencapi adapter, and its state
>>> can be controlled through the pnv-php hotplug driver:
>>>
>>>    echo 0|1 > /sys/bus/pci/slots/OPENCAPI-<...>/power
>>>
>>> Note that the power to the card is not really turned off, as the card
>>> needs to stay on to be flashed with a new image. Instead the card is
>>> in reset.
>>>
>>> The first part of the series mostly deals with the pci/ioda state, as
>>> the opencapi devices can now go away and the state needs to be cleaned
>>> up.
>>>
>>> The second part is modifications to the PCI hotplug driver on powernv,
>>> so that a virtual slot is created for the opencapi adapters found in
>>> the device tree.
>>>
>>>
>>>
>>> Frederic Barrat (11):
>>>    powerpc/powernv/ioda: Fix ref count for devices with their own PE
>>>    powerpc/powernv/ioda: Protect PE list
>>>    powerpc/powernv/ioda: set up PE on opencapi device when enabling
>>>    powerpc/powernv/ioda: Release opencapi device
>>>    powerpc/powernv/ioda: Find opencapi slot for a device node
>>>    pci/hotplug/pnv-php: Remove erroneous warning
>>>    pci/hotplug/pnv-php: Improve error msg on power state change failure
>>>    pci/hotplug/pnv-php: Register opencapi slots
>>>    pci/hotplug/pnv-php: Relax check when disabling slot
>>>    pci/hotplug/pnv-php: Wrap warnings in macro
>>>    ocxl: Add PCI hotplug dependency to Kconfig
>>>
>>>   arch/powerpc/include/asm/pnv-pci.h    |   1 +
>>>   arch/powerpc/platforms/powernv/pci-ioda.c | 107 ++
>>>   arch/powerpc/platforms/powernv/pci.c  |  10 +-
>>>   drivers/misc/ocxl/Kconfig |   1 +
>>>   drivers/pci/hotplug/pnv_php.c |  82 ++---
>>>   5 files changed, 125 insertions(+), 76 deletions(-)
>>>
>>
> 

-- 
Alexey

Re: [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks

2019-09-24 Thread John Hubbard

On 9/24/19 2:24 PM, Leonardo Bras wrote:
> If a process (qemu) with a lot of CPUs (128) try to munmap() a large
> chunk of memory (496GB) mapped with THP, it takes an average of 275
> seconds, which can cause a lot of problems to the load (in qemu case,
> the guest will lock for this time).
> 
> Trying to find the source of this bug, I found out most of this time is
> spent on serialize_against_pte_lookup(). This function will take a lot
> of time in smp_call_function_many() if there is more than a couple CPUs
> running the user process. Since it has to happen to all THP mapped, it
> will take a very long time for large amounts of memory.
> 
> By the docs, serialize_against_pte_lookup() is needed in order to avoid
> pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
> pagetable walk, to happen concurrently with THP splitting/collapsing.
> 
> It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
> after interrupts are re-enabled.
> Since, interrupts are (usually) disabled during lockless pagetable
> walk, and serialize_against_pte_lookup will only return after
> interrupts are enabled, it is protected.
> 
> So, by what I could understand, if there is no lockless pagetable walk
> running, there is no need to call serialize_against_pte_lookup().
> 
> So, to avoid the cost of running serialize_against_pte_lookup(), I
> propose a counter that keeps track of how many find_current_mm_pte()
> are currently running, and if there is none, just skip
> smp_call_function_many().
> 
> The related functions are:
> start_lockless_pgtbl_walk(mm)
>   Insert before starting any lockless pgtable walk
> end_lockless_pgtbl_walk(mm)
>   Insert after the end of any lockless pgtable walk
>   (Mostly after the ptep is last used)
> running_lockless_pgtbl_walk(mm)
>   Returns the number of lockless pgtable walks running
> 
> On my workload (qemu), I could see munmap's time reduction from 275
> seconds to 418ms.
> 
> Changes since v2:
>  Rebased to v5.3
>  Adds support on __get_user_pages_fast
>  Adds usage decription to *_lockless_pgtbl_walk()
>  Better style to dummy functions
>  Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839

Hi Leonardo,

Thanks for adding linux-mm to CC for this next round of reviews. For the benefit
of any new reviewers, I'd like to add that there are some issues that were 
discovered
while reviewing the v2 patchset, that are not (yet) addressed in this v3 series.

Since those issues are not listed in the cover letter above, I'll list them 
here:

1. The locking model requires a combination of disabling interrupts and
atomic counting and memory barriers, but

a) some memory barriers are missing (start/end_lockless_pgtbl_walk), and
b) some cases (patch #8) fail to disable interrupts

...so the synchronization appears to be inadequate. (And if it *is* adequate, 
then
definitely we need the next item, to explain it.)

2. Documentation of the synchronization/locking model needs to exist, once we
figure out the exact details of (1).

3. Related to (1), I've asked to change things so that interrupt controls and 
atomic inc/dec are in the same start/end calls--assuming, of course, that the
caller can tolerate that. 

4. Please see the v2 series for any other details I've missed.

thanks,
-- 
John Hubbard
NVIDIA

>  
> Changes since v1:
>  Isolated atomic operations in functions *_lockless_pgtbl_walk()
>  Fixed behavior of decrementing before last ptep was used
>  Link: http://patchwork.ozlabs.org/patch/1163093/
>  
> Leonardo Bras (11):
>   powerpc/mm: Adds counting method to monitor lockless pgtable walks
>   asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable
> walks
>   mm/gup: Applies counting method to monitor gup_pgd_range
>   powerpc/mce_power: Applies counting method to monitor lockless pgtbl
> walks
>   powerpc/perf: Applies counting method to monitor lockless pgtbl walks
>   powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
> pgtbl walks
>   powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
> walks
>   powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
> pgtbl walks
>   powerpc/kvm/book3s_64: Applies counting method to monitor lockless
> pgtbl walks
>   powerpc/book3s_64: Enables counting method to monitor lockless pgtbl
> walk
>   powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
> 
>  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  5 +++
>  arch/powerpc/kernel/mce_power.c  | 13 +--
>  arch/powerpc/kvm/book3s_64_mmu_hv.c  |  2 +
>  arch/powerpc/kvm/book3s_64_mmu_radix.c   | 20 +-
>  arch/powerpc/kvm/book3s_64_vio_hv.c  |  4 ++
>  arch/powerpc/kvm/book3s_hv_nested.c  |  8 
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  9 -
>  arch/powerpc/kvm/e500_mmu_host.c |  4 ++
>

Re: [PATCH V4 3/4] ASoC: pcm_dmaengine: Extract snd_dmaengine_pcm_refine_runtime_hwparams

2019-09-24 Thread Nicolin Chen

On Tue, Sep 24, 2019 at 06:52:34PM +0800, Shengjiu Wang wrote:
> When set the runtime hardware parameters, we may need to query
> the capability of DMA to complete the parameters.
> 
> This patch is to Extract this operation from
> dmaengine_pcm_set_runtime_hwparams function to a separate function
> snd_dmaengine_pcm_refine_runtime_hwparams, that other components
> which need this feature can call this function.
> 
> Signed-off-by: Shengjiu Wang 

Looks good to me.

Reviewed-by: Nicolin Chen

Re: [PATCH V4 4/4] ASoC: fsl_asrc: Fix error with S24_3LE format bitstream in i.MX8

2019-09-24 Thread Nicolin Chen

On Tue, Sep 24, 2019 at 06:52:35PM +0800, Shengjiu Wang wrote:
> There is error "aplay: pcm_write:2023: write error: Input/output error"
> on i.MX8QM/i.MX8QXP platform for S24_3LE format.
> 
> In i.MX8QM/i.MX8QXP, the DMA is EDMA, which don't support 24bit
> sample, but we didn't add any constraint, that cause issues.
> 
> So we need to query the caps of dma, then update the hw parameters
> according to the caps.
> 
> Signed-off-by: Shengjiu Wang 
> ---
>  sound/soc/fsl/fsl_asrc.c |  4 +--
>  sound/soc/fsl/fsl_asrc.h |  3 ++
>  sound/soc/fsl/fsl_asrc_dma.c | 59 +++-
>  3 files changed, 56 insertions(+), 10 deletions(-)
> 
> @@ -270,12 +268,17 @@ static int fsl_asrc_dma_hw_free(struct 
> snd_pcm_substream *substream)
>  
>  static int fsl_asrc_dma_startup(struct snd_pcm_substream *substream)
>  {
> + bool tx = substream->stream == SNDRV_PCM_STREAM_PLAYBACK;
>   struct snd_soc_pcm_runtime *rtd = substream->private_data;
>   struct snd_pcm_runtime *runtime = substream->runtime;
>   struct snd_soc_component *component = snd_soc_rtdcom_lookup(rtd, 
> DRV_NAME);
> + struct snd_dmaengine_dai_dma_data *dma_data;
>   struct device *dev = component->dev;
>   struct fsl_asrc *asrc_priv = dev_get_drvdata(dev);
>   struct fsl_asrc_pair *pair;
> + struct dma_chan *tmp_chan = NULL;
> + u8 dir = tx ? OUT : IN;
> + int ret = 0;
>  
>   pair = kzalloc(sizeof(struct fsl_asrc_pair), GFP_KERNEL);

Sorry, I didn't catch it previously. We would need to release
this memory also for all error-out paths, as the code doesn't
have any error-out routine, prior to applying this change.

>   if (!pair)
> @@ -285,11 +288,51 @@ static int fsl_asrc_dma_startup(struct 
> snd_pcm_substream *substream)

> + /* Request a dummy pair, which will be released later.
> +  * Request pair function needs channel num as input, for this
> +  * dummy pair, we just request "1" channel temporary.
> +  */

"temporary" => "temporarily"

> + ret = fsl_asrc_request_pair(1, pair);
> + if (ret < 0) {
> + dev_err(dev, "failed to request asrc pair\n");
> + return ret;
> + }
> +
> + /* Request a dummy dma channel, which will be release later. */

"release" => "released"

[PATCH v3 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

2019-09-24 Thread Leonardo Bras

Skips slow part of serialize_against_pte_lookup if there is no running
lockless pagetable walk.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/mm/book3s64/pgtable.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 0b86884a8097..e2aa6572f03f 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -95,7 +95,8 @@ static void do_nothing(void *unused)
 void serialize_against_pte_lookup(struct mm_struct *mm)
 {
smp_mb();
-   smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
+   if (running_lockless_pgtbl_walk(mm))
+   smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
 /*
-- 
2.20.1

[PATCH v3 09/11] powerpc/kvm/book3s_64: Applies counting method to monitor lockless pgtbl walks

2019-09-24 Thread Leonardo Bras

Applies the counting-based method for monitoring all book3s_64-related
functions that do lockless pagetable walks.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 ++--
 arch/powerpc/kvm/book3s_64_vio_hv.c|  4 
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 9a75f0e1933b..fcd3dad1297f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -620,6 +620,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 * We need to protect against page table destruction
 * hugepage split and collapse.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
local_irq_save(flags);
ptep = find_current_mm_pte(current->mm->pgd,
   hva, NULL, NULL);
@@ -629,6 +630,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
write_ok = 1;
}
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(kvm->mm);
}
}
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..d46f8258d8d6 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -741,6 +741,7 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t 
*pgtable, bool writing,
unsigned long pgflags;
unsigned int shift;
pte_t *ptep;
+   bool ret = false;
 
/*
 * Need to set an R or C bit in the 2nd-level tables;
@@ -755,12 +756,14 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t 
*pgtable, bool writing,
 * We can do this without disabling irq because the Linux MM
 * subsystem doesn't do THP splits and collapses on this tree.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) && (!writing || pte_write(*ptep))) {
kvmppc_radix_update_pte(kvm, ptep, 0, pgflags, gpa, shift);
-   return true;
+   ret = true;
}
-   return false;
+   end_lockless_pgtbl_walk(kvm->mm);
+   return ret;
 }
 
 int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
@@ -813,6 +816,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 * Read the PTE from the process' radix tree and use that
 * so we get the shift and attribute bits.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
local_irq_disable();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
/*
@@ -821,12 +825,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 */
if (!ptep) {
local_irq_enable();
+   end_lockless_pgtbl_walk(kvm->mm);
if (page)
put_page(page);
return RESUME_GUEST;
}
pte = *ptep;
local_irq_enable();
+   end_lockless_pgtbl_walk(kvm->mm);
 
/* If we're logging dirty pages, always map single pages */
large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
@@ -972,10 +978,12 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
 
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
 kvm->arch.lpid);
+   end_lockless_pgtbl_walk(kvm->mm);
return 0;   
 }
 
@@ -989,6 +997,7 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
int ref = 0;
unsigned long old, *rmapp;
 
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1001,6 +1010,7 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
   1UL << shift);
ref = 1;
}
+   end_lockless_pgtbl_walk(kvm->mm);
return ref;
 }
 
@@ -1013,9 +1023,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
 
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (ptep && pte_present(*ptep) &&

[PATCH v3 10/11] powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk

2019-09-24 Thread Leonardo Bras

Enables count-based monitoring method for lockless pagetable walks on
PowerPC book3s_64.

Other architectures/platforms fallback to using generic dummy functions.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8308f32e9782..eb9b26a4a483 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1370,5 +1370,10 @@ static inline bool pgd_is_leaf(pgd_t pgd)
return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
 }
 
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+void start_lockless_pgtbl_walk(struct mm_struct *mm);
+void end_lockless_pgtbl_walk(struct mm_struct *mm);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
-- 
2.20.1

[PATCH v3 07/11] powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl walks

2019-09-24 Thread Leonardo Bras

Applies the counting-based method for monitoring lockless pgtable walks on
kvmppc_e500_shadow_map().

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kvm/e500_mmu_host.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 321db0fdb9db..a1b8bfe20bc8 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -473,6 +473,7 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 * We are holding kvm->mmu_lock so a notifier invalidate
 * can't run hence pfn won't change.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
local_irq_save(flags);
ptep = find_linux_pte(pgdir, hva, NULL, NULL);
if (ptep) {
@@ -484,12 +485,15 @@ static inline int kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
local_irq_restore(flags);
} else {
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(kvm->mm);
pr_err_ratelimited("%s: pte not present: gfn %lx,pfn 
%lx\n",
   __func__, (long)gfn, pfn);
ret = -EINVAL;
goto out;
}
}
+   end_lockless_pgtbl_walk(kvm->mm);
+
kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg);
 
kvmppc_e500_setup_stlbe(_e500->vcpu, gtlbe, tsize,
-- 
2.20.1

[PATCH v3 08/11] powerpc/kvm/book3s_hv: Applies counting method to monitor lockless pgtbl walks

2019-09-24 Thread Leonardo Bras

Applies the counting-based method for monitoring all book3s_hv related
functions that do lockless pagetable walks.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kvm/book3s_hv_nested.c | 8 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 -
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 735e0ac6f5b2..ed68e57af3a3 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -804,6 +804,7 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 
n_rmap,
return;
 
/* Find the pte */
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, );
/*
 * If the pte is present and the pfn is still the same, update the pte.
@@ -815,6 +816,7 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 
n_rmap,
__radix_pte_update(ptep, clr, set);
kvmppc_radix_tlbie_page(kvm, gpa, shift, lpid);
}
+   end_lockless_pgtbl_walk(kvm->mm);
 }
 
 /*
@@ -854,10 +856,12 @@ static void kvmhv_remove_nest_rmap(struct kvm *kvm, u64 
n_rmap,
return;
 
/* Find and invalidate the pte */
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, );
/* Don't spuriously invalidate ptes if the pfn has changed */
if (ptep && pte_present(*ptep) && ((pte_val(*ptep) & mask) == hpa))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, NULL, gp->shadow_lpid);
+   end_lockless_pgtbl_walk(kvm->mm);
 }
 
 static void kvmhv_remove_nest_rmap_list(struct kvm *kvm, unsigned long *rmapp,
@@ -921,6 +925,7 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu 
*vcpu,
int shift;
 
spin_lock(>mmu_lock);
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, );
if (!shift)
shift = PAGE_SHIFT;
@@ -928,6 +933,7 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu 
*vcpu,
kvmppc_unmap_pte(kvm, ptep, gpa, shift, NULL, gp->shadow_lpid);
ret = true;
}
+   end_lockless_pgtbl_walk(kvm->mm);
spin_unlock(>mmu_lock);
 
if (shift_ret)
@@ -1362,11 +1368,13 @@ static long int __kvmhv_nested_page_fault(struct 
kvm_run *run,
/* See if can find translation in our partition scoped tables for L1 */
pte = __pte(0);
spin_lock(>mmu_lock);
+   start_lockless_pgtbl_walk(kvm->mm);
pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
if (!shift)
shift = PAGE_SHIFT;
if (pte_p)
pte = *pte_p;
+   end_lockless_pgtbl_walk(kvm->mm);
spin_unlock(>mmu_lock);
 
if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) {
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 63e0ce91e29d..53ca67492211 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -258,6 +258,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 * If called in real mode we have MSR_EE = 0. Otherwise
 * we disable irq above.
 */
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(pgdir, hva, NULL, _shift);
if (ptep) {
pte_t pte;
@@ -311,6 +312,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G);
ptel |= HPTE_R_M;
}
+   end_lockless_pgtbl_walk(kvm->mm);
 
/* Find and lock the HPTEG slot to use */
  do_insert:
@@ -886,10 +888,15 @@ static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned 
long gpa,
hva = __gfn_to_hva_memslot(memslot, gfn);
 
/* Try to find the host pte for that virtual address */
+   start_lockless_pgtbl_walk(kvm->mm);
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, );
-   if (!ptep)
+   if (!ptep) {
+   end_lockless_pgtbl_walk(kvm->mm);
return H_TOO_HARD;
+   }
pte = kvmppc_read_update_linux_pte(ptep, writing);
+   end_lockless_pgtbl_walk(kvm->mm);
+
if (!pte_present(pte))
return H_TOO_HARD;
 
-- 
2.20.1

[PATCH v3 05/11] powerpc/perf: Applies counting method to monitor lockless pgtbl walks

2019-09-24 Thread Leonardo Bras

Applies the counting-based method for monitoring lockless pgtable walks on
read_user_stack_slow.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/perf/callchain.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index c84bbd4298a0..9d76194a2a8f 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -113,16 +113,18 @@ static int read_user_stack_slow(void __user *ptr, void 
*buf, int nb)
int ret = -EFAULT;
pgd_t *pgdir;
pte_t *ptep, pte;
+   struct mm_struct *mm = current->mm;
unsigned shift;
unsigned long addr = (unsigned long) ptr;
unsigned long offset;
unsigned long pfn, flags;
void *kaddr;
 
-   pgdir = current->mm->pgd;
+   pgdir = mm->pgd;
if (!pgdir)
return -EFAULT;
 
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
ptep = find_current_mm_pte(pgdir, addr, NULL, );
if (!ptep)
@@ -146,6 +148,7 @@ static int read_user_stack_slow(void __user *ptr, void 
*buf, int nb)
ret = 0;
 err_out:
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
return ret;
 }
 
-- 
2.20.1

[PATCH v3 06/11] powerpc/mm/book3s64/hash: Applies counting method to monitor lockless pgtbl walks

2019-09-24 Thread Leonardo Bras

Applies the counting-based method for monitoring all hash-related functions
that do lockless pagetable walks.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/mm/book3s64/hash_tlb.c   | 2 ++
 arch/powerpc/mm/book3s64/hash_utils.c | 7 +++
 2 files changed, 9 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c 
b/arch/powerpc/mm/book3s64/hash_tlb.c
index 4a70d8dd39cd..5e5213c3f7c4 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -209,6 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, 
unsigned long start,
 * to being hashed). This is not the most performance oriented
 * way to do things but is fine for our needs here.
 */
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
arch_enter_lazy_mmu_mode();
for (; start < end; start += PAGE_SIZE) {
@@ -230,6 +231,7 @@ void __flush_hash_table_range(struct mm_struct *mm, 
unsigned long start,
}
arch_leave_lazy_mmu_mode();
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
 }
 
 void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index b8ad14bb1170..299946cedc3a 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1322,6 +1322,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
 #endif /* CONFIG_PPC_64K_PAGES */
 
/* Get PTE and page size from page tables */
+   start_lockless_pgtbl_walk(mm);
ptep = find_linux_pte(pgdir, ea, _thp, );
if (ptep == NULL || !pte_present(*ptep)) {
DBG_LOW(" no PTE !\n");
@@ -1438,6 +1439,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
DBG_LOW(" -> rc=%d\n", rc);
 
 bail:
+   end_lockless_pgtbl_walk(mm);
exception_exit(prev_state);
return rc;
 }
@@ -1547,10 +1549,12 @@ void hash_preload(struct mm_struct *mm, unsigned long 
ea,
vsid = get_user_vsid(>context, ea, ssize);
if (!vsid)
return;
+
/*
 * Hash doesn't like irqs. Walking linux page table with irq disabled
 * saves us from holding multiple locks.
 */
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
 
/*
@@ -1597,6 +1601,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
   pte_val(*ptep));
 out_exit:
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
 }
 
 #ifdef CONFIG_PPC_MEM_KEYS
@@ -1613,11 +1618,13 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long 
address)
if (!mm || !mm->pgd)
return 0;
 
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
if (ptep)
pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
 
return pkey;
 }
-- 
2.20.1

[PATCH v3 03/11] mm/gup: Applies counting method to monitor gup_pgd_range

2019-09-24 Thread Leonardo Bras

As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
monitor against THP split/collapse with the couting method, it's necessary
to bound it with {start,end}_lockless_pgtbl_walk.

There are dummy functions, so it is not going to add any overhead on archs
that don't use this method.

Signed-off-by: Leonardo Bras 
---
 mm/gup.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..eabd6fd15cf8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, 
unsigned long end)
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages)
 {
+   struct mm_struct mm;
unsigned long len, end;
unsigned long flags;
int nr = 0;
@@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
 
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
+   mm = current->mm;
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, );
local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
}
 
return nr;
@@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)
 {
unsigned long addr, len, end;
+   struct mm_struct *mm;
int nr = 0, ret = 0;
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
@@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int 
nr_pages,
 
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
gup_fast_permitted(start, end)) {
+   mm = current->mm;
+   start_lockless_pgtbl_walk(mm);
local_irq_disable();
gup_pgd_range(addr, end, gup_flags, pages, );
local_irq_enable();
+   end_lockless_pgtbl_walk(mm);
ret = nr;
}
 
-- 
2.20.1

[PATCH v3 02/11] asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable walks

2019-09-24 Thread Leonardo Bras

There is a need to monitor lockless pagetable walks, in order to avoid
doing THP splitting/collapsing during them.

Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.

In order to speedup these cases, I propose a refcount-based approach, that
counts the number of lockless pagetable walks happening on the process.

Given that there are lockless pagetable walks on generic code, it's
necessary to create dummy functions for archs that won't use the approach.

Signed-off-by: Leonardo Bras 
---
 include/asm-generic/pgtable.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..0831475e72d3 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1172,6 +1172,21 @@ static inline bool arch_has_pfn_modify_check(void)
 #endif
 #endif
 
+#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+static inline void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   return 0;
+}
+#endif
+
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.20.1

[PATCH v3 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks

2019-09-24 Thread Leonardo Bras

Applies the counting-based method for monitoring lockless pgtable walks on
addr_to_pfn().

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kernel/mce_power.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index a814d2dfb5b0..0f2f87da4cd1 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -27,6 +27,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long 
addr)
 {
pte_t *ptep;
unsigned long flags;
+   unsigned long pfn;
struct mm_struct *mm;
 
if (user_mode(regs))
@@ -34,15 +35,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned 
long addr)
else
mm = _mm;
 
+   start_lockless_pgtbl_walk(mm);
local_irq_save(flags);
if (mm == current->mm)
ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
else
ptep = find_init_mm_pte(addr, NULL);
-   local_irq_restore(flags);
+
if (!ptep || pte_special(*ptep))
-   return ULONG_MAX;
-   return pte_pfn(*ptep);
+   pfn = ULONG_MAX;
+   else
+   pfn = pte_pfn(*ptep);
+
+   local_irq_restore(flags);
+   end_lockless_pgtbl_walk(mm);
+   return pfn;
 }
 
 /* flush SLBs and reload */
-- 
2.20.1

[PATCH v3 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks

2019-09-24 Thread Leonardo Bras

It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.

In order to speedup some cases, I propose a refcount-based approach, that
counts the number of lockless pagetable walks happening on the process.

This method does not exclude the current irq-oriented method. It works as a
complement to skip unnecessary waiting.

start_lockless_pgtbl_walk(mm)
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
Returns the number of lockless pgtable walks running

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 37 
 3 files changed, 41 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 23b83d3593e2..13b006e7dde4 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -116,6 +116,9 @@ typedef struct {
/* Number of users of the external (Nest) MMU */
atomic_t copros;
 
+   /* Number of running instances of lockless pagetable walk*/
+   atomic_t lockless_pgtbl_walk_count;
+
struct hash_mm_context *hash_context;
 
unsigned long vdso_base;
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index 2d0cb5ba9a47..3dd01c0ca5be 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -200,6 +200,7 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
 #endif
atomic_set(>context.active_cpus, 0);
atomic_set(>context.copros, 0);
+   atomic_set(>context.lockless_pgtbl_walk_count, 0);
 
return 0;
 }
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..0b86884a8097 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,43 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* start_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte()
+ */
+void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   atomic_inc(>context.lockless_pgtbl_walk_count);
+}
+EXPORT_SYMBOL(start_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+*/
+void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   atomic_dec(>context.lockless_pgtbl_walk_count);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, 
and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+   return atomic_read(>context.lockless_pgtbl_walk_count);
+}
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
-- 
2.20.1

[PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks

2019-09-24 Thread Leonardo Bras

If a process (qemu) with a lot of CPUs (128) try to munmap() a large
chunk of memory (496GB) mapped with THP, it takes an average of 275
seconds, which can cause a lot of problems to the load (in qemu case,
the guest will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot
of time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it
will take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
pagetable walk, to happen concurrently with THP splitting/collapsing.

It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
after interrupts are re-enabled.
Since, interrupts are (usually) disabled during lockless pagetable
walk, and serialize_against_pte_lookup will only return after
interrupts are enabled, it is protected.

So, by what I could understand, if there is no lockless pagetable walk
running, there is no need to call serialize_against_pte_lookup().

So, to avoid the cost of running serialize_against_pte_lookup(), I
propose a counter that keeps track of how many find_current_mm_pte()
are currently running, and if there is none, just skip
smp_call_function_many().

The related functions are:
start_lockless_pgtbl_walk(mm)
Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
Insert after the end of any lockless pgtable walk
(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
Returns the number of lockless pgtable walks running

On my workload (qemu), I could see munmap's time reduction from 275
seconds to 418ms.

Changes since v2:
 Rebased to v5.3
 Adds support on __get_user_pages_fast
 Adds usage decription to *_lockless_pgtbl_walk()
 Better style to dummy functions
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839
 
Changes since v1:
 Isolated atomic operations in functions *_lockless_pgtbl_walk()
 Fixed behavior of decrementing before last ptep was used
 Link: http://patchwork.ozlabs.org/patch/1163093/
 
Leonardo Bras (11):
  powerpc/mm: Adds counting method to monitor lockless pgtable walks
  asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable
walks
  mm/gup: Applies counting method to monitor gup_pgd_range
  powerpc/mce_power: Applies counting method to monitor lockless pgtbl
walks
  powerpc/perf: Applies counting method to monitor lockless pgtbl walks
  powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
pgtbl walks
  powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
walks
  powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
pgtbl walks
  powerpc/kvm/book3s_64: Applies counting method to monitor lockless
pgtbl walks
  powerpc/book3s_64: Enables counting method to monitor lockless pgtbl
walk
  powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  5 +++
 arch/powerpc/kernel/mce_power.c  | 13 +--
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 20 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c  |  4 ++
 arch/powerpc/kvm/book3s_hv_nested.c  |  8 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  9 -
 arch/powerpc/kvm/e500_mmu_host.c |  4 ++
 arch/powerpc/mm/book3s64/hash_tlb.c  |  2 +
 arch/powerpc/mm/book3s64/hash_utils.c|  7 
 arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 40 +++-
 arch/powerpc/perf/callchain.c|  5 ++-
 include/asm-generic/pgtable.h| 15 
 mm/gup.c |  8 
 16 files changed, 138 insertions(+), 8 deletions(-)

-- 
2.20.1

Re: [PATCH v2 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

2019-09-24 Thread Leonardo Bras

John Hubbard  writes:

>> Is that what you meant?
>
> Yes.
>

I am still trying to understand this issue.

I am also analyzing some cases where interrupt disable is not done
before the lockless pagetable walk (patch 3 discussion).

But given I forgot to add the mm mailing list before, I think it would
be wiser to send a v3 and gather feedback while I keep trying to
understand how it works, and if it needs additional memory barrier here.

Thanks!

Leonardo Bras

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 00/11] Introduces new count-based method for monitoring lockless pagetable wakls

2019-09-24 Thread Leonardo Bras

John Hubbard  writes:
> Also, which tree do these patches apply to, please? 

I will send a v3 that applies directly over v5.3, and make sure to
include mm mailing list.

Thanks!


signature.asc
Description: This is a digitally signed message part

Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use

2019-09-24 Thread Greg Kurz

On Tue, 24 Sep 2019 15:33:28 +1000
Paul Mackerras  wrote:

> On Mon, Sep 23, 2019 at 05:43:48PM +0200, Greg Kurz wrote:
> > We currently prevent userspace to connect a new vCPU if we already have
> > one with the same vCPU id. This is good but unfortunately not enough,
> > because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> > can return colliding values. For examples, 348 stays unchanged since it
> > is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> > guest's core stride is 8. Nothing currently prevents userspace to connect
> > vCPUs with forged ids, that end up being associated to the same VP. This
> > confuses the irq layer and likely crashes the kernel:
> > 
> > [96631.670454] genirq: Flags mismatch irq 4161. 0001 (kvm-1-2392) vs. 
> > 0001 (kvm-1-348)
> 
> Have you seen a host kernel crash?

Yes I have.

[29191.162740] genirq: Flags mismatch irq 199. 0001 (kvm-2-2392) vs. 
0001 (kvm-2-348)
[29191.162849] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 
5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.162966] Call Trace:
[29191.163002] [c03f7f9937e0] [c0c0110c] dump_stack+0xb0/0xf4 
(unreliable)
[29191.163090] [c03f7f993820] [c01cb480] __setup_irq+0xa70/0xad0
[29191.163180] [c03f7f9938d0] [c01cb75c] 
request_threaded_irq+0x13c/0x260
[29191.163290] [c03f7f993940] [c0080d44e7ac] 
kvmppc_xive_attach_escalation+0x104/0x270 [kvm]
[29191.163396] [c03f7f9939d0] [c0080d45013c] 
kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
[29191.163504] [c03f7f993ac0] [c0080d28] 
kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
[29191.163616] [c03f7f993b90] [c0080d43593c] kvm_vcpu_ioctl+0x154/0x7c8 
[kvm]
[29191.163695] [c03f7f993d00] [c04840f0] do_vfs_ioctl+0xe0/0xc30
[29191.163806] [c03f7f993db0] [c0484d44] ksys_ioctl+0x104/0x120
[29191.163889] [c03f7f993e00] [c0484d88] sys_ioctl+0x28/0x80
[29191.163962] [c03f7f993e20] [c000b278] system_call+0x5c/0x68
[29191.164035] xive-kvm: Failed to request escalation interrupt for queue 0 of 
VCPU 2392
[29191.164152] [ cut here ]
[29191.164229] remove_proc_entry: removing non-empty directory 'irq/199', 
leaking at least 'kvm-2-348'
[29191.164343] WARNING: CPU: 24 PID: 88176 at 
/home/greg/Work/linux/kernel-kvm-ppc/fs/proc/generic.c:684 
remove_proc_entry+0x1ec/0x200
[29191.164501] Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap 
xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge 
stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd 
uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf 
ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp 
libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 
raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq 
libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm 
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci 
libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
[29191.165450] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 
5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.165568] NIP:  c053b0cc LR: c053b0c8 CTR: c00ba3b0
[29191.165644] REGS: c03f7f9934b0 TRAP: 0700   Not tainted  
(5.3.0-xive-nr-servers-5.3-gku+)
[29191.165741] MSR:  90029033   CR: 48228222  
XER: 2004
[29191.165939] CFAR: c0131a50 IRQMASK: 0 
[29191.165939] GPR00: c053b0c8 c03f7f993740 c15ec500 
0057 
[29191.165939] GPR04: 0001  49fb98484262 
1bcf 
[29191.165939] GPR08: 0007 0007 0001 
90001033 
[29191.165939] GPR12: 8000 c03eb800  
00012f4ce5a1 
[29191.165939] GPR16: 00012ef5a0c8  00012f113bb0 
 
[29191.165939] GPR20: 00012f45d918 c03f863758b0 c03f86375870 
0006 
[29191.165939] GPR24: c03f86375a30 0007 c0002039373d9020 
c14c4a48 
[29191.165939] GPR28: 0001 c03fe62a4f6b c00020394b2e9fab 
c03fe62a4ec0 
[29191.166755] NIP [c053b0cc] remove_proc_entry+0x1ec/0x200
[29191.166803] LR [c053b0c8] remove_proc_entry+0x1e8/0x200
[29191.166874] Call Trace:
[29191.166908] [c03f7f993740] [c053b0c8] 
remove_proc_entry+0x1e8/0x200 (unreliable)
[29191.167022] [c03f7f9937e0] [c01d3654] 
unregister_irq_proc+0x114/0x150
[29191.167106] [c03f7f993880] [c01c6284] free_desc+0x54/0xb0
[29191.167175] [c03f7f9938c0] [c01c65ec] irq_free_descs+0xac/0x100
[29191.167256] [c03f7f993910] [c01d1ff8]

Re: [PATCH] powerpc/book3s64: Export has_transparent_hugepage() related functions.

2019-09-24 Thread Dan Williams

On Mon, Sep 23, 2019 at 9:25 PM Aneesh Kumar K.V
 wrote:
>
> In later patch, we want to use hash_transparent_hugepage() in a kernel module.
> Export two related functions.
>

Looks good, thanks.

[PATCH AUTOSEL 4.4 13/14] powerpc/pseries: correctly track irq state in default idle

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ]

prep_irq_for_idle() is intended to be called before entering
H_CEDE (and it is used by the pseries cpuidle driver). However the
default pseries idle routine does not call it, leading to mismanaged
lazy irq state when the cpuidle driver isn't in use. Manifestations of
this include:

* Dropped IPIs in the time immediately after a cpu comes
  online (before it has installed the cpuidle handler), making the
  online operation block indefinitely waiting for the new cpu to
  respond.

* Hitting this WARN_ON in arch_local_irq_restore():
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON_ONCE(mfmsr() & MSR_EE))
__hard_irq_disable();

Call prep_irq_for_idle() from pseries_lpar_idle() and honor its
result.

Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on 
pseries")
Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/setup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 9cc976ff7fecc..88fcf6a95fa67 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -369,6 +369,9 @@ static void pseries_lpar_idle(void)
 * low power mode by cedeing processor to hypervisor
 */
 
+   if (!prep_irq_for_idle())
+   return;
+
/* Indicate to hypervisor that we are idle. */
get_lppaca()->idle = 1;
 
-- 
2.20.1

[PATCH AUTOSEL 4.4 12/14] powerpc/64s/exception: machine check use correct cfar for late handler

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ]

Bare metal machine checks run an "early" handler in real mode before
running the main handler which reports the event.

The main handler runs exactly as a normal interrupt handler, after the
"windup" which sets registers back as they were at interrupt entry.
CFAR does not get restored by the windup code, so that will be wrong
when the handler is run.

Restore the CFAR to the saved value before running the late handler.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index a44f1755dc4bf..536718ed033fc 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1465,6 +1465,10 @@ machine_check_handle_early:
RFI_TO_USER_OR_KERNEL
 9:
/* Deliver the machine check to host kernel in V mode. */
+BEGIN_FTR_SECTION
+   ld  r10,ORIG_GPR3(r1)
+   mtspr   SPRN_CFAR,r10
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
MACHINE_CHECK_HANDLER_WINDUP
b   machine_check_pSeries
 
-- 
2.20.1

[PATCH AUTOSEL 4.4 10/14] powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag

2019-09-24 Thread Sasha Levin

From: Sam Bobroff 

[ Upstream commit aa06e3d60e245284d1e55497eb3108828092818d ]

The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the
use of driver callbacks in drivers that have been bound part way
through the recovery process. This is necessary to prevent later stage
handlers from being called when the earlier stage handlers haven't,
which can be confusing for drivers.

However, the flag is set for all devices that are added after boot
time and only cleared at the end of the EEH recovery process. This
results in hot plugged devices erroneously having the flag set during
the first recovery after they are added (causing their driver's
handlers to be incorrectly ignored).

To remedy this, clear the flag at the beginning of recovery
processing. The flag is still cleared at the end of recovery
processing, although it is no longer really necessary.

Also clear the flag during eeh_handle_special_event(), for the same
reasons.

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobr...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_driver.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 9837c98caabe9..045038469295d 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -675,6 +675,10 @@ static bool eeh_handle_normal_event(struct eeh_pe *pe)
pr_warn("EEH: This PCI device has failed %d times in the last hour\n",
pe->freeze_count);
 
+   eeh_for_each_pe(pe, tmp_pe)
+   eeh_pe_for_each_dev(tmp_pe, edev, tmp)
+   edev->mode &= ~EEH_DEV_NO_HANDLER;
+
/* Walk the various device drivers attached to this slot through
 * a reset sequence, giving each an opportunity to do what it needs
 * to accomplish the reset.  Each child gets a report of the
@@ -840,7 +844,8 @@ static bool eeh_handle_normal_event(struct eeh_pe *pe)
 
 static void eeh_handle_special_event(void)
 {
-   struct eeh_pe *pe, *phb_pe;
+   struct eeh_pe *pe, *phb_pe, *tmp_pe;
+   struct eeh_dev *edev, *tmp_edev;
struct pci_bus *bus;
struct pci_controller *hose;
unsigned long flags;
@@ -919,6 +924,10 @@ static void eeh_handle_special_event(void)
(phb_pe->state & EEH_PE_RECOVERING))
continue;
 
+   eeh_for_each_pe(pe, tmp_pe)
+   eeh_pe_for_each_dev(tmp_pe, edev, 
tmp_edev)
+   edev->mode &= 
~EEH_DEV_NO_HANDLER;
+
/* Notify all devices to be down */
eeh_pe_state_clear(pe, EEH_PE_PRI_BUS);
bus = eeh_pe_bus_get(phb_pe);
-- 
2.20.1

[PATCH AUTOSEL 4.4 08/14] powerpc/pseries/mobility: use cond_resched when updating device tree

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ]

After a partition migration, pseries_devicetree_update() processes
changes to the device tree communicated from the platform to
Linux. This is a relatively heavyweight operation, with multiple
device tree searches, memory allocations, and conversations with
partition firmware.

There's a few levels of nested loops which are bounded only by
decisions made by the platform, outside of Linux's control, and indeed
we have seen RCU stalls on large systems while executing this call
graph. Use cond_resched() in these loops so that the cpu is yielded
when needed.

Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/mobility.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index c773396d0969b..8d30a425a88ab 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -206,7 +207,11 @@ static int update_dt_node(__be32 phandle, s32 scope)
 
prop_data += vd;
}
+
+   cond_resched();
}
+
+   cond_resched();
} while (rtas_rc == 1);
 
of_node_put(dn);
@@ -282,8 +287,12 @@ int pseries_devicetree_update(s32 scope)
add_dt_node(phandle, drc_index);
break;
}
+
+   cond_resched();
}
}
+
+   cond_resched();
} while (rc == 1);
 
kfree(rtas_buf);
-- 
2.20.1

[PATCH AUTOSEL 4.4 07/14] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function

2019-09-24 Thread Sasha Levin

From: Christophe Leroy 

[ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ]

We see warnings such as:
  kernel/futex.c: In function 'do_futex':
  kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this 
function [-Wmaybe-uninitialized]
 return oldval == cmparg;
   ^
  kernel/futex.c:1651:6: note: 'oldval' was declared here
int oldval, ret;
^

This is because arch_futex_atomic_op_inuser() only sets *oval if ret
is 0 and GCC doesn't see that it will only use it when ret is 0.

Anyway, the non-zero ret path is an error path that won't suffer from
setting *oval, and as *oval is a local var in futex_atomic_op_inuser()
it will have no impact.

Signed-off-by: Christophe Leroy 
[mpe: reword change log slightly]
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/futex.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h
index f4c7467f74655..b73ab8a7ebc3f 100644
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -60,8 +60,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int 
oparg, int *oval,
 
pagefault_enable();
 
-   if (!ret)
-   *oval = oldval;
+   *oval = oldval;
 
return ret;
 }
-- 
2.20.1

[PATCH AUTOSEL 4.4 06/14] powerpc/rtas: use device model APIs and serialization during LPM

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ]

The LPAR migration implementation and userspace-initiated cpu hotplug
can interleave their executions like so:

1. Set cpu 7 offline via sysfs.

2. Begin a partition migration, whose implementation requires the OS
   to ensure all present cpus are online; cpu 7 is onlined:

 rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up

   This sets cpu 7 online in all respects except for the cpu's
   corresponding struct device; dev->offline remains true.

3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is
   already online and returns success. The driver core (device_online)
   sets dev->offline = false.

4. The migration completes and restores cpu 7 to offline state:

 rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down

This leaves cpu7 in a state where the driver core considers the cpu
device online, but in all other respects it is offline and
unused. Attempts to online the cpu via sysfs appear to succeed but the
driver core actually does not pass the request to the lower-level
cpuhp support code. This makes the cpu unusable until the cpu device
is manually set offline and then online again via sysfs.

Instead of directly calling cpu_up/cpu_down, the migration code should
use the higher-level device core APIs to maintain consistent state and
serialize operations.

Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to 
migration/hibernation")
Signed-off-by: Nathan Lynch 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/rtas.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 5a753fae8265a..0c42e872d548b 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -857,15 +857,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state 
state,
return 0;
 
for_each_cpu(cpu, cpus) {
+   struct device *dev = get_cpu_device(cpu);
+
switch (state) {
case DOWN:
-   cpuret = cpu_down(cpu);
+   cpuret = device_offline(dev);
break;
case UP:
-   cpuret = cpu_up(cpu);
+   cpuret = device_online(dev);
break;
}
-   if (cpuret) {
+   if (cpuret < 0) {
pr_debug("%s: cpu_%s for cpu#%d returned %d.\n",
__func__,
((state == UP) ? "up" : "down"),
@@ -954,6 +956,8 @@ int rtas_ibm_suspend_me(u64 handle)
data.token = rtas_token("ibm,suspend-me");
data.complete = 
 
+   lock_device_hotplug();
+
/* All present CPUs must be online */
cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask);
cpuret = rtas_online_cpus_mask(offline_mask);
@@ -985,6 +989,7 @@ int rtas_ibm_suspend_me(u64 handle)
__func__);
 
 out:
+   unlock_device_hotplug();
free_cpumask_var(offline_mask);
return atomic_read();
 }
-- 
2.20.1

[PATCH AUTOSEL 4.9 16/19] powerpc/pseries: correctly track irq state in default idle

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ]

prep_irq_for_idle() is intended to be called before entering
H_CEDE (and it is used by the pseries cpuidle driver). However the
default pseries idle routine does not call it, leading to mismanaged
lazy irq state when the cpuidle driver isn't in use. Manifestations of
this include:

* Dropped IPIs in the time immediately after a cpu comes
  online (before it has installed the cpuidle handler), making the
  online operation block indefinitely waiting for the new cpu to
  respond.

* Hitting this WARN_ON in arch_local_irq_restore():
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON_ONCE(mfmsr() & MSR_EE))
__hard_irq_disable();

Call prep_irq_for_idle() from pseries_lpar_idle() and honor its
result.

Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on 
pseries")
Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/setup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index adb09ab87f7c0..30782859d8980 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -298,6 +298,9 @@ static void pseries_lpar_idle(void)
 * low power mode by ceding processor to hypervisor
 */
 
+   if (!prep_irq_for_idle())
+   return;
+
/* Indicate to hypervisor that we are idle. */
get_lppaca()->idle = 1;
 
-- 
2.20.1

[PATCH AUTOSEL 4.9 15/19] powerpc/64s/exception: machine check use correct cfar for late handler

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ]

Bare metal machine checks run an "early" handler in real mode before
running the main handler which reports the event.

The main handler runs exactly as a normal interrupt handler, after the
"windup" which sets registers back as they were at interrupt entry.
CFAR does not get restored by the windup code, so that will be wrong
when the handler is run.

Restore the CFAR to the saved value before running the late handler.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 92474227262b4..0c8b966e80702 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -467,6 +467,10 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
RFI_TO_USER_OR_KERNEL
 9:
/* Deliver the machine check to host kernel in V mode. */
+BEGIN_FTR_SECTION
+   ld  r10,ORIG_GPR3(r1)
+   mtspr   SPRN_CFAR,r10
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
MACHINE_CHECK_HANDLER_WINDUP
b   machine_check_pSeries
 
-- 
2.20.1

[PATCH AUTOSEL 4.9 11/19] powerpc/pseries/mobility: use cond_resched when updating device tree

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ]

After a partition migration, pseries_devicetree_update() processes
changes to the device tree communicated from the platform to
Linux. This is a relatively heavyweight operation, with multiple
device tree searches, memory allocations, and conversations with
partition firmware.

There's a few levels of nested loops which are bounded only by
decisions made by the platform, outside of Linux's control, and indeed
we have seen RCU stalls on large systems while executing this call
graph. Use cond_resched() in these loops so that the cpu is yielded
when needed.

Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/mobility.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 3784a7abfcc80..74791e8382d22 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -206,7 +207,11 @@ static int update_dt_node(__be32 phandle, s32 scope)
 
prop_data += vd;
}
+
+   cond_resched();
}
+
+   cond_resched();
} while (rtas_rc == 1);
 
of_node_put(dn);
@@ -282,8 +287,12 @@ int pseries_devicetree_update(s32 scope)
add_dt_node(phandle, drc_index);
break;
}
+
+   cond_resched();
}
}
+
+   cond_resched();
} while (rc == 1);
 
kfree(rtas_buf);
-- 
2.20.1

[PATCH AUTOSEL 4.9 10/19] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function

2019-09-24 Thread Sasha Levin

From: Christophe Leroy 

[ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ]

We see warnings such as:
  kernel/futex.c: In function 'do_futex':
  kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this 
function [-Wmaybe-uninitialized]
 return oldval == cmparg;
   ^
  kernel/futex.c:1651:6: note: 'oldval' was declared here
int oldval, ret;
^

This is because arch_futex_atomic_op_inuser() only sets *oval if ret
is 0 and GCC doesn't see that it will only use it when ret is 0.

Anyway, the non-zero ret path is an error path that won't suffer from
setting *oval, and as *oval is a local var in futex_atomic_op_inuser()
it will have no impact.

Signed-off-by: Christophe Leroy 
[mpe: reword change log slightly]
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/futex.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h
index f4c7467f74655..b73ab8a7ebc3f 100644
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -60,8 +60,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int 
oparg, int *oval,
 
pagefault_enable();
 
-   if (!ret)
-   *oval = oldval;
+   *oval = oldval;
 
return ret;
 }
-- 
2.20.1

[PATCH AUTOSEL 4.9 09/19] powerpc/rtas: use device model APIs and serialization during LPM

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ]

The LPAR migration implementation and userspace-initiated cpu hotplug
can interleave their executions like so:

1. Set cpu 7 offline via sysfs.

2. Begin a partition migration, whose implementation requires the OS
   to ensure all present cpus are online; cpu 7 is onlined:

 rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up

   This sets cpu 7 online in all respects except for the cpu's
   corresponding struct device; dev->offline remains true.

3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is
   already online and returns success. The driver core (device_online)
   sets dev->offline = false.

4. The migration completes and restores cpu 7 to offline state:

 rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down

This leaves cpu7 in a state where the driver core considers the cpu
device online, but in all other respects it is offline and
unused. Attempts to online the cpu via sysfs appear to succeed but the
driver core actually does not pass the request to the lower-level
cpuhp support code. This makes the cpu unusable until the cpu device
is manually set offline and then online again via sysfs.

Instead of directly calling cpu_up/cpu_down, the migration code should
use the higher-level device core APIs to maintain consistent state and
serialize operations.

Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to 
migration/hibernation")
Signed-off-by: Nathan Lynch 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/rtas.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 6a3e5de544ce2..a309a7a29cc60 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -874,15 +874,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state 
state,
return 0;
 
for_each_cpu(cpu, cpus) {
+   struct device *dev = get_cpu_device(cpu);
+
switch (state) {
case DOWN:
-   cpuret = cpu_down(cpu);
+   cpuret = device_offline(dev);
break;
case UP:
-   cpuret = cpu_up(cpu);
+   cpuret = device_online(dev);
break;
}
-   if (cpuret) {
+   if (cpuret < 0) {
pr_debug("%s: cpu_%s for cpu#%d returned %d.\n",
__func__,
((state == UP) ? "up" : "down"),
@@ -971,6 +973,8 @@ int rtas_ibm_suspend_me(u64 handle)
data.token = rtas_token("ibm,suspend-me");
data.complete = 
 
+   lock_device_hotplug();
+
/* All present CPUs must be online */
cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask);
cpuret = rtas_online_cpus_mask(offline_mask);
@@ -1002,6 +1006,7 @@ int rtas_ibm_suspend_me(u64 handle)
__func__);
 
 out:
+   unlock_device_hotplug();
free_cpumask_var(offline_mask);
return atomic_read();
 }
-- 
2.20.1

[PATCH AUTOSEL 4.14 23/28] powerpc/64s/exception: machine check use correct cfar for late handler

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ]

Bare metal machine checks run an "early" handler in real mode before
running the main handler which reports the event.

The main handler runs exactly as a normal interrupt handler, after the
"windup" which sets registers back as they were at interrupt entry.
CFAR does not get restored by the windup code, so that will be wrong
when the handler is run.

Restore the CFAR to the saved value before running the late handler.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 43cde6c602795..cdc53fd905977 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -464,6 +464,10 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
RFI_TO_USER_OR_KERNEL
 9:
/* Deliver the machine check to host kernel in V mode. */
+BEGIN_FTR_SECTION
+   ld  r10,ORIG_GPR3(r1)
+   mtspr   SPRN_CFAR,r10
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
MACHINE_CHECK_HANDLER_WINDUP
b   machine_check_pSeries
 
-- 
2.20.1

[PATCH AUTOSEL 4.14 24/28] powerpc/pseries: correctly track irq state in default idle

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ]

prep_irq_for_idle() is intended to be called before entering
H_CEDE (and it is used by the pseries cpuidle driver). However the
default pseries idle routine does not call it, leading to mismanaged
lazy irq state when the cpuidle driver isn't in use. Manifestations of
this include:

* Dropped IPIs in the time immediately after a cpu comes
  online (before it has installed the cpuidle handler), making the
  online operation block indefinitely waiting for the new cpu to
  respond.

* Hitting this WARN_ON in arch_local_irq_restore():
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON_ONCE(mfmsr() & MSR_EE))
__hard_irq_disable();

Call prep_irq_for_idle() from pseries_lpar_idle() and honor its
result.

Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on 
pseries")
Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/setup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 6a0ad56e89b93..7a9945b350536 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -307,6 +307,9 @@ static void pseries_lpar_idle(void)
 * low power mode by ceding processor to hypervisor
 */
 
+   if (!prep_irq_for_idle())
+   return;
+
/* Indicate to hypervisor that we are idle. */
get_lppaca()->idle = 1;
 
-- 
2.20.1

[PATCH AUTOSEL 4.14 18/28] powerpc/pseries/mobility: use cond_resched when updating device tree

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ]

After a partition migration, pseries_devicetree_update() processes
changes to the device tree communicated from the platform to
Linux. This is a relatively heavyweight operation, with multiple
device tree searches, memory allocations, and conversations with
partition firmware.

There's a few levels of nested loops which are bounded only by
decisions made by the platform, outside of Linux's control, and indeed
we have seen RCU stalls on large systems while executing this call
graph. Use cond_resched() in these loops so that the cpu is yielded
when needed.

Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/mobility.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 4addc552eb33d..9739a055e5f7b 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -208,7 +209,11 @@ static int update_dt_node(__be32 phandle, s32 scope)
 
prop_data += vd;
}
+
+   cond_resched();
}
+
+   cond_resched();
} while (rtas_rc == 1);
 
of_node_put(dn);
@@ -317,8 +322,12 @@ int pseries_devicetree_update(s32 scope)
add_dt_node(phandle, drc_index);
break;
}
+
+   cond_resched();
}
}
+
+   cond_resched();
} while (rc == 1);
 
kfree(rtas_buf);
-- 
2.20.1

[PATCH AUTOSEL 4.14 17/28] powerpc/64s/radix: Remove redundant pfn_pte bitop, add VM_BUG_ON

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 6bb25170d7a44ef0ed9677814600f0785e7421d1 ]

pfn_pte is never given a pte above the addressable physical memory
limit, so the masking is redundant. In case of a software bug, it
is not obviously better to silently truncate the pfn than to corrupt
the pte (either one will result in memory corruption or crashes),
so there is no reason to add this to the fast path.

Add VM_BUG_ON to catch cases where the pfn is invalid. These would
catch the create_section_mapping bug fixed by a previous commit.

  [16885.256466] [ cut here ]
  [16885.256492] kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
  cpu 0x0: Vector: 700 (Program Check) at [c000ee0a36d0]
  pc: c0080738: __map_kernel_page+0x248/0x6f0
  lr: c0080ac0: __map_kernel_page+0x5d0/0x6f0
  sp: c000ee0a3960
 msr: 90029033
current = 0xc000ec63b400
paca= 0xc17f   irqmask: 0x03   irq_happened: 0x01
  pid   = 85, comm = sh
  kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
  Linux version 5.3.0-rc1-1-g0fe93e5f3394
  enter ? for help
  [c000ee0a3a00] c0d37378 create_physical_mapping+0x260/0x360
  [c000ee0a3b10] c0d370bc create_section_mapping+0x1c/0x3c
  [c000ee0a3b30] c0071f54 arch_add_memory+0x74/0x130

Signed-off-by: Nicholas Piggin 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190724084638.24982-5-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4dd13b503dbbd..5f37d8c3b4798 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -555,8 +555,10 @@ static inline int pte_present(pte_t pte)
  */
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot)
 {
-   return __ptepte_basic_t)(pfn) << PAGE_SHIFT) & PTE_RPN_MASK) |
-pgprot_val(pgprot));
+   VM_BUG_ON(pfn >> (64 - PAGE_SHIFT));
+   VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK);
+
+   return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot));
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
-- 
2.20.1

[PATCH AUTOSEL 4.14 16/28] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function

2019-09-24 Thread Sasha Levin

From: Christophe Leroy 

[ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ]

We see warnings such as:
  kernel/futex.c: In function 'do_futex':
  kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this 
function [-Wmaybe-uninitialized]
 return oldval == cmparg;
   ^
  kernel/futex.c:1651:6: note: 'oldval' was declared here
int oldval, ret;
^

This is because arch_futex_atomic_op_inuser() only sets *oval if ret
is 0 and GCC doesn't see that it will only use it when ret is 0.

Anyway, the non-zero ret path is an error path that won't suffer from
setting *oval, and as *oval is a local var in futex_atomic_op_inuser()
it will have no impact.

Signed-off-by: Christophe Leroy 
[mpe: reword change log slightly]
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/futex.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h
index 1a944c18c5390..3c7d859452294 100644
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -59,8 +59,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int 
oparg, int *oval,
 
pagefault_enable();
 
-   if (!ret)
-   *oval = oldval;
+   *oval = oldval;
 
return ret;
 }
-- 
2.20.1

[PATCH AUTOSEL 4.14 15/28] powerpc/rtas: use device model APIs and serialization during LPM

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ]

The LPAR migration implementation and userspace-initiated cpu hotplug
can interleave their executions like so:

1. Set cpu 7 offline via sysfs.

2. Begin a partition migration, whose implementation requires the OS
   to ensure all present cpus are online; cpu 7 is onlined:

 rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up

   This sets cpu 7 online in all respects except for the cpu's
   corresponding struct device; dev->offline remains true.

3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is
   already online and returns success. The driver core (device_online)
   sets dev->offline = false.

4. The migration completes and restores cpu 7 to offline state:

 rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down

This leaves cpu7 in a state where the driver core considers the cpu
device online, but in all other respects it is offline and
unused. Attempts to online the cpu via sysfs appear to succeed but the
driver core actually does not pass the request to the lower-level
cpuhp support code. This makes the cpu unusable until the cpu device
is manually set offline and then online again via sysfs.

Instead of directly calling cpu_up/cpu_down, the migration code should
use the higher-level device core APIs to maintain consistent state and
serialize operations.

Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to 
migration/hibernation")
Signed-off-by: Nathan Lynch 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/rtas.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 1643e9e536557..141d192c69538 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -874,15 +874,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state 
state,
return 0;
 
for_each_cpu(cpu, cpus) {
+   struct device *dev = get_cpu_device(cpu);
+
switch (state) {
case DOWN:
-   cpuret = cpu_down(cpu);
+   cpuret = device_offline(dev);
break;
case UP:
-   cpuret = cpu_up(cpu);
+   cpuret = device_online(dev);
break;
}
-   if (cpuret) {
+   if (cpuret < 0) {
pr_debug("%s: cpu_%s for cpu#%d returned %d.\n",
__func__,
((state == UP) ? "up" : "down"),
@@ -971,6 +973,8 @@ int rtas_ibm_suspend_me(u64 handle)
data.token = rtas_token("ibm,suspend-me");
data.complete = 
 
+   lock_device_hotplug();
+
/* All present CPUs must be online */
cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask);
cpuret = rtas_online_cpus_mask(offline_mask);
@@ -1002,6 +1006,7 @@ int rtas_ibm_suspend_me(u64 handle)
__func__);
 
 out:
+   unlock_device_hotplug();
free_cpumask_var(offline_mask);
return atomic_read();
 }
-- 
2.20.1

[PATCH AUTOSEL 4.14 14/28] powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL

2019-09-24 Thread Sasha Levin

From: Cédric Le Goater 

[ Upstream commit c3e0dbd7f780a58c4695f1cd8fc8afde80376737 ]

Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in
the OPAL logs and also outputs some of the fields of the internal XIVE
structures in Linux. The OPAL calls can only be done on baremetal
(PowerNV) and they crash a pseries machine. Fix by checking the
hypervisor feature of the CPU.

Signed-off-by: Cédric Le Goater 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190814154754.23682-2-...@kaod.org
Signed-off-by: Sasha Levin 
---
 arch/powerpc/xmon/xmon.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 6b9038a3e79f0..5a739588aa505 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2438,13 +2438,16 @@ static void dump_pacas(void)
 static void dump_one_xive(int cpu)
 {
unsigned int hwid = get_hard_smp_processor_id(cpu);
+   bool hv = cpu_has_feature(CPU_FTR_HVMODE);
 
-   opal_xive_dump(XIVE_DUMP_TM_HYP, hwid);
-   opal_xive_dump(XIVE_DUMP_TM_POOL, hwid);
-   opal_xive_dump(XIVE_DUMP_TM_OS, hwid);
-   opal_xive_dump(XIVE_DUMP_TM_USER, hwid);
-   opal_xive_dump(XIVE_DUMP_VP, hwid);
-   opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid);
+   if (hv) {
+   opal_xive_dump(XIVE_DUMP_TM_HYP, hwid);
+   opal_xive_dump(XIVE_DUMP_TM_POOL, hwid);
+   opal_xive_dump(XIVE_DUMP_TM_OS, hwid);
+   opal_xive_dump(XIVE_DUMP_TM_USER, hwid);
+   opal_xive_dump(XIVE_DUMP_VP, hwid);
+   opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid);
+   }
 
if (setjmp(bus_error_jmp) != 0) {
catch_memory_errors = 0;
-- 
2.20.1

[PATCH AUTOSEL 4.19 45/50] powerpc: dump kernel log before carrying out fadump or kdump

2019-09-24 Thread Sasha Levin

From: Ganesh Goudar 

[ Upstream commit e7ca44ed3ba77fc26cf32650bb71584896662474 ]

Since commit 4388c9b3a6ee ("powerpc: Do not send system reset request
through the oops path"), pstore dmesg file is not updated when dump is
triggered from HMC. This commit modified system reset (sreset) handler
to invoke fadump or kdump (if configured), without pushing dmesg to
pstore. This leaves pstore to have old dmesg data which won't be much
of a help if kdump fails to capture the dump. This patch fixes that by
calling kmsg_dump() before heading to fadump ot kdump.

Fixes: 4388c9b3a6ee ("powerpc: Do not send system reset request through the 
oops path")
Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Nicholas Piggin 
Signed-off-by: Ganesh Goudar 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190904075949.15607-1-ganes...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/traps.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 02fe6d0201741..d5f351f02c153 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -399,6 +399,7 @@ void system_reset_exception(struct pt_regs *regs)
if (debugger(regs))
goto out;
 
+   kmsg_dump(KMSG_DUMP_OOPS);
/*
 * A system reset is a request to dump, so we always send
 * it through the crashdump code (if fadump or kdump are
-- 
2.20.1

[PATCH AUTOSEL 4.19 41/50] powerpc/pseries: correctly track irq state in default idle

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ]

prep_irq_for_idle() is intended to be called before entering
H_CEDE (and it is used by the pseries cpuidle driver). However the
default pseries idle routine does not call it, leading to mismanaged
lazy irq state when the cpuidle driver isn't in use. Manifestations of
this include:

* Dropped IPIs in the time immediately after a cpu comes
  online (before it has installed the cpuidle handler), making the
  online operation block indefinitely waiting for the new cpu to
  respond.

* Hitting this WARN_ON in arch_local_irq_restore():
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON_ONCE(mfmsr() & MSR_EE))
__hard_irq_disable();

Call prep_irq_for_idle() from pseries_lpar_idle() and honor its
result.

Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on 
pseries")
Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/setup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index ba1791fd3234d..67f49159ea708 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -325,6 +325,9 @@ static void pseries_lpar_idle(void)
 * low power mode by ceding processor to hypervisor
 */
 
+   if (!prep_irq_for_idle())
+   return;
+
/* Indicate to hypervisor that we are idle. */
get_lppaca()->idle = 1;
 
-- 
2.20.1

[PATCH AUTOSEL 4.19 39/50] powerpc/imc: Dont create debugfs files for cpu-less nodes

2019-09-24 Thread Sasha Levin

From: Madhavan Srinivasan 

[ Upstream commit 41ba17f20ea835c489e77bd54e2da73184e22060 ]

Commit <684d984038aa> ('powerpc/powernv: Add debugfs interface for
imc-mode and imc') added debugfs interface for the nest imc pmu
devices to support changing of different ucode modes. Primarily adding
this capability for debug. But when doing so, the code did not
consider the case of cpu-less nodes. So when reading the _cmd_ or
_mode_ file of a cpu-less node will create this crash.

  Faulting instruction address: 0xc00d0d58
  Oops: Kernel access of bad area, sig: 11 [#1]
  ...
  CPU: 67 PID: 5301 Comm: cat Not tainted 5.2.0-rc6-next-20190627+ #19
  NIP:  c00d0d58 LR: c049aa18 CTR:c00d0d50
  REGS: c00020194548f9e0 TRAP: 0300   Not tainted  (5.2.0-rc6-next-20190627+)
  MSR:  90009033   CR:28022822  XER: 
  CFAR: c049aa14 DAR: 0003fc08 DSISR:4000 IRQMASK: 0
  ...
  NIP imc_mem_get+0x8/0x20
  LR  simple_attr_read+0x118/0x170
  Call Trace:
simple_attr_read+0x70/0x170 (unreliable)
debugfs_attr_read+0x6c/0xb0
__vfs_read+0x3c/0x70
 vfs_read+0xbc/0x1a0
ksys_read+0x7c/0x140
system_call+0x5c/0x70

Patch fixes the issue with a more robust check for vbase to NULL.

Before patch, ls output for the debugfs imc directory

  # ls /sys/kernel/debug/powerpc/imc/
  imc_cmd_0imc_cmd_251  imc_cmd_253  imc_cmd_255  imc_mode_0
imc_mode_251  imc_mode_253  imc_mode_255
  imc_cmd_250  imc_cmd_252  imc_cmd_254  imc_cmd_8imc_mode_250  
imc_mode_252  imc_mode_254  imc_mode_8

After patch, ls output for the debugfs imc directory

  # ls /sys/kernel/debug/powerpc/imc/
  imc_cmd_0  imc_cmd_8  imc_mode_0  imc_mode_8

Actual bug here is that, we have two loops with potentially different
loop counts. That is, in imc_get_mem_addr_nest(), loop count is
obtained from the dt entries. But in case of export_imc_mode_and_cmd(),
loop was based on for_each_nid() count. Patch fixes the loop count in
latter based on the struct mem_info. Ideally it would be better to
have array size in struct imc_pmu.

Fixes: 684d984038aa ('powerpc/powernv: Add debugfs interface for imc-mode and 
imc')
Reported-by: Qian Cai 
Suggested-by: Michael Ellerman 
Signed-off-by: Madhavan Srinivasan 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190827101635.6942-1-ma...@linux.vnet.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/powernv/opal-imc.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index 828f6656f8f74..649fb268f4461 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -57,9 +57,9 @@ static void export_imc_mode_and_cmd(struct device_node *node,
struct imc_pmu *pmu_ptr)
 {
static u64 loc, *imc_mode_addr, *imc_cmd_addr;
-   int chip = 0, nid;
char mode[16], cmd[16];
u32 cb_offset;
+   struct imc_mem_info *ptr = pmu_ptr->mem_info;
 
imc_debugfs_parent = debugfs_create_dir("imc", powerpc_debugfs_root);
 
@@ -73,20 +73,20 @@ static void export_imc_mode_and_cmd(struct device_node 
*node,
if (of_property_read_u32(node, "cb_offset", _offset))
cb_offset = IMC_CNTL_BLK_OFFSET;
 
-   for_each_node(nid) {
-   loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset;
+   while (ptr->vbase != NULL) {
+   loc = (u64)(ptr->vbase) + cb_offset;
imc_mode_addr = (u64 *)(loc + IMC_CNTL_BLK_MODE_OFFSET);
-   sprintf(mode, "imc_mode_%d", nid);
+   sprintf(mode, "imc_mode_%d", (u32)(ptr->id));
if (!imc_debugfs_create_x64(mode, 0600, imc_debugfs_parent,
imc_mode_addr))
goto err;
 
imc_cmd_addr = (u64 *)(loc + IMC_CNTL_BLK_CMD_OFFSET);
-   sprintf(cmd, "imc_cmd_%d", nid);
+   sprintf(cmd, "imc_cmd_%d", (u32)(ptr->id));
if (!imc_debugfs_create_x64(cmd, 0600, imc_debugfs_parent,
imc_cmd_addr))
goto err;
-   chip++;
+   ptr++;
}
return;
 
-- 
2.20.1

[PATCH AUTOSEL 4.19 37/50] powerpc/64s/exception: machine check use correct cfar for late handler

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ]

Bare metal machine checks run an "early" handler in real mode before
running the main handler which reports the event.

The main handler runs exactly as a normal interrupt handler, after the
"windup" which sets registers back as they were at interrupt entry.
CFAR does not get restored by the windup code, so that will be wrong
when the handler is run.

Restore the CFAR to the saved value before running the late handler.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 06cc77813dbb7..90af86f143a91 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -520,6 +520,10 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
RFI_TO_USER_OR_KERNEL
 9:
/* Deliver the machine check to host kernel in V mode. */
+BEGIN_FTR_SECTION
+   ld  r10,ORIG_GPR3(r1)
+   mtspr   SPRN_CFAR,r10
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
MACHINE_CHECK_HANDLER_WINDUP
b   machine_check_pSeries
 
-- 
2.20.1

[PATCH AUTOSEL 4.19 29/50] powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag

2019-09-24 Thread Sasha Levin

From: Sam Bobroff 

[ Upstream commit aa06e3d60e245284d1e55497eb3108828092818d ]

The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the
use of driver callbacks in drivers that have been bound part way
through the recovery process. This is necessary to prevent later stage
handlers from being called when the earlier stage handlers haven't,
which can be confusing for drivers.

However, the flag is set for all devices that are added after boot
time and only cleared at the end of the EEH recovery process. This
results in hot plugged devices erroneously having the flag set during
the first recovery after they are added (causing their driver's
handlers to be incorrectly ignored).

To remedy this, clear the flag at the beginning of recovery
processing. The flag is still cleared at the end of recovery
processing, although it is no longer really necessary.

Also clear the flag during eeh_handle_special_event(), for the same
reasons.

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobr...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_driver.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 67619b4b3f96c..110eba400de7c 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -811,6 +811,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
pr_warn("EEH: This PCI device has failed %d times in the last hour and 
will be permanently disabled after %d failures.\n",
pe->freeze_count, eeh_max_freezes);
 
+   eeh_for_each_pe(pe, tmp_pe)
+   eeh_pe_for_each_dev(tmp_pe, edev, tmp)
+   edev->mode &= ~EEH_DEV_NO_HANDLER;
+
/* Walk the various device drivers attached to this slot through
 * a reset sequence, giving each an opportunity to do what it needs
 * to accomplish the reset.  Each child gets a report of the
@@ -1004,7 +1008,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
  */
 void eeh_handle_special_event(void)
 {
-   struct eeh_pe *pe, *phb_pe;
+   struct eeh_pe *pe, *phb_pe, *tmp_pe;
+   struct eeh_dev *edev, *tmp_edev;
struct pci_bus *bus;
struct pci_controller *hose;
unsigned long flags;
@@ -1075,6 +1080,10 @@ void eeh_handle_special_event(void)
(phb_pe->state & EEH_PE_RECOVERING))
continue;
 
+   eeh_for_each_pe(pe, tmp_pe)
+   eeh_pe_for_each_dev(tmp_pe, edev, 
tmp_edev)
+   edev->mode &= 
~EEH_DEV_NO_HANDLER;
+
/* Notify all devices to be down */
eeh_pe_state_clear(pe, EEH_PE_PRI_BUS);
eeh_set_channel_state(pe, 
pci_channel_io_perm_failure);
-- 
2.20.1

[PATCH AUTOSEL 4.19 27/50] powerpc/pseries/mobility: use cond_resched when updating device tree

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ]

After a partition migration, pseries_devicetree_update() processes
changes to the device tree communicated from the platform to
Linux. This is a relatively heavyweight operation, with multiple
device tree searches, memory allocations, and conversations with
partition firmware.

There's a few levels of nested loops which are bounded only by
decisions made by the platform, outside of Linux's control, and indeed
we have seen RCU stalls on large systems while executing this call
graph. Use cond_resched() in these loops so that the cpu is yielded
when needed.

Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/mobility.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 7b60fcf04dc47..e4ea713833832 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -209,7 +210,11 @@ static int update_dt_node(__be32 phandle, s32 scope)
 
prop_data += vd;
}
+
+   cond_resched();
}
+
+   cond_resched();
} while (rtas_rc == 1);
 
of_node_put(dn);
@@ -318,8 +323,12 @@ int pseries_devicetree_update(s32 scope)
add_dt_node(phandle, drc_index);
break;
}
+
+   cond_resched();
}
}
+
+   cond_resched();
} while (rc == 1);
 
kfree(rtas_buf);
-- 
2.20.1

[PATCH AUTOSEL 4.19 26/50] powerpc/64s/radix: Remove redundant pfn_pte bitop, add VM_BUG_ON

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 6bb25170d7a44ef0ed9677814600f0785e7421d1 ]

pfn_pte is never given a pte above the addressable physical memory
limit, so the masking is redundant. In case of a software bug, it
is not obviously better to silently truncate the pfn than to corrupt
the pte (either one will result in memory corruption or crashes),
so there is no reason to add this to the fast path.

Add VM_BUG_ON to catch cases where the pfn is invalid. These would
catch the create_section_mapping bug fixed by a previous commit.

  [16885.256466] [ cut here ]
  [16885.256492] kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
  cpu 0x0: Vector: 700 (Program Check) at [c000ee0a36d0]
  pc: c0080738: __map_kernel_page+0x248/0x6f0
  lr: c0080ac0: __map_kernel_page+0x5d0/0x6f0
  sp: c000ee0a3960
 msr: 90029033
current = 0xc000ec63b400
paca= 0xc17f   irqmask: 0x03   irq_happened: 0x01
  pid   = 85, comm = sh
  kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
  Linux version 5.3.0-rc1-1-g0fe93e5f3394
  enter ? for help
  [c000ee0a3a00] c0d37378 create_physical_mapping+0x260/0x360
  [c000ee0a3b10] c0d370bc create_section_mapping+0x1c/0x3c
  [c000ee0a3b30] c0071f54 arch_add_memory+0x74/0x130

Signed-off-by: Nicholas Piggin 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190724084638.24982-5-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 855dbae6d351d..a717640b7eda4 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -629,8 +629,10 @@ static inline bool pte_access_permitted(pte_t pte, bool 
write)
  */
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot)
 {
-   return __ptepte_basic_t)(pfn) << PAGE_SHIFT) & PTE_RPN_MASK) |
-pgprot_val(pgprot));
+   VM_BUG_ON(pfn >> (64 - PAGE_SHIFT));
+   VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK);
+
+   return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot));
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
-- 
2.20.1

[PATCH AUTOSEL 4.19 25/50] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function

2019-09-24 Thread Sasha Levin

From: Christophe Leroy 

[ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ]

We see warnings such as:
  kernel/futex.c: In function 'do_futex':
  kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this 
function [-Wmaybe-uninitialized]
 return oldval == cmparg;
   ^
  kernel/futex.c:1651:6: note: 'oldval' was declared here
int oldval, ret;
^

This is because arch_futex_atomic_op_inuser() only sets *oval if ret
is 0 and GCC doesn't see that it will only use it when ret is 0.

Anyway, the non-zero ret path is an error path that won't suffer from
setting *oval, and as *oval is a local var in futex_atomic_op_inuser()
it will have no impact.

Signed-off-by: Christophe Leroy 
[mpe: reword change log slightly]
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/futex.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h
index 94542776a62d6..2a7b01f97a56b 100644
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -59,8 +59,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int 
oparg, int *oval,
 
pagefault_enable();
 
-   if (!ret)
-   *oval = oldval;
+   *oval = oldval;
 
return ret;
 }
-- 
2.20.1

[PATCH AUTOSEL 4.19 24/50] powerpc/rtas: use device model APIs and serialization during LPM

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ]

The LPAR migration implementation and userspace-initiated cpu hotplug
can interleave their executions like so:

1. Set cpu 7 offline via sysfs.

2. Begin a partition migration, whose implementation requires the OS
   to ensure all present cpus are online; cpu 7 is onlined:

 rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up

   This sets cpu 7 online in all respects except for the cpu's
   corresponding struct device; dev->offline remains true.

3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is
   already online and returns success. The driver core (device_online)
   sets dev->offline = false.

4. The migration completes and restores cpu 7 to offline state:

 rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down

This leaves cpu7 in a state where the driver core considers the cpu
device online, but in all other respects it is offline and
unused. Attempts to online the cpu via sysfs appear to succeed but the
driver core actually does not pass the request to the lower-level
cpuhp support code. This makes the cpu unusable until the cpu device
is manually set offline and then online again via sysfs.

Instead of directly calling cpu_up/cpu_down, the migration code should
use the higher-level device core APIs to maintain consistent state and
serialize operations.

Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to 
migration/hibernation")
Signed-off-by: Nathan Lynch 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/rtas.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 8afd146bc9c70..9e41a9de43235 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -875,15 +875,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state 
state,
return 0;
 
for_each_cpu(cpu, cpus) {
+   struct device *dev = get_cpu_device(cpu);
+
switch (state) {
case DOWN:
-   cpuret = cpu_down(cpu);
+   cpuret = device_offline(dev);
break;
case UP:
-   cpuret = cpu_up(cpu);
+   cpuret = device_online(dev);
break;
}
-   if (cpuret) {
+   if (cpuret < 0) {
pr_debug("%s: cpu_%s for cpu#%d returned %d.\n",
__func__,
((state == UP) ? "up" : "down"),
@@ -972,6 +974,8 @@ int rtas_ibm_suspend_me(u64 handle)
data.token = rtas_token("ibm,suspend-me");
data.complete = 
 
+   lock_device_hotplug();
+
/* All present CPUs must be online */
cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask);
cpuret = rtas_online_cpus_mask(offline_mask);
@@ -1003,6 +1007,7 @@ int rtas_ibm_suspend_me(u64 handle)
__func__);
 
 out:
+   unlock_device_hotplug();
free_cpumask_var(offline_mask);
return atomic_read();
 }
-- 
2.20.1

[PATCH AUTOSEL 4.19 23/50] powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL

2019-09-24 Thread Sasha Levin

From: Cédric Le Goater 

[ Upstream commit c3e0dbd7f780a58c4695f1cd8fc8afde80376737 ]

Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in
the OPAL logs and also outputs some of the fields of the internal XIVE
structures in Linux. The OPAL calls can only be done on baremetal
(PowerNV) and they crash a pseries machine. Fix by checking the
hypervisor feature of the CPU.

Signed-off-by: Cédric Le Goater 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190814154754.23682-2-...@kaod.org
Signed-off-by: Sasha Levin 
---
 arch/powerpc/xmon/xmon.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 74cfc1be04d6e..bb5db7bfd8539 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2497,13 +2497,16 @@ static void dump_pacas(void)
 static void dump_one_xive(int cpu)
 {
unsigned int hwid = get_hard_smp_processor_id(cpu);
+   bool hv = cpu_has_feature(CPU_FTR_HVMODE);
 
-   opal_xive_dump(XIVE_DUMP_TM_HYP, hwid);
-   opal_xive_dump(XIVE_DUMP_TM_POOL, hwid);
-   opal_xive_dump(XIVE_DUMP_TM_OS, hwid);
-   opal_xive_dump(XIVE_DUMP_TM_USER, hwid);
-   opal_xive_dump(XIVE_DUMP_VP, hwid);
-   opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid);
+   if (hv) {
+   opal_xive_dump(XIVE_DUMP_TM_HYP, hwid);
+   opal_xive_dump(XIVE_DUMP_TM_POOL, hwid);
+   opal_xive_dump(XIVE_DUMP_TM_OS, hwid);
+   opal_xive_dump(XIVE_DUMP_TM_USER, hwid);
+   opal_xive_dump(XIVE_DUMP_VP, hwid);
+   opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid);
+   }
 
if (setjmp(bus_error_jmp) != 0) {
catch_memory_errors = 0;
-- 
2.20.1

[PATCH AUTOSEL 4.19 17/50] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window

2019-09-24 Thread Sasha Levin

From: Alexey Kardashevskiy 

[ Upstream commit c37c792dec0929dbb6360a609fb00fa20bb16fc2 ]

We allocate only the first level of multilevel TCE tables for KVM
already (alloc_userspace_copy==true), and the rest is allocated on demand.
This is not enabled though for bare metal.

This removes the KVM limitation (implicit, via the alloc_userspace_copy
parameter) and always allocates just the first level. The on-demand
allocation of missing levels is already implemented.

As from now on DMA map might happen with disabled interrupts, this
allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1].
In practice just a single page is allocated there so chances for failure
are quite low.

To save time when creating a new clean table, this skips non-allocated
indirect TCE entries in pnv_tce_free just like we already do in
the VFIO IOMMU TCE driver.

This changes the default level number from 1 to 2 to reduce the amount
of memory required for the default 32bit DMA window at the boot time.
The default window size is up to 2GB which requires 4MB of TCEs which is
unlikely to be used entirely or at all as most devices these days are
64bit capable so by switching to 2 levels by default we save 4032KB of
RAM per a device.

While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace
can trigger this path via VFIO, see the failure and try creating a table
again with different parameters which might succeed.

[1]:
===
BUG: sleeping function called from invalid context at mm/page_alloc.c:4596
in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1
2 locks held by scsi_eh_1/1038:
 #0: 5efd659a (>eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80
 #1: 06cf56a6 (&(>lock)->rlock){}, at: 
ata_exec_internal_sg+0xb0/0x5c0
irq event stamp: 500
hardirqs last  enabled at (499): [] 
_raw_spin_unlock_irqrestore+0x94/0xd0
hardirqs last disabled at (500): [] 
_raw_spin_lock_irqsave+0x44/0x120
softirqs last  enabled at (0): [] 
copy_process.isra.4.part.5+0x640/0x1a80
softirqs last disabled at (0): [<>] 0x0
CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 
#634
Call Trace:
[c03d064cef50] [c0c8e6c4] dump_stack+0xe8/0x164 (unreliable)
[c03d064cefa0] [c014ed78] ___might_sleep+0x2f8/0x310
[c03d064cf020] [c03ca084] __alloc_pages_nodemask+0x2a4/0x1560
[c03d064cf220] [c00c2530] pnv_alloc_tce_level.isra.0+0x90/0x130
[c03d064cf290] [c00c2888] pnv_tce+0x128/0x3b0
[c03d064cf360] [c00c2c00] pnv_tce_build+0xb0/0xf0
[c03d064cf3c0] [c00bbd9c] pnv_ioda2_tce_build+0x3c/0xb0
[c03d064cf400] [c004cfe0] ppc_iommu_map_sg+0x210/0x550
[c03d064cf510] [c004b7a4] dma_iommu_map_sg+0x74/0xb0
[c03d064cf530] [c0863944] ata_qc_issue+0x134/0x470
[c03d064cf5b0] [c0863ec4] ata_exec_internal_sg+0x244/0x5c0
[c03d064cf700] [c08642d0] ata_exec_internal+0x90/0xe0
[c03d064cf780] [c08650ac] ata_dev_read_id+0x2ec/0x640
[c03d064cf8d0] [c0878e28] ata_eh_recover+0x948/0x16d0
[c03d064cfa10] [c087d760] sata_pmp_error_handler+0x480/0xbf0
[c03d064cfbc0] [c0884624] ahci_error_handler+0x74/0xe0
[c03d064cfbf0] [c0879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0
[c03d064cfca0] [c087a544] ata_scsi_error+0xb4/0x100
[c03d064cfd00] [c0802450] scsi_error_handler+0x120/0x510
[c03d064cfdb0] [c0140c48] kthread+0x1b8/0x1c0
[c03d064cfe20] [c000bd8c] ret_from_kernel_thread+0x5c/0x70
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
irq event stamp: 2305


hardirqs last  enabled at (2305): [] 
fast_exc_return_irq+0x28/0x34
hardirqs last disabled at (2303): [] __do_softirq+0x4a0/0x654
WARNING: possible irq lock inversion dependency detected
5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: GW
softirqs last  enabled at (2304): [] __do_softirq+0x524/0x654
softirqs last disabled at (2297): [] irq_exit+0x128/0x180

swapper/0/0 just changed the state of lock:
06cf56a6 (&(>lock)->rlock){-...}, at: 
ahci_single_level_irq_intr+0xac/0x120
but this lock took another, HARDIRQ-unsafe lock in the past:
 (fs_reclaim){+.+.}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

   CPU0CPU1
   
  lock(fs_reclaim);
   local_irq_disable();
   lock(&(>lock)->rlock);
   lock(fs_reclaim);
  
lock(&(>lock)->rlock);

 *** DEADLOCK ***

no locks held by swapper/0/0.

the shortest dependencies between 2nd lock and 1st lock:
 -> (fs_reclaim){+.+.} ops: 167579 {
HARDIRQ-ON-W at:
  lock_acquire+0xf8/0x2a0

[PATCH AUTOSEL 4.19 11/50] PCI: rpaphp: Avoid a sometimes-uninitialized warning

2019-09-24 Thread Sasha Levin

From: Nathan Chancellor 

[ Upstream commit 0df3e42167caaf9f8c7b64de3da40a459979afe8 ]

When building with -Wsometimes-uninitialized, clang warns:

drivers/pci/hotplug/rpaphp_core.c:243:14: warning: variable 'fndit' is
used uninitialized whenever 'for' loop exits because its condition is
false [-Wsometimes-uninitialized]
for (j = 0; j < entries; j++) {
^~~
drivers/pci/hotplug/rpaphp_core.c:256:6: note: uninitialized use occurs
here
if (fndit)
^
drivers/pci/hotplug/rpaphp_core.c:243:14: note: remove the condition if
it is always true
for (j = 0; j < entries; j++) {
^~~
drivers/pci/hotplug/rpaphp_core.c:233:14: note: initialize the variable
'fndit' to silence this warning
int j, fndit;
^
 = 0

fndit is only used to gate a sprintf call, which can be moved into the
loop to simplify the code and eliminate the local variable, which will
fix this warning.

Fixes: 2fcf3ae508c2 ("hotplug/drc-info: Add code to search ibm,drc-info 
property")
Suggested-by: Nick Desaulniers 
Signed-off-by: Nathan Chancellor 
Acked-by: Tyrel Datwyler 
Acked-by: Joel Savitz 
Signed-off-by: Michael Ellerman 
Link: https://github.com/ClangBuiltLinux/linux/issues/504
Link: https://lore.kernel.org/r/20190603221157.58502-1-natechancel...@gmail.com
Signed-off-by: Sasha Levin 
---
 drivers/pci/hotplug/rpaphp_core.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/hotplug/rpaphp_core.c 
b/drivers/pci/hotplug/rpaphp_core.c
index 857c358b727b8..cc860c5f7d26f 100644
--- a/drivers/pci/hotplug/rpaphp_core.c
+++ b/drivers/pci/hotplug/rpaphp_core.c
@@ -230,7 +230,7 @@ static int rpaphp_check_drc_props_v2(struct device_node 
*dn, char *drc_name,
struct of_drc_info drc;
const __be32 *value;
char cell_drc_name[MAX_DRC_NAME_LEN];
-   int j, fndit;
+   int j;
 
info = of_find_property(dn->parent, "ibm,drc-info", NULL);
if (info == NULL)
@@ -245,17 +245,13 @@ static int rpaphp_check_drc_props_v2(struct device_node 
*dn, char *drc_name,
 
/* Should now know end of current entry */
 
-   if (my_index > drc.last_drc_index)
-   continue;
-
-   fndit = 1;
-   break;
+   /* Found it */
+   if (my_index <= drc.last_drc_index) {
+   sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix,
+   my_index);
+   break;
+   }
}
-   /* Found it */
-
-   if (fndit)
-   sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, 
-   my_index);
 
if (((drc_name == NULL) ||
 (drc_name && !strcmp(drc_name, cell_drc_name))) &&
-- 
2.20.1

[PATCH AUTOSEL 5.2 63/70] powerpc: dump kernel log before carrying out fadump or kdump

2019-09-24 Thread Sasha Levin

From: Ganesh Goudar 

[ Upstream commit e7ca44ed3ba77fc26cf32650bb71584896662474 ]

Since commit 4388c9b3a6ee ("powerpc: Do not send system reset request
through the oops path"), pstore dmesg file is not updated when dump is
triggered from HMC. This commit modified system reset (sreset) handler
to invoke fadump or kdump (if configured), without pushing dmesg to
pstore. This leaves pstore to have old dmesg data which won't be much
of a help if kdump fails to capture the dump. This patch fixes that by
calling kmsg_dump() before heading to fadump ot kdump.

Fixes: 4388c9b3a6ee ("powerpc: Do not send system reset request through the 
oops path")
Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Nicholas Piggin 
Signed-off-by: Ganesh Goudar 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190904075949.15607-1-ganes...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/traps.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 47df30982de1b..c8ea3a253b815 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -472,6 +472,7 @@ void system_reset_exception(struct pt_regs *regs)
if (debugger(regs))
goto out;
 
+   kmsg_dump(KMSG_DUMP_OOPS);
/*
 * A system reset is a request to dump, so we always send
 * it through the crashdump code (if fadump or kdump are
-- 
2.20.1

[PATCH AUTOSEL 5.2 55/70] powerpc/pseries: correctly track irq state in default idle

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ]

prep_irq_for_idle() is intended to be called before entering
H_CEDE (and it is used by the pseries cpuidle driver). However the
default pseries idle routine does not call it, leading to mismanaged
lazy irq state when the cpuidle driver isn't in use. Manifestations of
this include:

* Dropped IPIs in the time immediately after a cpu comes
  online (before it has installed the cpuidle handler), making the
  online operation block indefinitely waiting for the new cpu to
  respond.

* Hitting this WARN_ON in arch_local_irq_restore():
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON_ONCE(mfmsr() & MSR_EE))
__hard_irq_disable();

Call prep_irq_for_idle() from pseries_lpar_idle() and honor its
result.

Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on 
pseries")
Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/setup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 8fa012a65a712..cc682759feae8 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -344,6 +344,9 @@ static void pseries_lpar_idle(void)
 * low power mode by ceding processor to hypervisor
 */
 
+   if (!prep_irq_for_idle())
+   return;
+
/* Indicate to hypervisor that we are idle. */
get_lppaca()->idle = 1;
 
-- 
2.20.1

[PATCH AUTOSEL 5.2 53/70] powerpc/imc: Dont create debugfs files for cpu-less nodes

2019-09-24 Thread Sasha Levin

From: Madhavan Srinivasan 

[ Upstream commit 41ba17f20ea835c489e77bd54e2da73184e22060 ]

Commit <684d984038aa> ('powerpc/powernv: Add debugfs interface for
imc-mode and imc') added debugfs interface for the nest imc pmu
devices to support changing of different ucode modes. Primarily adding
this capability for debug. But when doing so, the code did not
consider the case of cpu-less nodes. So when reading the _cmd_ or
_mode_ file of a cpu-less node will create this crash.

  Faulting instruction address: 0xc00d0d58
  Oops: Kernel access of bad area, sig: 11 [#1]
  ...
  CPU: 67 PID: 5301 Comm: cat Not tainted 5.2.0-rc6-next-20190627+ #19
  NIP:  c00d0d58 LR: c049aa18 CTR:c00d0d50
  REGS: c00020194548f9e0 TRAP: 0300   Not tainted  (5.2.0-rc6-next-20190627+)
  MSR:  90009033   CR:28022822  XER: 
  CFAR: c049aa14 DAR: 0003fc08 DSISR:4000 IRQMASK: 0
  ...
  NIP imc_mem_get+0x8/0x20
  LR  simple_attr_read+0x118/0x170
  Call Trace:
simple_attr_read+0x70/0x170 (unreliable)
debugfs_attr_read+0x6c/0xb0
__vfs_read+0x3c/0x70
 vfs_read+0xbc/0x1a0
ksys_read+0x7c/0x140
system_call+0x5c/0x70

Patch fixes the issue with a more robust check for vbase to NULL.

Before patch, ls output for the debugfs imc directory

  # ls /sys/kernel/debug/powerpc/imc/
  imc_cmd_0imc_cmd_251  imc_cmd_253  imc_cmd_255  imc_mode_0
imc_mode_251  imc_mode_253  imc_mode_255
  imc_cmd_250  imc_cmd_252  imc_cmd_254  imc_cmd_8imc_mode_250  
imc_mode_252  imc_mode_254  imc_mode_8

After patch, ls output for the debugfs imc directory

  # ls /sys/kernel/debug/powerpc/imc/
  imc_cmd_0  imc_cmd_8  imc_mode_0  imc_mode_8

Actual bug here is that, we have two loops with potentially different
loop counts. That is, in imc_get_mem_addr_nest(), loop count is
obtained from the dt entries. But in case of export_imc_mode_and_cmd(),
loop was based on for_each_nid() count. Patch fixes the loop count in
latter based on the struct mem_info. Ideally it would be better to
have array size in struct imc_pmu.

Fixes: 684d984038aa ('powerpc/powernv: Add debugfs interface for imc-mode and 
imc')
Reported-by: Qian Cai 
Suggested-by: Michael Ellerman 
Signed-off-by: Madhavan Srinivasan 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190827101635.6942-1-ma...@linux.vnet.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/powernv/opal-imc.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index 186109bdd41be..e04b20625cb94 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -53,9 +53,9 @@ static void export_imc_mode_and_cmd(struct device_node *node,
struct imc_pmu *pmu_ptr)
 {
static u64 loc, *imc_mode_addr, *imc_cmd_addr;
-   int chip = 0, nid;
char mode[16], cmd[16];
u32 cb_offset;
+   struct imc_mem_info *ptr = pmu_ptr->mem_info;
 
imc_debugfs_parent = debugfs_create_dir("imc", powerpc_debugfs_root);
 
@@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct device_node 
*node,
if (of_property_read_u32(node, "cb_offset", _offset))
cb_offset = IMC_CNTL_BLK_OFFSET;
 
-   for_each_node(nid) {
-   loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset;
+   while (ptr->vbase != NULL) {
+   loc = (u64)(ptr->vbase) + cb_offset;
imc_mode_addr = (u64 *)(loc + IMC_CNTL_BLK_MODE_OFFSET);
-   sprintf(mode, "imc_mode_%d", nid);
+   sprintf(mode, "imc_mode_%d", (u32)(ptr->id));
if (!imc_debugfs_create_x64(mode, 0600, imc_debugfs_parent,
imc_mode_addr))
goto err;
 
imc_cmd_addr = (u64 *)(loc + IMC_CNTL_BLK_CMD_OFFSET);
-   sprintf(cmd, "imc_cmd_%d", nid);
+   sprintf(cmd, "imc_cmd_%d", (u32)(ptr->id));
if (!imc_debugfs_create_x64(cmd, 0600, imc_debugfs_parent,
imc_cmd_addr))
goto err;
-   chip++;
+   ptr++;
}
return;
 
-- 
2.20.1

[PATCH AUTOSEL 5.2 52/70] powerpc/eeh: Clean up EEH PEs after recovery finishes

2019-09-24 Thread Sasha Levin

From: Oliver O'Halloran 

[ Upstream commit 799abe283e5103d48e079149579b4f167c95ea0e ]

When the last device in an eeh_pe is removed the eeh_pe structure itself
(and any empty parents) are freed since they are no longer needed. This
results in a crash when a hotplug driver is involved since the following
may occur:

1. Device is suprise removed.
2. Driver performs an MMIO, which fails and queues and eeh_event.
3. Hotplug driver receives a hotplug interrupt and removes any
   pci_devs that were under the slot.
4. pci_dev is torn down and the eeh_pe is freed.
5. The EEH event handler thread processes the eeh_event and crashes
   since the eeh_pe pointer in the eeh_event structure is no
   longer valid.

Crashing is generally considered poor form. Instead of doing that use
the fact PEs are marked as EEH_PE_INVALID to keep them around until the
end of the recovery cycle, at which point we can safely prune any empty
PEs.

Signed-off-by: Oliver O'Halloran 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190903101605.2890-2-ooh...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_driver.c | 36 ++--
 arch/powerpc/kernel/eeh_event.c  |  8 +++
 arch/powerpc/kernel/eeh_pe.c | 23 +++-
 3 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 1fbe541856f5e..fe0c32fb9f96f 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -744,6 +744,33 @@ static int eeh_reset_device(struct eeh_pe *pe, struct 
pci_bus *bus,
  */
 #define MAX_WAIT_FOR_RECOVERY 300
 
+
+/* Walks the PE tree after processing an event to remove any stale PEs.
+ *
+ * NB: This needs to be recursive to ensure the leaf PEs get removed
+ * before their parents do. Although this is possible to do recursively
+ * we don't since this is easier to read and we need to garantee
+ * the leaf nodes will be handled first.
+ */
+static void eeh_pe_cleanup(struct eeh_pe *pe)
+{
+   struct eeh_pe *child_pe, *tmp;
+
+   list_for_each_entry_safe(child_pe, tmp, >child_list, child)
+   eeh_pe_cleanup(child_pe);
+
+   if (pe->state & EEH_PE_KEEP)
+   return;
+
+   if (!(pe->state & EEH_PE_INVALID))
+   return;
+
+   if (list_empty(>edevs) && list_empty(>child_list)) {
+   list_del(>child);
+   kfree(pe);
+   }
+}
+
 /**
  * eeh_handle_normal_event - Handle EEH events on a specific PE
  * @pe: EEH PE - which should not be used after we return, as it may
@@ -782,8 +809,6 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
return;
}
 
-   eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
-
eeh_pe_update_time_stamp(pe);
pe->freeze_count++;
if (pe->freeze_count > eeh_max_freezes) {
@@ -973,6 +998,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
return;
}
}
+
+   /*
+* Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
+* we don't want to modify the PE tree structure so we do it here.
+*/
+   eeh_pe_cleanup(pe);
eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 }
 
@@ -1045,6 +1076,7 @@ void eeh_handle_special_event(void)
 */
if (rc == EEH_NEXT_ERR_FROZEN_PE ||
rc == EEH_NEXT_ERR_FENCED_PHB) {
+   eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
eeh_handle_normal_event(pe);
} else {
pci_lock_rescan_remove();
diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
index 64cfbe41174b2..e36653e5f76b3 100644
--- a/arch/powerpc/kernel/eeh_event.c
+++ b/arch/powerpc/kernel/eeh_event.c
@@ -121,6 +121,14 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
}
event->pe = pe;
 
+   /*
+* Mark the PE as recovering before inserting it in the queue.
+* This prevents the PE from being free()ed by a hotplug driver
+* while the PE is sitting in the event queue.
+*/
+   if (pe)
+   eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
+
/* We may or may not be called in an interrupt context */
spin_lock_irqsave(_eventlist_lock, flags);
list_add(>list, _eventlist);
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 854cef7b18f4d..f0813d50e0b1c 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -491,6 +491,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 {
struct eeh_pe *pe, *parent, *child;
+   bool keep, recover;
int cnt;
struct pci_dn *pdn = eeh_dev_to_pdn(edev);
 
@@ -516,10 +517,21 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 */
while (1) {
parent =

[PATCH AUTOSEL 5.2 50/70] powerpc/64s/exception: machine check use correct cfar for late handler

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ]

Bare metal machine checks run an "early" handler in real mode before
running the main handler which reports the event.

The main handler runs exactly as a normal interrupt handler, after the
"windup" which sets registers back as they were at interrupt entry.
CFAR does not get restored by the windup code, so that will be wrong
when the handler is run.

Restore the CFAR to the saved value before running the late handler.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 6c51aa845bcee..3e564536a237f 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -556,6 +556,10 @@ FTR_SECTION_ELSE
 ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
 9:
/* Deliver the machine check to host kernel in V mode. */
+BEGIN_FTR_SECTION
+   ld  r10,ORIG_GPR3(r1)
+   mtspr   SPRN_CFAR,r10
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
MACHINE_CHECK_HANDLER_WINDUP
SET_SCRATCH0(r13)   /* save r13 */
EXCEPTION_PROLOG_0(PACA_EXMC)
-- 
2.20.1

[PATCH AUTOSEL 5.2 48/70] selftests/powerpc: Retry on host facility unavailable

2019-09-24 Thread Sasha Levin

From: Gustavo Romero 

[ Upstream commit 6652bf6408895b09d31fd4128a1589a1a0672823 ]

TM test tm-unavailable must take into account aborts due to host aborting
a transactin because of a facility unavailable exception, just like it
already does for aborts on reschedules (TM_CAUSE_KVM_RESCHED).

Reported-by: Desnes A. Nunes do Rosario 
Tested-by: Desnes A. Nunes do Rosario 
Signed-off-by: Gustavo Romero 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/1566341651-19747-1-git-send-email-grom...@linux.vnet.ibm.com
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/powerpc/tm/tm.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/tm/tm.h 
b/tools/testing/selftests/powerpc/tm/tm.h
index 97f9f491c541a..c402464b038fc 100644
--- a/tools/testing/selftests/powerpc/tm/tm.h
+++ b/tools/testing/selftests/powerpc/tm/tm.h
@@ -55,7 +55,8 @@ static inline bool failure_is_unavailable(void)
 static inline bool failure_is_reschedule(void)
 {
if ((failure_code() & TM_CAUSE_RESCHED) == TM_CAUSE_RESCHED ||
-   (failure_code() & TM_CAUSE_KVM_RESCHED) == TM_CAUSE_KVM_RESCHED)
+   (failure_code() & TM_CAUSE_KVM_RESCHED) == TM_CAUSE_KVM_RESCHED ||
+   (failure_code() & TM_CAUSE_KVM_FAC_UNAV) == TM_CAUSE_KVM_FAC_UNAV)
return true;
 
return false;
-- 
2.20.1

[PATCH AUTOSEL 5.2 40/70] powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag

2019-09-24 Thread Sasha Levin

From: Sam Bobroff 

[ Upstream commit aa06e3d60e245284d1e55497eb3108828092818d ]

The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the
use of driver callbacks in drivers that have been bound part way
through the recovery process. This is necessary to prevent later stage
handlers from being called when the earlier stage handlers haven't,
which can be confusing for drivers.

However, the flag is set for all devices that are added after boot
time and only cleared at the end of the EEH recovery process. This
results in hot plugged devices erroneously having the flag set during
the first recovery after they are added (causing their driver's
handlers to be incorrectly ignored).

To remedy this, clear the flag at the beginning of recovery
processing. The flag is still cleared at the end of recovery
processing, although it is no longer really necessary.

Also clear the flag during eeh_handle_special_event(), for the same
reasons.

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobr...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_driver.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 89623962c7275..1fbe541856f5e 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -793,6 +793,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
result = PCI_ERS_RESULT_DISCONNECT;
}
 
+   eeh_for_each_pe(pe, tmp_pe)
+   eeh_pe_for_each_dev(tmp_pe, edev, tmp)
+   edev->mode &= ~EEH_DEV_NO_HANDLER;
+
/* Walk the various device drivers attached to this slot through
 * a reset sequence, giving each an opportunity to do what it needs
 * to accomplish the reset.  Each child gets a report of the
@@ -981,7 +985,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
  */
 void eeh_handle_special_event(void)
 {
-   struct eeh_pe *pe, *phb_pe;
+   struct eeh_pe *pe, *phb_pe, *tmp_pe;
+   struct eeh_dev *edev, *tmp_edev;
struct pci_bus *bus;
struct pci_controller *hose;
unsigned long flags;
@@ -1050,6 +1055,10 @@ void eeh_handle_special_event(void)
(phb_pe->state & EEH_PE_RECOVERING))
continue;
 
+   eeh_for_each_pe(pe, tmp_pe)
+   eeh_pe_for_each_dev(tmp_pe, edev, 
tmp_edev)
+   edev->mode &= 
~EEH_DEV_NO_HANDLER;
+
/* Notify all devices to be down */
eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
eeh_set_channel_state(pe, 
pci_channel_io_perm_failure);
-- 
2.20.1

[PATCH AUTOSEL 5.2 38/70] powerpc/perf: fix imc allocation failure handling

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 10c4bd7cd28e77aeb8cfa65b23cb3c632ede2a49 ]

The alloc_pages_node return value should be tested for failure
before being passed to page_address.

Tested-by: Anju T Sudhakar 
Signed-off-by: Nicholas Piggin 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190724084638.24982-3-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/perf/imc-pmu.c | 29 ++---
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 3bdfc1e320964..2231959c56331 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -570,6 +570,7 @@ static int core_imc_mem_init(int cpu, int size)
 {
int nid, rc = 0, core_id = (cpu / threads_per_core);
struct imc_mem_info *mem_info;
+   struct page *page;
 
/*
 * alloc_pages_node() will allocate memory for core in the
@@ -580,11 +581,12 @@ static int core_imc_mem_init(int cpu, int size)
mem_info->id = core_id;
 
/* We need only vbase for core counters */
-   mem_info->vbase = page_address(alloc_pages_node(nid,
- GFP_KERNEL | __GFP_ZERO | 
__GFP_THISNODE |
- __GFP_NOWARN, get_order(size)));
-   if (!mem_info->vbase)
+   page = alloc_pages_node(nid,
+   GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE |
+   __GFP_NOWARN, get_order(size));
+   if (!page)
return -ENOMEM;
+   mem_info->vbase = page_address(page);
 
/* Init the mutex */
core_imc_refc[core_id].id = core_id;
@@ -839,15 +841,17 @@ static int thread_imc_mem_alloc(int cpu_id, int size)
int nid = cpu_to_node(cpu_id);
 
if (!local_mem) {
+   struct page *page;
/*
 * This case could happen only once at start, since we dont
 * free the memory in cpu offline path.
 */
-   local_mem = page_address(alloc_pages_node(nid,
+   page = alloc_pages_node(nid,
  GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE |
- __GFP_NOWARN, get_order(size)));
-   if (!local_mem)
+ __GFP_NOWARN, get_order(size));
+   if (!page)
return -ENOMEM;
+   local_mem = page_address(page);
 
per_cpu(thread_imc_mem, cpu_id) = local_mem;
}
@@ -1085,11 +1089,14 @@ static int trace_imc_mem_alloc(int cpu_id, int size)
int core_id = (cpu_id / threads_per_core);
 
if (!local_mem) {
-   local_mem = page_address(alloc_pages_node(phys_id,
-   GFP_KERNEL | __GFP_ZERO | 
__GFP_THISNODE |
-   __GFP_NOWARN, get_order(size)));
-   if (!local_mem)
+   struct page *page;
+
+   page = alloc_pages_node(phys_id,
+   GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE |
+   __GFP_NOWARN, get_order(size));
+   if (!page)
return -ENOMEM;
+   local_mem = page_address(page);
per_cpu(trace_imc_mem, cpu_id) = local_mem;
 
/* Initialise the counters for trace mode */
-- 
2.20.1

[PATCH AUTOSEL 5.2 37/70] powerpc/pseries/mobility: use cond_resched when updating device tree

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ]

After a partition migration, pseries_devicetree_update() processes
changes to the device tree communicated from the platform to
Linux. This is a relatively heavyweight operation, with multiple
device tree searches, memory allocations, and conversations with
partition firmware.

There's a few levels of nested loops which are bounded only by
decisions made by the platform, outside of Linux's control, and indeed
we have seen RCU stalls on large systems while executing this call
graph. Use cond_resched() in these loops so that the cpu is yielded
when needed.

Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/mobility.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 50e7aee3c7f37..accb732dcfac7 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -206,7 +207,11 @@ static int update_dt_node(__be32 phandle, s32 scope)
 
prop_data += vd;
}
+
+   cond_resched();
}
+
+   cond_resched();
} while (rtas_rc == 1);
 
of_node_put(dn);
@@ -309,8 +314,12 @@ int pseries_devicetree_update(s32 scope)
add_dt_node(phandle, drc_index);
break;
}
+
+   cond_resched();
}
}
+
+   cond_resched();
} while (rc == 1);
 
kfree(rtas_buf);
-- 
2.20.1

[PATCH AUTOSEL 5.2 36/70] powerpc/64s/radix: Fix memory hotplug section page table creation

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 8f51e3929470942e6a8744061254fdeef646cd36 ]

create_physical_mapping expects physical addresses, but creating and
splitting these mappings after boot is supplying virtual (effective)
addresses. This can be irritated by booting with mem= to limit memory
then probing an unused physical memory range:

  echo  > /sys/devices/system/memory/probe

This mostly works by accident, firstly because __va(__va(x)) == __va(x)
so the virtual address does not get corrupted. Secondly because pfn_pte
masks out the upper bits of the pfn beyond the physical address limit,
so a pfn constructed with a 0xc000 virtual linear address
will be masked back to the correct physical address in the pte.

Fixes: 6cc27341b21a8 ("powerpc/mm: add radix__create_section_mapping()")
Signed-off-by: Nicholas Piggin 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190724084638.24982-1-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 8deb432c29754..2b6cc823046a3 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -901,7 +901,7 @@ int __meminit radix__create_section_mapping(unsigned long 
start, unsigned long e
return -1;
}
 
-   return create_physical_mapping(start, end, nid);
+   return create_physical_mapping(__pa(start), __pa(end), nid);
 }
 
 int __meminit radix__remove_section_mapping(unsigned long start, unsigned long 
end)
-- 
2.20.1

[PATCH AUTOSEL 5.2 35/70] powerpc/64s/radix: Remove redundant pfn_pte bitop, add VM_BUG_ON

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 6bb25170d7a44ef0ed9677814600f0785e7421d1 ]

pfn_pte is never given a pte above the addressable physical memory
limit, so the masking is redundant. In case of a software bug, it
is not obviously better to silently truncate the pfn than to corrupt
the pte (either one will result in memory corruption or crashes),
so there is no reason to add this to the fast path.

Add VM_BUG_ON to catch cases where the pfn is invalid. These would
catch the create_section_mapping bug fixed by a previous commit.

  [16885.256466] [ cut here ]
  [16885.256492] kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
  cpu 0x0: Vector: 700 (Program Check) at [c000ee0a36d0]
  pc: c0080738: __map_kernel_page+0x248/0x6f0
  lr: c0080ac0: __map_kernel_page+0x5d0/0x6f0
  sp: c000ee0a3960
 msr: 90029033
current = 0xc000ec63b400
paca= 0xc17f   irqmask: 0x03   irq_happened: 0x01
  pid   = 85, comm = sh
  kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
  Linux version 5.3.0-rc1-1-g0fe93e5f3394
  enter ? for help
  [c000ee0a3a00] c0d37378 create_physical_mapping+0x260/0x360
  [c000ee0a3b10] c0d370bc create_section_mapping+0x1c/0x3c
  [c000ee0a3b30] c0071f54 arch_add_memory+0x74/0x130

Signed-off-by: Nicholas Piggin 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190724084638.24982-5-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index ccf00a8b98c6a..c470f6370dccb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -602,8 +602,10 @@ static inline bool pte_access_permitted(pte_t pte, bool 
write)
  */
 static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot)
 {
-   return __ptepte_basic_t)(pfn) << PAGE_SHIFT) & PTE_RPN_MASK) |
-pgprot_val(pgprot));
+   VM_BUG_ON(pfn >> (64 - PAGE_SHIFT));
+   VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK);
+
+   return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot));
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
-- 
2.20.1

[PATCH AUTOSEL 5.2 34/70] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function

2019-09-24 Thread Sasha Levin

From: Christophe Leroy 

[ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ]

We see warnings such as:
  kernel/futex.c: In function 'do_futex':
  kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this 
function [-Wmaybe-uninitialized]
 return oldval == cmparg;
   ^
  kernel/futex.c:1651:6: note: 'oldval' was declared here
int oldval, ret;
^

This is because arch_futex_atomic_op_inuser() only sets *oval if ret
is 0 and GCC doesn't see that it will only use it when ret is 0.

Anyway, the non-zero ret path is an error path that won't suffer from
setting *oval, and as *oval is a local var in futex_atomic_op_inuser()
it will have no impact.

Signed-off-by: Christophe Leroy 
[mpe: reword change log slightly]
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/futex.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h
index 3a6aa57b9d901..eea28ca679dbb 100644
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -60,8 +60,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int 
oparg, int *oval,
 
pagefault_enable();
 
-   if (!ret)
-   *oval = oldval;
+   *oval = oldval;
 
prevent_write_to_user(uaddr, sizeof(*uaddr));
return ret;
-- 
2.20.1

[PATCH AUTOSEL 5.2 33/70] powerpc/rtas: use device model APIs and serialization during LPM

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ]

The LPAR migration implementation and userspace-initiated cpu hotplug
can interleave their executions like so:

1. Set cpu 7 offline via sysfs.

2. Begin a partition migration, whose implementation requires the OS
   to ensure all present cpus are online; cpu 7 is onlined:

 rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up

   This sets cpu 7 online in all respects except for the cpu's
   corresponding struct device; dev->offline remains true.

3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is
   already online and returns success. The driver core (device_online)
   sets dev->offline = false.

4. The migration completes and restores cpu 7 to offline state:

 rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down

This leaves cpu7 in a state where the driver core considers the cpu
device online, but in all other respects it is offline and
unused. Attempts to online the cpu via sysfs appear to succeed but the
driver core actually does not pass the request to the lower-level
cpuhp support code. This makes the cpu unusable until the cpu device
is manually set offline and then online again via sysfs.

Instead of directly calling cpu_up/cpu_down, the migration code should
use the higher-level device core APIs to maintain consistent state and
serialize operations.

Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to 
migration/hibernation")
Signed-off-by: Nathan Lynch 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/rtas.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index fff2eb22427d0..65cd96c3b1d60 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -871,15 +871,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state 
state,
return 0;
 
for_each_cpu(cpu, cpus) {
+   struct device *dev = get_cpu_device(cpu);
+
switch (state) {
case DOWN:
-   cpuret = cpu_down(cpu);
+   cpuret = device_offline(dev);
break;
case UP:
-   cpuret = cpu_up(cpu);
+   cpuret = device_online(dev);
break;
}
-   if (cpuret) {
+   if (cpuret < 0) {
pr_debug("%s: cpu_%s for cpu#%d returned %d.\n",
__func__,
((state == UP) ? "up" : "down"),
@@ -968,6 +970,8 @@ int rtas_ibm_suspend_me(u64 handle)
data.token = rtas_token("ibm,suspend-me");
data.complete = 
 
+   lock_device_hotplug();
+
/* All present CPUs must be online */
cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask);
cpuret = rtas_online_cpus_mask(offline_mask);
@@ -1007,6 +1011,7 @@ int rtas_ibm_suspend_me(u64 handle)
__func__);
 
 out:
+   unlock_device_hotplug();
free_cpumask_var(offline_mask);
return atomic_read();
 }
-- 
2.20.1

[PATCH AUTOSEL 5.2 32/70] powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL

2019-09-24 Thread Sasha Levin

From: Cédric Le Goater 

[ Upstream commit c3e0dbd7f780a58c4695f1cd8fc8afde80376737 ]

Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in
the OPAL logs and also outputs some of the fields of the internal XIVE
structures in Linux. The OPAL calls can only be done on baremetal
(PowerNV) and they crash a pseries machine. Fix by checking the
hypervisor feature of the CPU.

Signed-off-by: Cédric Le Goater 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190814154754.23682-2-...@kaod.org
Signed-off-by: Sasha Levin 
---
 arch/powerpc/xmon/xmon.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 4a721fd624069..e15ccf19c1533 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2532,13 +2532,16 @@ static void dump_pacas(void)
 static void dump_one_xive(int cpu)
 {
unsigned int hwid = get_hard_smp_processor_id(cpu);
-
-   opal_xive_dump(XIVE_DUMP_TM_HYP, hwid);
-   opal_xive_dump(XIVE_DUMP_TM_POOL, hwid);
-   opal_xive_dump(XIVE_DUMP_TM_OS, hwid);
-   opal_xive_dump(XIVE_DUMP_TM_USER, hwid);
-   opal_xive_dump(XIVE_DUMP_VP, hwid);
-   opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid);
+   bool hv = cpu_has_feature(CPU_FTR_HVMODE);
+
+   if (hv) {
+   opal_xive_dump(XIVE_DUMP_TM_HYP, hwid);
+   opal_xive_dump(XIVE_DUMP_TM_POOL, hwid);
+   opal_xive_dump(XIVE_DUMP_TM_OS, hwid);
+   opal_xive_dump(XIVE_DUMP_TM_USER, hwid);
+   opal_xive_dump(XIVE_DUMP_VP, hwid);
+   opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid);
+   }
 
if (setjmp(bus_error_jmp) != 0) {
catch_memory_errors = 0;
-- 
2.20.1

[PATCH AUTOSEL 5.2 25/70] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window

2019-09-24 Thread Sasha Levin

From: Alexey Kardashevskiy 

[ Upstream commit c37c792dec0929dbb6360a609fb00fa20bb16fc2 ]

We allocate only the first level of multilevel TCE tables for KVM
already (alloc_userspace_copy==true), and the rest is allocated on demand.
This is not enabled though for bare metal.

This removes the KVM limitation (implicit, via the alloc_userspace_copy
parameter) and always allocates just the first level. The on-demand
allocation of missing levels is already implemented.

As from now on DMA map might happen with disabled interrupts, this
allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1].
In practice just a single page is allocated there so chances for failure
are quite low.

To save time when creating a new clean table, this skips non-allocated
indirect TCE entries in pnv_tce_free just like we already do in
the VFIO IOMMU TCE driver.

This changes the default level number from 1 to 2 to reduce the amount
of memory required for the default 32bit DMA window at the boot time.
The default window size is up to 2GB which requires 4MB of TCEs which is
unlikely to be used entirely or at all as most devices these days are
64bit capable so by switching to 2 levels by default we save 4032KB of
RAM per a device.

While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace
can trigger this path via VFIO, see the failure and try creating a table
again with different parameters which might succeed.

[1]:
===
BUG: sleeping function called from invalid context at mm/page_alloc.c:4596
in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1
2 locks held by scsi_eh_1/1038:
 #0: 5efd659a (>eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80
 #1: 06cf56a6 (&(>lock)->rlock){}, at: 
ata_exec_internal_sg+0xb0/0x5c0
irq event stamp: 500
hardirqs last  enabled at (499): [] 
_raw_spin_unlock_irqrestore+0x94/0xd0
hardirqs last disabled at (500): [] 
_raw_spin_lock_irqsave+0x44/0x120
softirqs last  enabled at (0): [] 
copy_process.isra.4.part.5+0x640/0x1a80
softirqs last disabled at (0): [<>] 0x0
CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 
#634
Call Trace:
[c03d064cef50] [c0c8e6c4] dump_stack+0xe8/0x164 (unreliable)
[c03d064cefa0] [c014ed78] ___might_sleep+0x2f8/0x310
[c03d064cf020] [c03ca084] __alloc_pages_nodemask+0x2a4/0x1560
[c03d064cf220] [c00c2530] pnv_alloc_tce_level.isra.0+0x90/0x130
[c03d064cf290] [c00c2888] pnv_tce+0x128/0x3b0
[c03d064cf360] [c00c2c00] pnv_tce_build+0xb0/0xf0
[c03d064cf3c0] [c00bbd9c] pnv_ioda2_tce_build+0x3c/0xb0
[c03d064cf400] [c004cfe0] ppc_iommu_map_sg+0x210/0x550
[c03d064cf510] [c004b7a4] dma_iommu_map_sg+0x74/0xb0
[c03d064cf530] [c0863944] ata_qc_issue+0x134/0x470
[c03d064cf5b0] [c0863ec4] ata_exec_internal_sg+0x244/0x5c0
[c03d064cf700] [c08642d0] ata_exec_internal+0x90/0xe0
[c03d064cf780] [c08650ac] ata_dev_read_id+0x2ec/0x640
[c03d064cf8d0] [c0878e28] ata_eh_recover+0x948/0x16d0
[c03d064cfa10] [c087d760] sata_pmp_error_handler+0x480/0xbf0
[c03d064cfbc0] [c0884624] ahci_error_handler+0x74/0xe0
[c03d064cfbf0] [c0879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0
[c03d064cfca0] [c087a544] ata_scsi_error+0xb4/0x100
[c03d064cfd00] [c0802450] scsi_error_handler+0x120/0x510
[c03d064cfdb0] [c0140c48] kthread+0x1b8/0x1c0
[c03d064cfe20] [c000bd8c] ret_from_kernel_thread+0x5c/0x70
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
irq event stamp: 2305


hardirqs last  enabled at (2305): [] 
fast_exc_return_irq+0x28/0x34
hardirqs last disabled at (2303): [] __do_softirq+0x4a0/0x654
WARNING: possible irq lock inversion dependency detected
5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: GW
softirqs last  enabled at (2304): [] __do_softirq+0x524/0x654
softirqs last disabled at (2297): [] irq_exit+0x128/0x180

swapper/0/0 just changed the state of lock:
06cf56a6 (&(>lock)->rlock){-...}, at: 
ahci_single_level_irq_intr+0xac/0x120
but this lock took another, HARDIRQ-unsafe lock in the past:
 (fs_reclaim){+.+.}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

   CPU0CPU1
   
  lock(fs_reclaim);
   local_irq_disable();
   lock(&(>lock)->rlock);
   lock(fs_reclaim);
  
lock(&(>lock)->rlock);

 *** DEADLOCK ***

no locks held by swapper/0/0.

the shortest dependencies between 2nd lock and 1st lock:
 -> (fs_reclaim){+.+.} ops: 167579 {
HARDIRQ-ON-W at:
  lock_acquire+0xf8/0x2a0

[PATCH AUTOSEL 5.2 17/70] PCI: rpaphp: Avoid a sometimes-uninitialized warning

2019-09-24 Thread Sasha Levin

From: Nathan Chancellor 

[ Upstream commit 0df3e42167caaf9f8c7b64de3da40a459979afe8 ]

When building with -Wsometimes-uninitialized, clang warns:

drivers/pci/hotplug/rpaphp_core.c:243:14: warning: variable 'fndit' is
used uninitialized whenever 'for' loop exits because its condition is
false [-Wsometimes-uninitialized]
for (j = 0; j < entries; j++) {
^~~
drivers/pci/hotplug/rpaphp_core.c:256:6: note: uninitialized use occurs
here
if (fndit)
^
drivers/pci/hotplug/rpaphp_core.c:243:14: note: remove the condition if
it is always true
for (j = 0; j < entries; j++) {
^~~
drivers/pci/hotplug/rpaphp_core.c:233:14: note: initialize the variable
'fndit' to silence this warning
int j, fndit;
^
 = 0

fndit is only used to gate a sprintf call, which can be moved into the
loop to simplify the code and eliminate the local variable, which will
fix this warning.

Fixes: 2fcf3ae508c2 ("hotplug/drc-info: Add code to search ibm,drc-info 
property")
Suggested-by: Nick Desaulniers 
Signed-off-by: Nathan Chancellor 
Acked-by: Tyrel Datwyler 
Acked-by: Joel Savitz 
Signed-off-by: Michael Ellerman 
Link: https://github.com/ClangBuiltLinux/linux/issues/504
Link: https://lore.kernel.org/r/20190603221157.58502-1-natechancel...@gmail.com
Signed-off-by: Sasha Levin 
---
 drivers/pci/hotplug/rpaphp_core.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/hotplug/rpaphp_core.c 
b/drivers/pci/hotplug/rpaphp_core.c
index bcd5d357ca238..c3899ee1db995 100644
--- a/drivers/pci/hotplug/rpaphp_core.c
+++ b/drivers/pci/hotplug/rpaphp_core.c
@@ -230,7 +230,7 @@ static int rpaphp_check_drc_props_v2(struct device_node 
*dn, char *drc_name,
struct of_drc_info drc;
const __be32 *value;
char cell_drc_name[MAX_DRC_NAME_LEN];
-   int j, fndit;
+   int j;
 
info = of_find_property(dn->parent, "ibm,drc-info", NULL);
if (info == NULL)
@@ -245,17 +245,13 @@ static int rpaphp_check_drc_props_v2(struct device_node 
*dn, char *drc_name,
 
/* Should now know end of current entry */
 
-   if (my_index > drc.last_drc_index)
-   continue;
-
-   fndit = 1;
-   break;
+   /* Found it */
+   if (my_index <= drc.last_drc_index) {
+   sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix,
+   my_index);
+   break;
+   }
}
-   /* Found it */
-
-   if (fndit)
-   sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, 
-   my_index);
 
if (((drc_name == NULL) ||
 (drc_name && !strcmp(drc_name, cell_drc_name))) &&
-- 
2.20.1

[PATCH AUTOSEL 5.3 80/87] powerpc: dump kernel log before carrying out fadump or kdump

2019-09-24 Thread Sasha Levin

From: Ganesh Goudar 

[ Upstream commit e7ca44ed3ba77fc26cf32650bb71584896662474 ]

Since commit 4388c9b3a6ee ("powerpc: Do not send system reset request
through the oops path"), pstore dmesg file is not updated when dump is
triggered from HMC. This commit modified system reset (sreset) handler
to invoke fadump or kdump (if configured), without pushing dmesg to
pstore. This leaves pstore to have old dmesg data which won't be much
of a help if kdump fails to capture the dump. This patch fixes that by
calling kmsg_dump() before heading to fadump ot kdump.

Fixes: 4388c9b3a6ee ("powerpc: Do not send system reset request through the 
oops path")
Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Nicholas Piggin 
Signed-off-by: Ganesh Goudar 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190904075949.15607-1-ganes...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/traps.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 11caa0291254e..82f43535e6867 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -472,6 +472,7 @@ void system_reset_exception(struct pt_regs *regs)
if (debugger(regs))
goto out;
 
+   kmsg_dump(KMSG_DUMP_OOPS);
/*
 * A system reset is a request to dump, so we always send
 * it through the crashdump code (if fadump or kdump are
-- 
2.20.1

[PATCH AUTOSEL 5.3 71/87] powerpc/pseries: correctly track irq state in default idle

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ]

prep_irq_for_idle() is intended to be called before entering
H_CEDE (and it is used by the pseries cpuidle driver). However the
default pseries idle routine does not call it, leading to mismanaged
lazy irq state when the cpuidle driver isn't in use. Manifestations of
this include:

* Dropped IPIs in the time immediately after a cpu comes
  online (before it has installed the cpuidle handler), making the
  online operation block indefinitely waiting for the new cpu to
  respond.

* Hitting this WARN_ON in arch_local_irq_restore():
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
 * warn if we are wrong. Only do that when IRQ tracing
 * is enabled as mfmsr() can be costly.
 */
if (WARN_ON_ONCE(mfmsr() & MSR_EE))
__hard_irq_disable();

Call prep_irq_for_idle() from pseries_lpar_idle() and honor its
result.

Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on 
pseries")
Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/setup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index f5940cc71c372..63462e96cf0ee 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -316,6 +316,9 @@ static void pseries_lpar_idle(void)
 * low power mode by ceding processor to hypervisor
 */
 
+   if (!prep_irq_for_idle())
+   return;
+
/* Indicate to hypervisor that we are idle. */
get_lppaca()->idle = 1;
 
-- 
2.20.1

[PATCH AUTOSEL 5.3 69/87] powerpc/imc: Dont create debugfs files for cpu-less nodes

2019-09-24 Thread Sasha Levin

From: Madhavan Srinivasan 

[ Upstream commit 41ba17f20ea835c489e77bd54e2da73184e22060 ]

Commit <684d984038aa> ('powerpc/powernv: Add debugfs interface for
imc-mode and imc') added debugfs interface for the nest imc pmu
devices to support changing of different ucode modes. Primarily adding
this capability for debug. But when doing so, the code did not
consider the case of cpu-less nodes. So when reading the _cmd_ or
_mode_ file of a cpu-less node will create this crash.

  Faulting instruction address: 0xc00d0d58
  Oops: Kernel access of bad area, sig: 11 [#1]
  ...
  CPU: 67 PID: 5301 Comm: cat Not tainted 5.2.0-rc6-next-20190627+ #19
  NIP:  c00d0d58 LR: c049aa18 CTR:c00d0d50
  REGS: c00020194548f9e0 TRAP: 0300   Not tainted  (5.2.0-rc6-next-20190627+)
  MSR:  90009033   CR:28022822  XER: 
  CFAR: c049aa14 DAR: 0003fc08 DSISR:4000 IRQMASK: 0
  ...
  NIP imc_mem_get+0x8/0x20
  LR  simple_attr_read+0x118/0x170
  Call Trace:
simple_attr_read+0x70/0x170 (unreliable)
debugfs_attr_read+0x6c/0xb0
__vfs_read+0x3c/0x70
 vfs_read+0xbc/0x1a0
ksys_read+0x7c/0x140
system_call+0x5c/0x70

Patch fixes the issue with a more robust check for vbase to NULL.

Before patch, ls output for the debugfs imc directory

  # ls /sys/kernel/debug/powerpc/imc/
  imc_cmd_0imc_cmd_251  imc_cmd_253  imc_cmd_255  imc_mode_0
imc_mode_251  imc_mode_253  imc_mode_255
  imc_cmd_250  imc_cmd_252  imc_cmd_254  imc_cmd_8imc_mode_250  
imc_mode_252  imc_mode_254  imc_mode_8

After patch, ls output for the debugfs imc directory

  # ls /sys/kernel/debug/powerpc/imc/
  imc_cmd_0  imc_cmd_8  imc_mode_0  imc_mode_8

Actual bug here is that, we have two loops with potentially different
loop counts. That is, in imc_get_mem_addr_nest(), loop count is
obtained from the dt entries. But in case of export_imc_mode_and_cmd(),
loop was based on for_each_nid() count. Patch fixes the loop count in
latter based on the struct mem_info. Ideally it would be better to
have array size in struct imc_pmu.

Fixes: 684d984038aa ('powerpc/powernv: Add debugfs interface for imc-mode and 
imc')
Reported-by: Qian Cai 
Suggested-by: Michael Ellerman 
Signed-off-by: Madhavan Srinivasan 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190827101635.6942-1-ma...@linux.vnet.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/powernv/opal-imc.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index 186109bdd41be..e04b20625cb94 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -53,9 +53,9 @@ static void export_imc_mode_and_cmd(struct device_node *node,
struct imc_pmu *pmu_ptr)
 {
static u64 loc, *imc_mode_addr, *imc_cmd_addr;
-   int chip = 0, nid;
char mode[16], cmd[16];
u32 cb_offset;
+   struct imc_mem_info *ptr = pmu_ptr->mem_info;
 
imc_debugfs_parent = debugfs_create_dir("imc", powerpc_debugfs_root);
 
@@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct device_node 
*node,
if (of_property_read_u32(node, "cb_offset", _offset))
cb_offset = IMC_CNTL_BLK_OFFSET;
 
-   for_each_node(nid) {
-   loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset;
+   while (ptr->vbase != NULL) {
+   loc = (u64)(ptr->vbase) + cb_offset;
imc_mode_addr = (u64 *)(loc + IMC_CNTL_BLK_MODE_OFFSET);
-   sprintf(mode, "imc_mode_%d", nid);
+   sprintf(mode, "imc_mode_%d", (u32)(ptr->id));
if (!imc_debugfs_create_x64(mode, 0600, imc_debugfs_parent,
imc_mode_addr))
goto err;
 
imc_cmd_addr = (u64 *)(loc + IMC_CNTL_BLK_CMD_OFFSET);
-   sprintf(cmd, "imc_cmd_%d", nid);
+   sprintf(cmd, "imc_cmd_%d", (u32)(ptr->id));
if (!imc_debugfs_create_x64(cmd, 0600, imc_debugfs_parent,
imc_cmd_addr))
goto err;
-   chip++;
+   ptr++;
}
return;
 
-- 
2.20.1

[PATCH AUTOSEL 5.3 68/87] powerpc/eeh: Clean up EEH PEs after recovery finishes

2019-09-24 Thread Sasha Levin

From: Oliver O'Halloran 

[ Upstream commit 799abe283e5103d48e079149579b4f167c95ea0e ]

When the last device in an eeh_pe is removed the eeh_pe structure itself
(and any empty parents) are freed since they are no longer needed. This
results in a crash when a hotplug driver is involved since the following
may occur:

1. Device is suprise removed.
2. Driver performs an MMIO, which fails and queues and eeh_event.
3. Hotplug driver receives a hotplug interrupt and removes any
   pci_devs that were under the slot.
4. pci_dev is torn down and the eeh_pe is freed.
5. The EEH event handler thread processes the eeh_event and crashes
   since the eeh_pe pointer in the eeh_event structure is no
   longer valid.

Crashing is generally considered poor form. Instead of doing that use
the fact PEs are marked as EEH_PE_INVALID to keep them around until the
end of the recovery cycle, at which point we can safely prune any empty
PEs.

Signed-off-by: Oliver O'Halloran 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190903101605.2890-2-ooh...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_driver.c | 36 ++--
 arch/powerpc/kernel/eeh_event.c  |  8 +++
 arch/powerpc/kernel/eeh_pe.c | 23 +++-
 3 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 1fbe541856f5e..fe0c32fb9f96f 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -744,6 +744,33 @@ static int eeh_reset_device(struct eeh_pe *pe, struct 
pci_bus *bus,
  */
 #define MAX_WAIT_FOR_RECOVERY 300
 
+
+/* Walks the PE tree after processing an event to remove any stale PEs.
+ *
+ * NB: This needs to be recursive to ensure the leaf PEs get removed
+ * before their parents do. Although this is possible to do recursively
+ * we don't since this is easier to read and we need to garantee
+ * the leaf nodes will be handled first.
+ */
+static void eeh_pe_cleanup(struct eeh_pe *pe)
+{
+   struct eeh_pe *child_pe, *tmp;
+
+   list_for_each_entry_safe(child_pe, tmp, >child_list, child)
+   eeh_pe_cleanup(child_pe);
+
+   if (pe->state & EEH_PE_KEEP)
+   return;
+
+   if (!(pe->state & EEH_PE_INVALID))
+   return;
+
+   if (list_empty(>edevs) && list_empty(>child_list)) {
+   list_del(>child);
+   kfree(pe);
+   }
+}
+
 /**
  * eeh_handle_normal_event - Handle EEH events on a specific PE
  * @pe: EEH PE - which should not be used after we return, as it may
@@ -782,8 +809,6 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
return;
}
 
-   eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
-
eeh_pe_update_time_stamp(pe);
pe->freeze_count++;
if (pe->freeze_count > eeh_max_freezes) {
@@ -973,6 +998,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
return;
}
}
+
+   /*
+* Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING
+* we don't want to modify the PE tree structure so we do it here.
+*/
+   eeh_pe_cleanup(pe);
eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 }
 
@@ -1045,6 +1076,7 @@ void eeh_handle_special_event(void)
 */
if (rc == EEH_NEXT_ERR_FROZEN_PE ||
rc == EEH_NEXT_ERR_FENCED_PHB) {
+   eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
eeh_handle_normal_event(pe);
} else {
pci_lock_rescan_remove();
diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
index 64cfbe41174b2..e36653e5f76b3 100644
--- a/arch/powerpc/kernel/eeh_event.c
+++ b/arch/powerpc/kernel/eeh_event.c
@@ -121,6 +121,14 @@ int __eeh_send_failure_event(struct eeh_pe *pe)
}
event->pe = pe;
 
+   /*
+* Mark the PE as recovering before inserting it in the queue.
+* This prevents the PE from being free()ed by a hotplug driver
+* while the PE is sitting in the event queue.
+*/
+   if (pe)
+   eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
+
/* We may or may not be called in an interrupt context */
spin_lock_irqsave(_eventlist_lock, flags);
list_add(>list, _eventlist);
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 854cef7b18f4d..f0813d50e0b1c 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -491,6 +491,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 {
struct eeh_pe *pe, *parent, *child;
+   bool keep, recover;
int cnt;
struct pci_dn *pdn = eeh_dev_to_pdn(edev);
 
@@ -516,10 +517,21 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 */
while (1) {
parent =

[PATCH AUTOSEL 5.3 66/87] powerpc/64s/exception: machine check use correct cfar for late handler

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ]

Bare metal machine checks run an "early" handler in real mode before
running the main handler which reports the event.

The main handler runs exactly as a normal interrupt handler, after the
"windup" which sets registers back as they were at interrupt entry.
CFAR does not get restored by the windup code, so that will be wrong
when the handler is run.

Restore the CFAR to the saved value before running the late handler.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 6ba3cc2ef8abc..36c8a3652cf3a 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1211,6 +1211,10 @@ FTR_SECTION_ELSE
 ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
 9:
/* Deliver the machine check to host kernel in V mode. */
+BEGIN_FTR_SECTION
+   ld  r10,ORIG_GPR3(r1)
+   mtspr   SPRN_CFAR,r10
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
MACHINE_CHECK_HANDLER_WINDUP
EXCEPTION_PROLOG_0 PACA_EXMC
b   machine_check_pSeries_0
-- 
2.20.1

[PATCH AUTOSEL 5.3 63/87] selftests/powerpc: Retry on host facility unavailable

2019-09-24 Thread Sasha Levin

From: Gustavo Romero 

[ Upstream commit 6652bf6408895b09d31fd4128a1589a1a0672823 ]

TM test tm-unavailable must take into account aborts due to host aborting
a transactin because of a facility unavailable exception, just like it
already does for aborts on reschedules (TM_CAUSE_KVM_RESCHED).

Reported-by: Desnes A. Nunes do Rosario 
Tested-by: Desnes A. Nunes do Rosario 
Signed-off-by: Gustavo Romero 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/1566341651-19747-1-git-send-email-grom...@linux.vnet.ibm.com
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/powerpc/tm/tm.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/tm/tm.h 
b/tools/testing/selftests/powerpc/tm/tm.h
index 97f9f491c541a..c402464b038fc 100644
--- a/tools/testing/selftests/powerpc/tm/tm.h
+++ b/tools/testing/selftests/powerpc/tm/tm.h
@@ -55,7 +55,8 @@ static inline bool failure_is_unavailable(void)
 static inline bool failure_is_reschedule(void)
 {
if ((failure_code() & TM_CAUSE_RESCHED) == TM_CAUSE_RESCHED ||
-   (failure_code() & TM_CAUSE_KVM_RESCHED) == TM_CAUSE_KVM_RESCHED)
+   (failure_code() & TM_CAUSE_KVM_RESCHED) == TM_CAUSE_KVM_RESCHED ||
+   (failure_code() & TM_CAUSE_KVM_FAC_UNAV) == TM_CAUSE_KVM_FAC_UNAV)
return true;
 
return false;
-- 
2.20.1

[PATCH AUTOSEL 5.3 51/87] powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag

2019-09-24 Thread Sasha Levin

From: Sam Bobroff 

[ Upstream commit aa06e3d60e245284d1e55497eb3108828092818d ]

The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the
use of driver callbacks in drivers that have been bound part way
through the recovery process. This is necessary to prevent later stage
handlers from being called when the earlier stage handlers haven't,
which can be confusing for drivers.

However, the flag is set for all devices that are added after boot
time and only cleared at the end of the EEH recovery process. This
results in hot plugged devices erroneously having the flag set during
the first recovery after they are added (causing their driver's
handlers to be incorrectly ignored).

To remedy this, clear the flag at the beginning of recovery
processing. The flag is still cleared at the end of recovery
processing, although it is no longer really necessary.

Also clear the flag during eeh_handle_special_event(), for the same
reasons.

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobr...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_driver.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 89623962c7275..1fbe541856f5e 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -793,6 +793,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
result = PCI_ERS_RESULT_DISCONNECT;
}
 
+   eeh_for_each_pe(pe, tmp_pe)
+   eeh_pe_for_each_dev(tmp_pe, edev, tmp)
+   edev->mode &= ~EEH_DEV_NO_HANDLER;
+
/* Walk the various device drivers attached to this slot through
 * a reset sequence, giving each an opportunity to do what it needs
 * to accomplish the reset.  Each child gets a report of the
@@ -981,7 +985,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
  */
 void eeh_handle_special_event(void)
 {
-   struct eeh_pe *pe, *phb_pe;
+   struct eeh_pe *pe, *phb_pe, *tmp_pe;
+   struct eeh_dev *edev, *tmp_edev;
struct pci_bus *bus;
struct pci_controller *hose;
unsigned long flags;
@@ -1050,6 +1055,10 @@ void eeh_handle_special_event(void)
(phb_pe->state & EEH_PE_RECOVERING))
continue;
 
+   eeh_for_each_pe(pe, tmp_pe)
+   eeh_pe_for_each_dev(tmp_pe, edev, 
tmp_edev)
+   edev->mode &= 
~EEH_DEV_NO_HANDLER;
+
/* Notify all devices to be down */
eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
eeh_set_channel_state(pe, 
pci_channel_io_perm_failure);
-- 
2.20.1

[PATCH AUTOSEL 5.3 49/87] powerpc/perf: fix imc allocation failure handling

2019-09-24 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 10c4bd7cd28e77aeb8cfa65b23cb3c632ede2a49 ]

The alloc_pages_node return value should be tested for failure
before being passed to page_address.

Tested-by: Anju T Sudhakar 
Signed-off-by: Nicholas Piggin 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190724084638.24982-3-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/perf/imc-pmu.c | 29 ++---
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index dea243185ea4b..cb50a9e1fd2d7 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -577,6 +577,7 @@ static int core_imc_mem_init(int cpu, int size)
 {
int nid, rc = 0, core_id = (cpu / threads_per_core);
struct imc_mem_info *mem_info;
+   struct page *page;
 
/*
 * alloc_pages_node() will allocate memory for core in the
@@ -587,11 +588,12 @@ static int core_imc_mem_init(int cpu, int size)
mem_info->id = core_id;
 
/* We need only vbase for core counters */
-   mem_info->vbase = page_address(alloc_pages_node(nid,
- GFP_KERNEL | __GFP_ZERO | 
__GFP_THISNODE |
- __GFP_NOWARN, get_order(size)));
-   if (!mem_info->vbase)
+   page = alloc_pages_node(nid,
+   GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE |
+   __GFP_NOWARN, get_order(size));
+   if (!page)
return -ENOMEM;
+   mem_info->vbase = page_address(page);
 
/* Init the mutex */
core_imc_refc[core_id].id = core_id;
@@ -849,15 +851,17 @@ static int thread_imc_mem_alloc(int cpu_id, int size)
int nid = cpu_to_node(cpu_id);
 
if (!local_mem) {
+   struct page *page;
/*
 * This case could happen only once at start, since we dont
 * free the memory in cpu offline path.
 */
-   local_mem = page_address(alloc_pages_node(nid,
+   page = alloc_pages_node(nid,
  GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE |
- __GFP_NOWARN, get_order(size)));
-   if (!local_mem)
+ __GFP_NOWARN, get_order(size));
+   if (!page)
return -ENOMEM;
+   local_mem = page_address(page);
 
per_cpu(thread_imc_mem, cpu_id) = local_mem;
}
@@ -1095,11 +1099,14 @@ static int trace_imc_mem_alloc(int cpu_id, int size)
int core_id = (cpu_id / threads_per_core);
 
if (!local_mem) {
-   local_mem = page_address(alloc_pages_node(phys_id,
-   GFP_KERNEL | __GFP_ZERO | 
__GFP_THISNODE |
-   __GFP_NOWARN, get_order(size)));
-   if (!local_mem)
+   struct page *page;
+
+   page = alloc_pages_node(phys_id,
+   GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE |
+   __GFP_NOWARN, get_order(size));
+   if (!page)
return -ENOMEM;
+   local_mem = page_address(page);
per_cpu(trace_imc_mem, cpu_id) = local_mem;
 
/* Initialise the counters for trace mode */
-- 
2.20.1

[PATCH AUTOSEL 5.3 48/87] powerpc/pseries/mobility: use cond_resched when updating device tree

2019-09-24 Thread Sasha Levin

From: Nathan Lynch 

[ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ]

After a partition migration, pseries_devicetree_update() processes
changes to the device tree communicated from the platform to
Linux. This is a relatively heavyweight operation, with multiple
device tree searches, memory allocations, and conversations with
partition firmware.

There's a few levels of nested loops which are bounded only by
decisions made by the platform, outside of Linux's control, and indeed
we have seen RCU stalls on large systems while executing this call
graph. Use cond_resched() in these loops so that the cpu is yielded
when needed.

Signed-off-by: Nathan Lynch 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/mobility.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index fe812bebdf5e6..b571285f6c149 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -207,7 +208,11 @@ static int update_dt_node(__be32 phandle, s32 scope)
 
prop_data += vd;
}
+
+   cond_resched();
}
+
+   cond_resched();
} while (rtas_rc == 1);
 
of_node_put(dn);
@@ -310,8 +315,12 @@ int pseries_devicetree_update(s32 scope)
add_dt_node(phandle, drc_index);
break;
}
+
+   cond_resched();
}
}
+
+   cond_resched();
} while (rc == 1);
 
kfree(rtas_buf);
-- 
2.20.1

1 2 >

1 - 100 of 139 matches

Mail list logo