[PATCH v9 8/8] KVM: PPC: Ultravisor: Add PPC_UV config option
From: Anshuman Khandual CONFIG_PPC_UV adds support for ultravisor. Signed-off-by: Anshuman Khandual Signed-off-by: Bharata B Rao Signed-off-by: Ram Pai [ Update config help and commit message ] Signed-off-by: Claudio Carvalho Reviewed-by: Sukadev Bhattiprolu --- arch/powerpc/Kconfig | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d8dcd8820369..044838794112 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -448,6 +448,23 @@ config PPC_TRANSACTIONAL_MEM help Support user-mode Transactional Memory on POWERPC. +config PPC_UV + bool "Ultravisor support" + depends on KVM_BOOK3S_HV_POSSIBLE + select ZONE_DEVICE + select DEV_PAGEMAP_OPS + select DEVICE_PRIVATE + select MEMORY_HOTPLUG + select MEMORY_HOTREMOVE + default n + help + This option paravirtualizes the kernel to run in POWER platforms that + supports the Protected Execution Facility (PEF). On such platforms, + the ultravisor firmware runs at a privilege level above the + hypervisor. + + If unsure, say "N". + config LD_HEAD_STUB_CATCH bool "Reserve 256 bytes to cope with linker stubs in HEAD text" if EXPERT depends on PPC64 -- 2.21.0
[PATCH v9 7/8] KVM: PPC: Support reset of secure guest
Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF. This ioctl will be issued by QEMU during reset and includes the the following steps: - Ask UV to terminate the guest via UV_SVM_TERMINATE ucall - Unpin the VPA pages so that they can be migrated back to secure side when guest becomes secure again. This is required because pinned pages can't be migrated. - Reinitialize guest's partitioned scoped page tables. These are freed when guest becomes secure (H_SVM_INIT_DONE) - Release all device pages of the secure guest. After these steps, guest is ready to issue UV_ESM call once again to switch to secure mode. Signed-off-by: Bharata B Rao Signed-off-by: Sukadev Bhattiprolu [Implementation of uv_svm_terminate() and its call from guest shutdown path] Signed-off-by: Ram Pai [Unpinning of VPA pages] --- Documentation/virt/kvm/api.txt | 19 ++ arch/powerpc/include/asm/kvm_book3s_uvmem.h | 7 ++ arch/powerpc/include/asm/kvm_ppc.h | 2 + arch/powerpc/include/asm/ultravisor-api.h | 1 + arch/powerpc/include/asm/ultravisor.h | 5 ++ arch/powerpc/kvm/book3s_hv.c| 74 + arch/powerpc/kvm/book3s_hv_uvmem.c | 60 + arch/powerpc/kvm/powerpc.c | 12 include/uapi/linux/kvm.h| 1 + 9 files changed, 181 insertions(+) diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt index 2d067767b617..8e7a02e547e9 100644 --- a/Documentation/virt/kvm/api.txt +++ b/Documentation/virt/kvm/api.txt @@ -4111,6 +4111,25 @@ Valid values for 'action': #define KVM_PMU_EVENT_ALLOW 0 #define KVM_PMU_EVENT_DENY 1 +4.121 KVM_PPC_SVM_OFF + +Capability: basic +Architectures: powerpc +Type: vm ioctl +Parameters: none +Returns: 0 on successful completion, +Errors: + EINVAL:if ultravisor failed to terminate the secure guest + ENOMEM:if hypervisor failed to allocate new radix page tables for guest + +This ioctl is used to turn off the secure mode of the guest or transition +the guest from secure mode to normal mode. This is invoked when the guest +is reset. This has no effect if called for a normal guest. + +This ioctl issues an ultravisor call to terminate the secure guest, +unpins the VPA pages, reinitializes guest's partition scoped page +tables and releases all the device pages that are used to track the +secure pages by hypervisor. 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h b/arch/powerpc/include/asm/kvm_book3s_uvmem.h index fc924ef00b91..6b8cc8edd0ab 100644 --- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h +++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h @@ -13,6 +13,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long page_shift); unsigned long kvmppc_h_svm_init_start(struct kvm *kvm); unsigned long kvmppc_h_svm_init_done(struct kvm *kvm); +void kvmppc_uvmem_free_memslot_pfns(struct kvm *kvm, + struct kvm_memslots *slots); #else static inline unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra, @@ -37,5 +39,10 @@ static inline unsigned long kvmppc_h_svm_init_done(struct kvm *kvm) { return H_UNSUPPORTED; } + +static inline void kvmppc_uvmem_free_memslot_pfns(struct kvm *kvm, + struct kvm_memslots *slots) +{ +} #endif /* CONFIG_PPC_UV */ #endif /* __POWERPC_KVM_PPC_HMM_H__ */ diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 2484e6a8f5ca..e4093d067354 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -177,6 +177,7 @@ extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm, extern int kvmppc_switch_mmu_to_hpt(struct kvm *kvm); extern int kvmppc_switch_mmu_to_radix(struct kvm *kvm); extern void kvmppc_setup_partition_table(struct kvm *kvm); +extern int kvmppc_reinit_partition_table(struct kvm *kvm); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce_64 *args); @@ -321,6 +322,7 @@ struct kvmppc_ops { int size); int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr, int size); + int (*svm_off)(struct kvm *kvm); }; extern struct kvmppc_ops *kvmppc_hv_ops; diff --git a/arch/powerpc/include/asm/ultravisor-api.h b/arch/powerpc/include/asm/ultravisor-api.h index cf200d4ce703..3a27a0c0be05 100644 --- a/arch/powerpc/include/asm/ultravisor-api.h +++ b/arch/powerpc/include/asm/ultravisor-api.h @@ -30,5 +30,6 @@ #define UV_PAGE_IN 0xF128 #define UV_PAGE_OUT0xF12C #define UV_PAGE_INVAL 0xF138 +#define UV_SVM_TERMINATE 0xF13C #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
[PATCH v9 6/8] KVM: PPC: Radix changes for secure guest
- After the guest becomes secure, when we handle a page fault of a page belonging to SVM in HV, send that page to UV via UV_PAGE_IN. - Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL. - Ensure all those routines that walk the secondary page tables of the guest don't do so in case of secure VM. For secure guest, the active secondary page tables are in secure memory and the secondary page tables in HV are freed when guest becomes secure. Signed-off-by: Bharata B Rao Reviewed-by: Sukadev Bhattiprolu --- arch/powerpc/include/asm/kvm_host.h | 12 arch/powerpc/include/asm/ultravisor-api.h | 1 + arch/powerpc/include/asm/ultravisor.h | 5 + arch/powerpc/kvm/book3s_64_mmu_radix.c| 22 ++ arch/powerpc/kvm/book3s_hv_uvmem.c| 20 5 files changed, 60 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 726d35eb3bfe..c0c6603ddd6b 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -877,6 +877,8 @@ static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {} #ifdef CONFIG_PPC_UV int kvmppc_uvmem_init(void); void kvmppc_uvmem_free(void); +bool kvmppc_is_guest_secure(struct kvm *kvm); +int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gpa); #else static inline int kvmppc_uvmem_init(void) { @@ -884,6 +886,16 @@ static inline int kvmppc_uvmem_init(void) } static inline void kvmppc_uvmem_free(void) {} + +static inline bool kvmppc_is_guest_secure(struct kvm *kvm) +{ + return false; +} + +static inline int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gpa) +{ + return -EFAULT; +} #endif /* CONFIG_PPC_UV */ #endif /* __POWERPC_KVM_HOST_H__ */ diff --git a/arch/powerpc/include/asm/ultravisor-api.h b/arch/powerpc/include/asm/ultravisor-api.h index 46b1ee381695..cf200d4ce703 100644 --- a/arch/powerpc/include/asm/ultravisor-api.h +++ b/arch/powerpc/include/asm/ultravisor-api.h @@ -29,5 +29,6 @@ #define UV_UNREGISTER_MEM_SLOT 0xF124 #define UV_PAGE_IN 0xF128 #define UV_PAGE_OUT0xF12C +#define UV_PAGE_INVAL 0xF138 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */ diff --git a/arch/powerpc/include/asm/ultravisor.h b/arch/powerpc/include/asm/ultravisor.h index 719c0c3930b9..b333241bbe4c 100644 --- a/arch/powerpc/include/asm/ultravisor.h +++ b/arch/powerpc/include/asm/ultravisor.h @@ -57,4 +57,9 @@ static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid) return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid); } +static inline int uv_page_inval(u64 lpid, u64 gpa, u64 page_shift) +{ + return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift); +} + #endif /* _ASM_POWERPC_ULTRAVISOR_H */ diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 2d415c36a61d..93ad34e63045 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -19,6 +19,8 @@ #include #include #include +#include +#include /* * Supported radix tree geometry. @@ -915,6 +917,9 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (!(dsisr & DSISR_PRTABLE_FAULT)) gpa |= ea & 0xfff; + if (kvmppc_is_guest_secure(kvm)) + return kvmppc_send_page_to_uv(kvm, gpa & PAGE_MASK); + /* Get the corresponding memslot */ memslot = gfn_to_memslot(kvm, gfn); @@ -972,6 +977,11 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, unsigned long gpa = gfn << PAGE_SHIFT; unsigned int shift; + if (kvmppc_is_guest_secure(kvm)) { + uv_page_inval(kvm->arch.lpid, gpa, PAGE_SIZE); + return 0; + } + ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep)) kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot, @@ -989,6 +999,9 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, int ref = 0; unsigned long old, *rmapp; + if (kvmppc_is_guest_secure(kvm)) + return ref; + ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep) && pte_young(*ptep)) { old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0, @@ -1013,6 +1026,9 @@ int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, unsigned int shift; int ref = 0; + if (kvmppc_is_guest_secure(kvm)) + return ref; + ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep) && pte_young(*ptep)) ref = 1; @@ -1030,6 +1046,9 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm, int ret = 0; unsigned long old, *rmapp; +
[PATCH v9 5/8] KVM: PPC: Handle memory plug/unplug to secure VM
Register the new memslot with UV during plug and unregister the memslot during unplug. Signed-off-by: Bharata B Rao Acked-by: Paul Mackerras Reviewed-by: Sukadev Bhattiprolu --- arch/powerpc/include/asm/ultravisor-api.h | 1 + arch/powerpc/include/asm/ultravisor.h | 5 + arch/powerpc/kvm/book3s_hv.c | 21 + 3 files changed, 27 insertions(+) diff --git a/arch/powerpc/include/asm/ultravisor-api.h b/arch/powerpc/include/asm/ultravisor-api.h index c578d9b13a56..46b1ee381695 100644 --- a/arch/powerpc/include/asm/ultravisor-api.h +++ b/arch/powerpc/include/asm/ultravisor-api.h @@ -26,6 +26,7 @@ #define UV_WRITE_PATE 0xF104 #define UV_RETURN 0xF11C #define UV_REGISTER_MEM_SLOT 0xF120 +#define UV_UNREGISTER_MEM_SLOT 0xF124 #define UV_PAGE_IN 0xF128 #define UV_PAGE_OUT0xF12C diff --git a/arch/powerpc/include/asm/ultravisor.h b/arch/powerpc/include/asm/ultravisor.h index 58ccf5e2d6bb..719c0c3930b9 100644 --- a/arch/powerpc/include/asm/ultravisor.h +++ b/arch/powerpc/include/asm/ultravisor.h @@ -52,4 +52,9 @@ static inline int uv_register_mem_slot(u64 lpid, u64 start_gpa, u64 size, size, flags, slotid); } +static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid) +{ + return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid); +} + #endif /* _ASM_POWERPC_ULTRAVISOR_H */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 3ba27fed3018..c5320cc0a534 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -74,6 +74,7 @@ #include #include #include +#include #include "book3s.h" @@ -4517,6 +4518,26 @@ static void kvmppc_core_commit_memory_region_hv(struct kvm *kvm, if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) && ((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES)) kvmppc_radix_flush_memslot(kvm, old); + /* +* If UV hasn't yet called H_SVM_INIT_START, don't register memslots. +*/ + if (!kvm->arch.secure_guest) + return; + + switch (change) { + case KVM_MR_CREATE: + uv_register_mem_slot(kvm->arch.lpid, +new->base_gfn << PAGE_SHIFT, +new->npages * PAGE_SIZE, +0, new->id); + break; + case KVM_MR_DELETE: + uv_unregister_mem_slot(kvm->arch.lpid, old->id); + break; + default: + /* TODO: Handle KVM_MR_MOVE */ + break; + } } /* -- 2.21.0
[PATCH v9 4/8] KVM: PPC: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls
H_SVM_INIT_START: Initiate securing a VM H_SVM_INIT_DONE: Conclude securing a VM As part of H_SVM_INIT_START, register all existing memslots with the UV. H_SVM_INIT_DONE call by UV informs HV that transition of the guest to secure mode is complete. These two states (transition to secure mode STARTED and transition to secure mode COMPLETED) are recorded in kvm->arch.secure_guest. Setting these states will cause the assembly code that enters the guest to call the UV_RETURN ucall instead of trying to enter the guest directly. Signed-off-by: Bharata B Rao Acked-by: Paul Mackerras Reviewed-by: Sukadev Bhattiprolu --- arch/powerpc/include/asm/hvcall.h | 2 ++ arch/powerpc/include/asm/kvm_book3s_uvmem.h | 12 arch/powerpc/include/asm/kvm_host.h | 4 +++ arch/powerpc/include/asm/ultravisor-api.h | 1 + arch/powerpc/include/asm/ultravisor.h | 7 + arch/powerpc/kvm/book3s_hv.c| 7 + arch/powerpc/kvm/book3s_hv_uvmem.c | 34 + 7 files changed, 67 insertions(+) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 4e98dd992bd1..13bd870609c3 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -348,6 +348,8 @@ /* Platform-specific hcalls used by the Ultravisor */ #define H_SVM_PAGE_IN 0xEF00 #define H_SVM_PAGE_OUT 0xEF04 +#define H_SVM_INIT_START 0xEF08 +#define H_SVM_INIT_DONE0xEF0C /* Values for 2nd argument to H_SET_MODE */ #define H_SET_MODE_RESOURCE_SET_CIABR 1 diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h b/arch/powerpc/include/asm/kvm_book3s_uvmem.h index 9603c2b48d67..fc924ef00b91 100644 --- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h +++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h @@ -11,6 +11,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra, unsigned long flags, unsigned long page_shift); +unsigned long kvmppc_h_svm_init_start(struct kvm *kvm); +unsigned long kvmppc_h_svm_init_done(struct kvm *kvm); #else static inline unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra, @@ -25,5 +27,15 @@ kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra, { return H_UNSUPPORTED; } + +static inline unsigned long kvmppc_h_svm_init_start(struct kvm *kvm) +{ + return H_UNSUPPORTED; +} + +static inline unsigned long kvmppc_h_svm_init_done(struct kvm *kvm) +{ + return H_UNSUPPORTED; +} #endif /* CONFIG_PPC_UV */ #endif /* __POWERPC_KVM_PPC_HMM_H__ */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index a2e7502346a3..726d35eb3bfe 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -281,6 +281,10 @@ struct kvm_hpt_info { struct kvm_resize_hpt; +/* Flag values for kvm_arch.secure_guest */ +#define KVMPPC_SECURE_INIT_START 0x1 /* H_SVM_INIT_START has been called */ +#define KVMPPC_SECURE_INIT_DONE 0x2 /* H_SVM_INIT_DONE completed */ + struct kvm_arch { unsigned int lpid; unsigned int smt_mode; /* # vcpus per virtual core */ diff --git a/arch/powerpc/include/asm/ultravisor-api.h b/arch/powerpc/include/asm/ultravisor-api.h index 1cd1f595fd81..c578d9b13a56 100644 --- a/arch/powerpc/include/asm/ultravisor-api.h +++ b/arch/powerpc/include/asm/ultravisor-api.h @@ -25,6 +25,7 @@ /* opcodes */ #define UV_WRITE_PATE 0xF104 #define UV_RETURN 0xF11C +#define UV_REGISTER_MEM_SLOT 0xF120 #define UV_PAGE_IN 0xF128 #define UV_PAGE_OUT0xF12C diff --git a/arch/powerpc/include/asm/ultravisor.h b/arch/powerpc/include/asm/ultravisor.h index 0fc4a974b2e8..58ccf5e2d6bb 100644 --- a/arch/powerpc/include/asm/ultravisor.h +++ b/arch/powerpc/include/asm/ultravisor.h @@ -45,4 +45,11 @@ static inline int uv_page_out(u64 lpid, u64 dst_ra, u64 src_gpa, u64 flags, page_shift); } +static inline int uv_register_mem_slot(u64 lpid, u64 start_gpa, u64 size, + u64 flags, u64 slotid) +{ + return ucall_norets(UV_REGISTER_MEM_SLOT, lpid, start_gpa, + size, flags, slotid); +} + #endif /* _ASM_POWERPC_ULTRAVISOR_H */ diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index ef532cce85f9..3ba27fed3018 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1089,6 +1089,13 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) kvmppc_get_gpr(vcpu, 5), kvmppc_get_gpr(vcpu, 6)); break; + case H_SVM_INIT_START: + ret = kvmppc_h_svm_init_start(vcpu->kvm); + break; +
[PATCH v9 3/8] KVM: PPC: Shared pages support for secure guests
A secure guest will share some of its pages with hypervisor (Eg. virtio bounce buffers etc). Support sharing of pages between hypervisor and ultravisor. Shared page is reachable via both HV and UV side page tables. Once a secure page is converted to shared page, the device page that represents the secure page is unmapped from the HV side page tables. Signed-off-by: Bharata B Rao --- arch/powerpc/include/asm/hvcall.h | 3 ++ arch/powerpc/kvm/book3s_hv_uvmem.c | 86 -- 2 files changed, 85 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 2595d0144958..4e98dd992bd1 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -342,6 +342,9 @@ #define H_TLB_INVALIDATE 0xF808 #define H_COPY_TOFROM_GUEST0xF80C +/* Flags for H_SVM_PAGE_IN */ +#define H_PAGE_IN_SHARED0x1 + /* Platform-specific hcalls used by the Ultravisor */ #define H_SVM_PAGE_IN 0xEF00 #define H_SVM_PAGE_OUT 0xEF04 diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 312f0fedde0b..5e5b5a3e9eec 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -19,7 +19,10 @@ * available in the platform for running secure guests is hotplugged. * Whenever a page belonging to the guest becomes secure, a page from this * private device memory is used to represent and track that secure page - * on the HV side. + * on the HV side. Some pages (like virtio buffers, VPA pages etc) are + * shared between UV and HV. However such pages aren't represented by + * device private memory and mappings to shared memory exist in both + * UV and HV page tables. * * For each page that gets moved into secure memory, a device PFN is used * on the HV side and migration PTE corresponding to that PFN would be @@ -80,6 +83,7 @@ struct kvmppc_uvmem_page_pvt { unsigned long *rmap; struct kvm *kvm; unsigned long gpa; + bool skip_page_out; }; /* @@ -190,8 +194,70 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start, return ret; } +/* + * Shares the page with HV, thus making it a normal page. + * + * - If the page is already secure, then provision a new page and share + * - If the page is a normal page, share the existing page + * + * In the former case, uses dev_pagemap_ops.migrate_to_ram handler + * to unmap the device page from QEMU's page tables. + */ +static unsigned long +kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long page_shift) +{ + + int ret = H_PARAMETER; + struct page *uvmem_page; + struct kvmppc_uvmem_page_pvt *pvt; + unsigned long pfn; + unsigned long *rmap; + struct kvm_memory_slot *slot; + unsigned long gfn = gpa >> page_shift; + int srcu_idx; + + srcu_idx = srcu_read_lock(>srcu); + slot = gfn_to_memslot(kvm, gfn); + if (!slot) + goto out; + + rmap = >arch.rmap[gfn - slot->base_gfn]; + mutex_lock(>arch.uvmem_lock); + if (kvmppc_rmap_type(rmap) == KVMPPC_RMAP_UVMEM_PFN) { + uvmem_page = pfn_to_page(*rmap & ~KVMPPC_RMAP_UVMEM_PFN); + pvt = uvmem_page->zone_device_data; + pvt->skip_page_out = true; + } + +retry: + mutex_unlock(>arch.uvmem_lock); + pfn = gfn_to_pfn(kvm, gfn); + if (is_error_noslot_pfn(pfn)) + goto out; + + mutex_lock(>arch.uvmem_lock); + if (kvmppc_rmap_type(rmap) == KVMPPC_RMAP_UVMEM_PFN) { + uvmem_page = pfn_to_page(*rmap & ~KVMPPC_RMAP_UVMEM_PFN); + pvt = uvmem_page->zone_device_data; + pvt->skip_page_out = true; + kvm_release_pfn_clean(pfn); + goto retry; + } + + if (!uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0, page_shift)) + ret = H_SUCCESS; + kvm_release_pfn_clean(pfn); + mutex_unlock(>arch.uvmem_lock); +out: + srcu_read_unlock(>srcu, srcu_idx); + return ret; +} + /* * H_SVM_PAGE_IN: Move page from normal memory to secure memory. + * + * H_PAGE_IN_SHARED flag makes the page shared which means that the same + * memory in is visible from both UV and HV. */ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa, @@ -208,9 +274,12 @@ kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa, if (page_shift != PAGE_SHIFT) return H_P3; - if (flags) + if (flags & ~H_PAGE_IN_SHARED) return H_P2; + if (flags & H_PAGE_IN_SHARED) + return kvmppc_share_page(kvm, gpa, page_shift); + ret = H_PARAMETER; srcu_idx = srcu_read_lock(>srcu); down_read(>mm->mmap_sem); @@ -292,8 +361,17 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned long start, pvt = spage->zone_device_data;
[PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory
Manage migration of pages betwen normal and secure memory of secure guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls. H_SVM_PAGE_IN: Move the content of a normal page to secure page H_SVM_PAGE_OUT: Move the content of a secure page to normal page Private ZONE_DEVICE memory equal to the amount of secure memory available in the platform for running secure guests is created. Whenever a page belonging to the guest becomes secure, a page from this private device memory is used to represent and track that secure page on the HV side. The movement of pages between normal and secure memory is done via migrate_vma_pages() using UV_PAGE_IN and UV_PAGE_OUT ucalls. Signed-off-by: Bharata B Rao --- arch/powerpc/include/asm/hvcall.h | 4 + arch/powerpc/include/asm/kvm_book3s_uvmem.h | 29 ++ arch/powerpc/include/asm/kvm_host.h | 13 + arch/powerpc/include/asm/ultravisor-api.h | 2 + arch/powerpc/include/asm/ultravisor.h | 14 + arch/powerpc/kvm/Makefile | 3 + arch/powerpc/kvm/book3s_hv.c| 20 + arch/powerpc/kvm/book3s_hv_uvmem.c | 481 8 files changed, 566 insertions(+) create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 2023e327..2595d0144958 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -342,6 +342,10 @@ #define H_TLB_INVALIDATE 0xF808 #define H_COPY_TOFROM_GUEST0xF80C +/* Platform-specific hcalls used by the Ultravisor */ +#define H_SVM_PAGE_IN 0xEF00 +#define H_SVM_PAGE_OUT 0xEF04 + /* Values for 2nd argument to H_SET_MODE */ #define H_SET_MODE_RESOURCE_SET_CIABR 1 #define H_SET_MODE_RESOURCE_SET_DAWR 2 diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h b/arch/powerpc/include/asm/kvm_book3s_uvmem.h new file mode 100644 index ..9603c2b48d67 --- /dev/null +++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __POWERPC_KVM_PPC_HMM_H__ +#define __POWERPC_KVM_PPC_HMM_H__ + +#ifdef CONFIG_PPC_UV +unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, + unsigned long gra, + unsigned long flags, + unsigned long page_shift); +unsigned long kvmppc_h_svm_page_out(struct kvm *kvm, + unsigned long gra, + unsigned long flags, + unsigned long page_shift); +#else +static inline unsigned long +kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra, +unsigned long flags, unsigned long page_shift) +{ + return H_UNSUPPORTED; +} + +static inline unsigned long +kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra, + unsigned long flags, unsigned long page_shift) +{ + return H_UNSUPPORTED; +} +#endif /* CONFIG_PPC_UV */ +#endif /* __POWERPC_KVM_PPC_HMM_H__ */ diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 81cd221ccc04..a2e7502346a3 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -336,6 +336,7 @@ struct kvm_arch { #endif struct kvmppc_ops *kvm_ops; #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE + struct mutex uvmem_lock; struct mutex mmu_setup_lock;/* nests inside vcpu mutexes */ u64 l1_ptcr; int max_nested_lpid; @@ -869,4 +870,16 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {} +#ifdef CONFIG_PPC_UV +int kvmppc_uvmem_init(void); +void kvmppc_uvmem_free(void); +#else +static inline int kvmppc_uvmem_init(void) +{ + return 0; +} + +static inline void kvmppc_uvmem_free(void) {} +#endif /* CONFIG_PPC_UV */ + #endif /* __POWERPC_KVM_HOST_H__ */ diff --git a/arch/powerpc/include/asm/ultravisor-api.h b/arch/powerpc/include/asm/ultravisor-api.h index 6a0f9c74f959..1cd1f595fd81 100644 --- a/arch/powerpc/include/asm/ultravisor-api.h +++ b/arch/powerpc/include/asm/ultravisor-api.h @@ -25,5 +25,7 @@ /* opcodes */ #define UV_WRITE_PATE 0xF104 #define UV_RETURN 0xF11C +#define UV_PAGE_IN 0xF128 +#define UV_PAGE_OUT0xF12C #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */ diff --git a/arch/powerpc/include/asm/ultravisor.h b/arch/powerpc/include/asm/ultravisor.h index d7aa97aa7834..0fc4a974b2e8 100644 --- a/arch/powerpc/include/asm/ultravisor.h +++ b/arch/powerpc/include/asm/ultravisor.h @@ -31,4 +31,18 @@ static inline int uv_register_pate(u64 lpid, u64 dw0, u64 dw1)
[PATCH v9 1/8] KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot
From: Suraj Jitindar Singh The rmap array in the guest memslot is an array of size number of guest pages, allocated at memslot creation time. Each rmap entry in this array is used to store information about the guest page to which it corresponds. For example for a hpt guest it is used to store a lock bit, rc bits, a present bit and the index of a hpt entry in the guest hpt which maps this page. For a radix guest which is running nested guests it is used to store a pointer to a linked list of nested rmap entries which store the nested guest physical address which maps this guest address and for which there is a pte in the shadow page table. As there are currently two uses for the rmap array, and the potential for this to expand to more in the future, define a type field (being the top 8 bits of the rmap entry) to be used to define the type of the rmap entry which is currently present and define two values for this field for the two current uses of the rmap array. Since the nested case uses the rmap entry to store a pointer, define this type as having the two high bits set as is expected for a pointer. Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering). Signed-off-by: Suraj Jitindar Singh Signed-off-by: Paul Mackerras Signed-off-by: Bharata B Rao [Added rmap type KVMPPC_RMAP_UVMEM_PFN] Reviewed-by: Sukadev Bhattiprolu --- arch/powerpc/include/asm/kvm_host.h | 28 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- 2 files changed, 25 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 4bb552d639b8..81cd221ccc04 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -232,11 +232,31 @@ struct revmap_entry { }; /* - * We use the top bit of each memslot->arch.rmap entry as a lock bit, - * and bit 32 as a present flag. The bottom 32 bits are the - * index in the guest HPT of a HPTE that points to the page. + * The rmap array of size number of guest pages is allocated for each memslot. + * This array is used to store usage specific information about the guest page. + * Below are the encodings of the various possible usage types. */ -#define KVMPPC_RMAP_LOCK_BIT 63 +/* Free bits which can be used to define a new usage */ +#define KVMPPC_RMAP_TYPE_MASK 0xff00 +#define KVMPPC_RMAP_NESTED 0xc000 /* Nested rmap array */ +#define KVMPPC_RMAP_HPT0x0100 /* HPT guest */ +#define KVMPPC_RMAP_UVMEM_PFN 0x0200 /* Secure GPA */ + +static inline unsigned long kvmppc_rmap_type(unsigned long *rmap) +{ + return (*rmap & KVMPPC_RMAP_TYPE_MASK); +} + +/* + * rmap usage definition for a hash page table (hpt) guest: + * 0x0800 Lock bit + * 0x0180 RC bits + * 0x0001 Present bit + * 0x HPT index bits + * The bottom 32 bits are the index in the guest HPT of a HPTE that points to + * the page. + */ +#define KVMPPC_RMAP_LOCK_BIT 43 #define KVMPPC_RMAP_RC_SHIFT 32 #define KVMPPC_RMAP_REFERENCED (HPTE_R_R << KVMPPC_RMAP_RC_SHIFT) #define KVMPPC_RMAP_PRESENT0x1ul diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 63e0ce91e29d..7186c65c61c9 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -99,7 +99,7 @@ void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev, } else { rev->forw = rev->back = pte_index; *rmap = (*rmap & ~KVMPPC_RMAP_INDEX) | - pte_index | KVMPPC_RMAP_PRESENT; + pte_index | KVMPPC_RMAP_PRESENT | KVMPPC_RMAP_HPT; } unlock_rmap(rmap); } -- 2.21.0
[PATCH v9 0/8] KVM: PPC: Driver to manage pages of secure guest
[The main change in this version is the introduction of new locking to prevent concurrent page-in and page-out calls. More details about this are present in patch 2/8] Hi, A pseries guest can be run as a secure guest on Ultravisor-enabled POWER platforms. On such platforms, this driver will be used to manage the movement of guest pages between the normal memory managed by hypervisor(HV) and secure memory managed by Ultravisor(UV). Private ZONE_DEVICE memory equal to the amount of secure memory available in the platform for running secure guests is created. Whenever a page belonging to the guest becomes secure, a page from this private device memory is used to represent and track that secure page on the HV side. The movement of pages between normal and secure memory is done via migrate_vma_pages(). The reverse movement is driven via pagemap_ops.migrate_to_ram(). The page-in or page-out requests from UV will come to HV as hcalls and HV will call back into UV via uvcalls to satisfy these page requests. These patches are against hmm.git (https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm) plus Claudio Carvalho's base ultravisor enablement patches that are present in Michael Ellerman's tree (https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/log/?h=topic/ppc-kvm) These patches along with Claudio's above patches are required to run secure pseries guests on KVM. This patchset is based on hmm.git because hmm.git has migrate_vma cleanup and not-device memremap_pages patchsets that are required by this patchset. Changes in v9 = - Prevent concurrent page-in and page-out calls. - Ensure device PFNs are allocated for zero-pages that are sent to UV. - Failure to migrate a page during page-in will now return error via hcall. - Address review comments by Suka - Misc cleanups v8: https://lore.kernel.org/linux-mm/20190910082946.7849-2-bhar...@linux.ibm.com/T/ Anshuman Khandual (1): KVM: PPC: Ultravisor: Add PPC_UV config option Bharata B Rao (6): KVM: PPC: Move pages between normal and secure memory KVM: PPC: Shared pages support for secure guests KVM: PPC: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls KVM: PPC: Handle memory plug/unplug to secure VM KVM: PPC: Radix changes for secure guest KVM: PPC: Support reset of secure guest Suraj Jitindar Singh (1): KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot Documentation/virt/kvm/api.txt | 19 + arch/powerpc/Kconfig| 17 + arch/powerpc/include/asm/hvcall.h | 9 + arch/powerpc/include/asm/kvm_book3s_uvmem.h | 48 ++ arch/powerpc/include/asm/kvm_host.h | 57 +- arch/powerpc/include/asm/kvm_ppc.h | 2 + arch/powerpc/include/asm/ultravisor-api.h | 6 + arch/powerpc/include/asm/ultravisor.h | 36 ++ arch/powerpc/kvm/Makefile | 3 + arch/powerpc/kvm/book3s_64_mmu_radix.c | 22 + arch/powerpc/kvm/book3s_hv.c| 122 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +- arch/powerpc/kvm/book3s_hv_uvmem.c | 673 arch/powerpc/kvm/powerpc.c | 12 + include/uapi/linux/kvm.h| 1 + 15 files changed, 1024 insertions(+), 5 deletions(-) create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c -- 2.21.0
RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property
> > > > > > > > > > > > > > > > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle > > > > > > > > > an errata > > > > > > > > > A-008646 on LS1021A > > > > > > > > > > > > > > > > > > Signed-off-by: Biwen Li > > > > > > > > > --- > > > > > > > > > Change in v3: > > > > > > > > > - rename property name > > > > > > > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr > > > > > > > > > > > > > > > > > > Change in v2: > > > > > > > > > - update desc of the property 'fsl,rcpm-scfg' > > > > > > > > > > > > > > > > > > Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14 > > > > > > > > > ++ > > > > > > > > > 1 file changed, 14 insertions(+) > > > > > > > > > > > > > > > > > > diff --git > > > > > > > > > a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > > > index 5a33619d881d..157dcf6da17c 100644 > > > > > > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > > > @@ -34,6 +34,11 @@ Chassis VersionExample > > > Chips > > > > > > > > > Optional properties: > > > > > > > > > - little-endian : RCPM register block is Little Endian. > > > > > > > > > Without it > > > > RCPM > > > > > > > > > will be Big Endian (default case). > > > > > > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for > > > > > > > > > + SoC LS1021A, > > > > > > > > > > > > > > > > You probably should mention this is related to a hardware > > > > > > > > issue on LS1021a and only needed on LS1021a. > > > > > > > Okay, got it, thanks, I will add this in v4. > > > > > > > > > > > > > > > > > + Must include n + 1 entries (n = > > > > > > > > > + #fsl,rcpm-wakeup-cells, such > > as: > > > > > > > > > + #fsl,rcpm-wakeup-cells equal to 2, then must include > > > > > > > > > + 2 > > > > > > > > > + + > > > > > > > > > + 1 > > > > entries). > > > > > > > > > > > > > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR > > > > > > > > registers on an > > > > SoC. > > > > > > > > However you are defining an offset to scfg registers here. > > > > > > > > Why these two are related? The length here should > > > > > > > > actually be related to the #address-cells of the soc/. > > > > > > > > But since this is only needed for LS1021, you can > > > > > > > just make it 3. > > > > > > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 > > > > > > > device node(fsl,rcpm-wakeup = < 0x0 0x2000>; > > > > > > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for > > > > IPPDEXPCR1). > > > > > > > But because of the hardware issue on LS1021A, I need store > > > > > > > the value of IPPDEXPCR registers to an alt address. So I > > > > > > > defining an offset to scfg registers, then RCPM driver get > > > > > > > an abosolute address from offset, RCPM driver write the > > > > > > > value of IPPDEXPCR registers to these abosolute > > > > > > > addresses(backup the value of IPPDEXPCR > > > > registers). > > > > > > > > > > > > I understand what you are trying to do. The problem is that > > > > > > the new fsl,ippdexpcr-alt-addr property contains a phandle and an > offset. > > > > > > The size of it shouldn't be related to #fsl,rcpm-wakeup-cells. > > > > > You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/* > > > > > SCFG_SPARECR8 */ > > > > > > > > No. The #address-cell for the soc/ is 2, so the offset to scfg > > > > should be 0x0 0x51c. The total size should be 3, but it shouldn't > > > > be coming from #fsl,rcpm-wakeup-cells like you mentioned in the > binding. > > > Oh, I got it. You want that fsl,ippdexpcr-alt-add is relative with > > > #address-cells instead of #fsl,rcpm-wakeup-cells. > > > > Yes. > I got an example from drivers/pci/controller/dwc/pci-layerscape.c > and arch/arm/boot/dts/ls1021a.dtsi as follows: > fsl,pcie-scfg = < 0>, 0 is an index > > In my fsl,ippdexpcr-alt-addr = < 0x0 0x51c>, It means that 0x0 is an alt > offset address for IPPDEXPCR0, 0x51c is an alt offset address For > IPPDEXPCR1 instead of 0x0 and 0x51c compose to an alt address of > SCFG_SPARECR8. Maybe I need write it as: fsl,ippdexpcr-alt-addr = < 0x0 0x0 0x0 0x51c>; first two 0x0 compose an alt offset address for IPPDEXPCR0, last 0x0 and 0x51c compose an alt address for IPPDEXPCR1, Best Regards, Biwen Li > > > > Regards, > > Leo > > > > > > > > > > > > > > > > > > > > > > > > > > > + The first entry must be a link to the SCFG device node. > > > > > > > > > + The non-first entry must be offset of registers of SCFG. > > > > > > > > > > > > > > > > > > Example: > > > > > > > > > The RCPM node for T4240: > > > > > > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240: > > > > > > > > > #fsl,rcpm-wakeup-cells = <2>; > > > > > > > > > }; > > > > > > > > > > > > > > > > > > +The RCPM node for LS1021A: > > > > > > > > > + rcpm: rcpm@1ee2140 { > > > > > > > > > +
[PATCH v4 5/5] Powerpc/Watchpoint: Support for 8xx in ptrace-hwbreak.c selftest
On the 8xx, signals are generated after executing the instruction. So no need to manually single-step on 8xx. Signed-off-by: Ravi Bangoria --- .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 26 ++- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c index 654131591fca..58505277346d 100644 --- a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c +++ b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c @@ -22,6 +22,11 @@ #include #include "ptrace.h" +#define SPRN_PVR 0x11F +#define PVR_8xx0x0050 + +bool is_8xx; + /* * Use volatile on all global var so that compiler doesn't * optimise their load/stores. Otherwise selftest can fail. @@ -205,13 +210,15 @@ static void check_success(pid_t child_pid, const char *name, const char *type, printf("%s, %s, len: %d: Ok\n", name, type, len); - /* -* For ptrace registered watchpoint, signal is generated -* before executing load/store. Singlestep the instruction -* and then continue the test. -*/ - ptrace(PTRACE_SINGLESTEP, child_pid, NULL, 0); - wait(NULL); + if (!is_8xx) { + /* +* For ptrace registered watchpoint, signal is generated +* before executing load/store. Singlestep the instruction +* and then continue the test. +*/ + ptrace(PTRACE_SINGLESTEP, child_pid, NULL, 0); + wait(NULL); + } } static void ptrace_set_debugreg(pid_t child_pid, unsigned long wp_addr) @@ -489,5 +496,10 @@ static int ptrace_hwbreak(void) int main(int argc, char **argv, char **envp) { + int pvr = 0; + asm __volatile__ ("mfspr %0,%1" : "=r"(pvr) : "i"(SPRN_PVR)); + if (pvr == PVR_8xx) + is_8xx = true; + return test_harness(ptrace_hwbreak, "ptrace-hwbreak"); } -- 2.21.0
[PATCH v4 4/5] Powerpc/Watchpoint: Add dar outside test in perf-hwbreak.c selftest
So far we used to ignore exception if dar points outside of user specified range. But now we are ignoring it only if actual load/ store range does not overlap with user specified range. Include selftests for the same: # ./tools/testing/selftests/powerpc/ptrace/perf-hwbreak ... TESTED: No overlap TESTED: Partial overlap TESTED: Partial overlap TESTED: No overlap TESTED: Full overlap success: perf_hwbreak Signed-off-by: Ravi Bangoria --- .../selftests/powerpc/ptrace/perf-hwbreak.c | 111 +- 1 file changed, 110 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c b/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c index 200337daec42..389c545675c6 100644 --- a/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c +++ b/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c @@ -148,6 +148,113 @@ static int runtestsingle(int readwriteflag, int exclude_user, int arraytest) return 0; } +static int runtest_dar_outside(void) +{ + volatile char target[8]; + volatile __u16 temp16; + volatile __u64 temp64; + struct perf_event_attr attr; + int break_fd; + unsigned long long breaks; + int fail = 0; + size_t res; + + /* setup counters */ + memset(, 0, sizeof(attr)); + attr.disabled = 1; + attr.type = PERF_TYPE_BREAKPOINT; + attr.exclude_kernel = 1; + attr.exclude_hv = 1; + attr.exclude_guest = 1; + attr.bp_type = HW_BREAKPOINT_RW; + /* watch middle half of target array */ + attr.bp_addr = (__u64)(target + 2); + attr.bp_len = 4; + break_fd = sys_perf_event_open(, 0, -1, -1, 0); + if (break_fd < 0) { + perror("sys_perf_event_open"); + exit(1); + } + + /* Shouldn't hit. */ + ioctl(break_fd, PERF_EVENT_IOC_RESET); + ioctl(break_fd, PERF_EVENT_IOC_ENABLE); + temp16 = *((__u16 *)target); + *((__u16 *)target) = temp16; + ioctl(break_fd, PERF_EVENT_IOC_DISABLE); + res = read(break_fd, , sizeof(unsigned long long)); + assert(res == sizeof(unsigned long long)); + if (breaks == 0) { + printf("TESTED: No overlap\n"); + } else { + printf("FAILED: No overlap: %lld != 0\n", breaks); + fail = 1; + } + + /* Hit */ + ioctl(break_fd, PERF_EVENT_IOC_RESET); + ioctl(break_fd, PERF_EVENT_IOC_ENABLE); + temp16 = *((__u16 *)(target + 1)); + *((__u16 *)(target + 1)) = temp16; + ioctl(break_fd, PERF_EVENT_IOC_DISABLE); + res = read(break_fd, , sizeof(unsigned long long)); + assert(res == sizeof(unsigned long long)); + if (breaks == 2) { + printf("TESTED: Partial overlap\n"); + } else { + printf("FAILED: Partial overlap: %lld != 2\n", breaks); + fail = 1; + } + + /* Hit */ + ioctl(break_fd, PERF_EVENT_IOC_RESET); + ioctl(break_fd, PERF_EVENT_IOC_ENABLE); + temp16 = *((__u16 *)(target + 5)); + *((__u16 *)(target + 5)) = temp16; + ioctl(break_fd, PERF_EVENT_IOC_DISABLE); + res = read(break_fd, , sizeof(unsigned long long)); + assert(res == sizeof(unsigned long long)); + if (breaks == 2) { + printf("TESTED: Partial overlap\n"); + } else { + printf("FAILED: Partial overlap: %lld != 2\n", breaks); + fail = 1; + } + + /* Shouldn't Hit */ + ioctl(break_fd, PERF_EVENT_IOC_RESET); + ioctl(break_fd, PERF_EVENT_IOC_ENABLE); + temp16 = *((__u16 *)(target + 6)); + *((__u16 *)(target + 6)) = temp16; + ioctl(break_fd, PERF_EVENT_IOC_DISABLE); + res = read(break_fd, , sizeof(unsigned long long)); + assert(res == sizeof(unsigned long long)); + if (breaks == 0) { + printf("TESTED: No overlap\n"); + } else { + printf("FAILED: No overlap: %lld != 0\n", breaks); + fail = 1; + } + + /* Hit */ + ioctl(break_fd, PERF_EVENT_IOC_RESET); + ioctl(break_fd, PERF_EVENT_IOC_ENABLE); + temp64 = *((__u64 *)target); + *((__u64 *)target) = temp64; + ioctl(break_fd, PERF_EVENT_IOC_DISABLE); + res = read(break_fd, , sizeof(unsigned long long)); + assert(res == sizeof(unsigned long long)); + if (breaks == 2) { + printf("TESTED: Full overlap\n"); + } else { + printf("FAILED: Full overlap: %lld != 2\n", breaks); + fail = 1; + } + + close(break_fd); + return fail; +} + static int runtest(void) { int rwflag; @@ -172,7 +279,9 @@ static int runtest(void) return ret; } } - return 0; + + ret = runtest_dar_outside(); + return ret; } -- 2.21.0
[PATCH v4 3/5] Powerpc/Watchpoint: Rewrite ptrace-hwbreak.c selftest
ptrace-hwbreak.c selftest is logically broken. On powerpc, when watchpoint is created with ptrace, signals are generated before executing the instruction and user has to manually singlestep the instruction with watchpoint disabled, which selftest never does and thus it keeps on getting the signal at the same instruction. If we fix it, selftest fails because the logical connection between tracer(parent) and tracee(child) is also broken. Rewrite the selftest and add new tests for unaligned access. With patch: $ ./tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak test: ptrace-hwbreak tags: git_version:powerpc-5.3-4-224-g218b868240c7-dirty PTRACE_SET_DEBUGREG, WO, len: 1: Ok PTRACE_SET_DEBUGREG, WO, len: 2: Ok PTRACE_SET_DEBUGREG, WO, len: 4: Ok PTRACE_SET_DEBUGREG, WO, len: 8: Ok PTRACE_SET_DEBUGREG, RO, len: 1: Ok PTRACE_SET_DEBUGREG, RO, len: 2: Ok PTRACE_SET_DEBUGREG, RO, len: 4: Ok PTRACE_SET_DEBUGREG, RO, len: 8: Ok PTRACE_SET_DEBUGREG, RW, len: 1: Ok PTRACE_SET_DEBUGREG, RW, len: 2: Ok PTRACE_SET_DEBUGREG, RW, len: 4: Ok PTRACE_SET_DEBUGREG, RW, len: 8: Ok PPC_PTRACE_SETHWDEBUG, MODE_EXACT, WO, len: 1: Ok PPC_PTRACE_SETHWDEBUG, MODE_EXACT, RO, len: 1: Ok PPC_PTRACE_SETHWDEBUG, MODE_EXACT, RW, len: 1: Ok PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, WO, len: 6: Ok PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, RO, len: 6: Ok PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, RW, len: 6: Ok PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, WO, len: 6: Ok PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, RO, len: 6: Ok PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, RW, len: 6: Ok PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, DAR OUTSIDE, RW, len: 6: Ok PPC_PTRACE_SETHWDEBUG, DAWR_MAX_LEN, RW, len: 512: Ok success: ptrace-hwbreak Signed-off-by: Ravi Bangoria --- .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 571 +++--- 1 file changed, 361 insertions(+), 210 deletions(-) diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c index 3066d310f32b..654131591fca 100644 --- a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c +++ b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c @@ -22,318 +22,469 @@ #include #include "ptrace.h" -/* Breakpoint access modes */ -enum { - BP_X = 1, - BP_RW = 2, - BP_W = 4, -}; - -static pid_t child_pid; -static struct ppc_debug_info dbginfo; - -static void get_dbginfo(void) -{ - int ret; - - ret = ptrace(PPC_PTRACE_GETHWDBGINFO, child_pid, NULL, ); - if (ret) { - perror("Can't get breakpoint info\n"); - exit(-1); - } -} - -static bool hwbreak_present(void) -{ - return (dbginfo.num_data_bps != 0); -} +/* + * Use volatile on all global var so that compiler doesn't + * optimise their load/stores. Otherwise selftest can fail. + */ +static volatile __u64 glvar; -static bool dawr_present(void) -{ - return !!(dbginfo.features & PPC_DEBUG_FEATURE_DATA_BP_DAWR); -} +#define DAWR_MAX_LEN 512 +static volatile __u8 big_var[DAWR_MAX_LEN] __attribute__((aligned(512))); -static void set_breakpoint_addr(void *addr) -{ - int ret; +#define A_LEN 6 +#define B_LEN 6 +struct gstruct { + __u8 a[A_LEN]; /* double word aligned */ + __u8 b[B_LEN]; /* double word unaligned */ +}; +static volatile struct gstruct gstruct __attribute__((aligned(512))); - ret = ptrace(PTRACE_SET_DEBUGREG, child_pid, 0, addr); - if (ret) { - perror("Can't set breakpoint addr\n"); - exit(-1); - } -} -static int set_hwbreakpoint_addr(void *addr, int range) +static void get_dbginfo(pid_t child_pid, struct ppc_debug_info *dbginfo) { - int ret; - - struct ppc_hw_breakpoint info; - - info.version = 1; - info.trigger_type = PPC_BREAKPOINT_TRIGGER_RW; - info.addr_mode = PPC_BREAKPOINT_MODE_EXACT; - if (range > 0) - info.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE; - info.condition_mode = PPC_BREAKPOINT_CONDITION_NONE; - info.addr = (__u64)addr; - info.addr2 = (__u64)addr + range; - info.condition_value = 0; - - ret = ptrace(PPC_PTRACE_SETHWDEBUG, child_pid, 0, ); - if (ret < 0) { - perror("Can't set breakpoint\n"); + if (ptrace(PPC_PTRACE_GETHWDBGINFO, child_pid, NULL, dbginfo)) { + perror("Can't get breakpoint info"); exit(-1); } - return ret; } -static int del_hwbreakpoint_addr(int watchpoint_handle) +static bool dawr_present(struct ppc_debug_info *dbginfo) { - int ret; - - ret = ptrace(PPC_PTRACE_DELHWDEBUG, child_pid, 0, watchpoint_handle); - if (ret < 0) { - perror("Can't delete hw breakpoint\n"); - exit(-1); - } - return ret; + return !!(dbginfo->features & PPC_DEBUG_FEATURE_DATA_BP_DAWR); }
[PATCH v4 2/5] Powerpc/Watchpoint: Don't ignore extraneous exceptions blindly
On Powerpc, watchpoint match range is double-word granular. On a watchpoint hit, DAR is set to the first byte of overlap between actual access and watched range. And thus it's quite possible that DAR does not point inside user specified range. Ex, say user creates a watchpoint with address range 0x1004 to 0x1007. So hw would be configured to watch from 0x1000 to 0x1007. If there is a 4 byte access from 0x1002 to 0x1005, DAR will point to 0x1002 and thus interrupt handler considers it as extraneous, but it's actually not, because part of the access belongs to what user has asked. Instead of blindly ignoring the exception, get actual address range by analysing an instruction, and ignore only if actual range does not overlap with user specified range. Note: The behaviour is unchanged for 8xx. Signed-off-by: Ravi Bangoria --- arch/powerpc/kernel/hw_breakpoint.c | 52 + 1 file changed, 31 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c index 5a2d8c306c40..c04a345e2cc2 100644 --- a/arch/powerpc/kernel/hw_breakpoint.c +++ b/arch/powerpc/kernel/hw_breakpoint.c @@ -179,33 +179,49 @@ void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs) tsk->thread.last_hit_ubp = NULL; } -static bool is_larx_stcx_instr(struct pt_regs *regs, unsigned int instr) +static bool dar_within_range(unsigned long dar, struct arch_hw_breakpoint *info) { - int ret, type; - struct instruction_op op; + return ((info->address <= dar) && (dar - info->address < info->len)); +} - ret = analyse_instr(, regs, instr); - type = GETTYPE(op.type); - return (!ret && (type == LARX || type == STCX)); +static bool +dar_range_overlaps(unsigned long dar, int size, struct arch_hw_breakpoint *info) +{ + return ((dar <= info->address + info->len - 1) && + (dar + size - 1 >= info->address)); } /* * Handle debug exception notifications. */ static bool stepping_handler(struct pt_regs *regs, struct perf_event *bp, -unsigned long addr) +struct arch_hw_breakpoint *info) { unsigned int instr = 0; + int ret, type, size; + struct instruction_op op; + unsigned long addr = info->address; if (__get_user_inatomic(instr, (unsigned int *)regs->nip)) goto fail; - if (is_larx_stcx_instr(regs, instr)) { + ret = analyse_instr(, regs, instr); + type = GETTYPE(op.type); + size = GETSIZE(op.type); + + if (!ret && (type == LARX || type == STCX)) { printk_ratelimited("Breakpoint hit on instruction that can't be emulated." " Breakpoint at 0x%lx will be disabled.\n", addr); goto disable; } + /* +* If it's extraneous event, we still need to emulate/single- +* step the instruction, but we don't generate an event. +*/ + if (size && !dar_range_overlaps(regs->dar, size, info)) + info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ; + /* Do not emulate user-space instructions, instead single-step them */ if (user_mode(regs)) { current->thread.last_hit_ubp = bp; @@ -237,7 +253,6 @@ int hw_breakpoint_handler(struct die_args *args) struct perf_event *bp; struct pt_regs *regs = args->regs; struct arch_hw_breakpoint *info; - unsigned long dar = regs->dar; /* Disable breakpoints during exception handling */ hw_breakpoint_disable(); @@ -269,19 +284,14 @@ int hw_breakpoint_handler(struct die_args *args) goto out; } - /* -* Verify if dar lies within the address range occupied by the symbol -* being watched to filter extraneous exceptions. If it doesn't, -* we still need to single-step the instruction, but we don't -* generate an event. -*/ info->type &= ~HW_BRK_TYPE_EXTRANEOUS_IRQ; - if (!((bp->attr.bp_addr <= dar) && - (dar - bp->attr.bp_addr < bp->attr.bp_len))) - info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ; - - if (!IS_ENABLED(CONFIG_PPC_8xx) && !stepping_handler(regs, bp, info->address)) - goto out; + if (IS_ENABLED(CONFIG_PPC_8xx)) { + if (!dar_within_range(regs->dar, info)) + info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ; + } else { + if (!stepping_handler(regs, bp, info)) + goto out; + } /* * As a policy, the callback is invoked in a 'trigger-after-execute' -- 2.21.0
[PATCH v4 1/5] Powerpc/Watchpoint: Fix length calculation for unaligned target
Watchpoint match range is always doubleword(8 bytes) aligned on powerpc. If the given range is crossing doubleword boundary, we need to increase the length such that next doubleword also get covered. Ex, address len = 6 bytes |=. |v--|--v| | | | | | | | | | | | | | | | | | |---|---| <---8 bytes---> In such case, current code configures hw as: start_addr = address & ~HW_BREAKPOINT_ALIGN len = 8 bytes And thus read/write in last 4 bytes of the given range is ignored. Fix this by including next doubleword in the length. Plus, fix ptrace code which is messing up address/len. Signed-off-by: Ravi Bangoria --- arch/powerpc/include/asm/debug.h | 1 + arch/powerpc/include/asm/hw_breakpoint.h | 9 +++-- arch/powerpc/kernel/dawr.c | 6 ++-- arch/powerpc/kernel/hw_breakpoint.c | 24 +++-- arch/powerpc/kernel/process.c| 46 arch/powerpc/kernel/ptrace.c | 37 ++- arch/powerpc/xmon/xmon.c | 3 +- 7 files changed, 83 insertions(+), 43 deletions(-) diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h index 7756026b95ca..9c1b4aaa374b 100644 --- a/arch/powerpc/include/asm/debug.h +++ b/arch/powerpc/include/asm/debug.h @@ -45,6 +45,7 @@ static inline int debugger_break_match(struct pt_regs *regs) { return 0; } static inline int debugger_fault_handler(struct pt_regs *regs) { return 0; } #endif +int hw_breakpoint_validate_len(struct arch_hw_breakpoint *hw); void __set_breakpoint(struct arch_hw_breakpoint *brk); bool ppc_breakpoint_available(void); #ifdef CONFIG_PPC_ADV_DEBUG_REGS diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h index 67e2da195eae..27ac6f5d2891 100644 --- a/arch/powerpc/include/asm/hw_breakpoint.h +++ b/arch/powerpc/include/asm/hw_breakpoint.h @@ -14,6 +14,7 @@ struct arch_hw_breakpoint { unsigned long address; u16 type; u16 len; /* length of the target data symbol */ + u16 hw_len; /* length programmed in hw */ }; /* Note: Don't change the the first 6 bits below as they are in the same order @@ -33,6 +34,11 @@ struct arch_hw_breakpoint { #define HW_BRK_TYPE_PRIV_ALL (HW_BRK_TYPE_USER | HW_BRK_TYPE_KERNEL | \ HW_BRK_TYPE_HYP) +#define HW_BREAKPOINT_ALIGN 0x7 + +#define DABR_MAX_LEN 8 +#define DAWR_MAX_LEN 512 + #ifdef CONFIG_HAVE_HW_BREAKPOINT #include #include @@ -44,8 +50,6 @@ struct pmu; struct perf_sample_data; struct task_struct; -#define HW_BREAKPOINT_ALIGN 0x7 - extern int hw_breakpoint_slots(int type); extern int arch_bp_generic_fields(int type, int *gen_bp_type); extern int arch_check_bp_in_kernelspace(struct arch_hw_breakpoint *hw); @@ -70,6 +74,7 @@ static inline void hw_breakpoint_disable(void) brk.address = 0; brk.type = 0; brk.len = 0; + brk.hw_len = 0; if (ppc_breakpoint_available()) __set_breakpoint(); } diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c index 5f66b95b6858..8531623aa9b2 100644 --- a/arch/powerpc/kernel/dawr.c +++ b/arch/powerpc/kernel/dawr.c @@ -30,10 +30,10 @@ int set_dawr(struct arch_hw_breakpoint *brk) * DAWR length is stored in field MDR bits 48:53. Matches range in * doublewords (64 bits) baised by -1 eg. 0b00=1DW and * 0b11=64DW. -* brk->len is in bytes. +* brk->hw_len is in bytes. * This aligns up to double word size, shifts and does the bias. */ - mrd = ((brk->len + 7) >> 3) - 1; + mrd = ((brk->hw_len + 7) >> 3) - 1; dawrx |= (mrd & 0x3f) << (63 - 53); if (ppc_md.set_dawr) @@ -54,7 +54,7 @@ static ssize_t dawr_write_file_bool(struct file *file, const char __user *user_buf, size_t count, loff_t *ppos) { - struct arch_hw_breakpoint null_brk = {0, 0, 0}; + struct arch_hw_breakpoint null_brk = {0, 0, 0, 0}; size_t rc; /* Send error to user if they hypervisor won't allow us to write DAWR */ diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c index 1007ec36b4cb..5a2d8c306c40 100644 --- a/arch/powerpc/kernel/hw_breakpoint.c +++ b/arch/powerpc/kernel/hw_breakpoint.c @@ -133,9 +133,9 @@ int hw_breakpoint_arch_parse(struct perf_event *bp, const struct perf_event_attr *attr, struct arch_hw_breakpoint *hw) { - int ret = -EINVAL, length_max; + int ret = -EINVAL; - if (!bp) + if (!bp || !attr->bp_len) return ret; hw->type = HW_BRK_TYPE_TRANSLATE; @@ -155,26 +155,10 @@ int hw_breakpoint_arch_parse(struct perf_event *bp, hw->address =
[PATCH v4 0/5] Powerpc/Watchpoint: Few important fixes
v3: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-July/193339.html v3->v4: - Instead of considering exception as extraneous when dar is outside of user specified range, analyse the instruction and check for overlap between user specified range and actual load/store range. - Add selftest for the same in perf-hwbreak.c - Make ptrace-hwbreak.c selftest more strict by checking address in check_success. - Support for 8xx in ptrace-hwbreak.c selftest (Build tested only) - Rebase to powerpc/next @Christope, Can you please check Patch 5. I've just build-tested it with ep88xc_defconfig. Ravi Bangoria (5): Powerpc/Watchpoint: Fix length calculation for unaligned target Powerpc/Watchpoint: Don't ignore extraneous exceptions blindly Powerpc/Watchpoint: Rewrite ptrace-hwbreak.c selftest Powerpc/Watchpoint: Add dar outside test in perf-hwbreak.c selftest Powerpc/Watchpoint: Support for 8xx in ptrace-hwbreak.c selftest arch/powerpc/include/asm/debug.h | 1 + arch/powerpc/include/asm/hw_breakpoint.h | 9 +- arch/powerpc/kernel/dawr.c| 6 +- arch/powerpc/kernel/hw_breakpoint.c | 76 ++- arch/powerpc/kernel/process.c | 46 ++ arch/powerpc/kernel/ptrace.c | 37 +- arch/powerpc/xmon/xmon.c | 3 +- .../selftests/powerpc/ptrace/perf-hwbreak.c | 111 +++- .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 579 +++--- 9 files changed, 595 insertions(+), 273 deletions(-) -- 2.21.0
RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property
> > > > > > > > > > > > > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an > > > > > > > > errata > > > > > > > > A-008646 on LS1021A > > > > > > > > > > > > > > > > Signed-off-by: Biwen Li > > > > > > > > --- > > > > > > > > Change in v3: > > > > > > > > - rename property name > > > > > > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr > > > > > > > > > > > > > > > > Change in v2: > > > > > > > > - update desc of the property 'fsl,rcpm-scfg' > > > > > > > > > > > > > > > > Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14 > > > > > > > > ++ > > > > > > > > 1 file changed, 14 insertions(+) > > > > > > > > > > > > > > > > diff --git > > > > > > > > a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > > index 5a33619d881d..157dcf6da17c 100644 > > > > > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > > @@ -34,6 +34,11 @@ Chassis Version Example > > Chips > > > > > > > > Optional properties: > > > > > > > > - little-endian : RCPM register block is Little Endian. > > > > > > > > Without it > > > RCPM > > > > > > > > will be Big Endian (default case). > > > > > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC > > > > > > > > + LS1021A, > > > > > > > > > > > > > > You probably should mention this is related to a hardware > > > > > > > issue on LS1021a and only needed on LS1021a. > > > > > > Okay, got it, thanks, I will add this in v4. > > > > > > > > > > > > > > > + Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such > as: > > > > > > > > + #fsl,rcpm-wakeup-cells equal to 2, then must include 2 > > > > > > > > + + > > > > > > > > + 1 > > > entries). > > > > > > > > > > > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers > > > > > > > on an > > > SoC. > > > > > > > However you are defining an offset to scfg registers here. > > > > > > > Why these two are related? The length here should actually > > > > > > > be related to the #address-cells of the soc/. But since > > > > > > > this is only needed for LS1021, you can > > > > > > just make it 3. > > > > > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 > > > > > > device node(fsl,rcpm-wakeup = < 0x0 0x2000>; > > > > > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for > > > IPPDEXPCR1). > > > > > > But because of the hardware issue on LS1021A, I need store the > > > > > > value of IPPDEXPCR registers to an alt address. So I defining > > > > > > an offset to scfg registers, then RCPM driver get an abosolute > > > > > > address from offset, RCPM driver write the value of IPPDEXPCR > > > > > > registers to these abosolute addresses(backup the value of > > > > > > IPPDEXPCR > > > registers). > > > > > > > > > > I understand what you are trying to do. The problem is that the > > > > > new fsl,ippdexpcr-alt-addr property contains a phandle and an offset. > > > > > The size of it shouldn't be related to #fsl,rcpm-wakeup-cells. > > > > You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/* > > > > SCFG_SPARECR8 */ > > > > > > No. The #address-cell for the soc/ is 2, so the offset to scfg > > > should be 0x0 0x51c. The total size should be 3, but it shouldn't > > > be coming from #fsl,rcpm-wakeup-cells like you mentioned in the binding. > > Oh, I got it. You want that fsl,ippdexpcr-alt-add is relative with > > #address-cells instead of #fsl,rcpm-wakeup-cells. > > Yes. I got an example from drivers/pci/controller/dwc/pci-layerscape.c and arch/arm/boot/dts/ls1021a.dtsi as follows: fsl,pcie-scfg = < 0>, 0 is an index In my fsl,ippdexpcr-alt-addr = < 0x0 0x51c>, It means that 0x0 is an alt offset address for IPPDEXPCR0, 0x51c is an alt offset address For IPPDEXPCR1 instead of 0x0 and 0x51c compose to an alt address of SCFG_SPARECR8. > > Regards, > Leo > > > > > > > > > > > > > > > > > > > > > > > + The first entry must be a link to the SCFG device node. > > > > > > > > + The non-first entry must be offset of registers of SCFG. > > > > > > > > > > > > > > > > Example: > > > > > > > > The RCPM node for T4240: > > > > > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240: > > > > > > > > #fsl,rcpm-wakeup-cells = <2>; > > > > > > > > }; > > > > > > > > > > > > > > > > +The RCPM node for LS1021A: > > > > > > > > + rcpm: rcpm@1ee2140 { > > > > > > > > + compatible = "fsl,ls1021a-rcpm", > > > > > > > > "fsl,qoriq-rcpm- > > > > 2.1+"; > > > > > > > > + reg = <0x0 0x1ee2140 0x0 0x8>; > > > > > > > > + #fsl,rcpm-wakeup-cells = <2>; > > > > > > > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /* > > > > > > > > SCFG_SPARECR8 */ > > > > > > > > + }; > > > > > > > > + > > > > > > > > + > > > > > > > > * Freescale RCPM Wakeup Source Device Tree Bindings >
RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property
> -Original Message- > From: Biwen Li > Sent: Tuesday, September 24, 2019 10:47 PM > To: Leo Li ; shawn...@kernel.org; > robh...@kernel.org; mark.rutl...@arm.com; Ran Wang > > Cc: linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org; > linux-ker...@vger.kernel.org; devicet...@vger.kernel.org > Subject: RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt- > addr' property > > > > > > > > > > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an > > > > > > > errata > > > > > > > A-008646 on LS1021A > > > > > > > > > > > > > > Signed-off-by: Biwen Li > > > > > > > --- > > > > > > > Change in v3: > > > > > > > - rename property name > > > > > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr > > > > > > > > > > > > > > Change in v2: > > > > > > > - update desc of the property 'fsl,rcpm-scfg' > > > > > > > > > > > > > > Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14 > > > > > > > ++ > > > > > > > 1 file changed, 14 insertions(+) > > > > > > > > > > > > > > diff --git > > > > > > > a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > index 5a33619d881d..157dcf6da17c 100644 > > > > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > > @@ -34,6 +34,11 @@ Chassis VersionExample > Chips > > > > > > > Optional properties: > > > > > > > - little-endian : RCPM register block is Little Endian. > > > > > > > Without it > > RCPM > > > > > > > will be Big Endian (default case). > > > > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC > > > > > > > + LS1021A, > > > > > > > > > > > > You probably should mention this is related to a hardware > > > > > > issue on LS1021a and only needed on LS1021a. > > > > > Okay, got it, thanks, I will add this in v4. > > > > > > > > > > > > > + Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such > > > > > > > as: > > > > > > > + #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + > > > > > > > + 1 > > entries). > > > > > > > > > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on > > > > > > an > > SoC. > > > > > > However you are defining an offset to scfg registers here. > > > > > > Why these two are related? The length here should actually be > > > > > > related to the #address-cells of the soc/. But since this is > > > > > > only needed for LS1021, you can > > > > > just make it 3. > > > > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 > > > > > device node(fsl,rcpm-wakeup = < 0x0 0x2000>; > > > > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for > > IPPDEXPCR1). > > > > > But because of the hardware issue on LS1021A, I need store the > > > > > value of IPPDEXPCR registers to an alt address. So I defining an > > > > > offset to scfg registers, then RCPM driver get an abosolute > > > > > address from offset, RCPM driver write the value of IPPDEXPCR > > > > > registers to these abosolute addresses(backup the value of > > > > > IPPDEXPCR > > registers). > > > > > > > > I understand what you are trying to do. The problem is that the > > > > new fsl,ippdexpcr-alt-addr property contains a phandle and an offset. > > > > The size of it shouldn't be related to #fsl,rcpm-wakeup-cells. > > > You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/* > > > SCFG_SPARECR8 */ > > > > No. The #address-cell for the soc/ is 2, so the offset to scfg should > > be 0x0 0x51c. The total size should be 3, but it shouldn't be coming > > from #fsl,rcpm-wakeup-cells like you mentioned in the binding. > Oh, I got it. You want that fsl,ippdexpcr-alt-add is relative with > #address-cells > instead of #fsl,rcpm-wakeup-cells. Yes. Regards, Leo > > > > > > > > > > > > > > > > > > > + The first entry must be a link to the SCFG device node. > > > > > > > + The non-first entry must be offset of registers of SCFG. > > > > > > > > > > > > > > Example: > > > > > > > The RCPM node for T4240: > > > > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240: > > > > > > > #fsl,rcpm-wakeup-cells = <2>; > > > > > > > }; > > > > > > > > > > > > > > +The RCPM node for LS1021A: > > > > > > > + rcpm: rcpm@1ee2140 { > > > > > > > + compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm- > > > 2.1+"; > > > > > > > + reg = <0x0 0x1ee2140 0x0 0x8>; > > > > > > > + #fsl,rcpm-wakeup-cells = <2>; > > > > > > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /* > > > > > > > SCFG_SPARECR8 */ > > > > > > > + }; > > > > > > > + > > > > > > > + > > > > > > > * Freescale RCPM Wakeup Source Device Tree Bindings > > > > > > > --- > > > > > > > Required fsl,rcpm-wakeup property should be added to a > > > > > > > device node if the device > > > > > > > -- > > > > > > > 2.17.1
RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property
> > > > > > > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an > > > > > > errata > > > > > > A-008646 on LS1021A > > > > > > > > > > > > Signed-off-by: Biwen Li > > > > > > --- > > > > > > Change in v3: > > > > > > - rename property name > > > > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr > > > > > > > > > > > > Change in v2: > > > > > > - update desc of the property 'fsl,rcpm-scfg' > > > > > > > > > > > > Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14 > > > > > > ++ > > > > > > 1 file changed, 14 insertions(+) > > > > > > > > > > > > diff --git > > > > > > a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > index 5a33619d881d..157dcf6da17c 100644 > > > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > > @@ -34,6 +34,11 @@ Chassis Version Example Chips > > > > > > Optional properties: > > > > > > - little-endian : RCPM register block is Little Endian. Without it > RCPM > > > > > > will be Big Endian (default case). > > > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC > > > > > > + LS1021A, > > > > > > > > > > You probably should mention this is related to a hardware issue > > > > > on LS1021a and only needed on LS1021a. > > > > Okay, got it, thanks, I will add this in v4. > > > > > > > > > > > + Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as: > > > > > > + #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1 > entries). > > > > > > > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an > SoC. > > > > > However you are defining an offset to scfg registers here. Why > > > > > these two are related? The length here should actually be > > > > > related to the #address-cells of the soc/. But since this is > > > > > only needed for LS1021, you can > > > > just make it 3. > > > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device > > > > node(fsl,rcpm-wakeup = < 0x0 0x2000>; > > > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for > IPPDEXPCR1). > > > > But because of the hardware issue on LS1021A, I need store the > > > > value of IPPDEXPCR registers to an alt address. So I defining an > > > > offset to scfg registers, then RCPM driver get an abosolute > > > > address from offset, RCPM driver write the value of IPPDEXPCR > > > > registers to these abosolute addresses(backup the value of IPPDEXPCR > registers). > > > > > > I understand what you are trying to do. The problem is that the new > > > fsl,ippdexpcr-alt-addr property contains a phandle and an offset. > > > The size of it shouldn't be related to #fsl,rcpm-wakeup-cells. > > You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/* > > SCFG_SPARECR8 */ > > No. The #address-cell for the soc/ is 2, so the offset to scfg should be 0x0 > 0x51c. The total size should be 3, but it shouldn't be coming from > #fsl,rcpm-wakeup-cells like you mentioned in the binding. Oh, I got it. You want that fsl,ippdexpcr-alt-add is relative with #address-cells instead of #fsl,rcpm-wakeup-cells. > > > > > > > > > > > > > > > + The first entry must be a link to the SCFG device node. > > > > > > + The non-first entry must be offset of registers of SCFG. > > > > > > > > > > > > Example: > > > > > > The RCPM node for T4240: > > > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240: > > > > > > #fsl,rcpm-wakeup-cells = <2>; > > > > > > }; > > > > > > > > > > > > +The RCPM node for LS1021A: > > > > > > + rcpm: rcpm@1ee2140 { > > > > > > + compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm- > > 2.1+"; > > > > > > + reg = <0x0 0x1ee2140 0x0 0x8>; > > > > > > + #fsl,rcpm-wakeup-cells = <2>; > > > > > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /* > > > > > > SCFG_SPARECR8 */ > > > > > > + }; > > > > > > + > > > > > > + > > > > > > * Freescale RCPM Wakeup Source Device Tree Bindings > > > > > > --- > > > > > > Required fsl,rcpm-wakeup property should be added to a device > > > > > > node if the device > > > > > > -- > > > > > > 2.17.1
Re: [PATCH V3 0/2] mm/debug: Add tests for architecture exported page table helpers
On 09/24/2019 06:01 PM, Mike Rapoport wrote: > On Tue, Sep 24, 2019 at 02:51:01PM +0300, Kirill A. Shutemov wrote: >> On Fri, Sep 20, 2019 at 12:03:21PM +0530, Anshuman Khandual wrote: >>> This series adds a test validation for architecture exported page table >>> helpers. Patch in the series adds basic transformation tests at various >>> levels of the page table. Before that it exports gigantic page allocation >>> function from HugeTLB. >>> >>> This test was originally suggested by Catalin during arm64 THP migration >>> RFC discussion earlier. Going forward it can include more specific tests >>> with respect to various generic MM functions like THP, HugeTLB etc and >>> platform specific tests. >>> >>> https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/ >>> >>> Testing: >>> >>> Successfully build and boot tested on both arm64 and x86 platforms without >>> any test failing. Only build tested on some other platforms. Build failed >>> on some platforms (known) in pud_clear_tests() as there were no available >>> __pgd() definitions. >>> >>> - ARM32 >>> - IA64 >> >> Hm. Grep shows __pgd() definitions for both of them. Is it for specific >> config? > > For ARM32 it's defined only for 3-lelel page tables, i.e with LPAE on. > For IA64 it's defined for !STRICT_MM_TYPECHECKS which is even not a config > option, but a define in arch/ia64/include/asm/page.h Right. So now where we go from here ! We will need help from platform folks to fix this unless its trivial. I did propose this on last thread (v2), wondering if it will be a better idea to restrict DEBUG_ARCH_PGTABLE_TEST among architectures which have fixed all pending issues whether build or run time. Though enabling all platforms where the test builds at the least might make more sense, we might have to just exclude arm32 and ia64 for now. Then run time problems can be fixed later platform by platform. Any thoughts ? BTW the test is known to run successfully on arm64, x86, ppc32 platforms. Gerald has been trying to get it working on s390. in the meantime., if there are other volunteers to test this on ppc64, sparc, riscv, mips, m68k etc platforms, it will be really helpful. - Anshuman
RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property
> -Original Message- > From: Biwen Li > Sent: Tuesday, September 24, 2019 10:30 PM > To: Leo Li ; shawn...@kernel.org; > robh...@kernel.org; mark.rutl...@arm.com; Ran Wang > > Cc: linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org; > linux-ker...@vger.kernel.org; devicet...@vger.kernel.org > Subject: RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt- > addr' property > > > > > > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an > > > > > errata > > > > > A-008646 on LS1021A > > > > > > > > > > Signed-off-by: Biwen Li > > > > > --- > > > > > Change in v3: > > > > > - rename property name > > > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr > > > > > > > > > > Change in v2: > > > > > - update desc of the property 'fsl,rcpm-scfg' > > > > > > > > > > Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14 > > > > > ++ > > > > > 1 file changed, 14 insertions(+) > > > > > > > > > > diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > index 5a33619d881d..157dcf6da17c 100644 > > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > > @@ -34,6 +34,11 @@ Chassis VersionExample Chips > > > > > Optional properties: > > > > > - little-endian : RCPM register block is Little Endian. Without it > > > > > RCPM > > > > > will be Big Endian (default case). > > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC > > > > > + LS1021A, > > > > > > > > You probably should mention this is related to a hardware issue on > > > > LS1021a and only needed on LS1021a. > > > Okay, got it, thanks, I will add this in v4. > > > > > > > > > + Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as: > > > > > + #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1 > > > > > entries). > > > > > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an SoC. > > > > However you are defining an offset to scfg registers here. Why > > > > these two are related? The length here should actually be related > > > > to the #address-cells of the soc/. But since this is only needed > > > > for LS1021, you can > > > just make it 3. > > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device > > > node(fsl,rcpm-wakeup = < 0x0 0x2000>; > > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for IPPDEXPCR1). > > > But because of the hardware issue on LS1021A, I need store the value > > > of IPPDEXPCR registers to an alt address. So I defining an offset to > > > scfg registers, then RCPM driver get an abosolute address from > > > offset, RCPM driver write the value of IPPDEXPCR registers to these > > > abosolute addresses(backup the value of IPPDEXPCR registers). > > > > I understand what you are trying to do. The problem is that the new > > fsl,ippdexpcr-alt-addr property contains a phandle and an offset. The > > size of it shouldn't be related to #fsl,rcpm-wakeup-cells. > You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/* > SCFG_SPARECR8 */ No. The #address-cell for the soc/ is 2, so the offset to scfg should be 0x0 0x51c. The total size should be 3, but it shouldn't be coming from #fsl,rcpm-wakeup-cells like you mentioned in the binding. > > > > > > > > > > > + The first entry must be a link to the SCFG device node. > > > > > + The non-first entry must be offset of registers of SCFG. > > > > > > > > > > Example: > > > > > The RCPM node for T4240: > > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240: > > > > > #fsl,rcpm-wakeup-cells = <2>; > > > > > }; > > > > > > > > > > +The RCPM node for LS1021A: > > > > > + rcpm: rcpm@1ee2140 { > > > > > + compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm- > 2.1+"; > > > > > + reg = <0x0 0x1ee2140 0x0 0x8>; > > > > > + #fsl,rcpm-wakeup-cells = <2>; > > > > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /* > > > > > SCFG_SPARECR8 */ > > > > > + }; > > > > > + > > > > > + > > > > > * Freescale RCPM Wakeup Source Device Tree Bindings > > > > > --- > > > > > Required fsl,rcpm-wakeup property should be added to a device > > > > > node if the device > > > > > -- > > > > > 2.17.1
Re: [PATCH V4 4/4] ASoC: fsl_asrc: Fix error with S24_3LE format bitstream in i.MX8
Hi > On Tue, Sep 24, 2019 at 06:52:35PM +0800, Shengjiu Wang wrote: > > There is error "aplay: pcm_write:2023: write error: Input/output error" > > on i.MX8QM/i.MX8QXP platform for S24_3LE format. > > > > In i.MX8QM/i.MX8QXP, the DMA is EDMA, which don't support 24bit > > sample, but we didn't add any constraint, that cause issues. > > > > So we need to query the caps of dma, then update the hw parameters > > according to the caps. > > > > Signed-off-by: Shengjiu Wang > > --- > > sound/soc/fsl/fsl_asrc.c | 4 +-- > > sound/soc/fsl/fsl_asrc.h | 3 ++ > > sound/soc/fsl/fsl_asrc_dma.c | 59 > > +++- > > 3 files changed, 56 insertions(+), 10 deletions(-) > > > > @@ -270,12 +268,17 @@ static int fsl_asrc_dma_hw_free(struct > > snd_pcm_substream *substream) > > > > static int fsl_asrc_dma_startup(struct snd_pcm_substream *substream) > > { > > + bool tx = substream->stream == SNDRV_PCM_STREAM_PLAYBACK; > > struct snd_soc_pcm_runtime *rtd = substream->private_data; > > struct snd_pcm_runtime *runtime = substream->runtime; > > struct snd_soc_component *component = > snd_soc_rtdcom_lookup(rtd, > > DRV_NAME); > > + struct snd_dmaengine_dai_dma_data *dma_data; > > struct device *dev = component->dev; > > struct fsl_asrc *asrc_priv = dev_get_drvdata(dev); > > struct fsl_asrc_pair *pair; > > + struct dma_chan *tmp_chan = NULL; > > + u8 dir = tx ? OUT : IN; > > + int ret = 0; > > > > pair = kzalloc(sizeof(struct fsl_asrc_pair), GFP_KERNEL); > > Sorry, I didn't catch it previously. We would need to release this memory > also for all error-out paths, as the code doesn't have any error-out routine, > prior to applying this change. > > > if (!pair) > > @@ -285,11 +288,51 @@ static int fsl_asrc_dma_startup(struct > > snd_pcm_substream *substream) > > > + /* Request a dummy pair, which will be released later. > > + * Request pair function needs channel num as input, for this > > + * dummy pair, we just request "1" channel temporary. > > + */ > > "temporary" => "temporarily" > > > + ret = fsl_asrc_request_pair(1, pair); > > + if (ret < 0) { > > + dev_err(dev, "failed to request asrc pair\n"); > > + return ret; > > + } > > + > > + /* Request a dummy dma channel, which will be release later. */ > > "release" => "released" Ok, will update them. Best regards Wang shengjiu
RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property
> > > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an errata > > > > A-008646 on LS1021A > > > > > > > > Signed-off-by: Biwen Li > > > > --- > > > > Change in v3: > > > > - rename property name > > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr > > > > > > > > Change in v2: > > > > - update desc of the property 'fsl,rcpm-scfg' > > > > > > > > Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14 > > > > ++ > > > > 1 file changed, 14 insertions(+) > > > > > > > > diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > index 5a33619d881d..157dcf6da17c 100644 > > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > > @@ -34,6 +34,11 @@ Chassis Version Example Chips > > > > Optional properties: > > > > - little-endian : RCPM register block is Little Endian. Without it > > > > RCPM > > > > will be Big Endian (default case). > > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC > > > > + LS1021A, > > > > > > You probably should mention this is related to a hardware issue on > > > LS1021a and only needed on LS1021a. > > Okay, got it, thanks, I will add this in v4. > > > > > > > + Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as: > > > > + #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1 entries). > > > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an SoC. > > > However you are defining an offset to scfg registers here. Why > > > these two are related? The length here should actually be related > > > to the #address-cells of the soc/. But since this is only needed > > > for LS1021, you can > > just make it 3. > > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device > > node(fsl,rcpm-wakeup = < 0x0 0x2000>; > > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for IPPDEXPCR1). > > But because of the hardware issue on LS1021A, I need store the value > > of IPPDEXPCR registers to an alt address. So I defining an offset to > > scfg registers, then RCPM driver get an abosolute address from offset, > > RCPM driver write the value of IPPDEXPCR registers to these abosolute > > addresses(backup the value of IPPDEXPCR registers). > > I understand what you are trying to do. The problem is that the new > fsl,ippdexpcr-alt-addr property contains a phandle and an offset. The size > of it shouldn't be related to #fsl,rcpm-wakeup-cells. You maybe like this: fsl,ippdexpcr-alt-addr = < 0x51c>;/* SCFG_SPARECR8 */ > > > > > > > > + The first entry must be a link to the SCFG device node. > > > > + The non-first entry must be offset of registers of SCFG. > > > > > > > > Example: > > > > The RCPM node for T4240: > > > > @@ -43,6 +48,15 @@ The RCPM node for T4240: > > > > #fsl,rcpm-wakeup-cells = <2>; > > > > }; > > > > > > > > +The RCPM node for LS1021A: > > > > + rcpm: rcpm@1ee2140 { > > > > + compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-2.1+"; > > > > + reg = <0x0 0x1ee2140 0x0 0x8>; > > > > + #fsl,rcpm-wakeup-cells = <2>; > > > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /* > > > > SCFG_SPARECR8 */ > > > > + }; > > > > + > > > > + > > > > * Freescale RCPM Wakeup Source Device Tree Bindings > > > > --- > > > > Required fsl,rcpm-wakeup property should be added to a device > > > > node if the device > > > > -- > > > > 2.17.1
RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property
> -Original Message- > From: Biwen Li > Sent: Tuesday, September 24, 2019 10:13 PM > To: Leo Li ; shawn...@kernel.org; > robh...@kernel.org; mark.rutl...@arm.com; Ran Wang > > Cc: linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org; > linux-ker...@vger.kernel.org; devicet...@vger.kernel.org > Subject: RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt- > addr' property > > > > > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an errata > > > A-008646 on LS1021A > > > > > > Signed-off-by: Biwen Li > > > --- > > > Change in v3: > > > - rename property name > > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr > > > > > > Change in v2: > > > - update desc of the property 'fsl,rcpm-scfg' > > > > > > Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14 > > > ++ > > > 1 file changed, 14 insertions(+) > > > > > > diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > index 5a33619d881d..157dcf6da17c 100644 > > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > > @@ -34,6 +34,11 @@ Chassis VersionExample Chips > > > Optional properties: > > > - little-endian : RCPM register block is Little Endian. Without it RCPM > > > will be Big Endian (default case). > > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC LS1021A, > > > > You probably should mention this is related to a hardware issue on > > LS1021a and only needed on LS1021a. > Okay, got it, thanks, I will add this in v4. > > > > > + Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as: > > > + #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1 entries). > > > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an SoC. > > However you are defining an offset to scfg registers here. Why these > > two are related? The length here should actually be related to the > > #address-cells of the soc/. But since this is only needed for LS1021, you > > can > just make it 3. > I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device > node(fsl,rcpm-wakeup = < 0x0 0x2000>; > 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for IPPDEXPCR1). > But because of the hardware issue on LS1021A, I need store the value of > IPPDEXPCR registers to an alt address. So I defining an offset to scfg > registers, > then RCPM driver get an abosolute address from offset, RCPM driver write > the value of IPPDEXPCR registers to these abosolute addresses(backup the > value of IPPDEXPCR registers). I understand what you are trying to do. The problem is that the new fsl,ippdexpcr-alt-addr property contains a phandle and an offset. The size of it shouldn't be related to #fsl,rcpm-wakeup-cells. > > > > > + The first entry must be a link to the SCFG device node. > > > + The non-first entry must be offset of registers of SCFG. > > > > > > Example: > > > The RCPM node for T4240: > > > @@ -43,6 +48,15 @@ The RCPM node for T4240: > > > #fsl,rcpm-wakeup-cells = <2>; > > > }; > > > > > > +The RCPM node for LS1021A: > > > + rcpm: rcpm@1ee2140 { > > > + compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-2.1+"; > > > + reg = <0x0 0x1ee2140 0x0 0x8>; > > > + #fsl,rcpm-wakeup-cells = <2>; > > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /* > > > SCFG_SPARECR8 */ > > > + }; > > > + > > > + > > > * Freescale RCPM Wakeup Source Device Tree Bindings > > > --- > > > Required fsl,rcpm-wakeup property should be added to a device node > > > if the device > > > -- > > > 2.17.1
RE: [v3,3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr-alt-addr' property
> > > > The 'fsl,ippdexpcr-alt-addr' property is used to handle an errata > > A-008646 on LS1021A > > > > Signed-off-by: Biwen Li > > --- > > Change in v3: > > - rename property name > > fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr > > > > Change in v2: > > - update desc of the property 'fsl,rcpm-scfg' > > > > Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14 > > ++ > > 1 file changed, 14 insertions(+) > > > > diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > index 5a33619d881d..157dcf6da17c 100644 > > --- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > +++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt > > @@ -34,6 +34,11 @@ Chassis Version Example Chips > > Optional properties: > > - little-endian : RCPM register block is Little Endian. Without it RCPM > > will be Big Endian (default case). > > + - fsl,ippdexpcr-alt-addr : Must add the property for SoC LS1021A, > > You probably should mention this is related to a hardware issue on LS1021a > and only needed on LS1021a. Okay, got it, thanks, I will add this in v4. > > > + Must include n + 1 entries (n = #fsl,rcpm-wakeup-cells, such as: > > + #fsl,rcpm-wakeup-cells equal to 2, then must include 2 + 1 entries). > > #fsl,rcpm-wakeup-cells is the number of IPPDEXPCR registers on an SoC. > However you are defining an offset to scfg registers here. Why these two > are related? The length here should actually be related to the #address-cells > of the soc/. But since this is only needed for LS1021, you can just make it > 3. I need set the value of IPPDEXPCR resgiters from ftm_alarm0 device node(fsl,rcpm-wakeup = < 0x0 0x2000>; 0x0 is a value for IPPDEXPCR0, 0x2000 is a value for IPPDEXPCR1). But because of the hardware issue on LS1021A, I need store the value of IPPDEXPCR registers to an alt address. So I defining an offset to scfg registers, then RCPM driver get an abosolute address from offset, RCPM driver write the value of IPPDEXPCR registers to these abosolute addresses(backup the value of IPPDEXPCR registers). > > > + The first entry must be a link to the SCFG device node. > > + The non-first entry must be offset of registers of SCFG. > > > > Example: > > The RCPM node for T4240: > > @@ -43,6 +48,15 @@ The RCPM node for T4240: > > #fsl,rcpm-wakeup-cells = <2>; > > }; > > > > +The RCPM node for LS1021A: > > + rcpm: rcpm@1ee2140 { > > + compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-2.1+"; > > + reg = <0x0 0x1ee2140 0x0 0x8>; > > + #fsl,rcpm-wakeup-cells = <2>; > > + fsl,ippdexpcr-alt-addr = < 0x0 0x51c>; /* > > SCFG_SPARECR8 */ > > + }; > > + > > + > > * Freescale RCPM Wakeup Source Device Tree Bindings > > --- > > Required fsl,rcpm-wakeup property should be added to a device node if > > the device > > -- > > 2.17.1
Re: [PATCH 00/11] opencapi: enable card reset and link retraining
On 24/09/2019 16:39, Frederic Barrat wrote: > > > Le 24/09/2019 à 06:24, Alexey Kardashevskiy a écrit : >> Hi Fred, >> >> what is this made against of? It does not apply on the master. Thanks, > > It applies on v5.3. And I can see there's a conflict with the current > state in the merge window. I'll resubmit. Not really necessary, just mention the exact sha1 or tag in the cover letter next time. Thanks, > > Fred > > > >> On 10/09/2019 01:45, Frederic Barrat wrote: >>> This is the linux part of the work to use the PCI hotplug framework to >>> control an opencapi card so that it can be reset and re-read after >>> flashing a new FPGA image. I had posted it earlier as an RFC and this >>> version is mostly similar, with just some minor editing. >>> >>> It needs support in skiboot: >>> http://patchwork.ozlabs.org/project/skiboot/list/?series=129724 >>> On an old skiboot, it will do nothing. >>> >>> A virtual PCI slot is created for the opencapi adapter, and its state >>> can be controlled through the pnv-php hotplug driver: >>> >>> echo 0|1 > /sys/bus/pci/slots/OPENCAPI-<...>/power >>> >>> Note that the power to the card is not really turned off, as the card >>> needs to stay on to be flashed with a new image. Instead the card is >>> in reset. >>> >>> The first part of the series mostly deals with the pci/ioda state, as >>> the opencapi devices can now go away and the state needs to be cleaned >>> up. >>> >>> The second part is modifications to the PCI hotplug driver on powernv, >>> so that a virtual slot is created for the opencapi adapters found in >>> the device tree. >>> >>> >>> >>> Frederic Barrat (11): >>> powerpc/powernv/ioda: Fix ref count for devices with their own PE >>> powerpc/powernv/ioda: Protect PE list >>> powerpc/powernv/ioda: set up PE on opencapi device when enabling >>> powerpc/powernv/ioda: Release opencapi device >>> powerpc/powernv/ioda: Find opencapi slot for a device node >>> pci/hotplug/pnv-php: Remove erroneous warning >>> pci/hotplug/pnv-php: Improve error msg on power state change failure >>> pci/hotplug/pnv-php: Register opencapi slots >>> pci/hotplug/pnv-php: Relax check when disabling slot >>> pci/hotplug/pnv-php: Wrap warnings in macro >>> ocxl: Add PCI hotplug dependency to Kconfig >>> >>> arch/powerpc/include/asm/pnv-pci.h | 1 + >>> arch/powerpc/platforms/powernv/pci-ioda.c | 107 ++ >>> arch/powerpc/platforms/powernv/pci.c | 10 +- >>> drivers/misc/ocxl/Kconfig | 1 + >>> drivers/pci/hotplug/pnv_php.c | 82 ++--- >>> 5 files changed, 125 insertions(+), 76 deletions(-) >>> >> > -- Alexey
Re: [PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks
On 9/24/19 2:24 PM, Leonardo Bras wrote: > If a process (qemu) with a lot of CPUs (128) try to munmap() a large > chunk of memory (496GB) mapped with THP, it takes an average of 275 > seconds, which can cause a lot of problems to the load (in qemu case, > the guest will lock for this time). > > Trying to find the source of this bug, I found out most of this time is > spent on serialize_against_pte_lookup(). This function will take a lot > of time in smp_call_function_many() if there is more than a couple CPUs > running the user process. Since it has to happen to all THP mapped, it > will take a very long time for large amounts of memory. > > By the docs, serialize_against_pte_lookup() is needed in order to avoid > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless > pagetable walk, to happen concurrently with THP splitting/collapsing. > > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], > after interrupts are re-enabled. > Since, interrupts are (usually) disabled during lockless pagetable > walk, and serialize_against_pte_lookup will only return after > interrupts are enabled, it is protected. > > So, by what I could understand, if there is no lockless pagetable walk > running, there is no need to call serialize_against_pte_lookup(). > > So, to avoid the cost of running serialize_against_pte_lookup(), I > propose a counter that keeps track of how many find_current_mm_pte() > are currently running, and if there is none, just skip > smp_call_function_many(). > > The related functions are: > start_lockless_pgtbl_walk(mm) > Insert before starting any lockless pgtable walk > end_lockless_pgtbl_walk(mm) > Insert after the end of any lockless pgtable walk > (Mostly after the ptep is last used) > running_lockless_pgtbl_walk(mm) > Returns the number of lockless pgtable walks running > > On my workload (qemu), I could see munmap's time reduction from 275 > seconds to 418ms. > > Changes since v2: > Rebased to v5.3 > Adds support on __get_user_pages_fast > Adds usage decription to *_lockless_pgtbl_walk() > Better style to dummy functions > Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839 Hi Leonardo, Thanks for adding linux-mm to CC for this next round of reviews. For the benefit of any new reviewers, I'd like to add that there are some issues that were discovered while reviewing the v2 patchset, that are not (yet) addressed in this v3 series. Since those issues are not listed in the cover letter above, I'll list them here: 1. The locking model requires a combination of disabling interrupts and atomic counting and memory barriers, but a) some memory barriers are missing (start/end_lockless_pgtbl_walk), and b) some cases (patch #8) fail to disable interrupts ...so the synchronization appears to be inadequate. (And if it *is* adequate, then definitely we need the next item, to explain it.) 2. Documentation of the synchronization/locking model needs to exist, once we figure out the exact details of (1). 3. Related to (1), I've asked to change things so that interrupt controls and atomic inc/dec are in the same start/end calls--assuming, of course, that the caller can tolerate that. 4. Please see the v2 series for any other details I've missed. thanks, -- John Hubbard NVIDIA > > Changes since v1: > Isolated atomic operations in functions *_lockless_pgtbl_walk() > Fixed behavior of decrementing before last ptep was used > Link: http://patchwork.ozlabs.org/patch/1163093/ > > Leonardo Bras (11): > powerpc/mm: Adds counting method to monitor lockless pgtable walks > asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable > walks > mm/gup: Applies counting method to monitor gup_pgd_range > powerpc/mce_power: Applies counting method to monitor lockless pgtbl > walks > powerpc/perf: Applies counting method to monitor lockless pgtbl walks > powerpc/mm/book3s64/hash: Applies counting method to monitor lockless > pgtbl walks > powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl > walks > powerpc/kvm/book3s_hv: Applies counting method to monitor lockless > pgtbl walks > powerpc/kvm/book3s_64: Applies counting method to monitor lockless > pgtbl walks > powerpc/book3s_64: Enables counting method to monitor lockless pgtbl > walk > powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing > > arch/powerpc/include/asm/book3s/64/mmu.h | 3 ++ > arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++ > arch/powerpc/kernel/mce_power.c | 13 +-- > arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 + > arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 +- > arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++ > arch/powerpc/kvm/book3s_hv_nested.c | 8 > arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 - > arch/powerpc/kvm/e500_mmu_host.c | 4 ++ >
Re: [PATCH V4 3/4] ASoC: pcm_dmaengine: Extract snd_dmaengine_pcm_refine_runtime_hwparams
On Tue, Sep 24, 2019 at 06:52:34PM +0800, Shengjiu Wang wrote: > When set the runtime hardware parameters, we may need to query > the capability of DMA to complete the parameters. > > This patch is to Extract this operation from > dmaengine_pcm_set_runtime_hwparams function to a separate function > snd_dmaengine_pcm_refine_runtime_hwparams, that other components > which need this feature can call this function. > > Signed-off-by: Shengjiu Wang Looks good to me. Reviewed-by: Nicolin Chen
Re: [PATCH V4 4/4] ASoC: fsl_asrc: Fix error with S24_3LE format bitstream in i.MX8
On Tue, Sep 24, 2019 at 06:52:35PM +0800, Shengjiu Wang wrote: > There is error "aplay: pcm_write:2023: write error: Input/output error" > on i.MX8QM/i.MX8QXP platform for S24_3LE format. > > In i.MX8QM/i.MX8QXP, the DMA is EDMA, which don't support 24bit > sample, but we didn't add any constraint, that cause issues. > > So we need to query the caps of dma, then update the hw parameters > according to the caps. > > Signed-off-by: Shengjiu Wang > --- > sound/soc/fsl/fsl_asrc.c | 4 +-- > sound/soc/fsl/fsl_asrc.h | 3 ++ > sound/soc/fsl/fsl_asrc_dma.c | 59 +++- > 3 files changed, 56 insertions(+), 10 deletions(-) > > @@ -270,12 +268,17 @@ static int fsl_asrc_dma_hw_free(struct > snd_pcm_substream *substream) > > static int fsl_asrc_dma_startup(struct snd_pcm_substream *substream) > { > + bool tx = substream->stream == SNDRV_PCM_STREAM_PLAYBACK; > struct snd_soc_pcm_runtime *rtd = substream->private_data; > struct snd_pcm_runtime *runtime = substream->runtime; > struct snd_soc_component *component = snd_soc_rtdcom_lookup(rtd, > DRV_NAME); > + struct snd_dmaengine_dai_dma_data *dma_data; > struct device *dev = component->dev; > struct fsl_asrc *asrc_priv = dev_get_drvdata(dev); > struct fsl_asrc_pair *pair; > + struct dma_chan *tmp_chan = NULL; > + u8 dir = tx ? OUT : IN; > + int ret = 0; > > pair = kzalloc(sizeof(struct fsl_asrc_pair), GFP_KERNEL); Sorry, I didn't catch it previously. We would need to release this memory also for all error-out paths, as the code doesn't have any error-out routine, prior to applying this change. > if (!pair) > @@ -285,11 +288,51 @@ static int fsl_asrc_dma_startup(struct > snd_pcm_substream *substream) > + /* Request a dummy pair, which will be released later. > + * Request pair function needs channel num as input, for this > + * dummy pair, we just request "1" channel temporary. > + */ "temporary" => "temporarily" > + ret = fsl_asrc_request_pair(1, pair); > + if (ret < 0) { > + dev_err(dev, "failed to request asrc pair\n"); > + return ret; > + } > + > + /* Request a dummy dma channel, which will be release later. */ "release" => "released"
[PATCH v3 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
Skips slow part of serialize_against_pte_lookup if there is no running lockless pagetable walk. Signed-off-by: Leonardo Bras --- arch/powerpc/mm/book3s64/pgtable.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 0b86884a8097..e2aa6572f03f 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -95,7 +95,8 @@ static void do_nothing(void *unused) void serialize_against_pte_lookup(struct mm_struct *mm) { smp_mb(); - smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); + if (running_lockless_pgtbl_walk(mm)) + smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); } /* -- 2.20.1
[PATCH v3 09/11] powerpc/kvm/book3s_64: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring all book3s_64-related functions that do lockless pagetable walks. Signed-off-by: Leonardo Bras --- arch/powerpc/kvm/book3s_64_mmu_hv.c| 2 ++ arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 ++-- arch/powerpc/kvm/book3s_64_vio_hv.c| 4 3 files changed, 24 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 9a75f0e1933b..fcd3dad1297f 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -620,6 +620,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, * We need to protect against page table destruction * hugepage split and collapse. */ + start_lockless_pgtbl_walk(kvm->mm); local_irq_save(flags); ptep = find_current_mm_pte(current->mm->pgd, hva, NULL, NULL); @@ -629,6 +630,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, write_ok = 1; } local_irq_restore(flags); + end_lockless_pgtbl_walk(kvm->mm); } } diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 2d415c36a61d..d46f8258d8d6 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -741,6 +741,7 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable, bool writing, unsigned long pgflags; unsigned int shift; pte_t *ptep; + bool ret = false; /* * Need to set an R or C bit in the 2nd-level tables; @@ -755,12 +756,14 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable, bool writing, * We can do this without disabling irq because the Linux MM * subsystem doesn't do THP splits and collapses on this tree. */ + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep) && (!writing || pte_write(*ptep))) { kvmppc_radix_update_pte(kvm, ptep, 0, pgflags, gpa, shift); - return true; + ret = true; } - return false; + end_lockless_pgtbl_walk(kvm->mm); + return ret; } int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, @@ -813,6 +816,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, * Read the PTE from the process' radix tree and use that * so we get the shift and attribute bits. */ + start_lockless_pgtbl_walk(kvm->mm); local_irq_disable(); ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, ); /* @@ -821,12 +825,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, */ if (!ptep) { local_irq_enable(); + end_lockless_pgtbl_walk(kvm->mm); if (page) put_page(page); return RESUME_GUEST; } pte = *ptep; local_irq_enable(); + end_lockless_pgtbl_walk(kvm->mm); /* If we're logging dirty pages, always map single pages */ large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES); @@ -972,10 +978,12 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, unsigned long gpa = gfn << PAGE_SHIFT; unsigned int shift; + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep)) kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot, kvm->arch.lpid); + end_lockless_pgtbl_walk(kvm->mm); return 0; } @@ -989,6 +997,7 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, int ref = 0; unsigned long old, *rmapp; + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep) && pte_young(*ptep)) { old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0, @@ -1001,6 +1010,7 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, 1UL << shift); ref = 1; } + end_lockless_pgtbl_walk(kvm->mm); return ref; } @@ -1013,9 +1023,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot, unsigned int shift; int ref = 0; + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (ptep && pte_present(*ptep) &&
[PATCH v3 10/11] powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk
Enables count-based monitoring method for lockless pagetable walks on PowerPC book3s_64. Other architectures/platforms fallback to using generic dummy functions. Signed-off-by: Leonardo Bras --- arch/powerpc/include/asm/book3s/64/pgtable.h | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 8308f32e9782..eb9b26a4a483 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1370,5 +1370,10 @@ static inline bool pgd_is_leaf(pgd_t pgd) return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE)); } +#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER +void start_lockless_pgtbl_walk(struct mm_struct *mm); +void end_lockless_pgtbl_walk(struct mm_struct *mm); +int running_lockless_pgtbl_walk(struct mm_struct *mm); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */ -- 2.20.1
[PATCH v3 07/11] powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring lockless pgtable walks on kvmppc_e500_shadow_map(). Signed-off-by: Leonardo Bras --- arch/powerpc/kvm/e500_mmu_host.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 321db0fdb9db..a1b8bfe20bc8 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -473,6 +473,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, * We are holding kvm->mmu_lock so a notifier invalidate * can't run hence pfn won't change. */ + start_lockless_pgtbl_walk(kvm->mm); local_irq_save(flags); ptep = find_linux_pte(pgdir, hva, NULL, NULL); if (ptep) { @@ -484,12 +485,15 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, local_irq_restore(flags); } else { local_irq_restore(flags); + end_lockless_pgtbl_walk(kvm->mm); pr_err_ratelimited("%s: pte not present: gfn %lx,pfn %lx\n", __func__, (long)gfn, pfn); ret = -EINVAL; goto out; } } + end_lockless_pgtbl_walk(kvm->mm); + kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg); kvmppc_e500_setup_stlbe(_e500->vcpu, gtlbe, tsize, -- 2.20.1
[PATCH v3 08/11] powerpc/kvm/book3s_hv: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring all book3s_hv related functions that do lockless pagetable walks. Signed-off-by: Leonardo Bras --- arch/powerpc/kvm/book3s_hv_nested.c | 8 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 - 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c index 735e0ac6f5b2..ed68e57af3a3 100644 --- a/arch/powerpc/kvm/book3s_hv_nested.c +++ b/arch/powerpc/kvm/book3s_hv_nested.c @@ -804,6 +804,7 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 n_rmap, return; /* Find the pte */ + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, ); /* * If the pte is present and the pfn is still the same, update the pte. @@ -815,6 +816,7 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 n_rmap, __radix_pte_update(ptep, clr, set); kvmppc_radix_tlbie_page(kvm, gpa, shift, lpid); } + end_lockless_pgtbl_walk(kvm->mm); } /* @@ -854,10 +856,12 @@ static void kvmhv_remove_nest_rmap(struct kvm *kvm, u64 n_rmap, return; /* Find and invalidate the pte */ + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, ); /* Don't spuriously invalidate ptes if the pfn has changed */ if (ptep && pte_present(*ptep) && ((pte_val(*ptep) & mask) == hpa)) kvmppc_unmap_pte(kvm, ptep, gpa, shift, NULL, gp->shadow_lpid); + end_lockless_pgtbl_walk(kvm->mm); } static void kvmhv_remove_nest_rmap_list(struct kvm *kvm, unsigned long *rmapp, @@ -921,6 +925,7 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu, int shift; spin_lock(>mmu_lock); + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, ); if (!shift) shift = PAGE_SHIFT; @@ -928,6 +933,7 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu, kvmppc_unmap_pte(kvm, ptep, gpa, shift, NULL, gp->shadow_lpid); ret = true; } + end_lockless_pgtbl_walk(kvm->mm); spin_unlock(>mmu_lock); if (shift_ret) @@ -1362,11 +1368,13 @@ static long int __kvmhv_nested_page_fault(struct kvm_run *run, /* See if can find translation in our partition scoped tables for L1 */ pte = __pte(0); spin_lock(>mmu_lock); + start_lockless_pgtbl_walk(kvm->mm); pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, ); if (!shift) shift = PAGE_SHIFT; if (pte_p) pte = *pte_p; + end_lockless_pgtbl_walk(kvm->mm); spin_unlock(>mmu_lock); if (!pte_present(pte) || (writing && !(pte_val(pte) & _PAGE_WRITE))) { diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 63e0ce91e29d..53ca67492211 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -258,6 +258,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, * If called in real mode we have MSR_EE = 0. Otherwise * we disable irq above. */ + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(pgdir, hva, NULL, _shift); if (ptep) { pte_t pte; @@ -311,6 +312,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G); ptel |= HPTE_R_M; } + end_lockless_pgtbl_walk(kvm->mm); /* Find and lock the HPTEG slot to use */ do_insert: @@ -886,10 +888,15 @@ static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned long gpa, hva = __gfn_to_hva_memslot(memslot, gfn); /* Try to find the host pte for that virtual address */ + start_lockless_pgtbl_walk(kvm->mm); ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, ); - if (!ptep) + if (!ptep) { + end_lockless_pgtbl_walk(kvm->mm); return H_TOO_HARD; + } pte = kvmppc_read_update_linux_pte(ptep, writing); + end_lockless_pgtbl_walk(kvm->mm); + if (!pte_present(pte)) return H_TOO_HARD; -- 2.20.1
[PATCH v3 05/11] powerpc/perf: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring lockless pgtable walks on read_user_stack_slow. Signed-off-by: Leonardo Bras --- arch/powerpc/perf/callchain.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c index c84bbd4298a0..9d76194a2a8f 100644 --- a/arch/powerpc/perf/callchain.c +++ b/arch/powerpc/perf/callchain.c @@ -113,16 +113,18 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb) int ret = -EFAULT; pgd_t *pgdir; pte_t *ptep, pte; + struct mm_struct *mm = current->mm; unsigned shift; unsigned long addr = (unsigned long) ptr; unsigned long offset; unsigned long pfn, flags; void *kaddr; - pgdir = current->mm->pgd; + pgdir = mm->pgd; if (!pgdir) return -EFAULT; + start_lockless_pgtbl_walk(mm); local_irq_save(flags); ptep = find_current_mm_pte(pgdir, addr, NULL, ); if (!ptep) @@ -146,6 +148,7 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb) ret = 0; err_out: local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); return ret; } -- 2.20.1
[PATCH v3 06/11] powerpc/mm/book3s64/hash: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring all hash-related functions that do lockless pagetable walks. Signed-off-by: Leonardo Bras --- arch/powerpc/mm/book3s64/hash_tlb.c | 2 ++ arch/powerpc/mm/book3s64/hash_utils.c | 7 +++ 2 files changed, 9 insertions(+) diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c index 4a70d8dd39cd..5e5213c3f7c4 100644 --- a/arch/powerpc/mm/book3s64/hash_tlb.c +++ b/arch/powerpc/mm/book3s64/hash_tlb.c @@ -209,6 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start, * to being hashed). This is not the most performance oriented * way to do things but is fine for our needs here. */ + start_lockless_pgtbl_walk(mm); local_irq_save(flags); arch_enter_lazy_mmu_mode(); for (; start < end; start += PAGE_SIZE) { @@ -230,6 +231,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start, } arch_leave_lazy_mmu_mode(); local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); } void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index b8ad14bb1170..299946cedc3a 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -1322,6 +1322,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea, #endif /* CONFIG_PPC_64K_PAGES */ /* Get PTE and page size from page tables */ + start_lockless_pgtbl_walk(mm); ptep = find_linux_pte(pgdir, ea, _thp, ); if (ptep == NULL || !pte_present(*ptep)) { DBG_LOW(" no PTE !\n"); @@ -1438,6 +1439,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea, DBG_LOW(" -> rc=%d\n", rc); bail: + end_lockless_pgtbl_walk(mm); exception_exit(prev_state); return rc; } @@ -1547,10 +1549,12 @@ void hash_preload(struct mm_struct *mm, unsigned long ea, vsid = get_user_vsid(>context, ea, ssize); if (!vsid) return; + /* * Hash doesn't like irqs. Walking linux page table with irq disabled * saves us from holding multiple locks. */ + start_lockless_pgtbl_walk(mm); local_irq_save(flags); /* @@ -1597,6 +1601,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea, pte_val(*ptep)); out_exit: local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); } #ifdef CONFIG_PPC_MEM_KEYS @@ -1613,11 +1618,13 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address) if (!mm || !mm->pgd) return 0; + start_lockless_pgtbl_walk(mm); local_irq_save(flags); ptep = find_linux_pte(mm->pgd, address, NULL, NULL); if (ptep) pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep))); local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); return pkey; } -- 2.20.1
[PATCH v3 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to monitor against THP split/collapse with the couting method, it's necessary to bound it with {start,end}_lockless_pgtbl_walk. There are dummy functions, so it is not going to add any overhead on archs that don't use this method. Signed-off-by: Leonardo Bras --- mm/gup.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index 98f13ab37bac..eabd6fd15cf8 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end) int __get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages) { + struct mm_struct mm; unsigned long len, end; unsigned long flags; int nr = 0; @@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { + mm = current->mm; + start_lockless_pgtbl_walk(mm); local_irq_save(flags); gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, ); local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); } return nr; @@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) { unsigned long addr, len, end; + struct mm_struct *mm; int nr = 0, ret = 0; if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM))) @@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages, if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) { + mm = current->mm; + start_lockless_pgtbl_walk(mm); local_irq_disable(); gup_pgd_range(addr, end, gup_flags, pages, ); local_irq_enable(); + end_lockless_pgtbl_walk(mm); ret = nr; } -- 2.20.1
[PATCH v3 02/11] asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable walks
There is a need to monitor lockless pagetable walks, in order to avoid doing THP splitting/collapsing during them. Some methods rely on local_irq_{save,restore}, but that can be slow on cases with a lot of cpus are used for the process. In order to speedup these cases, I propose a refcount-based approach, that counts the number of lockless pagetable walks happening on the process. Given that there are lockless pagetable walks on generic code, it's necessary to create dummy functions for archs that won't use the approach. Signed-off-by: Leonardo Bras --- include/asm-generic/pgtable.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 75d9d68a6de7..0831475e72d3 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -1172,6 +1172,21 @@ static inline bool arch_has_pfn_modify_check(void) #endif #endif +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER +static inline void start_lockless_pgtbl_walk(struct mm_struct *mm) +{ +} + +static inline void end_lockless_pgtbl_walk(struct mm_struct *mm) +{ +} + +static inline int running_lockless_pgtbl_walk(struct mm_struct *mm) +{ + return 0; +} +#endif + /* * On some architectures it depends on the mm if the p4d/pud or pmd * layer of the page table hierarchy is folded or not. -- 2.20.1
[PATCH v3 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks
Applies the counting-based method for monitoring lockless pgtable walks on addr_to_pfn(). Signed-off-by: Leonardo Bras --- arch/powerpc/kernel/mce_power.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c index a814d2dfb5b0..0f2f87da4cd1 100644 --- a/arch/powerpc/kernel/mce_power.c +++ b/arch/powerpc/kernel/mce_power.c @@ -27,6 +27,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr) { pte_t *ptep; unsigned long flags; + unsigned long pfn; struct mm_struct *mm; if (user_mode(regs)) @@ -34,15 +35,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr) else mm = _mm; + start_lockless_pgtbl_walk(mm); local_irq_save(flags); if (mm == current->mm) ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL); else ptep = find_init_mm_pte(addr, NULL); - local_irq_restore(flags); + if (!ptep || pte_special(*ptep)) - return ULONG_MAX; - return pte_pfn(*ptep); + pfn = ULONG_MAX; + else + pfn = pte_pfn(*ptep); + + local_irq_restore(flags); + end_lockless_pgtbl_walk(mm); + return pfn; } /* flush SLBs and reload */ -- 2.20.1
[PATCH v3 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
It's necessary to monitor lockless pagetable walks, in order to avoid doing THP splitting/collapsing during them. Some methods rely on local_irq_{save,restore}, but that can be slow on cases with a lot of cpus are used for the process. In order to speedup some cases, I propose a refcount-based approach, that counts the number of lockless pagetable walks happening on the process. This method does not exclude the current irq-oriented method. It works as a complement to skip unnecessary waiting. start_lockless_pgtbl_walk(mm) Insert before starting any lockless pgtable walk end_lockless_pgtbl_walk(mm) Insert after the end of any lockless pgtable walk (Mostly after the ptep is last used) running_lockless_pgtbl_walk(mm) Returns the number of lockless pgtable walks running Signed-off-by: Leonardo Bras --- arch/powerpc/include/asm/book3s/64/mmu.h | 3 ++ arch/powerpc/mm/book3s64/mmu_context.c | 1 + arch/powerpc/mm/book3s64/pgtable.c | 37 3 files changed, 41 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h index 23b83d3593e2..13b006e7dde4 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu.h +++ b/arch/powerpc/include/asm/book3s/64/mmu.h @@ -116,6 +116,9 @@ typedef struct { /* Number of users of the external (Nest) MMU */ atomic_t copros; + /* Number of running instances of lockless pagetable walk*/ + atomic_t lockless_pgtbl_walk_count; + struct hash_mm_context *hash_context; unsigned long vdso_base; diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c index 2d0cb5ba9a47..3dd01c0ca5be 100644 --- a/arch/powerpc/mm/book3s64/mmu_context.c +++ b/arch/powerpc/mm/book3s64/mmu_context.c @@ -200,6 +200,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) #endif atomic_set(>context.active_cpus, 0); atomic_set(>context.copros, 0); + atomic_set(>context.lockless_pgtbl_walk_count, 0); return 0; } diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 7d0e0d0d22c4..0b86884a8097 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -98,6 +98,43 @@ void serialize_against_pte_lookup(struct mm_struct *mm) smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); } +/* + * Counting method to monitor lockless pagetable walks: + * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the + * number of lockless pgtable walks happening, and + * running_lockless_pgtbl_walk to return this value. + */ + +/* start_lockless_pgtbl_walk: Must be inserted before a function call that does + * lockless pagetable walks, such as __find_linux_pte() + */ +void start_lockless_pgtbl_walk(struct mm_struct *mm) +{ + atomic_inc(>context.lockless_pgtbl_walk_count); +} +EXPORT_SYMBOL(start_lockless_pgtbl_walk); + +/* + * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer + * returned by a lockless pagetable walk, such as __find_linux_pte() +*/ +void end_lockless_pgtbl_walk(struct mm_struct *mm) +{ + atomic_dec(>context.lockless_pgtbl_walk_count); +} +EXPORT_SYMBOL(end_lockless_pgtbl_walk); + +/* + * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks + * currently running. If it returns 0, there is no running pagetable walk, and + * THP split/collapse can be safely done. This can be used to avoid more + * expensive approaches like serialize_against_pte_lookup() + */ +int running_lockless_pgtbl_walk(struct mm_struct *mm) +{ + return atomic_read(>context.lockless_pgtbl_walk_count); +} + /* * We use this to invalidate a pmdp entry before switching from a * hugepte to regular pmd entry. -- 2.20.1
[PATCH v3 00/11] Introduces new count-based method for monitoring lockless pagetable walks
If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk of memory (496GB) mapped with THP, it takes an average of 275 seconds, which can cause a lot of problems to the load (in qemu case, the guest will lock for this time). Trying to find the source of this bug, I found out most of this time is spent on serialize_against_pte_lookup(). This function will take a lot of time in smp_call_function_many() if there is more than a couple CPUs running the user process. Since it has to happen to all THP mapped, it will take a very long time for large amounts of memory. By the docs, serialize_against_pte_lookup() is needed in order to avoid pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless pagetable walk, to happen concurrently with THP splitting/collapsing. It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], after interrupts are re-enabled. Since, interrupts are (usually) disabled during lockless pagetable walk, and serialize_against_pte_lookup will only return after interrupts are enabled, it is protected. So, by what I could understand, if there is no lockless pagetable walk running, there is no need to call serialize_against_pte_lookup(). So, to avoid the cost of running serialize_against_pte_lookup(), I propose a counter that keeps track of how many find_current_mm_pte() are currently running, and if there is none, just skip smp_call_function_many(). The related functions are: start_lockless_pgtbl_walk(mm) Insert before starting any lockless pgtable walk end_lockless_pgtbl_walk(mm) Insert after the end of any lockless pgtable walk (Mostly after the ptep is last used) running_lockless_pgtbl_walk(mm) Returns the number of lockless pgtable walks running On my workload (qemu), I could see munmap's time reduction from 275 seconds to 418ms. Changes since v2: Rebased to v5.3 Adds support on __get_user_pages_fast Adds usage decription to *_lockless_pgtbl_walk() Better style to dummy functions Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839 Changes since v1: Isolated atomic operations in functions *_lockless_pgtbl_walk() Fixed behavior of decrementing before last ptep was used Link: http://patchwork.ozlabs.org/patch/1163093/ Leonardo Bras (11): powerpc/mm: Adds counting method to monitor lockless pgtable walks asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable walks mm/gup: Applies counting method to monitor gup_pgd_range powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks powerpc/perf: Applies counting method to monitor lockless pgtbl walks powerpc/mm/book3s64/hash: Applies counting method to monitor lockless pgtbl walks powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl walks powerpc/kvm/book3s_hv: Applies counting method to monitor lockless pgtbl walks powerpc/kvm/book3s_64: Applies counting method to monitor lockless pgtbl walks powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing arch/powerpc/include/asm/book3s/64/mmu.h | 3 ++ arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++ arch/powerpc/kernel/mce_power.c | 13 +-- arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 + arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 +- arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++ arch/powerpc/kvm/book3s_hv_nested.c | 8 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 - arch/powerpc/kvm/e500_mmu_host.c | 4 ++ arch/powerpc/mm/book3s64/hash_tlb.c | 2 + arch/powerpc/mm/book3s64/hash_utils.c| 7 arch/powerpc/mm/book3s64/mmu_context.c | 1 + arch/powerpc/mm/book3s64/pgtable.c | 40 +++- arch/powerpc/perf/callchain.c| 5 ++- include/asm-generic/pgtable.h| 15 mm/gup.c | 8 16 files changed, 138 insertions(+), 8 deletions(-) -- 2.20.1
Re: [PATCH v2 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
John Hubbard writes: >> Is that what you meant? > > Yes. > I am still trying to understand this issue. I am also analyzing some cases where interrupt disable is not done before the lockless pagetable walk (patch 3 discussion). But given I forgot to add the mm mailing list before, I think it would be wiser to send a v3 and gather feedback while I keep trying to understand how it works, and if it needs additional memory barrier here. Thanks! Leonardo Bras signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 00/11] Introduces new count-based method for monitoring lockless pagetable wakls
John Hubbard writes: > Also, which tree do these patches apply to, please? I will send a v3 that applies directly over v5.3, and make sure to include mm mailing list. Thanks! signature.asc Description: This is a digitally signed message part
Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
On Tue, 24 Sep 2019 15:33:28 +1000 Paul Mackerras wrote: > On Mon, Sep 23, 2019 at 05:43:48PM +0200, Greg Kurz wrote: > > We currently prevent userspace to connect a new vCPU if we already have > > one with the same vCPU id. This is good but unfortunately not enough, > > because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id() > > can return colliding values. For examples, 348 stays unchanged since it > > is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the > > guest's core stride is 8. Nothing currently prevents userspace to connect > > vCPUs with forged ids, that end up being associated to the same VP. This > > confuses the irq layer and likely crashes the kernel: > > > > [96631.670454] genirq: Flags mismatch irq 4161. 0001 (kvm-1-2392) vs. > > 0001 (kvm-1-348) > > Have you seen a host kernel crash? Yes I have. [29191.162740] genirq: Flags mismatch irq 199. 0001 (kvm-2-2392) vs. 0001 (kvm-2-348) [29191.162849] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38 [29191.162966] Call Trace: [29191.163002] [c03f7f9937e0] [c0c0110c] dump_stack+0xb0/0xf4 (unreliable) [29191.163090] [c03f7f993820] [c01cb480] __setup_irq+0xa70/0xad0 [29191.163180] [c03f7f9938d0] [c01cb75c] request_threaded_irq+0x13c/0x260 [29191.163290] [c03f7f993940] [c0080d44e7ac] kvmppc_xive_attach_escalation+0x104/0x270 [kvm] [29191.163396] [c03f7f9939d0] [c0080d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm] [29191.163504] [c03f7f993ac0] [c0080d28] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm] [29191.163616] [c03f7f993b90] [c0080d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm] [29191.163695] [c03f7f993d00] [c04840f0] do_vfs_ioctl+0xe0/0xc30 [29191.163806] [c03f7f993db0] [c0484d44] ksys_ioctl+0x104/0x120 [29191.163889] [c03f7f993e00] [c0484d88] sys_ioctl+0x28/0x80 [29191.163962] [c03f7f993e20] [c000b278] system_call+0x5c/0x68 [29191.164035] xive-kvm: Failed to request escalation interrupt for queue 0 of VCPU 2392 [29191.164152] [ cut here ] [29191.164229] remove_proc_entry: removing non-empty directory 'irq/199', leaking at least 'kvm-2-348' [29191.164343] WARNING: CPU: 24 PID: 88176 at /home/greg/Work/linux/kernel-kvm-ppc/fs/proc/generic.c:684 remove_proc_entry+0x1ec/0x200 [29191.164501] Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm] [29191.165450] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38 [29191.165568] NIP: c053b0cc LR: c053b0c8 CTR: c00ba3b0 [29191.165644] REGS: c03f7f9934b0 TRAP: 0700 Not tainted (5.3.0-xive-nr-servers-5.3-gku+) [29191.165741] MSR: 90029033 CR: 48228222 XER: 2004 [29191.165939] CFAR: c0131a50 IRQMASK: 0 [29191.165939] GPR00: c053b0c8 c03f7f993740 c15ec500 0057 [29191.165939] GPR04: 0001 49fb98484262 1bcf [29191.165939] GPR08: 0007 0007 0001 90001033 [29191.165939] GPR12: 8000 c03eb800 00012f4ce5a1 [29191.165939] GPR16: 00012ef5a0c8 00012f113bb0 [29191.165939] GPR20: 00012f45d918 c03f863758b0 c03f86375870 0006 [29191.165939] GPR24: c03f86375a30 0007 c0002039373d9020 c14c4a48 [29191.165939] GPR28: 0001 c03fe62a4f6b c00020394b2e9fab c03fe62a4ec0 [29191.166755] NIP [c053b0cc] remove_proc_entry+0x1ec/0x200 [29191.166803] LR [c053b0c8] remove_proc_entry+0x1e8/0x200 [29191.166874] Call Trace: [29191.166908] [c03f7f993740] [c053b0c8] remove_proc_entry+0x1e8/0x200 (unreliable) [29191.167022] [c03f7f9937e0] [c01d3654] unregister_irq_proc+0x114/0x150 [29191.167106] [c03f7f993880] [c01c6284] free_desc+0x54/0xb0 [29191.167175] [c03f7f9938c0] [c01c65ec] irq_free_descs+0xac/0x100 [29191.167256] [c03f7f993910] [c01d1ff8]
Re: [PATCH] powerpc/book3s64: Export has_transparent_hugepage() related functions.
On Mon, Sep 23, 2019 at 9:25 PM Aneesh Kumar K.V wrote: > > In later patch, we want to use hash_transparent_hugepage() in a kernel module. > Export two related functions. > Looks good, thanks.
[PATCH AUTOSEL 4.4 13/14] powerpc/pseries: correctly track irq state in default idle
From: Nathan Lynch [ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ] prep_irq_for_idle() is intended to be called before entering H_CEDE (and it is used by the pseries cpuidle driver). However the default pseries idle routine does not call it, leading to mismanaged lazy irq state when the cpuidle driver isn't in use. Manifestations of this include: * Dropped IPIs in the time immediately after a cpu comes online (before it has installed the cpuidle handler), making the online operation block indefinitely waiting for the new cpu to respond. * Hitting this WARN_ON in arch_local_irq_restore(): /* * We should already be hard disabled here. We had bugs * where that wasn't the case so let's dbl check it and * warn if we are wrong. Only do that when IRQ tracing * is enabled as mfmsr() can be costly. */ if (WARN_ON_ONCE(mfmsr() & MSR_EE)) __hard_irq_disable(); Call prep_irq_for_idle() from pseries_lpar_idle() and honor its result. Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on pseries") Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/setup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 9cc976ff7fecc..88fcf6a95fa67 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -369,6 +369,9 @@ static void pseries_lpar_idle(void) * low power mode by cedeing processor to hypervisor */ + if (!prep_irq_for_idle()) + return; + /* Indicate to hypervisor that we are idle. */ get_lppaca()->idle = 1; -- 2.20.1
[PATCH AUTOSEL 4.4 12/14] powerpc/64s/exception: machine check use correct cfar for late handler
From: Nicholas Piggin [ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ] Bare metal machine checks run an "early" handler in real mode before running the main handler which reports the event. The main handler runs exactly as a normal interrupt handler, after the "windup" which sets registers back as they were at interrupt entry. CFAR does not get restored by the windup code, so that will be wrong when the handler is run. Restore the CFAR to the saved value before running the late handler. Signed-off-by: Nicholas Piggin Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/exceptions-64s.S | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index a44f1755dc4bf..536718ed033fc 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -1465,6 +1465,10 @@ machine_check_handle_early: RFI_TO_USER_OR_KERNEL 9: /* Deliver the machine check to host kernel in V mode. */ +BEGIN_FTR_SECTION + ld r10,ORIG_GPR3(r1) + mtspr SPRN_CFAR,r10 +END_FTR_SECTION_IFSET(CPU_FTR_CFAR) MACHINE_CHECK_HANDLER_WINDUP b machine_check_pSeries -- 2.20.1
[PATCH AUTOSEL 4.4 10/14] powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag
From: Sam Bobroff [ Upstream commit aa06e3d60e245284d1e55497eb3108828092818d ] The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the use of driver callbacks in drivers that have been bound part way through the recovery process. This is necessary to prevent later stage handlers from being called when the earlier stage handlers haven't, which can be confusing for drivers. However, the flag is set for all devices that are added after boot time and only cleared at the end of the EEH recovery process. This results in hot plugged devices erroneously having the flag set during the first recovery after they are added (causing their driver's handlers to be incorrectly ignored). To remedy this, clear the flag at the beginning of recovery processing. The flag is still cleared at the end of recovery processing, although it is no longer really necessary. Also clear the flag during eeh_handle_special_event(), for the same reasons. Signed-off-by: Sam Bobroff Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobr...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/eeh_driver.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 9837c98caabe9..045038469295d 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -675,6 +675,10 @@ static bool eeh_handle_normal_event(struct eeh_pe *pe) pr_warn("EEH: This PCI device has failed %d times in the last hour\n", pe->freeze_count); + eeh_for_each_pe(pe, tmp_pe) + eeh_pe_for_each_dev(tmp_pe, edev, tmp) + edev->mode &= ~EEH_DEV_NO_HANDLER; + /* Walk the various device drivers attached to this slot through * a reset sequence, giving each an opportunity to do what it needs * to accomplish the reset. Each child gets a report of the @@ -840,7 +844,8 @@ static bool eeh_handle_normal_event(struct eeh_pe *pe) static void eeh_handle_special_event(void) { - struct eeh_pe *pe, *phb_pe; + struct eeh_pe *pe, *phb_pe, *tmp_pe; + struct eeh_dev *edev, *tmp_edev; struct pci_bus *bus; struct pci_controller *hose; unsigned long flags; @@ -919,6 +924,10 @@ static void eeh_handle_special_event(void) (phb_pe->state & EEH_PE_RECOVERING)) continue; + eeh_for_each_pe(pe, tmp_pe) + eeh_pe_for_each_dev(tmp_pe, edev, tmp_edev) + edev->mode &= ~EEH_DEV_NO_HANDLER; + /* Notify all devices to be down */ eeh_pe_state_clear(pe, EEH_PE_PRI_BUS); bus = eeh_pe_bus_get(phb_pe); -- 2.20.1
[PATCH AUTOSEL 4.4 08/14] powerpc/pseries/mobility: use cond_resched when updating device tree
From: Nathan Lynch [ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ] After a partition migration, pseries_devicetree_update() processes changes to the device tree communicated from the platform to Linux. This is a relatively heavyweight operation, with multiple device tree searches, memory allocations, and conversations with partition firmware. There's a few levels of nested loops which are bounded only by decisions made by the platform, outside of Linux's control, and indeed we have seen RCU stalls on large systems while executing this call graph. Use cond_resched() in these loops so that the cpu is yielded when needed. Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/mobility.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index c773396d0969b..8d30a425a88ab 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -206,7 +207,11 @@ static int update_dt_node(__be32 phandle, s32 scope) prop_data += vd; } + + cond_resched(); } + + cond_resched(); } while (rtas_rc == 1); of_node_put(dn); @@ -282,8 +287,12 @@ int pseries_devicetree_update(s32 scope) add_dt_node(phandle, drc_index); break; } + + cond_resched(); } } + + cond_resched(); } while (rc == 1); kfree(rtas_buf); -- 2.20.1
[PATCH AUTOSEL 4.4 07/14] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function
From: Christophe Leroy [ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ] We see warnings such as: kernel/futex.c: In function 'do_futex': kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this function [-Wmaybe-uninitialized] return oldval == cmparg; ^ kernel/futex.c:1651:6: note: 'oldval' was declared here int oldval, ret; ^ This is because arch_futex_atomic_op_inuser() only sets *oval if ret is 0 and GCC doesn't see that it will only use it when ret is 0. Anyway, the non-zero ret path is an error path that won't suffer from setting *oval, and as *oval is a local var in futex_atomic_op_inuser() it will have no impact. Signed-off-by: Christophe Leroy [mpe: reword change log slightly] Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr Signed-off-by: Sasha Levin --- arch/powerpc/include/asm/futex.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index f4c7467f74655..b73ab8a7ebc3f 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -60,8 +60,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, pagefault_enable(); - if (!ret) - *oval = oldval; + *oval = oldval; return ret; } -- 2.20.1
[PATCH AUTOSEL 4.4 06/14] powerpc/rtas: use device model APIs and serialization during LPM
From: Nathan Lynch [ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ] The LPAR migration implementation and userspace-initiated cpu hotplug can interleave their executions like so: 1. Set cpu 7 offline via sysfs. 2. Begin a partition migration, whose implementation requires the OS to ensure all present cpus are online; cpu 7 is onlined: rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up This sets cpu 7 online in all respects except for the cpu's corresponding struct device; dev->offline remains true. 3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is already online and returns success. The driver core (device_online) sets dev->offline = false. 4. The migration completes and restores cpu 7 to offline state: rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down This leaves cpu7 in a state where the driver core considers the cpu device online, but in all other respects it is offline and unused. Attempts to online the cpu via sysfs appear to succeed but the driver core actually does not pass the request to the lower-level cpuhp support code. This makes the cpu unusable until the cpu device is manually set offline and then online again via sysfs. Instead of directly calling cpu_up/cpu_down, the migration code should use the higher-level device core APIs to maintain consistent state and serialize operations. Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation") Signed-off-by: Nathan Lynch Reviewed-by: Gautham R. Shenoy Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/rtas.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 5a753fae8265a..0c42e872d548b 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -857,15 +857,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state state, return 0; for_each_cpu(cpu, cpus) { + struct device *dev = get_cpu_device(cpu); + switch (state) { case DOWN: - cpuret = cpu_down(cpu); + cpuret = device_offline(dev); break; case UP: - cpuret = cpu_up(cpu); + cpuret = device_online(dev); break; } - if (cpuret) { + if (cpuret < 0) { pr_debug("%s: cpu_%s for cpu#%d returned %d.\n", __func__, ((state == UP) ? "up" : "down"), @@ -954,6 +956,8 @@ int rtas_ibm_suspend_me(u64 handle) data.token = rtas_token("ibm,suspend-me"); data.complete = + lock_device_hotplug(); + /* All present CPUs must be online */ cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask); cpuret = rtas_online_cpus_mask(offline_mask); @@ -985,6 +989,7 @@ int rtas_ibm_suspend_me(u64 handle) __func__); out: + unlock_device_hotplug(); free_cpumask_var(offline_mask); return atomic_read(); } -- 2.20.1
[PATCH AUTOSEL 4.9 16/19] powerpc/pseries: correctly track irq state in default idle
From: Nathan Lynch [ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ] prep_irq_for_idle() is intended to be called before entering H_CEDE (and it is used by the pseries cpuidle driver). However the default pseries idle routine does not call it, leading to mismanaged lazy irq state when the cpuidle driver isn't in use. Manifestations of this include: * Dropped IPIs in the time immediately after a cpu comes online (before it has installed the cpuidle handler), making the online operation block indefinitely waiting for the new cpu to respond. * Hitting this WARN_ON in arch_local_irq_restore(): /* * We should already be hard disabled here. We had bugs * where that wasn't the case so let's dbl check it and * warn if we are wrong. Only do that when IRQ tracing * is enabled as mfmsr() can be costly. */ if (WARN_ON_ONCE(mfmsr() & MSR_EE)) __hard_irq_disable(); Call prep_irq_for_idle() from pseries_lpar_idle() and honor its result. Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on pseries") Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/setup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index adb09ab87f7c0..30782859d8980 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -298,6 +298,9 @@ static void pseries_lpar_idle(void) * low power mode by ceding processor to hypervisor */ + if (!prep_irq_for_idle()) + return; + /* Indicate to hypervisor that we are idle. */ get_lppaca()->idle = 1; -- 2.20.1
[PATCH AUTOSEL 4.9 15/19] powerpc/64s/exception: machine check use correct cfar for late handler
From: Nicholas Piggin [ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ] Bare metal machine checks run an "early" handler in real mode before running the main handler which reports the event. The main handler runs exactly as a normal interrupt handler, after the "windup" which sets registers back as they were at interrupt entry. CFAR does not get restored by the windup code, so that will be wrong when the handler is run. Restore the CFAR to the saved value before running the late handler. Signed-off-by: Nicholas Piggin Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/exceptions-64s.S | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 92474227262b4..0c8b966e80702 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -467,6 +467,10 @@ EXC_COMMON_BEGIN(machine_check_handle_early) RFI_TO_USER_OR_KERNEL 9: /* Deliver the machine check to host kernel in V mode. */ +BEGIN_FTR_SECTION + ld r10,ORIG_GPR3(r1) + mtspr SPRN_CFAR,r10 +END_FTR_SECTION_IFSET(CPU_FTR_CFAR) MACHINE_CHECK_HANDLER_WINDUP b machine_check_pSeries -- 2.20.1
[PATCH AUTOSEL 4.9 11/19] powerpc/pseries/mobility: use cond_resched when updating device tree
From: Nathan Lynch [ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ] After a partition migration, pseries_devicetree_update() processes changes to the device tree communicated from the platform to Linux. This is a relatively heavyweight operation, with multiple device tree searches, memory allocations, and conversations with partition firmware. There's a few levels of nested loops which are bounded only by decisions made by the platform, outside of Linux's control, and indeed we have seen RCU stalls on large systems while executing this call graph. Use cond_resched() in these loops so that the cpu is yielded when needed. Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/mobility.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 3784a7abfcc80..74791e8382d22 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -206,7 +207,11 @@ static int update_dt_node(__be32 phandle, s32 scope) prop_data += vd; } + + cond_resched(); } + + cond_resched(); } while (rtas_rc == 1); of_node_put(dn); @@ -282,8 +287,12 @@ int pseries_devicetree_update(s32 scope) add_dt_node(phandle, drc_index); break; } + + cond_resched(); } } + + cond_resched(); } while (rc == 1); kfree(rtas_buf); -- 2.20.1
[PATCH AUTOSEL 4.9 10/19] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function
From: Christophe Leroy [ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ] We see warnings such as: kernel/futex.c: In function 'do_futex': kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this function [-Wmaybe-uninitialized] return oldval == cmparg; ^ kernel/futex.c:1651:6: note: 'oldval' was declared here int oldval, ret; ^ This is because arch_futex_atomic_op_inuser() only sets *oval if ret is 0 and GCC doesn't see that it will only use it when ret is 0. Anyway, the non-zero ret path is an error path that won't suffer from setting *oval, and as *oval is a local var in futex_atomic_op_inuser() it will have no impact. Signed-off-by: Christophe Leroy [mpe: reword change log slightly] Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr Signed-off-by: Sasha Levin --- arch/powerpc/include/asm/futex.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index f4c7467f74655..b73ab8a7ebc3f 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -60,8 +60,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, pagefault_enable(); - if (!ret) - *oval = oldval; + *oval = oldval; return ret; } -- 2.20.1
[PATCH AUTOSEL 4.9 09/19] powerpc/rtas: use device model APIs and serialization during LPM
From: Nathan Lynch [ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ] The LPAR migration implementation and userspace-initiated cpu hotplug can interleave their executions like so: 1. Set cpu 7 offline via sysfs. 2. Begin a partition migration, whose implementation requires the OS to ensure all present cpus are online; cpu 7 is onlined: rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up This sets cpu 7 online in all respects except for the cpu's corresponding struct device; dev->offline remains true. 3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is already online and returns success. The driver core (device_online) sets dev->offline = false. 4. The migration completes and restores cpu 7 to offline state: rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down This leaves cpu7 in a state where the driver core considers the cpu device online, but in all other respects it is offline and unused. Attempts to online the cpu via sysfs appear to succeed but the driver core actually does not pass the request to the lower-level cpuhp support code. This makes the cpu unusable until the cpu device is manually set offline and then online again via sysfs. Instead of directly calling cpu_up/cpu_down, the migration code should use the higher-level device core APIs to maintain consistent state and serialize operations. Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation") Signed-off-by: Nathan Lynch Reviewed-by: Gautham R. Shenoy Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/rtas.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 6a3e5de544ce2..a309a7a29cc60 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -874,15 +874,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state state, return 0; for_each_cpu(cpu, cpus) { + struct device *dev = get_cpu_device(cpu); + switch (state) { case DOWN: - cpuret = cpu_down(cpu); + cpuret = device_offline(dev); break; case UP: - cpuret = cpu_up(cpu); + cpuret = device_online(dev); break; } - if (cpuret) { + if (cpuret < 0) { pr_debug("%s: cpu_%s for cpu#%d returned %d.\n", __func__, ((state == UP) ? "up" : "down"), @@ -971,6 +973,8 @@ int rtas_ibm_suspend_me(u64 handle) data.token = rtas_token("ibm,suspend-me"); data.complete = + lock_device_hotplug(); + /* All present CPUs must be online */ cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask); cpuret = rtas_online_cpus_mask(offline_mask); @@ -1002,6 +1006,7 @@ int rtas_ibm_suspend_me(u64 handle) __func__); out: + unlock_device_hotplug(); free_cpumask_var(offline_mask); return atomic_read(); } -- 2.20.1
[PATCH AUTOSEL 4.14 23/28] powerpc/64s/exception: machine check use correct cfar for late handler
From: Nicholas Piggin [ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ] Bare metal machine checks run an "early" handler in real mode before running the main handler which reports the event. The main handler runs exactly as a normal interrupt handler, after the "windup" which sets registers back as they were at interrupt entry. CFAR does not get restored by the windup code, so that will be wrong when the handler is run. Restore the CFAR to the saved value before running the late handler. Signed-off-by: Nicholas Piggin Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/exceptions-64s.S | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 43cde6c602795..cdc53fd905977 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -464,6 +464,10 @@ EXC_COMMON_BEGIN(machine_check_handle_early) RFI_TO_USER_OR_KERNEL 9: /* Deliver the machine check to host kernel in V mode. */ +BEGIN_FTR_SECTION + ld r10,ORIG_GPR3(r1) + mtspr SPRN_CFAR,r10 +END_FTR_SECTION_IFSET(CPU_FTR_CFAR) MACHINE_CHECK_HANDLER_WINDUP b machine_check_pSeries -- 2.20.1
[PATCH AUTOSEL 4.14 24/28] powerpc/pseries: correctly track irq state in default idle
From: Nathan Lynch [ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ] prep_irq_for_idle() is intended to be called before entering H_CEDE (and it is used by the pseries cpuidle driver). However the default pseries idle routine does not call it, leading to mismanaged lazy irq state when the cpuidle driver isn't in use. Manifestations of this include: * Dropped IPIs in the time immediately after a cpu comes online (before it has installed the cpuidle handler), making the online operation block indefinitely waiting for the new cpu to respond. * Hitting this WARN_ON in arch_local_irq_restore(): /* * We should already be hard disabled here. We had bugs * where that wasn't the case so let's dbl check it and * warn if we are wrong. Only do that when IRQ tracing * is enabled as mfmsr() can be costly. */ if (WARN_ON_ONCE(mfmsr() & MSR_EE)) __hard_irq_disable(); Call prep_irq_for_idle() from pseries_lpar_idle() and honor its result. Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on pseries") Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/setup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 6a0ad56e89b93..7a9945b350536 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -307,6 +307,9 @@ static void pseries_lpar_idle(void) * low power mode by ceding processor to hypervisor */ + if (!prep_irq_for_idle()) + return; + /* Indicate to hypervisor that we are idle. */ get_lppaca()->idle = 1; -- 2.20.1
[PATCH AUTOSEL 4.14 18/28] powerpc/pseries/mobility: use cond_resched when updating device tree
From: Nathan Lynch [ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ] After a partition migration, pseries_devicetree_update() processes changes to the device tree communicated from the platform to Linux. This is a relatively heavyweight operation, with multiple device tree searches, memory allocations, and conversations with partition firmware. There's a few levels of nested loops which are bounded only by decisions made by the platform, outside of Linux's control, and indeed we have seen RCU stalls on large systems while executing this call graph. Use cond_resched() in these loops so that the cpu is yielded when needed. Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/mobility.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 4addc552eb33d..9739a055e5f7b 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -208,7 +209,11 @@ static int update_dt_node(__be32 phandle, s32 scope) prop_data += vd; } + + cond_resched(); } + + cond_resched(); } while (rtas_rc == 1); of_node_put(dn); @@ -317,8 +322,12 @@ int pseries_devicetree_update(s32 scope) add_dt_node(phandle, drc_index); break; } + + cond_resched(); } } + + cond_resched(); } while (rc == 1); kfree(rtas_buf); -- 2.20.1
[PATCH AUTOSEL 4.14 17/28] powerpc/64s/radix: Remove redundant pfn_pte bitop, add VM_BUG_ON
From: Nicholas Piggin [ Upstream commit 6bb25170d7a44ef0ed9677814600f0785e7421d1 ] pfn_pte is never given a pte above the addressable physical memory limit, so the masking is redundant. In case of a software bug, it is not obviously better to silently truncate the pfn than to corrupt the pte (either one will result in memory corruption or crashes), so there is no reason to add this to the fast path. Add VM_BUG_ON to catch cases where the pfn is invalid. These would catch the create_section_mapping bug fixed by a previous commit. [16885.256466] [ cut here ] [16885.256492] kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612! cpu 0x0: Vector: 700 (Program Check) at [c000ee0a36d0] pc: c0080738: __map_kernel_page+0x248/0x6f0 lr: c0080ac0: __map_kernel_page+0x5d0/0x6f0 sp: c000ee0a3960 msr: 90029033 current = 0xc000ec63b400 paca= 0xc17f irqmask: 0x03 irq_happened: 0x01 pid = 85, comm = sh kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612! Linux version 5.3.0-rc1-1-g0fe93e5f3394 enter ? for help [c000ee0a3a00] c0d37378 create_physical_mapping+0x260/0x360 [c000ee0a3b10] c0d370bc create_section_mapping+0x1c/0x3c [c000ee0a3b30] c0071f54 arch_add_memory+0x74/0x130 Signed-off-by: Nicholas Piggin Reviewed-by: Aneesh Kumar K.V Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190724084638.24982-5-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/include/asm/book3s/64/pgtable.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 4dd13b503dbbd..5f37d8c3b4798 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -555,8 +555,10 @@ static inline int pte_present(pte_t pte) */ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) { - return __ptepte_basic_t)(pfn) << PAGE_SHIFT) & PTE_RPN_MASK) | -pgprot_val(pgprot)); + VM_BUG_ON(pfn >> (64 - PAGE_SHIFT)); + VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK); + + return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot)); } static inline unsigned long pte_pfn(pte_t pte) -- 2.20.1
[PATCH AUTOSEL 4.14 16/28] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function
From: Christophe Leroy [ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ] We see warnings such as: kernel/futex.c: In function 'do_futex': kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this function [-Wmaybe-uninitialized] return oldval == cmparg; ^ kernel/futex.c:1651:6: note: 'oldval' was declared here int oldval, ret; ^ This is because arch_futex_atomic_op_inuser() only sets *oval if ret is 0 and GCC doesn't see that it will only use it when ret is 0. Anyway, the non-zero ret path is an error path that won't suffer from setting *oval, and as *oval is a local var in futex_atomic_op_inuser() it will have no impact. Signed-off-by: Christophe Leroy [mpe: reword change log slightly] Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr Signed-off-by: Sasha Levin --- arch/powerpc/include/asm/futex.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index 1a944c18c5390..3c7d859452294 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -59,8 +59,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, pagefault_enable(); - if (!ret) - *oval = oldval; + *oval = oldval; return ret; } -- 2.20.1
[PATCH AUTOSEL 4.14 15/28] powerpc/rtas: use device model APIs and serialization during LPM
From: Nathan Lynch [ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ] The LPAR migration implementation and userspace-initiated cpu hotplug can interleave their executions like so: 1. Set cpu 7 offline via sysfs. 2. Begin a partition migration, whose implementation requires the OS to ensure all present cpus are online; cpu 7 is onlined: rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up This sets cpu 7 online in all respects except for the cpu's corresponding struct device; dev->offline remains true. 3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is already online and returns success. The driver core (device_online) sets dev->offline = false. 4. The migration completes and restores cpu 7 to offline state: rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down This leaves cpu7 in a state where the driver core considers the cpu device online, but in all other respects it is offline and unused. Attempts to online the cpu via sysfs appear to succeed but the driver core actually does not pass the request to the lower-level cpuhp support code. This makes the cpu unusable until the cpu device is manually set offline and then online again via sysfs. Instead of directly calling cpu_up/cpu_down, the migration code should use the higher-level device core APIs to maintain consistent state and serialize operations. Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation") Signed-off-by: Nathan Lynch Reviewed-by: Gautham R. Shenoy Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/rtas.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 1643e9e536557..141d192c69538 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -874,15 +874,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state state, return 0; for_each_cpu(cpu, cpus) { + struct device *dev = get_cpu_device(cpu); + switch (state) { case DOWN: - cpuret = cpu_down(cpu); + cpuret = device_offline(dev); break; case UP: - cpuret = cpu_up(cpu); + cpuret = device_online(dev); break; } - if (cpuret) { + if (cpuret < 0) { pr_debug("%s: cpu_%s for cpu#%d returned %d.\n", __func__, ((state == UP) ? "up" : "down"), @@ -971,6 +973,8 @@ int rtas_ibm_suspend_me(u64 handle) data.token = rtas_token("ibm,suspend-me"); data.complete = + lock_device_hotplug(); + /* All present CPUs must be online */ cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask); cpuret = rtas_online_cpus_mask(offline_mask); @@ -1002,6 +1006,7 @@ int rtas_ibm_suspend_me(u64 handle) __func__); out: + unlock_device_hotplug(); free_cpumask_var(offline_mask); return atomic_read(); } -- 2.20.1
[PATCH AUTOSEL 4.14 14/28] powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL
From: Cédric Le Goater [ Upstream commit c3e0dbd7f780a58c4695f1cd8fc8afde80376737 ] Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in the OPAL logs and also outputs some of the fields of the internal XIVE structures in Linux. The OPAL calls can only be done on baremetal (PowerNV) and they crash a pseries machine. Fix by checking the hypervisor feature of the CPU. Signed-off-by: Cédric Le Goater Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190814154754.23682-2-...@kaod.org Signed-off-by: Sasha Levin --- arch/powerpc/xmon/xmon.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 6b9038a3e79f0..5a739588aa505 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2438,13 +2438,16 @@ static void dump_pacas(void) static void dump_one_xive(int cpu) { unsigned int hwid = get_hard_smp_processor_id(cpu); + bool hv = cpu_has_feature(CPU_FTR_HVMODE); - opal_xive_dump(XIVE_DUMP_TM_HYP, hwid); - opal_xive_dump(XIVE_DUMP_TM_POOL, hwid); - opal_xive_dump(XIVE_DUMP_TM_OS, hwid); - opal_xive_dump(XIVE_DUMP_TM_USER, hwid); - opal_xive_dump(XIVE_DUMP_VP, hwid); - opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid); + if (hv) { + opal_xive_dump(XIVE_DUMP_TM_HYP, hwid); + opal_xive_dump(XIVE_DUMP_TM_POOL, hwid); + opal_xive_dump(XIVE_DUMP_TM_OS, hwid); + opal_xive_dump(XIVE_DUMP_TM_USER, hwid); + opal_xive_dump(XIVE_DUMP_VP, hwid); + opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid); + } if (setjmp(bus_error_jmp) != 0) { catch_memory_errors = 0; -- 2.20.1
[PATCH AUTOSEL 4.19 45/50] powerpc: dump kernel log before carrying out fadump or kdump
From: Ganesh Goudar [ Upstream commit e7ca44ed3ba77fc26cf32650bb71584896662474 ] Since commit 4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path"), pstore dmesg file is not updated when dump is triggered from HMC. This commit modified system reset (sreset) handler to invoke fadump or kdump (if configured), without pushing dmesg to pstore. This leaves pstore to have old dmesg data which won't be much of a help if kdump fails to capture the dump. This patch fixes that by calling kmsg_dump() before heading to fadump ot kdump. Fixes: 4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path") Reviewed-by: Mahesh Salgaonkar Reviewed-by: Nicholas Piggin Signed-off-by: Ganesh Goudar Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190904075949.15607-1-ganes...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/traps.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 02fe6d0201741..d5f351f02c153 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -399,6 +399,7 @@ void system_reset_exception(struct pt_regs *regs) if (debugger(regs)) goto out; + kmsg_dump(KMSG_DUMP_OOPS); /* * A system reset is a request to dump, so we always send * it through the crashdump code (if fadump or kdump are -- 2.20.1
[PATCH AUTOSEL 4.19 41/50] powerpc/pseries: correctly track irq state in default idle
From: Nathan Lynch [ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ] prep_irq_for_idle() is intended to be called before entering H_CEDE (and it is used by the pseries cpuidle driver). However the default pseries idle routine does not call it, leading to mismanaged lazy irq state when the cpuidle driver isn't in use. Manifestations of this include: * Dropped IPIs in the time immediately after a cpu comes online (before it has installed the cpuidle handler), making the online operation block indefinitely waiting for the new cpu to respond. * Hitting this WARN_ON in arch_local_irq_restore(): /* * We should already be hard disabled here. We had bugs * where that wasn't the case so let's dbl check it and * warn if we are wrong. Only do that when IRQ tracing * is enabled as mfmsr() can be costly. */ if (WARN_ON_ONCE(mfmsr() & MSR_EE)) __hard_irq_disable(); Call prep_irq_for_idle() from pseries_lpar_idle() and honor its result. Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on pseries") Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/setup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index ba1791fd3234d..67f49159ea708 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -325,6 +325,9 @@ static void pseries_lpar_idle(void) * low power mode by ceding processor to hypervisor */ + if (!prep_irq_for_idle()) + return; + /* Indicate to hypervisor that we are idle. */ get_lppaca()->idle = 1; -- 2.20.1
[PATCH AUTOSEL 4.19 39/50] powerpc/imc: Dont create debugfs files for cpu-less nodes
From: Madhavan Srinivasan [ Upstream commit 41ba17f20ea835c489e77bd54e2da73184e22060 ] Commit <684d984038aa> ('powerpc/powernv: Add debugfs interface for imc-mode and imc') added debugfs interface for the nest imc pmu devices to support changing of different ucode modes. Primarily adding this capability for debug. But when doing so, the code did not consider the case of cpu-less nodes. So when reading the _cmd_ or _mode_ file of a cpu-less node will create this crash. Faulting instruction address: 0xc00d0d58 Oops: Kernel access of bad area, sig: 11 [#1] ... CPU: 67 PID: 5301 Comm: cat Not tainted 5.2.0-rc6-next-20190627+ #19 NIP: c00d0d58 LR: c049aa18 CTR:c00d0d50 REGS: c00020194548f9e0 TRAP: 0300 Not tainted (5.2.0-rc6-next-20190627+) MSR: 90009033 CR:28022822 XER: CFAR: c049aa14 DAR: 0003fc08 DSISR:4000 IRQMASK: 0 ... NIP imc_mem_get+0x8/0x20 LR simple_attr_read+0x118/0x170 Call Trace: simple_attr_read+0x70/0x170 (unreliable) debugfs_attr_read+0x6c/0xb0 __vfs_read+0x3c/0x70 vfs_read+0xbc/0x1a0 ksys_read+0x7c/0x140 system_call+0x5c/0x70 Patch fixes the issue with a more robust check for vbase to NULL. Before patch, ls output for the debugfs imc directory # ls /sys/kernel/debug/powerpc/imc/ imc_cmd_0imc_cmd_251 imc_cmd_253 imc_cmd_255 imc_mode_0 imc_mode_251 imc_mode_253 imc_mode_255 imc_cmd_250 imc_cmd_252 imc_cmd_254 imc_cmd_8imc_mode_250 imc_mode_252 imc_mode_254 imc_mode_8 After patch, ls output for the debugfs imc directory # ls /sys/kernel/debug/powerpc/imc/ imc_cmd_0 imc_cmd_8 imc_mode_0 imc_mode_8 Actual bug here is that, we have two loops with potentially different loop counts. That is, in imc_get_mem_addr_nest(), loop count is obtained from the dt entries. But in case of export_imc_mode_and_cmd(), loop was based on for_each_nid() count. Patch fixes the loop count in latter based on the struct mem_info. Ideally it would be better to have array size in struct imc_pmu. Fixes: 684d984038aa ('powerpc/powernv: Add debugfs interface for imc-mode and imc') Reported-by: Qian Cai Suggested-by: Michael Ellerman Signed-off-by: Madhavan Srinivasan Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190827101635.6942-1-ma...@linux.vnet.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/powernv/opal-imc.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 828f6656f8f74..649fb268f4461 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -57,9 +57,9 @@ static void export_imc_mode_and_cmd(struct device_node *node, struct imc_pmu *pmu_ptr) { static u64 loc, *imc_mode_addr, *imc_cmd_addr; - int chip = 0, nid; char mode[16], cmd[16]; u32 cb_offset; + struct imc_mem_info *ptr = pmu_ptr->mem_info; imc_debugfs_parent = debugfs_create_dir("imc", powerpc_debugfs_root); @@ -73,20 +73,20 @@ static void export_imc_mode_and_cmd(struct device_node *node, if (of_property_read_u32(node, "cb_offset", _offset)) cb_offset = IMC_CNTL_BLK_OFFSET; - for_each_node(nid) { - loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset; + while (ptr->vbase != NULL) { + loc = (u64)(ptr->vbase) + cb_offset; imc_mode_addr = (u64 *)(loc + IMC_CNTL_BLK_MODE_OFFSET); - sprintf(mode, "imc_mode_%d", nid); + sprintf(mode, "imc_mode_%d", (u32)(ptr->id)); if (!imc_debugfs_create_x64(mode, 0600, imc_debugfs_parent, imc_mode_addr)) goto err; imc_cmd_addr = (u64 *)(loc + IMC_CNTL_BLK_CMD_OFFSET); - sprintf(cmd, "imc_cmd_%d", nid); + sprintf(cmd, "imc_cmd_%d", (u32)(ptr->id)); if (!imc_debugfs_create_x64(cmd, 0600, imc_debugfs_parent, imc_cmd_addr)) goto err; - chip++; + ptr++; } return; -- 2.20.1
[PATCH AUTOSEL 4.19 37/50] powerpc/64s/exception: machine check use correct cfar for late handler
From: Nicholas Piggin [ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ] Bare metal machine checks run an "early" handler in real mode before running the main handler which reports the event. The main handler runs exactly as a normal interrupt handler, after the "windup" which sets registers back as they were at interrupt entry. CFAR does not get restored by the windup code, so that will be wrong when the handler is run. Restore the CFAR to the saved value before running the late handler. Signed-off-by: Nicholas Piggin Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/exceptions-64s.S | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 06cc77813dbb7..90af86f143a91 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -520,6 +520,10 @@ EXC_COMMON_BEGIN(machine_check_handle_early) RFI_TO_USER_OR_KERNEL 9: /* Deliver the machine check to host kernel in V mode. */ +BEGIN_FTR_SECTION + ld r10,ORIG_GPR3(r1) + mtspr SPRN_CFAR,r10 +END_FTR_SECTION_IFSET(CPU_FTR_CFAR) MACHINE_CHECK_HANDLER_WINDUP b machine_check_pSeries -- 2.20.1
[PATCH AUTOSEL 4.19 29/50] powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag
From: Sam Bobroff [ Upstream commit aa06e3d60e245284d1e55497eb3108828092818d ] The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the use of driver callbacks in drivers that have been bound part way through the recovery process. This is necessary to prevent later stage handlers from being called when the earlier stage handlers haven't, which can be confusing for drivers. However, the flag is set for all devices that are added after boot time and only cleared at the end of the EEH recovery process. This results in hot plugged devices erroneously having the flag set during the first recovery after they are added (causing their driver's handlers to be incorrectly ignored). To remedy this, clear the flag at the beginning of recovery processing. The flag is still cleared at the end of recovery processing, although it is no longer really necessary. Also clear the flag during eeh_handle_special_event(), for the same reasons. Signed-off-by: Sam Bobroff Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobr...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/eeh_driver.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 67619b4b3f96c..110eba400de7c 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -811,6 +811,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe) pr_warn("EEH: This PCI device has failed %d times in the last hour and will be permanently disabled after %d failures.\n", pe->freeze_count, eeh_max_freezes); + eeh_for_each_pe(pe, tmp_pe) + eeh_pe_for_each_dev(tmp_pe, edev, tmp) + edev->mode &= ~EEH_DEV_NO_HANDLER; + /* Walk the various device drivers attached to this slot through * a reset sequence, giving each an opportunity to do what it needs * to accomplish the reset. Each child gets a report of the @@ -1004,7 +1008,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe) */ void eeh_handle_special_event(void) { - struct eeh_pe *pe, *phb_pe; + struct eeh_pe *pe, *phb_pe, *tmp_pe; + struct eeh_dev *edev, *tmp_edev; struct pci_bus *bus; struct pci_controller *hose; unsigned long flags; @@ -1075,6 +1080,10 @@ void eeh_handle_special_event(void) (phb_pe->state & EEH_PE_RECOVERING)) continue; + eeh_for_each_pe(pe, tmp_pe) + eeh_pe_for_each_dev(tmp_pe, edev, tmp_edev) + edev->mode &= ~EEH_DEV_NO_HANDLER; + /* Notify all devices to be down */ eeh_pe_state_clear(pe, EEH_PE_PRI_BUS); eeh_set_channel_state(pe, pci_channel_io_perm_failure); -- 2.20.1
[PATCH AUTOSEL 4.19 27/50] powerpc/pseries/mobility: use cond_resched when updating device tree
From: Nathan Lynch [ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ] After a partition migration, pseries_devicetree_update() processes changes to the device tree communicated from the platform to Linux. This is a relatively heavyweight operation, with multiple device tree searches, memory allocations, and conversations with partition firmware. There's a few levels of nested loops which are bounded only by decisions made by the platform, outside of Linux's control, and indeed we have seen RCU stalls on large systems while executing this call graph. Use cond_resched() in these loops so that the cpu is yielded when needed. Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/mobility.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 7b60fcf04dc47..e4ea713833832 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -209,7 +210,11 @@ static int update_dt_node(__be32 phandle, s32 scope) prop_data += vd; } + + cond_resched(); } + + cond_resched(); } while (rtas_rc == 1); of_node_put(dn); @@ -318,8 +323,12 @@ int pseries_devicetree_update(s32 scope) add_dt_node(phandle, drc_index); break; } + + cond_resched(); } } + + cond_resched(); } while (rc == 1); kfree(rtas_buf); -- 2.20.1
[PATCH AUTOSEL 4.19 26/50] powerpc/64s/radix: Remove redundant pfn_pte bitop, add VM_BUG_ON
From: Nicholas Piggin [ Upstream commit 6bb25170d7a44ef0ed9677814600f0785e7421d1 ] pfn_pte is never given a pte above the addressable physical memory limit, so the masking is redundant. In case of a software bug, it is not obviously better to silently truncate the pfn than to corrupt the pte (either one will result in memory corruption or crashes), so there is no reason to add this to the fast path. Add VM_BUG_ON to catch cases where the pfn is invalid. These would catch the create_section_mapping bug fixed by a previous commit. [16885.256466] [ cut here ] [16885.256492] kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612! cpu 0x0: Vector: 700 (Program Check) at [c000ee0a36d0] pc: c0080738: __map_kernel_page+0x248/0x6f0 lr: c0080ac0: __map_kernel_page+0x5d0/0x6f0 sp: c000ee0a3960 msr: 90029033 current = 0xc000ec63b400 paca= 0xc17f irqmask: 0x03 irq_happened: 0x01 pid = 85, comm = sh kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612! Linux version 5.3.0-rc1-1-g0fe93e5f3394 enter ? for help [c000ee0a3a00] c0d37378 create_physical_mapping+0x260/0x360 [c000ee0a3b10] c0d370bc create_section_mapping+0x1c/0x3c [c000ee0a3b30] c0071f54 arch_add_memory+0x74/0x130 Signed-off-by: Nicholas Piggin Reviewed-by: Aneesh Kumar K.V Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190724084638.24982-5-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/include/asm/book3s/64/pgtable.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 855dbae6d351d..a717640b7eda4 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -629,8 +629,10 @@ static inline bool pte_access_permitted(pte_t pte, bool write) */ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) { - return __ptepte_basic_t)(pfn) << PAGE_SHIFT) & PTE_RPN_MASK) | -pgprot_val(pgprot)); + VM_BUG_ON(pfn >> (64 - PAGE_SHIFT)); + VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK); + + return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot)); } static inline unsigned long pte_pfn(pte_t pte) -- 2.20.1
[PATCH AUTOSEL 4.19 25/50] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function
From: Christophe Leroy [ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ] We see warnings such as: kernel/futex.c: In function 'do_futex': kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this function [-Wmaybe-uninitialized] return oldval == cmparg; ^ kernel/futex.c:1651:6: note: 'oldval' was declared here int oldval, ret; ^ This is because arch_futex_atomic_op_inuser() only sets *oval if ret is 0 and GCC doesn't see that it will only use it when ret is 0. Anyway, the non-zero ret path is an error path that won't suffer from setting *oval, and as *oval is a local var in futex_atomic_op_inuser() it will have no impact. Signed-off-by: Christophe Leroy [mpe: reword change log slightly] Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr Signed-off-by: Sasha Levin --- arch/powerpc/include/asm/futex.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index 94542776a62d6..2a7b01f97a56b 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -59,8 +59,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, pagefault_enable(); - if (!ret) - *oval = oldval; + *oval = oldval; return ret; } -- 2.20.1
[PATCH AUTOSEL 4.19 24/50] powerpc/rtas: use device model APIs and serialization during LPM
From: Nathan Lynch [ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ] The LPAR migration implementation and userspace-initiated cpu hotplug can interleave their executions like so: 1. Set cpu 7 offline via sysfs. 2. Begin a partition migration, whose implementation requires the OS to ensure all present cpus are online; cpu 7 is onlined: rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up This sets cpu 7 online in all respects except for the cpu's corresponding struct device; dev->offline remains true. 3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is already online and returns success. The driver core (device_online) sets dev->offline = false. 4. The migration completes and restores cpu 7 to offline state: rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down This leaves cpu7 in a state where the driver core considers the cpu device online, but in all other respects it is offline and unused. Attempts to online the cpu via sysfs appear to succeed but the driver core actually does not pass the request to the lower-level cpuhp support code. This makes the cpu unusable until the cpu device is manually set offline and then online again via sysfs. Instead of directly calling cpu_up/cpu_down, the migration code should use the higher-level device core APIs to maintain consistent state and serialize operations. Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation") Signed-off-by: Nathan Lynch Reviewed-by: Gautham R. Shenoy Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/rtas.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index 8afd146bc9c70..9e41a9de43235 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -875,15 +875,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state state, return 0; for_each_cpu(cpu, cpus) { + struct device *dev = get_cpu_device(cpu); + switch (state) { case DOWN: - cpuret = cpu_down(cpu); + cpuret = device_offline(dev); break; case UP: - cpuret = cpu_up(cpu); + cpuret = device_online(dev); break; } - if (cpuret) { + if (cpuret < 0) { pr_debug("%s: cpu_%s for cpu#%d returned %d.\n", __func__, ((state == UP) ? "up" : "down"), @@ -972,6 +974,8 @@ int rtas_ibm_suspend_me(u64 handle) data.token = rtas_token("ibm,suspend-me"); data.complete = + lock_device_hotplug(); + /* All present CPUs must be online */ cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask); cpuret = rtas_online_cpus_mask(offline_mask); @@ -1003,6 +1007,7 @@ int rtas_ibm_suspend_me(u64 handle) __func__); out: + unlock_device_hotplug(); free_cpumask_var(offline_mask); return atomic_read(); } -- 2.20.1
[PATCH AUTOSEL 4.19 23/50] powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL
From: Cédric Le Goater [ Upstream commit c3e0dbd7f780a58c4695f1cd8fc8afde80376737 ] Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in the OPAL logs and also outputs some of the fields of the internal XIVE structures in Linux. The OPAL calls can only be done on baremetal (PowerNV) and they crash a pseries machine. Fix by checking the hypervisor feature of the CPU. Signed-off-by: Cédric Le Goater Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190814154754.23682-2-...@kaod.org Signed-off-by: Sasha Levin --- arch/powerpc/xmon/xmon.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 74cfc1be04d6e..bb5db7bfd8539 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2497,13 +2497,16 @@ static void dump_pacas(void) static void dump_one_xive(int cpu) { unsigned int hwid = get_hard_smp_processor_id(cpu); + bool hv = cpu_has_feature(CPU_FTR_HVMODE); - opal_xive_dump(XIVE_DUMP_TM_HYP, hwid); - opal_xive_dump(XIVE_DUMP_TM_POOL, hwid); - opal_xive_dump(XIVE_DUMP_TM_OS, hwid); - opal_xive_dump(XIVE_DUMP_TM_USER, hwid); - opal_xive_dump(XIVE_DUMP_VP, hwid); - opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid); + if (hv) { + opal_xive_dump(XIVE_DUMP_TM_HYP, hwid); + opal_xive_dump(XIVE_DUMP_TM_POOL, hwid); + opal_xive_dump(XIVE_DUMP_TM_OS, hwid); + opal_xive_dump(XIVE_DUMP_TM_USER, hwid); + opal_xive_dump(XIVE_DUMP_VP, hwid); + opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid); + } if (setjmp(bus_error_jmp) != 0) { catch_memory_errors = 0; -- 2.20.1
[PATCH AUTOSEL 4.19 17/50] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window
From: Alexey Kardashevskiy [ Upstream commit c37c792dec0929dbb6360a609fb00fa20bb16fc2 ] We allocate only the first level of multilevel TCE tables for KVM already (alloc_userspace_copy==true), and the rest is allocated on demand. This is not enabled though for bare metal. This removes the KVM limitation (implicit, via the alloc_userspace_copy parameter) and always allocates just the first level. The on-demand allocation of missing levels is already implemented. As from now on DMA map might happen with disabled interrupts, this allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1]. In practice just a single page is allocated there so chances for failure are quite low. To save time when creating a new clean table, this skips non-allocated indirect TCE entries in pnv_tce_free just like we already do in the VFIO IOMMU TCE driver. This changes the default level number from 1 to 2 to reduce the amount of memory required for the default 32bit DMA window at the boot time. The default window size is up to 2GB which requires 4MB of TCEs which is unlikely to be used entirely or at all as most devices these days are 64bit capable so by switching to 2 levels by default we save 4032KB of RAM per a device. While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace can trigger this path via VFIO, see the failure and try creating a table again with different parameters which might succeed. [1]: === BUG: sleeping function called from invalid context at mm/page_alloc.c:4596 in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1 2 locks held by scsi_eh_1/1038: #0: 5efd659a (>eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80 #1: 06cf56a6 (&(>lock)->rlock){}, at: ata_exec_internal_sg+0xb0/0x5c0 irq event stamp: 500 hardirqs last enabled at (499): [] _raw_spin_unlock_irqrestore+0x94/0xd0 hardirqs last disabled at (500): [] _raw_spin_lock_irqsave+0x44/0x120 softirqs last enabled at (0): [] copy_process.isra.4.part.5+0x640/0x1a80 softirqs last disabled at (0): [<>] 0x0 CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Call Trace: [c03d064cef50] [c0c8e6c4] dump_stack+0xe8/0x164 (unreliable) [c03d064cefa0] [c014ed78] ___might_sleep+0x2f8/0x310 [c03d064cf020] [c03ca084] __alloc_pages_nodemask+0x2a4/0x1560 [c03d064cf220] [c00c2530] pnv_alloc_tce_level.isra.0+0x90/0x130 [c03d064cf290] [c00c2888] pnv_tce+0x128/0x3b0 [c03d064cf360] [c00c2c00] pnv_tce_build+0xb0/0xf0 [c03d064cf3c0] [c00bbd9c] pnv_ioda2_tce_build+0x3c/0xb0 [c03d064cf400] [c004cfe0] ppc_iommu_map_sg+0x210/0x550 [c03d064cf510] [c004b7a4] dma_iommu_map_sg+0x74/0xb0 [c03d064cf530] [c0863944] ata_qc_issue+0x134/0x470 [c03d064cf5b0] [c0863ec4] ata_exec_internal_sg+0x244/0x5c0 [c03d064cf700] [c08642d0] ata_exec_internal+0x90/0xe0 [c03d064cf780] [c08650ac] ata_dev_read_id+0x2ec/0x640 [c03d064cf8d0] [c0878e28] ata_eh_recover+0x948/0x16d0 [c03d064cfa10] [c087d760] sata_pmp_error_handler+0x480/0xbf0 [c03d064cfbc0] [c0884624] ahci_error_handler+0x74/0xe0 [c03d064cfbf0] [c0879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0 [c03d064cfca0] [c087a544] ata_scsi_error+0xb4/0x100 [c03d064cfd00] [c0802450] scsi_error_handler+0x120/0x510 [c03d064cfdb0] [c0140c48] kthread+0x1b8/0x1c0 [c03d064cfe20] [c000bd8c] ret_from_kernel_thread+0x5c/0x70 ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) irq event stamp: 2305 hardirqs last enabled at (2305): [] fast_exc_return_irq+0x28/0x34 hardirqs last disabled at (2303): [] __do_softirq+0x4a0/0x654 WARNING: possible irq lock inversion dependency detected 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: GW softirqs last enabled at (2304): [] __do_softirq+0x524/0x654 softirqs last disabled at (2297): [] irq_exit+0x128/0x180 swapper/0/0 just changed the state of lock: 06cf56a6 (&(>lock)->rlock){-...}, at: ahci_single_level_irq_intr+0xac/0x120 but this lock took another, HARDIRQ-unsafe lock in the past: (fs_reclaim){+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0CPU1 lock(fs_reclaim); local_irq_disable(); lock(&(>lock)->rlock); lock(fs_reclaim); lock(&(>lock)->rlock); *** DEADLOCK *** no locks held by swapper/0/0. the shortest dependencies between 2nd lock and 1st lock: -> (fs_reclaim){+.+.} ops: 167579 { HARDIRQ-ON-W at: lock_acquire+0xf8/0x2a0
[PATCH AUTOSEL 4.19 11/50] PCI: rpaphp: Avoid a sometimes-uninitialized warning
From: Nathan Chancellor [ Upstream commit 0df3e42167caaf9f8c7b64de3da40a459979afe8 ] When building with -Wsometimes-uninitialized, clang warns: drivers/pci/hotplug/rpaphp_core.c:243:14: warning: variable 'fndit' is used uninitialized whenever 'for' loop exits because its condition is false [-Wsometimes-uninitialized] for (j = 0; j < entries; j++) { ^~~ drivers/pci/hotplug/rpaphp_core.c:256:6: note: uninitialized use occurs here if (fndit) ^ drivers/pci/hotplug/rpaphp_core.c:243:14: note: remove the condition if it is always true for (j = 0; j < entries; j++) { ^~~ drivers/pci/hotplug/rpaphp_core.c:233:14: note: initialize the variable 'fndit' to silence this warning int j, fndit; ^ = 0 fndit is only used to gate a sprintf call, which can be moved into the loop to simplify the code and eliminate the local variable, which will fix this warning. Fixes: 2fcf3ae508c2 ("hotplug/drc-info: Add code to search ibm,drc-info property") Suggested-by: Nick Desaulniers Signed-off-by: Nathan Chancellor Acked-by: Tyrel Datwyler Acked-by: Joel Savitz Signed-off-by: Michael Ellerman Link: https://github.com/ClangBuiltLinux/linux/issues/504 Link: https://lore.kernel.org/r/20190603221157.58502-1-natechancel...@gmail.com Signed-off-by: Sasha Levin --- drivers/pci/hotplug/rpaphp_core.c | 18 +++--- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/drivers/pci/hotplug/rpaphp_core.c b/drivers/pci/hotplug/rpaphp_core.c index 857c358b727b8..cc860c5f7d26f 100644 --- a/drivers/pci/hotplug/rpaphp_core.c +++ b/drivers/pci/hotplug/rpaphp_core.c @@ -230,7 +230,7 @@ static int rpaphp_check_drc_props_v2(struct device_node *dn, char *drc_name, struct of_drc_info drc; const __be32 *value; char cell_drc_name[MAX_DRC_NAME_LEN]; - int j, fndit; + int j; info = of_find_property(dn->parent, "ibm,drc-info", NULL); if (info == NULL) @@ -245,17 +245,13 @@ static int rpaphp_check_drc_props_v2(struct device_node *dn, char *drc_name, /* Should now know end of current entry */ - if (my_index > drc.last_drc_index) - continue; - - fndit = 1; - break; + /* Found it */ + if (my_index <= drc.last_drc_index) { + sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, + my_index); + break; + } } - /* Found it */ - - if (fndit) - sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, - my_index); if (((drc_name == NULL) || (drc_name && !strcmp(drc_name, cell_drc_name))) && -- 2.20.1
[PATCH AUTOSEL 5.2 63/70] powerpc: dump kernel log before carrying out fadump or kdump
From: Ganesh Goudar [ Upstream commit e7ca44ed3ba77fc26cf32650bb71584896662474 ] Since commit 4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path"), pstore dmesg file is not updated when dump is triggered from HMC. This commit modified system reset (sreset) handler to invoke fadump or kdump (if configured), without pushing dmesg to pstore. This leaves pstore to have old dmesg data which won't be much of a help if kdump fails to capture the dump. This patch fixes that by calling kmsg_dump() before heading to fadump ot kdump. Fixes: 4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path") Reviewed-by: Mahesh Salgaonkar Reviewed-by: Nicholas Piggin Signed-off-by: Ganesh Goudar Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190904075949.15607-1-ganes...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/traps.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 47df30982de1b..c8ea3a253b815 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -472,6 +472,7 @@ void system_reset_exception(struct pt_regs *regs) if (debugger(regs)) goto out; + kmsg_dump(KMSG_DUMP_OOPS); /* * A system reset is a request to dump, so we always send * it through the crashdump code (if fadump or kdump are -- 2.20.1
[PATCH AUTOSEL 5.2 55/70] powerpc/pseries: correctly track irq state in default idle
From: Nathan Lynch [ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ] prep_irq_for_idle() is intended to be called before entering H_CEDE (and it is used by the pseries cpuidle driver). However the default pseries idle routine does not call it, leading to mismanaged lazy irq state when the cpuidle driver isn't in use. Manifestations of this include: * Dropped IPIs in the time immediately after a cpu comes online (before it has installed the cpuidle handler), making the online operation block indefinitely waiting for the new cpu to respond. * Hitting this WARN_ON in arch_local_irq_restore(): /* * We should already be hard disabled here. We had bugs * where that wasn't the case so let's dbl check it and * warn if we are wrong. Only do that when IRQ tracing * is enabled as mfmsr() can be costly. */ if (WARN_ON_ONCE(mfmsr() & MSR_EE)) __hard_irq_disable(); Call prep_irq_for_idle() from pseries_lpar_idle() and honor its result. Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on pseries") Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/setup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 8fa012a65a712..cc682759feae8 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -344,6 +344,9 @@ static void pseries_lpar_idle(void) * low power mode by ceding processor to hypervisor */ + if (!prep_irq_for_idle()) + return; + /* Indicate to hypervisor that we are idle. */ get_lppaca()->idle = 1; -- 2.20.1
[PATCH AUTOSEL 5.2 53/70] powerpc/imc: Dont create debugfs files for cpu-less nodes
From: Madhavan Srinivasan [ Upstream commit 41ba17f20ea835c489e77bd54e2da73184e22060 ] Commit <684d984038aa> ('powerpc/powernv: Add debugfs interface for imc-mode and imc') added debugfs interface for the nest imc pmu devices to support changing of different ucode modes. Primarily adding this capability for debug. But when doing so, the code did not consider the case of cpu-less nodes. So when reading the _cmd_ or _mode_ file of a cpu-less node will create this crash. Faulting instruction address: 0xc00d0d58 Oops: Kernel access of bad area, sig: 11 [#1] ... CPU: 67 PID: 5301 Comm: cat Not tainted 5.2.0-rc6-next-20190627+ #19 NIP: c00d0d58 LR: c049aa18 CTR:c00d0d50 REGS: c00020194548f9e0 TRAP: 0300 Not tainted (5.2.0-rc6-next-20190627+) MSR: 90009033 CR:28022822 XER: CFAR: c049aa14 DAR: 0003fc08 DSISR:4000 IRQMASK: 0 ... NIP imc_mem_get+0x8/0x20 LR simple_attr_read+0x118/0x170 Call Trace: simple_attr_read+0x70/0x170 (unreliable) debugfs_attr_read+0x6c/0xb0 __vfs_read+0x3c/0x70 vfs_read+0xbc/0x1a0 ksys_read+0x7c/0x140 system_call+0x5c/0x70 Patch fixes the issue with a more robust check for vbase to NULL. Before patch, ls output for the debugfs imc directory # ls /sys/kernel/debug/powerpc/imc/ imc_cmd_0imc_cmd_251 imc_cmd_253 imc_cmd_255 imc_mode_0 imc_mode_251 imc_mode_253 imc_mode_255 imc_cmd_250 imc_cmd_252 imc_cmd_254 imc_cmd_8imc_mode_250 imc_mode_252 imc_mode_254 imc_mode_8 After patch, ls output for the debugfs imc directory # ls /sys/kernel/debug/powerpc/imc/ imc_cmd_0 imc_cmd_8 imc_mode_0 imc_mode_8 Actual bug here is that, we have two loops with potentially different loop counts. That is, in imc_get_mem_addr_nest(), loop count is obtained from the dt entries. But in case of export_imc_mode_and_cmd(), loop was based on for_each_nid() count. Patch fixes the loop count in latter based on the struct mem_info. Ideally it would be better to have array size in struct imc_pmu. Fixes: 684d984038aa ('powerpc/powernv: Add debugfs interface for imc-mode and imc') Reported-by: Qian Cai Suggested-by: Michael Ellerman Signed-off-by: Madhavan Srinivasan Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190827101635.6942-1-ma...@linux.vnet.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/powernv/opal-imc.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 186109bdd41be..e04b20625cb94 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -53,9 +53,9 @@ static void export_imc_mode_and_cmd(struct device_node *node, struct imc_pmu *pmu_ptr) { static u64 loc, *imc_mode_addr, *imc_cmd_addr; - int chip = 0, nid; char mode[16], cmd[16]; u32 cb_offset; + struct imc_mem_info *ptr = pmu_ptr->mem_info; imc_debugfs_parent = debugfs_create_dir("imc", powerpc_debugfs_root); @@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct device_node *node, if (of_property_read_u32(node, "cb_offset", _offset)) cb_offset = IMC_CNTL_BLK_OFFSET; - for_each_node(nid) { - loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset; + while (ptr->vbase != NULL) { + loc = (u64)(ptr->vbase) + cb_offset; imc_mode_addr = (u64 *)(loc + IMC_CNTL_BLK_MODE_OFFSET); - sprintf(mode, "imc_mode_%d", nid); + sprintf(mode, "imc_mode_%d", (u32)(ptr->id)); if (!imc_debugfs_create_x64(mode, 0600, imc_debugfs_parent, imc_mode_addr)) goto err; imc_cmd_addr = (u64 *)(loc + IMC_CNTL_BLK_CMD_OFFSET); - sprintf(cmd, "imc_cmd_%d", nid); + sprintf(cmd, "imc_cmd_%d", (u32)(ptr->id)); if (!imc_debugfs_create_x64(cmd, 0600, imc_debugfs_parent, imc_cmd_addr)) goto err; - chip++; + ptr++; } return; -- 2.20.1
[PATCH AUTOSEL 5.2 52/70] powerpc/eeh: Clean up EEH PEs after recovery finishes
From: Oliver O'Halloran [ Upstream commit 799abe283e5103d48e079149579b4f167c95ea0e ] When the last device in an eeh_pe is removed the eeh_pe structure itself (and any empty parents) are freed since they are no longer needed. This results in a crash when a hotplug driver is involved since the following may occur: 1. Device is suprise removed. 2. Driver performs an MMIO, which fails and queues and eeh_event. 3. Hotplug driver receives a hotplug interrupt and removes any pci_devs that were under the slot. 4. pci_dev is torn down and the eeh_pe is freed. 5. The EEH event handler thread processes the eeh_event and crashes since the eeh_pe pointer in the eeh_event structure is no longer valid. Crashing is generally considered poor form. Instead of doing that use the fact PEs are marked as EEH_PE_INVALID to keep them around until the end of the recovery cycle, at which point we can safely prune any empty PEs. Signed-off-by: Oliver O'Halloran Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190903101605.2890-2-ooh...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/eeh_driver.c | 36 ++-- arch/powerpc/kernel/eeh_event.c | 8 +++ arch/powerpc/kernel/eeh_pe.c | 23 +++- 3 files changed, 64 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 1fbe541856f5e..fe0c32fb9f96f 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -744,6 +744,33 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus, */ #define MAX_WAIT_FOR_RECOVERY 300 + +/* Walks the PE tree after processing an event to remove any stale PEs. + * + * NB: This needs to be recursive to ensure the leaf PEs get removed + * before their parents do. Although this is possible to do recursively + * we don't since this is easier to read and we need to garantee + * the leaf nodes will be handled first. + */ +static void eeh_pe_cleanup(struct eeh_pe *pe) +{ + struct eeh_pe *child_pe, *tmp; + + list_for_each_entry_safe(child_pe, tmp, >child_list, child) + eeh_pe_cleanup(child_pe); + + if (pe->state & EEH_PE_KEEP) + return; + + if (!(pe->state & EEH_PE_INVALID)) + return; + + if (list_empty(>edevs) && list_empty(>child_list)) { + list_del(>child); + kfree(pe); + } +} + /** * eeh_handle_normal_event - Handle EEH events on a specific PE * @pe: EEH PE - which should not be used after we return, as it may @@ -782,8 +809,6 @@ void eeh_handle_normal_event(struct eeh_pe *pe) return; } - eeh_pe_state_mark(pe, EEH_PE_RECOVERING); - eeh_pe_update_time_stamp(pe); pe->freeze_count++; if (pe->freeze_count > eeh_max_freezes) { @@ -973,6 +998,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe) return; } } + + /* +* Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING +* we don't want to modify the PE tree structure so we do it here. +*/ + eeh_pe_cleanup(pe); eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true); } @@ -1045,6 +1076,7 @@ void eeh_handle_special_event(void) */ if (rc == EEH_NEXT_ERR_FROZEN_PE || rc == EEH_NEXT_ERR_FENCED_PHB) { + eeh_pe_state_mark(pe, EEH_PE_RECOVERING); eeh_handle_normal_event(pe); } else { pci_lock_rescan_remove(); diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c index 64cfbe41174b2..e36653e5f76b3 100644 --- a/arch/powerpc/kernel/eeh_event.c +++ b/arch/powerpc/kernel/eeh_event.c @@ -121,6 +121,14 @@ int __eeh_send_failure_event(struct eeh_pe *pe) } event->pe = pe; + /* +* Mark the PE as recovering before inserting it in the queue. +* This prevents the PE from being free()ed by a hotplug driver +* while the PE is sitting in the event queue. +*/ + if (pe) + eeh_pe_state_mark(pe, EEH_PE_RECOVERING); + /* We may or may not be called in an interrupt context */ spin_lock_irqsave(_eventlist_lock, flags); list_add(>list, _eventlist); diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index 854cef7b18f4d..f0813d50e0b1c 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -491,6 +491,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev) int eeh_rmv_from_parent_pe(struct eeh_dev *edev) { struct eeh_pe *pe, *parent, *child; + bool keep, recover; int cnt; struct pci_dn *pdn = eeh_dev_to_pdn(edev); @@ -516,10 +517,21 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev) */ while (1) { parent =
[PATCH AUTOSEL 5.2 50/70] powerpc/64s/exception: machine check use correct cfar for late handler
From: Nicholas Piggin [ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ] Bare metal machine checks run an "early" handler in real mode before running the main handler which reports the event. The main handler runs exactly as a normal interrupt handler, after the "windup" which sets registers back as they were at interrupt entry. CFAR does not get restored by the windup code, so that will be wrong when the handler is run. Restore the CFAR to the saved value before running the late handler. Signed-off-by: Nicholas Piggin Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/exceptions-64s.S | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 6c51aa845bcee..3e564536a237f 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -556,6 +556,10 @@ FTR_SECTION_ELSE ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE) 9: /* Deliver the machine check to host kernel in V mode. */ +BEGIN_FTR_SECTION + ld r10,ORIG_GPR3(r1) + mtspr SPRN_CFAR,r10 +END_FTR_SECTION_IFSET(CPU_FTR_CFAR) MACHINE_CHECK_HANDLER_WINDUP SET_SCRATCH0(r13) /* save r13 */ EXCEPTION_PROLOG_0(PACA_EXMC) -- 2.20.1
[PATCH AUTOSEL 5.2 48/70] selftests/powerpc: Retry on host facility unavailable
From: Gustavo Romero [ Upstream commit 6652bf6408895b09d31fd4128a1589a1a0672823 ] TM test tm-unavailable must take into account aborts due to host aborting a transactin because of a facility unavailable exception, just like it already does for aborts on reschedules (TM_CAUSE_KVM_RESCHED). Reported-by: Desnes A. Nunes do Rosario Tested-by: Desnes A. Nunes do Rosario Signed-off-by: Gustavo Romero Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/1566341651-19747-1-git-send-email-grom...@linux.vnet.ibm.com Signed-off-by: Sasha Levin --- tools/testing/selftests/powerpc/tm/tm.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/powerpc/tm/tm.h b/tools/testing/selftests/powerpc/tm/tm.h index 97f9f491c541a..c402464b038fc 100644 --- a/tools/testing/selftests/powerpc/tm/tm.h +++ b/tools/testing/selftests/powerpc/tm/tm.h @@ -55,7 +55,8 @@ static inline bool failure_is_unavailable(void) static inline bool failure_is_reschedule(void) { if ((failure_code() & TM_CAUSE_RESCHED) == TM_CAUSE_RESCHED || - (failure_code() & TM_CAUSE_KVM_RESCHED) == TM_CAUSE_KVM_RESCHED) + (failure_code() & TM_CAUSE_KVM_RESCHED) == TM_CAUSE_KVM_RESCHED || + (failure_code() & TM_CAUSE_KVM_FAC_UNAV) == TM_CAUSE_KVM_FAC_UNAV) return true; return false; -- 2.20.1
[PATCH AUTOSEL 5.2 40/70] powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag
From: Sam Bobroff [ Upstream commit aa06e3d60e245284d1e55497eb3108828092818d ] The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the use of driver callbacks in drivers that have been bound part way through the recovery process. This is necessary to prevent later stage handlers from being called when the earlier stage handlers haven't, which can be confusing for drivers. However, the flag is set for all devices that are added after boot time and only cleared at the end of the EEH recovery process. This results in hot plugged devices erroneously having the flag set during the first recovery after they are added (causing their driver's handlers to be incorrectly ignored). To remedy this, clear the flag at the beginning of recovery processing. The flag is still cleared at the end of recovery processing, although it is no longer really necessary. Also clear the flag during eeh_handle_special_event(), for the same reasons. Signed-off-by: Sam Bobroff Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobr...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/eeh_driver.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 89623962c7275..1fbe541856f5e 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -793,6 +793,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe) result = PCI_ERS_RESULT_DISCONNECT; } + eeh_for_each_pe(pe, tmp_pe) + eeh_pe_for_each_dev(tmp_pe, edev, tmp) + edev->mode &= ~EEH_DEV_NO_HANDLER; + /* Walk the various device drivers attached to this slot through * a reset sequence, giving each an opportunity to do what it needs * to accomplish the reset. Each child gets a report of the @@ -981,7 +985,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe) */ void eeh_handle_special_event(void) { - struct eeh_pe *pe, *phb_pe; + struct eeh_pe *pe, *phb_pe, *tmp_pe; + struct eeh_dev *edev, *tmp_edev; struct pci_bus *bus; struct pci_controller *hose; unsigned long flags; @@ -1050,6 +1055,10 @@ void eeh_handle_special_event(void) (phb_pe->state & EEH_PE_RECOVERING)) continue; + eeh_for_each_pe(pe, tmp_pe) + eeh_pe_for_each_dev(tmp_pe, edev, tmp_edev) + edev->mode &= ~EEH_DEV_NO_HANDLER; + /* Notify all devices to be down */ eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true); eeh_set_channel_state(pe, pci_channel_io_perm_failure); -- 2.20.1
[PATCH AUTOSEL 5.2 38/70] powerpc/perf: fix imc allocation failure handling
From: Nicholas Piggin [ Upstream commit 10c4bd7cd28e77aeb8cfa65b23cb3c632ede2a49 ] The alloc_pages_node return value should be tested for failure before being passed to page_address. Tested-by: Anju T Sudhakar Signed-off-by: Nicholas Piggin Reviewed-by: Aneesh Kumar K.V Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190724084638.24982-3-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/perf/imc-pmu.c | 29 ++--- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 3bdfc1e320964..2231959c56331 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -570,6 +570,7 @@ static int core_imc_mem_init(int cpu, int size) { int nid, rc = 0, core_id = (cpu / threads_per_core); struct imc_mem_info *mem_info; + struct page *page; /* * alloc_pages_node() will allocate memory for core in the @@ -580,11 +581,12 @@ static int core_imc_mem_init(int cpu, int size) mem_info->id = core_id; /* We need only vbase for core counters */ - mem_info->vbase = page_address(alloc_pages_node(nid, - GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | - __GFP_NOWARN, get_order(size))); - if (!mem_info->vbase) + page = alloc_pages_node(nid, + GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | + __GFP_NOWARN, get_order(size)); + if (!page) return -ENOMEM; + mem_info->vbase = page_address(page); /* Init the mutex */ core_imc_refc[core_id].id = core_id; @@ -839,15 +841,17 @@ static int thread_imc_mem_alloc(int cpu_id, int size) int nid = cpu_to_node(cpu_id); if (!local_mem) { + struct page *page; /* * This case could happen only once at start, since we dont * free the memory in cpu offline path. */ - local_mem = page_address(alloc_pages_node(nid, + page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | - __GFP_NOWARN, get_order(size))); - if (!local_mem) + __GFP_NOWARN, get_order(size)); + if (!page) return -ENOMEM; + local_mem = page_address(page); per_cpu(thread_imc_mem, cpu_id) = local_mem; } @@ -1085,11 +1089,14 @@ static int trace_imc_mem_alloc(int cpu_id, int size) int core_id = (cpu_id / threads_per_core); if (!local_mem) { - local_mem = page_address(alloc_pages_node(phys_id, - GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | - __GFP_NOWARN, get_order(size))); - if (!local_mem) + struct page *page; + + page = alloc_pages_node(phys_id, + GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | + __GFP_NOWARN, get_order(size)); + if (!page) return -ENOMEM; + local_mem = page_address(page); per_cpu(trace_imc_mem, cpu_id) = local_mem; /* Initialise the counters for trace mode */ -- 2.20.1
[PATCH AUTOSEL 5.2 37/70] powerpc/pseries/mobility: use cond_resched when updating device tree
From: Nathan Lynch [ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ] After a partition migration, pseries_devicetree_update() processes changes to the device tree communicated from the platform to Linux. This is a relatively heavyweight operation, with multiple device tree searches, memory allocations, and conversations with partition firmware. There's a few levels of nested loops which are bounded only by decisions made by the platform, outside of Linux's control, and indeed we have seen RCU stalls on large systems while executing this call graph. Use cond_resched() in these loops so that the cpu is yielded when needed. Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/mobility.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index 50e7aee3c7f37..accb732dcfac7 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -206,7 +207,11 @@ static int update_dt_node(__be32 phandle, s32 scope) prop_data += vd; } + + cond_resched(); } + + cond_resched(); } while (rtas_rc == 1); of_node_put(dn); @@ -309,8 +314,12 @@ int pseries_devicetree_update(s32 scope) add_dt_node(phandle, drc_index); break; } + + cond_resched(); } } + + cond_resched(); } while (rc == 1); kfree(rtas_buf); -- 2.20.1
[PATCH AUTOSEL 5.2 36/70] powerpc/64s/radix: Fix memory hotplug section page table creation
From: Nicholas Piggin [ Upstream commit 8f51e3929470942e6a8744061254fdeef646cd36 ] create_physical_mapping expects physical addresses, but creating and splitting these mappings after boot is supplying virtual (effective) addresses. This can be irritated by booting with mem= to limit memory then probing an unused physical memory range: echo > /sys/devices/system/memory/probe This mostly works by accident, firstly because __va(__va(x)) == __va(x) so the virtual address does not get corrupted. Secondly because pfn_pte masks out the upper bits of the pfn beyond the physical address limit, so a pfn constructed with a 0xc000 virtual linear address will be masked back to the correct physical address in the pte. Fixes: 6cc27341b21a8 ("powerpc/mm: add radix__create_section_mapping()") Signed-off-by: Nicholas Piggin Reviewed-by: Aneesh Kumar K.V Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190724084638.24982-1-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 8deb432c29754..2b6cc823046a3 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -901,7 +901,7 @@ int __meminit radix__create_section_mapping(unsigned long start, unsigned long e return -1; } - return create_physical_mapping(start, end, nid); + return create_physical_mapping(__pa(start), __pa(end), nid); } int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end) -- 2.20.1
[PATCH AUTOSEL 5.2 35/70] powerpc/64s/radix: Remove redundant pfn_pte bitop, add VM_BUG_ON
From: Nicholas Piggin [ Upstream commit 6bb25170d7a44ef0ed9677814600f0785e7421d1 ] pfn_pte is never given a pte above the addressable physical memory limit, so the masking is redundant. In case of a software bug, it is not obviously better to silently truncate the pfn than to corrupt the pte (either one will result in memory corruption or crashes), so there is no reason to add this to the fast path. Add VM_BUG_ON to catch cases where the pfn is invalid. These would catch the create_section_mapping bug fixed by a previous commit. [16885.256466] [ cut here ] [16885.256492] kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612! cpu 0x0: Vector: 700 (Program Check) at [c000ee0a36d0] pc: c0080738: __map_kernel_page+0x248/0x6f0 lr: c0080ac0: __map_kernel_page+0x5d0/0x6f0 sp: c000ee0a3960 msr: 90029033 current = 0xc000ec63b400 paca= 0xc17f irqmask: 0x03 irq_happened: 0x01 pid = 85, comm = sh kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612! Linux version 5.3.0-rc1-1-g0fe93e5f3394 enter ? for help [c000ee0a3a00] c0d37378 create_physical_mapping+0x260/0x360 [c000ee0a3b10] c0d370bc create_section_mapping+0x1c/0x3c [c000ee0a3b30] c0071f54 arch_add_memory+0x74/0x130 Signed-off-by: Nicholas Piggin Reviewed-by: Aneesh Kumar K.V Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190724084638.24982-5-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/include/asm/book3s/64/pgtable.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index ccf00a8b98c6a..c470f6370dccb 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -602,8 +602,10 @@ static inline bool pte_access_permitted(pte_t pte, bool write) */ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t pgprot) { - return __ptepte_basic_t)(pfn) << PAGE_SHIFT) & PTE_RPN_MASK) | -pgprot_val(pgprot)); + VM_BUG_ON(pfn >> (64 - PAGE_SHIFT)); + VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK); + + return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot)); } static inline unsigned long pte_pfn(pte_t pte) -- 2.20.1
[PATCH AUTOSEL 5.2 34/70] powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this function
From: Christophe Leroy [ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ] We see warnings such as: kernel/futex.c: In function 'do_futex': kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this function [-Wmaybe-uninitialized] return oldval == cmparg; ^ kernel/futex.c:1651:6: note: 'oldval' was declared here int oldval, ret; ^ This is because arch_futex_atomic_op_inuser() only sets *oval if ret is 0 and GCC doesn't see that it will only use it when ret is 0. Anyway, the non-zero ret path is an error path that won't suffer from setting *oval, and as *oval is a local var in futex_atomic_op_inuser() it will have no impact. Signed-off-by: Christophe Leroy [mpe: reword change log slightly] Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.le...@c-s.fr Signed-off-by: Sasha Levin --- arch/powerpc/include/asm/futex.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index 3a6aa57b9d901..eea28ca679dbb 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -60,8 +60,7 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, pagefault_enable(); - if (!ret) - *oval = oldval; + *oval = oldval; prevent_write_to_user(uaddr, sizeof(*uaddr)); return ret; -- 2.20.1
[PATCH AUTOSEL 5.2 33/70] powerpc/rtas: use device model APIs and serialization during LPM
From: Nathan Lynch [ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ] The LPAR migration implementation and userspace-initiated cpu hotplug can interleave their executions like so: 1. Set cpu 7 offline via sysfs. 2. Begin a partition migration, whose implementation requires the OS to ensure all present cpus are online; cpu 7 is onlined: rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up This sets cpu 7 online in all respects except for the cpu's corresponding struct device; dev->offline remains true. 3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is already online and returns success. The driver core (device_online) sets dev->offline = false. 4. The migration completes and restores cpu 7 to offline state: rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down This leaves cpu7 in a state where the driver core considers the cpu device online, but in all other respects it is offline and unused. Attempts to online the cpu via sysfs appear to succeed but the driver core actually does not pass the request to the lower-level cpuhp support code. This makes the cpu unusable until the cpu device is manually set offline and then online again via sysfs. Instead of directly calling cpu_up/cpu_down, the migration code should use the higher-level device core APIs to maintain consistent state and serialize operations. Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation") Signed-off-by: Nathan Lynch Reviewed-by: Gautham R. Shenoy Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-2-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/rtas.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index fff2eb22427d0..65cd96c3b1d60 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -871,15 +871,17 @@ static int rtas_cpu_state_change_mask(enum rtas_cpu_state state, return 0; for_each_cpu(cpu, cpus) { + struct device *dev = get_cpu_device(cpu); + switch (state) { case DOWN: - cpuret = cpu_down(cpu); + cpuret = device_offline(dev); break; case UP: - cpuret = cpu_up(cpu); + cpuret = device_online(dev); break; } - if (cpuret) { + if (cpuret < 0) { pr_debug("%s: cpu_%s for cpu#%d returned %d.\n", __func__, ((state == UP) ? "up" : "down"), @@ -968,6 +970,8 @@ int rtas_ibm_suspend_me(u64 handle) data.token = rtas_token("ibm,suspend-me"); data.complete = + lock_device_hotplug(); + /* All present CPUs must be online */ cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask); cpuret = rtas_online_cpus_mask(offline_mask); @@ -1007,6 +1011,7 @@ int rtas_ibm_suspend_me(u64 handle) __func__); out: + unlock_device_hotplug(); free_cpumask_var(offline_mask); return atomic_read(); } -- 2.20.1
[PATCH AUTOSEL 5.2 32/70] powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL
From: Cédric Le Goater [ Upstream commit c3e0dbd7f780a58c4695f1cd8fc8afde80376737 ] Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in the OPAL logs and also outputs some of the fields of the internal XIVE structures in Linux. The OPAL calls can only be done on baremetal (PowerNV) and they crash a pseries machine. Fix by checking the hypervisor feature of the CPU. Signed-off-by: Cédric Le Goater Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190814154754.23682-2-...@kaod.org Signed-off-by: Sasha Levin --- arch/powerpc/xmon/xmon.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 4a721fd624069..e15ccf19c1533 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2532,13 +2532,16 @@ static void dump_pacas(void) static void dump_one_xive(int cpu) { unsigned int hwid = get_hard_smp_processor_id(cpu); - - opal_xive_dump(XIVE_DUMP_TM_HYP, hwid); - opal_xive_dump(XIVE_DUMP_TM_POOL, hwid); - opal_xive_dump(XIVE_DUMP_TM_OS, hwid); - opal_xive_dump(XIVE_DUMP_TM_USER, hwid); - opal_xive_dump(XIVE_DUMP_VP, hwid); - opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid); + bool hv = cpu_has_feature(CPU_FTR_HVMODE); + + if (hv) { + opal_xive_dump(XIVE_DUMP_TM_HYP, hwid); + opal_xive_dump(XIVE_DUMP_TM_POOL, hwid); + opal_xive_dump(XIVE_DUMP_TM_OS, hwid); + opal_xive_dump(XIVE_DUMP_TM_USER, hwid); + opal_xive_dump(XIVE_DUMP_VP, hwid); + opal_xive_dump(XIVE_DUMP_EMU_STATE, hwid); + } if (setjmp(bus_error_jmp) != 0) { catch_memory_errors = 0; -- 2.20.1
[PATCH AUTOSEL 5.2 25/70] powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window
From: Alexey Kardashevskiy [ Upstream commit c37c792dec0929dbb6360a609fb00fa20bb16fc2 ] We allocate only the first level of multilevel TCE tables for KVM already (alloc_userspace_copy==true), and the rest is allocated on demand. This is not enabled though for bare metal. This removes the KVM limitation (implicit, via the alloc_userspace_copy parameter) and always allocates just the first level. The on-demand allocation of missing levels is already implemented. As from now on DMA map might happen with disabled interrupts, this allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1]. In practice just a single page is allocated there so chances for failure are quite low. To save time when creating a new clean table, this skips non-allocated indirect TCE entries in pnv_tce_free just like we already do in the VFIO IOMMU TCE driver. This changes the default level number from 1 to 2 to reduce the amount of memory required for the default 32bit DMA window at the boot time. The default window size is up to 2GB which requires 4MB of TCEs which is unlikely to be used entirely or at all as most devices these days are 64bit capable so by switching to 2 levels by default we save 4032KB of RAM per a device. While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace can trigger this path via VFIO, see the failure and try creating a table again with different parameters which might succeed. [1]: === BUG: sleeping function called from invalid context at mm/page_alloc.c:4596 in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1 2 locks held by scsi_eh_1/1038: #0: 5efd659a (>eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80 #1: 06cf56a6 (&(>lock)->rlock){}, at: ata_exec_internal_sg+0xb0/0x5c0 irq event stamp: 500 hardirqs last enabled at (499): [] _raw_spin_unlock_irqrestore+0x94/0xd0 hardirqs last disabled at (500): [] _raw_spin_lock_irqsave+0x44/0x120 softirqs last enabled at (0): [] copy_process.isra.4.part.5+0x640/0x1a80 softirqs last disabled at (0): [<>] 0x0 CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Call Trace: [c03d064cef50] [c0c8e6c4] dump_stack+0xe8/0x164 (unreliable) [c03d064cefa0] [c014ed78] ___might_sleep+0x2f8/0x310 [c03d064cf020] [c03ca084] __alloc_pages_nodemask+0x2a4/0x1560 [c03d064cf220] [c00c2530] pnv_alloc_tce_level.isra.0+0x90/0x130 [c03d064cf290] [c00c2888] pnv_tce+0x128/0x3b0 [c03d064cf360] [c00c2c00] pnv_tce_build+0xb0/0xf0 [c03d064cf3c0] [c00bbd9c] pnv_ioda2_tce_build+0x3c/0xb0 [c03d064cf400] [c004cfe0] ppc_iommu_map_sg+0x210/0x550 [c03d064cf510] [c004b7a4] dma_iommu_map_sg+0x74/0xb0 [c03d064cf530] [c0863944] ata_qc_issue+0x134/0x470 [c03d064cf5b0] [c0863ec4] ata_exec_internal_sg+0x244/0x5c0 [c03d064cf700] [c08642d0] ata_exec_internal+0x90/0xe0 [c03d064cf780] [c08650ac] ata_dev_read_id+0x2ec/0x640 [c03d064cf8d0] [c0878e28] ata_eh_recover+0x948/0x16d0 [c03d064cfa10] [c087d760] sata_pmp_error_handler+0x480/0xbf0 [c03d064cfbc0] [c0884624] ahci_error_handler+0x74/0xe0 [c03d064cfbf0] [c0879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0 [c03d064cfca0] [c087a544] ata_scsi_error+0xb4/0x100 [c03d064cfd00] [c0802450] scsi_error_handler+0x120/0x510 [c03d064cfdb0] [c0140c48] kthread+0x1b8/0x1c0 [c03d064cfe20] [c000bd8c] ret_from_kernel_thread+0x5c/0x70 ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) irq event stamp: 2305 hardirqs last enabled at (2305): [] fast_exc_return_irq+0x28/0x34 hardirqs last disabled at (2303): [] __do_softirq+0x4a0/0x654 WARNING: possible irq lock inversion dependency detected 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: GW softirqs last enabled at (2304): [] __do_softirq+0x524/0x654 softirqs last disabled at (2297): [] irq_exit+0x128/0x180 swapper/0/0 just changed the state of lock: 06cf56a6 (&(>lock)->rlock){-...}, at: ahci_single_level_irq_intr+0xac/0x120 but this lock took another, HARDIRQ-unsafe lock in the past: (fs_reclaim){+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0CPU1 lock(fs_reclaim); local_irq_disable(); lock(&(>lock)->rlock); lock(fs_reclaim); lock(&(>lock)->rlock); *** DEADLOCK *** no locks held by swapper/0/0. the shortest dependencies between 2nd lock and 1st lock: -> (fs_reclaim){+.+.} ops: 167579 { HARDIRQ-ON-W at: lock_acquire+0xf8/0x2a0
[PATCH AUTOSEL 5.2 17/70] PCI: rpaphp: Avoid a sometimes-uninitialized warning
From: Nathan Chancellor [ Upstream commit 0df3e42167caaf9f8c7b64de3da40a459979afe8 ] When building with -Wsometimes-uninitialized, clang warns: drivers/pci/hotplug/rpaphp_core.c:243:14: warning: variable 'fndit' is used uninitialized whenever 'for' loop exits because its condition is false [-Wsometimes-uninitialized] for (j = 0; j < entries; j++) { ^~~ drivers/pci/hotplug/rpaphp_core.c:256:6: note: uninitialized use occurs here if (fndit) ^ drivers/pci/hotplug/rpaphp_core.c:243:14: note: remove the condition if it is always true for (j = 0; j < entries; j++) { ^~~ drivers/pci/hotplug/rpaphp_core.c:233:14: note: initialize the variable 'fndit' to silence this warning int j, fndit; ^ = 0 fndit is only used to gate a sprintf call, which can be moved into the loop to simplify the code and eliminate the local variable, which will fix this warning. Fixes: 2fcf3ae508c2 ("hotplug/drc-info: Add code to search ibm,drc-info property") Suggested-by: Nick Desaulniers Signed-off-by: Nathan Chancellor Acked-by: Tyrel Datwyler Acked-by: Joel Savitz Signed-off-by: Michael Ellerman Link: https://github.com/ClangBuiltLinux/linux/issues/504 Link: https://lore.kernel.org/r/20190603221157.58502-1-natechancel...@gmail.com Signed-off-by: Sasha Levin --- drivers/pci/hotplug/rpaphp_core.c | 18 +++--- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/drivers/pci/hotplug/rpaphp_core.c b/drivers/pci/hotplug/rpaphp_core.c index bcd5d357ca238..c3899ee1db995 100644 --- a/drivers/pci/hotplug/rpaphp_core.c +++ b/drivers/pci/hotplug/rpaphp_core.c @@ -230,7 +230,7 @@ static int rpaphp_check_drc_props_v2(struct device_node *dn, char *drc_name, struct of_drc_info drc; const __be32 *value; char cell_drc_name[MAX_DRC_NAME_LEN]; - int j, fndit; + int j; info = of_find_property(dn->parent, "ibm,drc-info", NULL); if (info == NULL) @@ -245,17 +245,13 @@ static int rpaphp_check_drc_props_v2(struct device_node *dn, char *drc_name, /* Should now know end of current entry */ - if (my_index > drc.last_drc_index) - continue; - - fndit = 1; - break; + /* Found it */ + if (my_index <= drc.last_drc_index) { + sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, + my_index); + break; + } } - /* Found it */ - - if (fndit) - sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, - my_index); if (((drc_name == NULL) || (drc_name && !strcmp(drc_name, cell_drc_name))) && -- 2.20.1
[PATCH AUTOSEL 5.3 80/87] powerpc: dump kernel log before carrying out fadump or kdump
From: Ganesh Goudar [ Upstream commit e7ca44ed3ba77fc26cf32650bb71584896662474 ] Since commit 4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path"), pstore dmesg file is not updated when dump is triggered from HMC. This commit modified system reset (sreset) handler to invoke fadump or kdump (if configured), without pushing dmesg to pstore. This leaves pstore to have old dmesg data which won't be much of a help if kdump fails to capture the dump. This patch fixes that by calling kmsg_dump() before heading to fadump ot kdump. Fixes: 4388c9b3a6ee ("powerpc: Do not send system reset request through the oops path") Reviewed-by: Mahesh Salgaonkar Reviewed-by: Nicholas Piggin Signed-off-by: Ganesh Goudar Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190904075949.15607-1-ganes...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/traps.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 11caa0291254e..82f43535e6867 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -472,6 +472,7 @@ void system_reset_exception(struct pt_regs *regs) if (debugger(regs)) goto out; + kmsg_dump(KMSG_DUMP_OOPS); /* * A system reset is a request to dump, so we always send * it through the crashdump code (if fadump or kdump are -- 2.20.1
[PATCH AUTOSEL 5.3 71/87] powerpc/pseries: correctly track irq state in default idle
From: Nathan Lynch [ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ] prep_irq_for_idle() is intended to be called before entering H_CEDE (and it is used by the pseries cpuidle driver). However the default pseries idle routine does not call it, leading to mismanaged lazy irq state when the cpuidle driver isn't in use. Manifestations of this include: * Dropped IPIs in the time immediately after a cpu comes online (before it has installed the cpuidle handler), making the online operation block indefinitely waiting for the new cpu to respond. * Hitting this WARN_ON in arch_local_irq_restore(): /* * We should already be hard disabled here. We had bugs * where that wasn't the case so let's dbl check it and * warn if we are wrong. Only do that when IRQ tracing * is enabled as mfmsr() can be costly. */ if (WARN_ON_ONCE(mfmsr() & MSR_EE)) __hard_irq_disable(); Call prep_irq_for_idle() from pseries_lpar_idle() and honor its result. Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on pseries") Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190910225244.25056-1-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/setup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index f5940cc71c372..63462e96cf0ee 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -316,6 +316,9 @@ static void pseries_lpar_idle(void) * low power mode by ceding processor to hypervisor */ + if (!prep_irq_for_idle()) + return; + /* Indicate to hypervisor that we are idle. */ get_lppaca()->idle = 1; -- 2.20.1
[PATCH AUTOSEL 5.3 69/87] powerpc/imc: Dont create debugfs files for cpu-less nodes
From: Madhavan Srinivasan [ Upstream commit 41ba17f20ea835c489e77bd54e2da73184e22060 ] Commit <684d984038aa> ('powerpc/powernv: Add debugfs interface for imc-mode and imc') added debugfs interface for the nest imc pmu devices to support changing of different ucode modes. Primarily adding this capability for debug. But when doing so, the code did not consider the case of cpu-less nodes. So when reading the _cmd_ or _mode_ file of a cpu-less node will create this crash. Faulting instruction address: 0xc00d0d58 Oops: Kernel access of bad area, sig: 11 [#1] ... CPU: 67 PID: 5301 Comm: cat Not tainted 5.2.0-rc6-next-20190627+ #19 NIP: c00d0d58 LR: c049aa18 CTR:c00d0d50 REGS: c00020194548f9e0 TRAP: 0300 Not tainted (5.2.0-rc6-next-20190627+) MSR: 90009033 CR:28022822 XER: CFAR: c049aa14 DAR: 0003fc08 DSISR:4000 IRQMASK: 0 ... NIP imc_mem_get+0x8/0x20 LR simple_attr_read+0x118/0x170 Call Trace: simple_attr_read+0x70/0x170 (unreliable) debugfs_attr_read+0x6c/0xb0 __vfs_read+0x3c/0x70 vfs_read+0xbc/0x1a0 ksys_read+0x7c/0x140 system_call+0x5c/0x70 Patch fixes the issue with a more robust check for vbase to NULL. Before patch, ls output for the debugfs imc directory # ls /sys/kernel/debug/powerpc/imc/ imc_cmd_0imc_cmd_251 imc_cmd_253 imc_cmd_255 imc_mode_0 imc_mode_251 imc_mode_253 imc_mode_255 imc_cmd_250 imc_cmd_252 imc_cmd_254 imc_cmd_8imc_mode_250 imc_mode_252 imc_mode_254 imc_mode_8 After patch, ls output for the debugfs imc directory # ls /sys/kernel/debug/powerpc/imc/ imc_cmd_0 imc_cmd_8 imc_mode_0 imc_mode_8 Actual bug here is that, we have two loops with potentially different loop counts. That is, in imc_get_mem_addr_nest(), loop count is obtained from the dt entries. But in case of export_imc_mode_and_cmd(), loop was based on for_each_nid() count. Patch fixes the loop count in latter based on the struct mem_info. Ideally it would be better to have array size in struct imc_pmu. Fixes: 684d984038aa ('powerpc/powernv: Add debugfs interface for imc-mode and imc') Reported-by: Qian Cai Suggested-by: Michael Ellerman Signed-off-by: Madhavan Srinivasan Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190827101635.6942-1-ma...@linux.vnet.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/powernv/opal-imc.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 186109bdd41be..e04b20625cb94 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -53,9 +53,9 @@ static void export_imc_mode_and_cmd(struct device_node *node, struct imc_pmu *pmu_ptr) { static u64 loc, *imc_mode_addr, *imc_cmd_addr; - int chip = 0, nid; char mode[16], cmd[16]; u32 cb_offset; + struct imc_mem_info *ptr = pmu_ptr->mem_info; imc_debugfs_parent = debugfs_create_dir("imc", powerpc_debugfs_root); @@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct device_node *node, if (of_property_read_u32(node, "cb_offset", _offset)) cb_offset = IMC_CNTL_BLK_OFFSET; - for_each_node(nid) { - loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset; + while (ptr->vbase != NULL) { + loc = (u64)(ptr->vbase) + cb_offset; imc_mode_addr = (u64 *)(loc + IMC_CNTL_BLK_MODE_OFFSET); - sprintf(mode, "imc_mode_%d", nid); + sprintf(mode, "imc_mode_%d", (u32)(ptr->id)); if (!imc_debugfs_create_x64(mode, 0600, imc_debugfs_parent, imc_mode_addr)) goto err; imc_cmd_addr = (u64 *)(loc + IMC_CNTL_BLK_CMD_OFFSET); - sprintf(cmd, "imc_cmd_%d", nid); + sprintf(cmd, "imc_cmd_%d", (u32)(ptr->id)); if (!imc_debugfs_create_x64(cmd, 0600, imc_debugfs_parent, imc_cmd_addr)) goto err; - chip++; + ptr++; } return; -- 2.20.1
[PATCH AUTOSEL 5.3 68/87] powerpc/eeh: Clean up EEH PEs after recovery finishes
From: Oliver O'Halloran [ Upstream commit 799abe283e5103d48e079149579b4f167c95ea0e ] When the last device in an eeh_pe is removed the eeh_pe structure itself (and any empty parents) are freed since they are no longer needed. This results in a crash when a hotplug driver is involved since the following may occur: 1. Device is suprise removed. 2. Driver performs an MMIO, which fails and queues and eeh_event. 3. Hotplug driver receives a hotplug interrupt and removes any pci_devs that were under the slot. 4. pci_dev is torn down and the eeh_pe is freed. 5. The EEH event handler thread processes the eeh_event and crashes since the eeh_pe pointer in the eeh_event structure is no longer valid. Crashing is generally considered poor form. Instead of doing that use the fact PEs are marked as EEH_PE_INVALID to keep them around until the end of the recovery cycle, at which point we can safely prune any empty PEs. Signed-off-by: Oliver O'Halloran Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190903101605.2890-2-ooh...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/eeh_driver.c | 36 ++-- arch/powerpc/kernel/eeh_event.c | 8 +++ arch/powerpc/kernel/eeh_pe.c | 23 +++- 3 files changed, 64 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 1fbe541856f5e..fe0c32fb9f96f 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -744,6 +744,33 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus, */ #define MAX_WAIT_FOR_RECOVERY 300 + +/* Walks the PE tree after processing an event to remove any stale PEs. + * + * NB: This needs to be recursive to ensure the leaf PEs get removed + * before their parents do. Although this is possible to do recursively + * we don't since this is easier to read and we need to garantee + * the leaf nodes will be handled first. + */ +static void eeh_pe_cleanup(struct eeh_pe *pe) +{ + struct eeh_pe *child_pe, *tmp; + + list_for_each_entry_safe(child_pe, tmp, >child_list, child) + eeh_pe_cleanup(child_pe); + + if (pe->state & EEH_PE_KEEP) + return; + + if (!(pe->state & EEH_PE_INVALID)) + return; + + if (list_empty(>edevs) && list_empty(>child_list)) { + list_del(>child); + kfree(pe); + } +} + /** * eeh_handle_normal_event - Handle EEH events on a specific PE * @pe: EEH PE - which should not be used after we return, as it may @@ -782,8 +809,6 @@ void eeh_handle_normal_event(struct eeh_pe *pe) return; } - eeh_pe_state_mark(pe, EEH_PE_RECOVERING); - eeh_pe_update_time_stamp(pe); pe->freeze_count++; if (pe->freeze_count > eeh_max_freezes) { @@ -973,6 +998,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe) return; } } + + /* +* Clean up any PEs without devices. While marked as EEH_PE_RECOVERYING +* we don't want to modify the PE tree structure so we do it here. +*/ + eeh_pe_cleanup(pe); eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true); } @@ -1045,6 +1076,7 @@ void eeh_handle_special_event(void) */ if (rc == EEH_NEXT_ERR_FROZEN_PE || rc == EEH_NEXT_ERR_FENCED_PHB) { + eeh_pe_state_mark(pe, EEH_PE_RECOVERING); eeh_handle_normal_event(pe); } else { pci_lock_rescan_remove(); diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c index 64cfbe41174b2..e36653e5f76b3 100644 --- a/arch/powerpc/kernel/eeh_event.c +++ b/arch/powerpc/kernel/eeh_event.c @@ -121,6 +121,14 @@ int __eeh_send_failure_event(struct eeh_pe *pe) } event->pe = pe; + /* +* Mark the PE as recovering before inserting it in the queue. +* This prevents the PE from being free()ed by a hotplug driver +* while the PE is sitting in the event queue. +*/ + if (pe) + eeh_pe_state_mark(pe, EEH_PE_RECOVERING); + /* We may or may not be called in an interrupt context */ spin_lock_irqsave(_eventlist_lock, flags); list_add(>list, _eventlist); diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index 854cef7b18f4d..f0813d50e0b1c 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -491,6 +491,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev) int eeh_rmv_from_parent_pe(struct eeh_dev *edev) { struct eeh_pe *pe, *parent, *child; + bool keep, recover; int cnt; struct pci_dn *pdn = eeh_dev_to_pdn(edev); @@ -516,10 +517,21 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev) */ while (1) { parent =
[PATCH AUTOSEL 5.3 66/87] powerpc/64s/exception: machine check use correct cfar for late handler
From: Nicholas Piggin [ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ] Bare metal machine checks run an "early" handler in real mode before running the main handler which reports the event. The main handler runs exactly as a normal interrupt handler, after the "windup" which sets registers back as they were at interrupt entry. CFAR does not get restored by the windup code, so that will be wrong when the handler is run. Restore the CFAR to the saved value before running the late handler. Signed-off-by: Nicholas Piggin Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802105709.27696-8-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/exceptions-64s.S | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 6ba3cc2ef8abc..36c8a3652cf3a 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -1211,6 +1211,10 @@ FTR_SECTION_ELSE ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE) 9: /* Deliver the machine check to host kernel in V mode. */ +BEGIN_FTR_SECTION + ld r10,ORIG_GPR3(r1) + mtspr SPRN_CFAR,r10 +END_FTR_SECTION_IFSET(CPU_FTR_CFAR) MACHINE_CHECK_HANDLER_WINDUP EXCEPTION_PROLOG_0 PACA_EXMC b machine_check_pSeries_0 -- 2.20.1
[PATCH AUTOSEL 5.3 63/87] selftests/powerpc: Retry on host facility unavailable
From: Gustavo Romero [ Upstream commit 6652bf6408895b09d31fd4128a1589a1a0672823 ] TM test tm-unavailable must take into account aborts due to host aborting a transactin because of a facility unavailable exception, just like it already does for aborts on reschedules (TM_CAUSE_KVM_RESCHED). Reported-by: Desnes A. Nunes do Rosario Tested-by: Desnes A. Nunes do Rosario Signed-off-by: Gustavo Romero Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/1566341651-19747-1-git-send-email-grom...@linux.vnet.ibm.com Signed-off-by: Sasha Levin --- tools/testing/selftests/powerpc/tm/tm.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/powerpc/tm/tm.h b/tools/testing/selftests/powerpc/tm/tm.h index 97f9f491c541a..c402464b038fc 100644 --- a/tools/testing/selftests/powerpc/tm/tm.h +++ b/tools/testing/selftests/powerpc/tm/tm.h @@ -55,7 +55,8 @@ static inline bool failure_is_unavailable(void) static inline bool failure_is_reschedule(void) { if ((failure_code() & TM_CAUSE_RESCHED) == TM_CAUSE_RESCHED || - (failure_code() & TM_CAUSE_KVM_RESCHED) == TM_CAUSE_KVM_RESCHED) + (failure_code() & TM_CAUSE_KVM_RESCHED) == TM_CAUSE_KVM_RESCHED || + (failure_code() & TM_CAUSE_KVM_FAC_UNAV) == TM_CAUSE_KVM_FAC_UNAV) return true; return false; -- 2.20.1
[PATCH AUTOSEL 5.3 51/87] powerpc/eeh: Clear stale EEH_DEV_NO_HANDLER flag
From: Sam Bobroff [ Upstream commit aa06e3d60e245284d1e55497eb3108828092818d ] The EEH_DEV_NO_HANDLER flag is used by the EEH system to prevent the use of driver callbacks in drivers that have been bound part way through the recovery process. This is necessary to prevent later stage handlers from being called when the earlier stage handlers haven't, which can be confusing for drivers. However, the flag is set for all devices that are added after boot time and only cleared at the end of the EEH recovery process. This results in hot plugged devices erroneously having the flag set during the first recovery after they are added (causing their driver's handlers to be incorrectly ignored). To remedy this, clear the flag at the beginning of recovery processing. The flag is still cleared at the end of recovery processing, although it is no longer really necessary. Also clear the flag during eeh_handle_special_event(), for the same reasons. Signed-off-by: Sam Bobroff Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/b8ca5629d27de74c957d4f4b250177d1b6fc4bbd.1565930772.git.sbobr...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/kernel/eeh_driver.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 89623962c7275..1fbe541856f5e 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -793,6 +793,10 @@ void eeh_handle_normal_event(struct eeh_pe *pe) result = PCI_ERS_RESULT_DISCONNECT; } + eeh_for_each_pe(pe, tmp_pe) + eeh_pe_for_each_dev(tmp_pe, edev, tmp) + edev->mode &= ~EEH_DEV_NO_HANDLER; + /* Walk the various device drivers attached to this slot through * a reset sequence, giving each an opportunity to do what it needs * to accomplish the reset. Each child gets a report of the @@ -981,7 +985,8 @@ void eeh_handle_normal_event(struct eeh_pe *pe) */ void eeh_handle_special_event(void) { - struct eeh_pe *pe, *phb_pe; + struct eeh_pe *pe, *phb_pe, *tmp_pe; + struct eeh_dev *edev, *tmp_edev; struct pci_bus *bus; struct pci_controller *hose; unsigned long flags; @@ -1050,6 +1055,10 @@ void eeh_handle_special_event(void) (phb_pe->state & EEH_PE_RECOVERING)) continue; + eeh_for_each_pe(pe, tmp_pe) + eeh_pe_for_each_dev(tmp_pe, edev, tmp_edev) + edev->mode &= ~EEH_DEV_NO_HANDLER; + /* Notify all devices to be down */ eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true); eeh_set_channel_state(pe, pci_channel_io_perm_failure); -- 2.20.1
[PATCH AUTOSEL 5.3 49/87] powerpc/perf: fix imc allocation failure handling
From: Nicholas Piggin [ Upstream commit 10c4bd7cd28e77aeb8cfa65b23cb3c632ede2a49 ] The alloc_pages_node return value should be tested for failure before being passed to page_address. Tested-by: Anju T Sudhakar Signed-off-by: Nicholas Piggin Reviewed-by: Aneesh Kumar K.V Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190724084638.24982-3-npig...@gmail.com Signed-off-by: Sasha Levin --- arch/powerpc/perf/imc-pmu.c | 29 ++--- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index dea243185ea4b..cb50a9e1fd2d7 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -577,6 +577,7 @@ static int core_imc_mem_init(int cpu, int size) { int nid, rc = 0, core_id = (cpu / threads_per_core); struct imc_mem_info *mem_info; + struct page *page; /* * alloc_pages_node() will allocate memory for core in the @@ -587,11 +588,12 @@ static int core_imc_mem_init(int cpu, int size) mem_info->id = core_id; /* We need only vbase for core counters */ - mem_info->vbase = page_address(alloc_pages_node(nid, - GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | - __GFP_NOWARN, get_order(size))); - if (!mem_info->vbase) + page = alloc_pages_node(nid, + GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | + __GFP_NOWARN, get_order(size)); + if (!page) return -ENOMEM; + mem_info->vbase = page_address(page); /* Init the mutex */ core_imc_refc[core_id].id = core_id; @@ -849,15 +851,17 @@ static int thread_imc_mem_alloc(int cpu_id, int size) int nid = cpu_to_node(cpu_id); if (!local_mem) { + struct page *page; /* * This case could happen only once at start, since we dont * free the memory in cpu offline path. */ - local_mem = page_address(alloc_pages_node(nid, + page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | - __GFP_NOWARN, get_order(size))); - if (!local_mem) + __GFP_NOWARN, get_order(size)); + if (!page) return -ENOMEM; + local_mem = page_address(page); per_cpu(thread_imc_mem, cpu_id) = local_mem; } @@ -1095,11 +1099,14 @@ static int trace_imc_mem_alloc(int cpu_id, int size) int core_id = (cpu_id / threads_per_core); if (!local_mem) { - local_mem = page_address(alloc_pages_node(phys_id, - GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | - __GFP_NOWARN, get_order(size))); - if (!local_mem) + struct page *page; + + page = alloc_pages_node(phys_id, + GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE | + __GFP_NOWARN, get_order(size)); + if (!page) return -ENOMEM; + local_mem = page_address(page); per_cpu(trace_imc_mem, cpu_id) = local_mem; /* Initialise the counters for trace mode */ -- 2.20.1
[PATCH AUTOSEL 5.3 48/87] powerpc/pseries/mobility: use cond_resched when updating device tree
From: Nathan Lynch [ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ] After a partition migration, pseries_devicetree_update() processes changes to the device tree communicated from the platform to Linux. This is a relatively heavyweight operation, with multiple device tree searches, memory allocations, and conversations with partition firmware. There's a few levels of nested loops which are bounded only by decisions made by the platform, outside of Linux's control, and indeed we have seen RCU stalls on large systems while executing this call graph. Use cond_resched() in these loops so that the cpu is yielded when needed. Signed-off-by: Nathan Lynch Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20190802192926.19277-4-nath...@linux.ibm.com Signed-off-by: Sasha Levin --- arch/powerpc/platforms/pseries/mobility.c | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c index fe812bebdf5e6..b571285f6c149 100644 --- a/arch/powerpc/platforms/pseries/mobility.c +++ b/arch/powerpc/platforms/pseries/mobility.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -207,7 +208,11 @@ static int update_dt_node(__be32 phandle, s32 scope) prop_data += vd; } + + cond_resched(); } + + cond_resched(); } while (rtas_rc == 1); of_node_put(dn); @@ -310,8 +315,12 @@ int pseries_devicetree_update(s32 scope) add_dt_node(phandle, drc_index); break; } + + cond_resched(); } } + + cond_resched(); } while (rc == 1); kfree(rtas_buf); -- 2.20.1