Re: [PATCH v3 3/5] powerpc/mm: add radix__create_section_mapping()
On 12/19/2016 02:34 PM, Aneesh Kumar K.V wrote: > Reza Arbabwrites: > >> > Add the linear page mapping function for radix, used by memory hotplug. >> > This is similar to vmemmap_populate(). >> > > Ok with this patch your first patch becomes useful. Can you merge that > with this and rename mmu_linear_psize to mmu_hotplug_psize or even use > mmu_virtual_psize. The linear naming is confusing. mmu_linear_psize variable was referring to the page size used to create kernel linear mapping (the 0xc00 range) for the newly added memory section. Why the name should be changed ?
Re: [PATCH v3 0/5] powerpc/mm: enable memory hotplug on radix
On 12/19/2016 11:28 PM, Reza Arbab wrote: > On Sat, Dec 17, 2016 at 01:38:40AM +1100, Balbir Singh wrote: >> Do we care about alt maps yet? > > Good question. I'll try to see if/how altmaps might need special > consideration here. vmemmap alt-map should be enabled separately and should not be mixed in this series which tries to just get hotplug working on radix.
Re: [PATCH v4 07/12] powerpc: Add support to take additional parameter in MASKABLE_* macro
On Tuesday 20 December 2016 08:06 AM, Nicholas Piggin wrote: On Mon, 19 Dec 2016 13:37:03 +0530 Madhavan Srinivasanwrote: To support addition of "bitmask" to MASKABLE_* macros, factor out the EXCPETION_PROLOG_1 macro. Currently soft_enabled is used as the flag to determine the interrupt state. Patch extends the soft_enabled to be used as a mask instead of a flag. This is really the core part of the patch -- after reversing the soft_enable logic to be a disable boolean, now it's being extended to be a disable mask. The exception macro changes just allow an interrupt type bit to be passed in later. I should have picked it up earlier, but if you do end up submitting another version, perhaps consider splitting the disable mask change and putting it after patch 5. Yes will do that. Make sense. Maddy Thanks, Nick
[RFC PATCH] powerpc/powernv: report error messages from opal
Recent versions of skiboot will raise an OPAL event (read: interrupt) when firmware writes an error message to its internal console. In conjunction they provide an OPAL call that the kernel can use to extract these messages from the OPAL log to allow them to be written into the kernel's log buffer where someone will (hopefully) look at them. For the companion skiboot patches see: https://lists.ozlabs.org/pipermail/skiboot/2016-December/005861.html Signed-off-by: Oliver O'Halloran--- arch/powerpc/include/asm/opal-api.h| 5 +++- arch/powerpc/include/asm/opal.h| 1 + arch/powerpc/platforms/powernv/opal-msglog.c | 41 ++ arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + 4 files changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 0e2e57bcab50..cb9c0e6afb33 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -167,7 +167,8 @@ #define OPAL_INT_EOI 124 #define OPAL_INT_SET_MFRR 125 #define OPAL_PCI_TCE_KILL 126 -#define OPAL_LAST 126 +#define OPAL_SCRAPE_LOG128 +#define OPAL_LAST 128 /* Device tree flags */ @@ -288,6 +289,7 @@ enum OpalPendingState { OPAL_EVENT_PCI_ERROR = 0x200, OPAL_EVENT_DUMP_AVAIL = 0x400, OPAL_EVENT_MSG_PENDING = 0x800, + OPAL_EVENT_LOG_PENDING = 0x1000, }; enum OpalThreadStatus { @@ -406,6 +408,7 @@ enum opal_msg_type { OPAL_MSG_DPO= 5, OPAL_MSG_PRD= 6, OPAL_MSG_OCC= 7, + OPAL_MSG_LOG= 8, OPAL_MSG_TYPE_MAX, }; diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 5c7db0f1a708..2b3bd3219fb4 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -232,6 +232,7 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t kill_type, int64_t opal_rm_pci_tce_kill(uint64_t phb_id, uint32_t kill_type, uint32_t pe_num, uint32_t tce_size, uint64_t dma_addr, uint32_t npages); +int64_t opal_scrape_log(int64_t *offset, char *buf, int64_t len, int64_t *lvl); /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, diff --git a/arch/powerpc/platforms/powernv/opal-msglog.c b/arch/powerpc/platforms/powernv/opal-msglog.c index 39d6ff9e5630..78168f66fb24 100644 --- a/arch/powerpc/platforms/powernv/opal-msglog.c +++ b/arch/powerpc/platforms/powernv/opal-msglog.c @@ -15,6 +15,7 @@ #include #include #include +#include /* OPAL in-memory console. Defined in OPAL source at core/console.c */ struct memcons { @@ -102,8 +103,36 @@ static struct bin_attribute opal_msglog_attr = { .read = opal_msglog_read }; +static char *log_levels[] = { "Emergency", "Alert", "Critical", "Error", "Warning" }; +static int64_t offset = -1; + +static irqreturn_t opal_print_log(int irq, void *data) +{ + int64_t rc, log_lvl; + char buffer[320]; + + /* +* only print one message per invokation of the IRQ handler +*/ + + rc = opal_scrape_log(, buffer, sizeof(buffer), _lvl); + + if (rc == OPAL_SUCCESS || rc == OPAL_PARTIAL) { + log_lvl = be64_to_cpu(log_lvl); + if (log_lvl > 4) + log_lvl = 4; + + printk_emit(0, log_lvl, NULL, 0, "OPAL %s: %s%s\r\n", + log_levels[log_lvl], buffer, + rc == OPAL_PARTIAL ? "" : ""); + } + + return IRQ_HANDLED; +} + void __init opal_msglog_init(void) { + int virq, rc = -1; u64 mcaddr; struct memcons *mc; @@ -123,6 +152,18 @@ void __init opal_msglog_init(void) return; } + virq = opal_event_request(ilog2(OPAL_EVENT_LOG_PENDING)); + if (virq) { + rc = request_irq(virq, opal_print_log, + IRQF_TRIGGER_HIGH, "opal memcons", NULL); + + if (rc) + irq_dispose_mapping(virq); + } + + if (!virq || rc) + pr_warn("Unable to register OPAL log event handler\n"); + opal_memcons = mc; } diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S index 3aa40f1b20f5..c59d7da3fd1a 100644 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S @@ -312,3 +312,4 @@ OPAL_CALL(opal_int_set_mfrr, OPAL_INT_SET_MFRR); OPAL_CALL_REAL(opal_rm_int_set_mfrr, OPAL_INT_SET_MFRR); OPAL_CALL(opal_pci_tce_kill, OPAL_PCI_TCE_KILL); OPAL_CALL_REAL(opal_rm_pci_tce_kill,
Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
On Tuesday 20 December 2016 07:46 PM, Vipin K Parashar wrote: Added check for OPAL_WRONG_STATE error code returned from OPAL. Currently Linux flashes "unexpected error" over console for this error. This will avoid throwing such message and return I/O error for such OPAL failures. Signed-off-by: Vipin K Parashar--- arch/powerpc/platforms/powernv/opal.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c index 2822935..ab91d53 100644 --- a/arch/powerpc/platforms/powernv/opal.c +++ b/arch/powerpc/platforms/powernv/opal.c @@ -866,9 +866,10 @@ int opal_error_code(int rc) case OPAL_NO_MEM: return -ENOMEM; case OPAL_PERMISSION: return -EPERM; - case OPAL_UNSUPPORTED: return -EIO; - case OPAL_HARDWARE: return -EIO; - case OPAL_INTERNAL_ERROR: return -EIO; + case OPAL_UNSUPPORTED: + case OPAL_HARDWARE: + case OPAL_INTERNAL_ERROR: + case OPAL_WRONG_STATE: return -EIO; Looks good. Reviewed-by: Mukesh Ojha default: pr_err("%s: unexpected OPAL error %d\n", __func__, rc); return -EIO;
Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list
On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote: > VFIO on sPAPR already implements guest memory pre-registration > when the entire guest RAM gets pinned. This can be used to translate > the physical address of a guest page containing the TCE list > from H_PUT_TCE_INDIRECT. > > This makes use of the pre-registrered memory API to access TCE list > pages in order to avoid unnecessary locking on the KVM memory > reverse map as we know that all of guest memory is pinned and > we have a flat array mapping GPA to HPA which makes it simpler and > quicker to index into that array (even with looking up the > kernel page tables in vmalloc_to_phys) than it is to find the memslot, > lock the rmap entry, look up the user page tables, and unlock the rmap > entry. Note that the rmap pointer is initialized to NULL > where declared (not in this patch). > > If a requested chunk of memory has not been preregistered, > this will fail with H_TOO_HARD so the virtual mode handle can > handle the request. > > Signed-off-by: Alexey Kardashevskiy> --- > Changes: > v2: > * updated the commit log with David's comment > --- > arch/powerpc/kvm/book3s_64_vio_hv.c | 65 > - > 1 file changed, 49 insertions(+), 16 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c > b/arch/powerpc/kvm/book3s_64_vio_hv.c > index d461c440889a..a3be4bd6188f 100644 > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c > @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa, > EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua); > > #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE > +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu) > +{ > + return mm_iommu_preregistered(vcpu->kvm->mm); > +} > + > +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup( > + struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size) > +{ > + return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size); > +} I don't see that there's much point to these inlines. > long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, > unsigned long ioba, unsigned long tce) > { > @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, > if (ret != H_SUCCESS) > return ret; > > - if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , )) > - return H_TOO_HARD; > + if (kvmppc_preregistered(vcpu)) { > + /* > + * We get here if guest memory was pre-registered which > + * is normally VFIO case and gpa->hpa translation does not > + * depend on hpt. > + */ > + struct mm_iommu_table_group_mem_t *mem; > > - rmap = (void *) vmalloc_to_phys(rmap); > + if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , NULL)) > + return H_TOO_HARD; > > - /* > - * Synchronize with the MMU notifier callbacks in > - * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.). > - * While we have the rmap lock, code running on other CPUs > - * cannot finish unmapping the host real page that backs > - * this guest real page, so we are OK to access the host > - * real page. > - */ > - lock_rmap(rmap); > - if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) { > - ret = H_TOO_HARD; > - goto unlock_exit; > + mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K); > + if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, )) > + return H_TOO_HARD; > + } else { > + /* > + * This is emulated devices case. > + * We do not require memory to be preregistered in this case > + * so lock rmap and do __find_linux_pte_or_hugepte(). > + */ Hmm. So this isn't wrong as such, but the logic and comments are both misleading. The 'if' here isn't really about VFIO vs. emulated - it's about whether the mm has *any* preregistered chunks, without any regard to which particular device you're talking about. For example if your guest has two PHBs, one with VFIO devices and the other with emulated devices, then the emulated devices will still go through the "VFIO" case here. Really what you have here is a fast case when the tce_list is in preregistered memory, and a fallback case when it isn't. But that's obscured by the fact that if for whatever reason you have some preregistered memory but it doesn't cover the tce_list, then you don't go to the fallback case here, but instead fall right back to the virtual mode handler. So, I think you should either: 1) Fallback to the code below whenever you can't access the tce_list via prereg memory, regardless of whether there's any _other_ prereg memory or 2) Drop the code below entirely and always return H_TOO_HARD if you can't get the tce_list from prereg. > + if
Re: [PATCH v4 02/12] powerpc: move set_soft_enabled() and rename
On Tuesday 20 December 2016 02:33 PM, Balbir Singh wrote: On 19/12/16 19:06, Madhavan Srinivasan wrote: Move set_soft_enabled() from powerpc/kernel/irq.c to asm/hw_irq.c, to force updates to paca-soft_enabled done via these access function. Add "memory" clobber to hint compiler since paca->soft_enabled memory is the target here +static inline notrace void soft_enabled_set(unsigned long enable) +{ + __asm__ __volatile__("stb %0,%1(13)" + : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled)) + : "memory"); +} + Can't we just rewrite this in "C"? local_paca->soft_enabled = enable Yes we did discussed this and parked it to handle in the follow on patchset. Nick did suggest something of this sort, __asm__ __volatile__("stb %1,%0" : "=m" (local_paca->soft_enabled) : "r" (enable)); Balbir Singh.
Re: [PATCH v4 01/12] powerpc: Add #defs for paca->soft_enabled flags
On Tuesday 20 December 2016 01:01 PM, Balbir Singh wrote: +/* + * flags for paca->soft_enabled + */ +#define IRQ_DISABLE_MASK_NONE1 +#define IRQ_DISABLE_MASK_LINUX0 + Isn't one just the complement of the other? Yes. But these are initial cleanup patchs and in subsequent patches we will reuse these as mask bits instead of flag. Maddy Balbir
Re: [PATCH] powerpc/64: Don't try to use radix MMU under a hypervisor
On Tue, 2016-12-20 at 22:40 +1100, Paul Mackerras wrote: > Currently, if the kernel is running on a POWER9 processor under a > hypervisor, it will try to use the radix MMU even though it doesn't > have the necessary code to use radix under a hypervisor (it doesn't > negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL > hcall). The result is that the guest kernel will crash when it tries > to turn on the MMU, because it will still actually be using the HPT > MMU, but it won't have set up any SLB or HPT entries. It does this > because the only thing that the kernel looks at in deciding to use > radix, on any platform, is the ibm,pa-features property on the cpu > device nodes. > > This fixes it by looking for the /chosen/ibm,architecture-vec-5 > property, and if it exists, clearing the radix MMU feature bit. > We do this before we decide whether to initialize for radix or HPT. > This property is created by the hypervisor as a result of the guest > calling the ibm,client-architecture-support method to indicate > its capabilities, so it only exists on systems with a hypervisor. > The reason for using this property is that in future, when we > have support for using radix under a hypervisor, we will need > to check this property to see whether the hypervisor agreed to > us using radix. > > Fixes: 17a3dd2f5fc7 ("powerpc/mm/radix: Use firmware feature to > enable Radix MMU") > Cc: sta...@vger.kernel.org # v4.7+ > Signed-off-by: Paul Mackerras> --- > arch/powerpc/mm/init_64.c | 27 +++ > 1 file changed, 27 insertions(+) > > diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c > index a000c35..098531d 100644 > --- a/arch/powerpc/mm/init_64.c > +++ b/arch/powerpc/mm/init_64.c > @@ -42,6 +42,8 @@ > #include > #include > #include > +#include > +#include > > #include > #include > @@ -344,6 +346,28 @@ static int __init parse_disable_radix(char *p) > } > early_param("disable_radix", parse_disable_radix); > > +/* > + * If we're running under a hypervisor, we currently can't do radix > + * since we don't have the code to do the H_REGISTER_PROC_TBL hcall. > + * We tell that we're running under a hypervisor by looking for the > + * /chosen/ibm,architecture-vec-5 property. > + */ > +static void early_check_vec5(void) > +{ > + unsigned long root, chosen; > + int size; > + const u8 *vec5; > + > + root = of_get_flat_dt_root(); > + chosen = of_get_flat_dt_subnode_by_name(root, "chosen"); > + if (chosen == -FDT_ERR_NOTFOUND) > + return; > + vec5 = of_get_flat_dt_prop(chosen, "ibm,architecture-vec-5", > ); > + if (!vec5) > + return; > + cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX; > +} > + Given that currently radix guest support doesn't exist upstream, it's sufficient to check for the existence of the vec5 node to determine that we are a guest and thus can't run radix. Is it worth checking the specific radix feature bit of the vec5 node so that this code is still correct for determining the lack of radix support by the host platform once guest radix kernels are (in the future) supported? > void __init mmu_early_init_devtree(void) > { > /* Disable radix mode based on kernel command line. */ > @@ -351,6 +375,9 @@ void __init mmu_early_init_devtree(void) > cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX; > > if (early_radix_enabled()) > + early_check_vec5(); > + > + if (early_radix_enabled()) > radix__early_init_devtree(); > else > hash__early_init_devtree();
Re: [PATCH] drivers/pci/hotplug: Handle presence detection change properly
On Tue, Dec 20, 2016 at 10:37:09AM +0100, Greg KH wrote: >On Tue, Dec 20, 2016 at 05:49:33PM +1100, Gavin Shan wrote: >> The surprise hotplug is driven by interrupt in PowerNV PCI hotplug >> driver. In the interrupt handler, pnv_php_interrupt(), we bail when >> pnv_pci_get_presence_state() returns zero wrongly. It causes the >> presence change event is always ignored incorrectly. >> >> This fixes the issue by bailing on error (non-zero value) returned >> from pnv_pci_get_presence_state(). >> >> Fixes: 360aebd85a4 ("drivers/pci/hotplug: Support surprise hotplug in >> powernv driver") >> Cc: sta...@vger.kernel.org #v4.9+ >> Reported-by: Hank Chang>> Signed-off-by: Gavin Shan >> Tested-by: Willie Liauw >> --- >> drivers/pci/hotplug/pnv_php.c | 6 +- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c >> index 56efaf7..38a2309 100644 >> --- a/drivers/pci/hotplug/pnv_php.c >> +++ b/drivers/pci/hotplug/pnv_php.c >> @@ -707,8 +707,12 @@ static irqreturn_t pnv_php_interrupt(int irq, void >> *data) >> added = !!(lsts & PCI_EXP_LNKSTA_DLLLA); >> } else if (sts & PCI_EXP_SLTSTA_PDC) { >> ret = pnv_pci_get_presence_state(php_slot->id, ); >> -if (!ret) >> +if (ret) { >> +dev_warn(>dev, "PCI slot [%s] error %d handling >> PDC event (0x%04x)\n", >> +php_slot->name, ret, sts); > >What can a user do with this warning? > This warning is to catch developer's attention only. It's very rare for pnv_pci_get_presence_state() to fail as it simply checks hardware regiter bit (from Slot Status regsiter in PCIe capability or PHB's Hotplug Override register) in the skiboot firmware. User needs to retry the operation (hot-add or hot-remove) when seeing this warning. Greg, please let me know if below message is better? dev_warn(>dev, "PCI slot [%s] error %d getting presence status on event 0x%04x, to retry the operation.\n", php_slot->name, ret, sts); Thanks, Gavin
Re: [PATCH v3 01/15] stacktrace/x86: add function for detecting reliable stack traces
On Tue, Dec 20, 2016 at 10:39:16AM +0100, Petr Mladek wrote: > On Mon 2016-12-19 11:25:49, Josh Poimboeuf wrote: > > 3) probably some kind of runtime NMI stack checking feature to > >complement objtool, along with a lot of burn time to ensure there are > >no issues, particularly in entry code > > Could you please provide more details about this NMI stack checking? > What is it supposed to protect that objtool could not? > Will it run regularly or will it be just a random check? save_stack_trace_tsk_reliable(current) would be called periodically from an NMI handler, and a warning would be printed if it ever doesn't reach the "end" of the stack (i.e., user-mode pt_regs). Due to the performance impact it would probably only be a debug option. It would verify the special hand-coded areas which objtool isn't smart enough to understand, like entry code, ftrace, kprobes, bpf. It would also make sure that objtool itself didn't missing anything. -- Josh
Re: [PATCH] powerpc/64: Don't try to use radix MMU under a hypervisor
On Tue, Dec 20, 2016 at 08:19:02PM +0530, Aneesh Kumar K.V wrote: > Paul Mackerraswrites: > > > Currently, if the kernel is running on a POWER9 processor under a > > hypervisor, it will try to use the radix MMU even though it doesn't > > have the necessary code to use radix under a hypervisor (it doesn't > > negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL > > hcall). The result is that the guest kernel will crash when it tries > > to turn on the MMU, because it will still actually be using the HPT > > MMU, but it won't have set up any SLB or HPT entries. It does this > > because the only thing that the kernel looks at in deciding to use > > radix, on any platform, is the ibm,pa-features property on the cpu > > device nodes. > > > > This fixes it by looking for the /chosen/ibm,architecture-vec-5 > > property, and if it exists, clearing the radix MMU feature bit. > > We do this before we decide whether to initialize for radix or HPT. > > This property is created by the hypervisor as a result of the guest > > calling the ibm,client-architecture-support method to indicate > > its capabilities, so it only exists on systems with a hypervisor. > > The reason for using this property is that in future, when we > > have support for using radix under a hypervisor, we will need > > to check this property to see whether the hypervisor agreed to > > us using radix. > > Hypervisor that doesn't support radix should clear the ibm,pa-features > radix feature bit right ? We look at that before setting > MMU_FTR_TYPE_RADIX. So how did we end up enabling radix in the above > case ? Two points in response: 1. The ibm,pa-features property specifies the capabilities of the processor, not the platform. The radix bit would be set on POWER9 even if the hypervisor doesn't support radix. 2. The hypervisor might well support radix, but the kernel definitely cannot operate with radix under a hypervisor, because it doesn't put the necessary things in the ibm,client-architecture support call, and it doesn't call the H_REGISTER_PROC_TBL hcall. That means that the CPU has no way to find our 1st level (process scoped) radix trees even if the partition was in radix mode. Most likely the hypervisor will put the partition into HPT mode because the kernel didn't ask for radix in the CAS call, but even if the hypervisor put the partition into radix mode, it still wouldn't work. Paul.
Re: [GIT PULL 00/29] perf/core improvements and fixes
* Arnaldo Carvalho de Melo <a...@kernel.org> wrote: > Hi Ingo, > > Please consider pulling, I had most of this queued before your first > pull req to Linus for 4.10, most are fixes, with 'perf sched timehist --idle' > as a followup new feature to the 'perf sched timehist' command introduced in > this window. > > One other thing that delayed this was the samples/bpf/ switch to > tools/lib/bpf/ that involved fixing up merge clashes with net.git and also > to properly test it, after more rounds than antecipated, but all seems ok > now and would be good to get this merge issues past us ASAP. > > - Arnaldo > > Test results at the end of this message, as usual. > > The following changes since commit e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba: > > Merge tag 'docs-4.10' of git://git.lwn.net/linux (2016-12-12 21:58:13 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git > tags/perf-core-for-mingo-20161220 > > for you to fetch changes up to 9899694a7f67714216665b87318eb367e2c5c901: > > samples/bpf: Move open_raw_sock to separate header (2016-12-20 12:00:40 > -0300) > > > perf/core improvements and fixes: > > New features: > > - Introduce 'perf sched timehist --idle', to analyse processes > going to/from idle state (Namhyung Kim) > > Fixes: > > - Allow 'perf record -u user' to continue when facing races with threads > going away after having scanned them via /proc (Jiri Olsa) > > - Fix 'perf mem' --all-user/--all-kernel options (Jiri Olsa) > > - Support jumps with multiple arguments (Ravi Bangoria) > > - Fix jumps to before the function where they are located (Ravi > Bangoria) > > - Fix lock-pi help string (Davidlohr Bueso) > > - Fix build of 'perf trace' in odd systems such as a RHEL PPC one (Jiri Olsa) > > - Do not overwrite valid build id in 'perf diff' (Kan Liang) > > - Don't throw error for zero length symbols, allowing the use of the TUI > in PowerPC, where such symbols became more common recently (Ravi Bangoria) > > Infrastructure: > > - Switch of samples/bpf/ to use tools/lib/bpf, removing libbpf > duplication (Joe Stringer) > > - Move headers check into bash script (Jiri Olsa) > > Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com> > > > Arnaldo Carvalho de Melo (3): > perf tools: Remove some needless __maybe_unused > samples/bpf: Make perf_event_read() static > samples/bpf: Be consistent with bpf_load_program bpf_insn parameter > > Davidlohr Bueso (1): > perf bench futex: Fix lock-pi help string > > Jiri Olsa (7): > perf tools: Move headers check into bash script > perf mem: Fix --all-user/--all-kernel options > perf evsel: Use variable instead of repeating lengthy FD macro > perf thread_map: Add thread_map__remove function > perf evsel: Allow to ignore missing pid > perf record: Force ignore_missing_thread for uid option > perf trace: Check if MAP_32BIT is defined (again) > > Joe Stringer (8): > tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h > tools lib bpf: use __u32 from linux/types.h > tools lib bpf: Add flags to bpf_create_map() > samples/bpf: Make samples more libbpf-centric > samples/bpf: Switch over to libbpf > tools lib bpf: Add bpf_prog_{attach,detach} > samples/bpf: Remove perf_event_open() declaration > samples/bpf: Move open_raw_sock to separate header > > Kan Liang (1): > perf diff: Do not overwrite valid build id > > Namhyung Kim (6): > perf sched timehist: Split is_idle_sample() > perf sched timehist: Introduce struct idle_time_data > perf sched timehist: Save callchain when entering idle > perf sched timehist: Skip non-idle events when necessary > perf sched timehist: Add -I/--idle-hist option > perf sched timehist: Show callchains for idle stat > > Ravi Bangoria (3): > perf annotate: Support jump instruction with target as second operand > perf annotate: Fix jump target outside of function address range > perf annotate: Don't throw error for zero length symbols > > samples/bpf/Makefile | 70 +-- > samples/bpf/README.rst| 4 +- > samples/bpf/bpf_load.c| 21 +- > samples/bpf/bpf_load.h| 3 + > samples/bpf/fds_example.c | 13 +- > samples/bpf/lathis
Re: [PATCH net v4 0/4] fsl/fman: fixes for ARM
From: Madalin BucurDate: Mon, 19 Dec 2016 22:42:42 +0200 > The patch set fixes advertised speeds for QSGMII interfaces, disables > A007273 erratum workaround on non-PowerPC platforms where it does not > apply, enables compilation on ARM64 and addresses a probing issue on > non PPC platforms. > > Changes from v3: removed redundant comment, added ack by Scott > Changes from v2: merged fsl/fman changes to avoid a point of failure > Changes from v1: unifying probing on all supported platforms Series applied, thanks.
Re: [PATCH v3 13/15] livepatch: change to a per-task consistency model
On Thu 2016-12-08 12:08:38, Josh Poimboeuf wrote: > Change livepatch to use a basic per-task consistency model. This is the > foundation which will eventually enable us to patch those ~10% of > security patches which change function or data semantics. This is the > biggest remaining piece needed to make livepatch more generally useful. > > [1] https://lkml.kernel.org/r/20141107140458.ga21...@suse.cz > > Signed-off-by: Josh Poimboeuf> --- > diff --git a/Documentation/livepatch/livepatch.txt > b/Documentation/livepatch/livepatch.txt > index 6c43f6e..f87e742 100644 > --- a/Documentation/livepatch/livepatch.txt > +++ b/Documentation/livepatch/livepatch.txt I like the description. Just a note that we will also need to review the section about limitations. But I am not sure that we want to do it in this patch. It might open a long discussion on its own. > diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h > index 1a5a93c..8e06fe5 100644 > --- a/include/linux/livepatch.h > +++ b/include/linux/livepatch.h > @@ -28,18 +28,40 @@ > > #include > > +/* task patch states */ > +#define KLP_UNDEFINED-1 > +#define KLP_UNPATCHED 0 > +#define KLP_PATCHED 1 > + > /** > * struct klp_func - function structure for live patching > * @old_name:name of the function to be patched > * @new_func:pointer to the patched function code > * @old_sympos: a hint indicating which symbol position the old function > * can be found (optional) > + * @immediate: patch the func immediately, bypassing backtrace safety checks There are more checks possible. I would use the same description as for klp_object. > * @old_addr:the address of the function being patched > * @kobj:kobject for sysfs resources > * @stack_node: list node for klp_ops func_stack list > * @old_size:size of the old function > * @new_size:size of the new function > * @patched: the func has been added to the klp_ops list > + * @transition: the func is currently being applied or reverted > + * > @@ -86,6 +110,7 @@ struct klp_object { > * struct klp_patch - patch structure for live patching > * @mod: reference to the live patch module > * @objs:object entries for kernel objects to be patched > + * @immediate: patch all funcs immediately, bypassing safety mechanisms > * @list:list node for global list of registered patches > * @kobj:kobject for sysfs resources > * @enabled: the patch is enabled (but operation may be incomplete) [...] > diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c > index fc160c6..22c0c01 100644 > --- a/kernel/livepatch/core.c > +++ b/kernel/livepatch/core.c > @@ -424,7 +477,10 @@ static ssize_t enabled_store(struct kobject *kobj, > struct kobj_attribute *attr, > goto err; > } > > - if (enabled) { > + if (patch == klp_transition_patch) { > + klp_reverse_transition(); > + mod_delayed_work(system_wq, _transition_work, 0); I would put this mod_delayed_work() into klp_reverse_transition(). Also I would put that schedule_delayed_work() into klp_try_complete_transition(). If I did not miss anything, it will allow to move the klp_transition_work code to transition.c where it logically belongs. > + } else if (enabled) { > ret = __klp_enable_patch(patch); > if (ret) > goto err; [...] > diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c > index 5efa262..e79ebb5 100644 > --- a/kernel/livepatch/patch.c > +++ b/kernel/livepatch/patch.c > @@ -29,6 +29,7 @@ > #include > #include > #include "patch.h" > +#include "transition.h" > > static LIST_HEAD(klp_ops); > > @@ -54,15 +55,53 @@ static void notrace klp_ftrace_handler(unsigned long ip, > { > struct klp_ops *ops; > struct klp_func *func; > + int patch_state; > > ops = container_of(fops, struct klp_ops, fops); > > rcu_read_lock(); > + > func = list_first_or_null_rcu(>func_stack, struct klp_func, > stack_node); > - if (WARN_ON_ONCE(!func)) > + > + if (!func) > goto unlock; Why do you removed the WARN_ON_ONCE(), please? We still add the function on the stack before registering the ftrace handler. Also we unregister the ftrace handler before removing the the last entry from the stack. AFAIK, unregister_ftrace_function() calls rcu_synchronize()' to make sure that no-one is inside the handler once finished. Mirek knows more about it. If this is not true, we have a problem. For example, we call kfree(ops) after unregister_ftrace_function(); BTW: I thought that this change was really needed because of klp_try_complete_transition(). But I think that the WARN could and should stay after all. See below. > + /* > + * Enforce the order of the ops->func_stack and func->transition reads. > +
[GIT PULL 00/29] perf/core improvements and fixes
Hi Ingo, Please consider pulling, I had most of this queued before your first pull req to Linus for 4.10, most are fixes, with 'perf sched timehist --idle' as a followup new feature to the 'perf sched timehist' command introduced in this window. One other thing that delayed this was the samples/bpf/ switch to tools/lib/bpf/ that involved fixing up merge clashes with net.git and also to properly test it, after more rounds than antecipated, but all seems ok now and would be good to get this merge issues past us ASAP. - Arnaldo Test results at the end of this message, as usual. The following changes since commit e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba: Merge tag 'docs-4.10' of git://git.lwn.net/linux (2016-12-12 21:58:13 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git tags/perf-core-for-mingo-20161220 for you to fetch changes up to 9899694a7f67714216665b87318eb367e2c5c901: samples/bpf: Move open_raw_sock to separate header (2016-12-20 12:00:40 -0300) perf/core improvements and fixes: New features: - Introduce 'perf sched timehist --idle', to analyse processes going to/from idle state (Namhyung Kim) Fixes: - Allow 'perf record -u user' to continue when facing races with threads going away after having scanned them via /proc (Jiri Olsa) - Fix 'perf mem' --all-user/--all-kernel options (Jiri Olsa) - Support jumps with multiple arguments (Ravi Bangoria) - Fix jumps to before the function where they are located (Ravi Bangoria) - Fix lock-pi help string (Davidlohr Bueso) - Fix build of 'perf trace' in odd systems such as a RHEL PPC one (Jiri Olsa) - Do not overwrite valid build id in 'perf diff' (Kan Liang) - Don't throw error for zero length symbols, allowing the use of the TUI in PowerPC, where such symbols became more common recently (Ravi Bangoria) Infrastructure: - Switch of samples/bpf/ to use tools/lib/bpf, removing libbpf duplication (Joe Stringer) - Move headers check into bash script (Jiri Olsa) Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com> Arnaldo Carvalho de Melo (3): perf tools: Remove some needless __maybe_unused samples/bpf: Make perf_event_read() static samples/bpf: Be consistent with bpf_load_program bpf_insn parameter Davidlohr Bueso (1): perf bench futex: Fix lock-pi help string Jiri Olsa (7): perf tools: Move headers check into bash script perf mem: Fix --all-user/--all-kernel options perf evsel: Use variable instead of repeating lengthy FD macro perf thread_map: Add thread_map__remove function perf evsel: Allow to ignore missing pid perf record: Force ignore_missing_thread for uid option perf trace: Check if MAP_32BIT is defined (again) Joe Stringer (8): tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h tools lib bpf: use __u32 from linux/types.h tools lib bpf: Add flags to bpf_create_map() samples/bpf: Make samples more libbpf-centric samples/bpf: Switch over to libbpf tools lib bpf: Add bpf_prog_{attach,detach} samples/bpf: Remove perf_event_open() declaration samples/bpf: Move open_raw_sock to separate header Kan Liang (1): perf diff: Do not overwrite valid build id Namhyung Kim (6): perf sched timehist: Split is_idle_sample() perf sched timehist: Introduce struct idle_time_data perf sched timehist: Save callchain when entering idle perf sched timehist: Skip non-idle events when necessary perf sched timehist: Add -I/--idle-hist option perf sched timehist: Show callchains for idle stat Ravi Bangoria (3): perf annotate: Support jump instruction with target as second operand perf annotate: Fix jump target outside of function address range perf annotate: Don't throw error for zero length symbols samples/bpf/Makefile | 70 +-- samples/bpf/README.rst| 4 +- samples/bpf/bpf_load.c| 21 +- samples/bpf/bpf_load.h| 3 + samples/bpf/fds_example.c | 13 +- samples/bpf/lathist_user.c| 2 +- samples/bpf/libbpf.c | 176 --- samples/bpf/libbpf.h | 28 +- samples/bpf/lwt_len_hist_user.c | 6 +- samples/bpf/offwaketime_user.c| 8 +- samples/bpf/sampleip_user.c | 7 +- samples/bpf/sock_example.c| 14 +- samples/bpf/sock_example.h| 35 ++ samples/bpf/sockex1_user.c| 7 +- samples/bpf/sockex2_user.c| 5 +- samples/bpf/sockex3_user.c
[PATCH 23/29] perf annotate: Don't throw error for zero length symbols
From: Ravi Bangoria'perf report --tui' exits with error when it finds a sample of zero length symbol (i.e. addr == sym->start == sym->end). Actually these are valid samples. Don't exit TUI and show report with such symbols. Reported-and-Tested-by: Anton Blanchard Link: https://lkml.org/lkml/2016/10/8/189 Signed-off-by: Ravi Bangoria Cc: Alexander Shishkin Cc: Benjamin Herrenschmidt Cc: Chris Riyder Cc: linuxppc-dev@lists.ozlabs.org Cc: Masami Hiramatsu Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Peter Zijlstra Cc: sta...@kernel.org # v4.9+ Link: http://lkml.kernel.org/r/1479804050-5028-1-git-send-email-ravi.bango...@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/annotate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index c81a3950a7fe..06cc04e5806a 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -647,7 +647,8 @@ static int __symbol__inc_addr_samples(struct symbol *sym, struct map *map, pr_debug3("%s: addr=%#" PRIx64 "\n", __func__, map->unmap_ip(map, addr)); - if (addr < sym->start || addr >= sym->end) { + if ((addr < sym->start || addr >= sym->end) && + (addr != sym->end || sym->start != sym->end)) { pr_debug("%s(%d): ERANGE! sym->name=%s, start=%#" PRIx64 ", addr=%#" PRIx64 ", end=%#" PRIx64 "\n", __func__, __LINE__, sym->name, sym->start, addr, sym->end); return -ERANGE; -- 2.9.3
[PATCH 15/29] perf annotate: Fix jump target outside of function address range
From: Ravi BangoriaIf jump target is outside of function range, perf is not handling it correctly. Especially when target address is lesser than function start address, target offset will be negative. But, target address declared to be unsigned, converts negative number into 2's complement. See below example. Here target of 'jumpq' instruction at 34cf8 is 34ac0 which is lesser than function start address(34cf0). 34ac0 - 34cf0 = -0x230 = 0xfdd0 Objdump output: 00034cf0 <__sigaction>: __GI___sigaction(): 34cf0: lea-0x20(%rdi),%eax 34cf3: cmp-bashx1,%eax 34cf6: jbe34d00 <__sigaction+0x10> 34cf8: jmpq 34ac0 <__GI___libc_sigaction> 34cfd: nopl (%rax) 34d00: mov0x386161(%rip),%rax# 3bae68 <_DYNAMIC+0x2e8> 34d07: movl -bashx16,%fs:(%rax) 34d0e: mov-bashx,%eax 34d13: retq perf annotate before applying patch: __GI___sigaction /usr/lib64/libc-2.22.so lea-0x20(%rdi),%eax cmp-bashx1,%eax v jbe10 v jmpq fdd0 nop 10:mov_DYNAMIC+0x2e8,%rax movl -bashx16,%fs:(%rax) mov-bashx,%eax retq perf annotate after applying patch: __GI___sigaction /usr/lib64/libc-2.22.so lea-0x20(%rdi),%eax cmp-bashx1,%eax v jbe10 ^ jmpq 34ac0 <__GI___libc_sigaction> nop 10:mov_DYNAMIC+0x2e8,%rax movl -bashx16,%fs:(%rax) mov-bashx,%eax retq Signed-off-by: Ravi Bangoria Cc: Alexander Shishkin Cc: Chris Riyder Cc: Kim Phillips Cc: Markus Trippelsdorf Cc: Masami Hiramatsu Cc: Naveen N. Rao Cc: Peter Zijlstra Cc: Taeung Song Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1480953407-7605-3-git-send-email-ravi.bango...@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/ui/browsers/annotate.c | 5 +++-- tools/perf/util/annotate.c| 14 +- tools/perf/util/annotate.h| 5 +++-- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/tools/perf/ui/browsers/annotate.c b/tools/perf/ui/browsers/annotate.c index ec7a30fad149..ba36aac340bc 100644 --- a/tools/perf/ui/browsers/annotate.c +++ b/tools/perf/ui/browsers/annotate.c @@ -215,7 +215,7 @@ static void annotate_browser__write(struct ui_browser *browser, void *entry, int ui_browser__set_color(browser, color); if (dl->ins.ops && dl->ins.ops->scnprintf) { if (ins__is_jump(>ins)) { - bool fwd = dl->ops.target.offset > (u64)dl->offset; + bool fwd = dl->ops.target.offset > dl->offset; ui_browser__write_graph(browser, fwd ? SLSMG_DARROW_CHAR : SLSMG_UARROW_CHAR); @@ -245,7 +245,8 @@ static bool disasm_line__is_valid_jump(struct disasm_line *dl, struct symbol *sy { if (!dl || !dl->ins.ops || !ins__is_jump(>ins) || !disasm_line__has_offset(dl) - || dl->ops.target.offset >= symbol__size(sym)) + || dl->ops.target.offset < 0 + || dl->ops.target.offset >= (s64)symbol__size(sym)) return false; return true; diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 590244e5781e..c81a3950a7fe 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -230,10 +230,12 @@ static int jump__parse(struct arch *arch __maybe_unused, struct ins_operands *op else ops->target.addr = strtoull(ops->raw, NULL, 16); - if (s++ != NULL) + if (s++ != NULL) { ops->target.offset = strtoull(s, NULL, 16); - else - ops->target.offset = UINT64_MAX; + ops->target.offset_avail = true; + } else { + ops->target.offset_avail = false; + } return 0; } @@ -241,7 +243,7 @@ static int jump__parse(struct arch *arch __maybe_unused, struct ins_operands *op static int jump__scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { - if (!ops->target.addr) + if (!ops->target.addr || ops->target.offset < 0) return ins__raw_scnprintf(ins, bf, size, ops); return scnprintf(bf, size, "%-6.6s %" PRIx64, ins->name, ops->target.offset); @@ -1209,9 +1211,11 @@ static int symbol__parse_objdump_line(struct symbol *sym, struct map *map, if (dl == NULL)
[PATCH 14/29] perf annotate: Support jump instruction with target as second operand
From: Ravi BangoriaArchitectures like PowerPC have jump instructions that includes a target address as a second operand. For example, 'bne cr7,0xc00f6154'. Add support for such instruction in perf annotate. objdump o/p: c00f6140: ld r9,1032(r31) c00f6144: cmpdi cr7,r9,0 c00f6148: bnecr7,0xc00f6154 c00f614c: ld r9,2312(r30) c00f6150: stdr9,1032(r31) c00f6154: ld r9,88(r31) Corresponding perf annotate o/p: Before patch: ld r9,1032(r31) cmpdi cr7,r9,0 v bne3ff09f2c ld r9,2312(r30) stdr9,1032(r31) 74:ld r9,88(r31) After patch: ld r9,1032(r31) cmpdi cr7,r9,0 v bne74 ld r9,2312(r30) stdr9,1032(r31) 74:ld r9,88(r31) Signed-off-by: Ravi Bangoria Cc: Alexander Shishkin Cc: Chris Riyder Cc: Kim Phillips Cc: Markus Trippelsdorf Cc: Masami Hiramatsu Cc: Naveen N. Rao Cc: Peter Zijlstra Cc: Taeung Song Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1480953407-7605-2-git-send-email-ravi.bango...@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/annotate.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index ea7e0de4b9c1..590244e5781e 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -223,8 +223,12 @@ bool ins__is_call(const struct ins *ins) static int jump__parse(struct arch *arch __maybe_unused, struct ins_operands *ops, struct map *map __maybe_unused) { const char *s = strchr(ops->raw, '+'); + const char *c = strchr(ops->raw, ','); - ops->target.addr = strtoull(ops->raw, NULL, 16); + if (c++ != NULL) + ops->target.addr = strtoull(c, NULL, 16); + else + ops->target.addr = strtoull(ops->raw, NULL, 16); if (s++ != NULL) ops->target.offset = strtoull(s, NULL, 16); -- 2.9.3
Re: [PATCH v3 3/5] powerpc/mm: add radix__create_section_mapping()
On Tue, Dec 20, 2016 at 05:28:40PM +1100, Balbir Singh wrote: +#ifdef CONFIG_MEMORY_HOTPLUG +int radix__create_section_mapping(unsigned long start, unsigned long end) +{ + unsigned long page_size = 1 << mmu_psize_defs[mmu_linear_psize].shift; Can we refactor bits from radix_init_pgtable() and reuse? Yes, that's my plan for v4. -- Reza Arbab
Re: [PATCH] powerpc/64: Don't try to use radix MMU under a hypervisor
Paul Mackerraswrites: > Currently, if the kernel is running on a POWER9 processor under a > hypervisor, it will try to use the radix MMU even though it doesn't > have the necessary code to use radix under a hypervisor (it doesn't > negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL > hcall). The result is that the guest kernel will crash when it tries > to turn on the MMU, because it will still actually be using the HPT > MMU, but it won't have set up any SLB or HPT entries. It does this > because the only thing that the kernel looks at in deciding to use > radix, on any platform, is the ibm,pa-features property on the cpu > device nodes. > > This fixes it by looking for the /chosen/ibm,architecture-vec-5 > property, and if it exists, clearing the radix MMU feature bit. > We do this before we decide whether to initialize for radix or HPT. > This property is created by the hypervisor as a result of the guest > calling the ibm,client-architecture-support method to indicate > its capabilities, so it only exists on systems with a hypervisor. > The reason for using this property is that in future, when we > have support for using radix under a hypervisor, we will need > to check this property to see whether the hypervisor agreed to > us using radix. Hypervisor that doesn't support radix should clear the ibm,pa-features radix feature bit right ? We look at that before setting MMU_FTR_TYPE_RADIX. So how did we end up enabling radix in the above case ? > > Fixes: 17a3dd2f5fc7 ("powerpc/mm/radix: Use firmware feature to enable Radix > MMU") > Cc: sta...@vger.kernel.org # v4.7+ > Signed-off-by: Paul Mackerras > --- > arch/powerpc/mm/init_64.c | 27 +++ > 1 file changed, 27 insertions(+) > > diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c > index a000c35..098531d 100644 > --- a/arch/powerpc/mm/init_64.c > +++ b/arch/powerpc/mm/init_64.c > @@ -42,6 +42,8 @@ > #include > #include > #include > +#include > +#include > > #include > #include > @@ -344,6 +346,28 @@ static int __init parse_disable_radix(char *p) > } > early_param("disable_radix", parse_disable_radix); > > +/* > + * If we're running under a hypervisor, we currently can't do radix > + * since we don't have the code to do the H_REGISTER_PROC_TBL hcall. > + * We tell that we're running under a hypervisor by looking for the > + * /chosen/ibm,architecture-vec-5 property. > + */ > +static void early_check_vec5(void) > +{ > + unsigned long root, chosen; > + int size; > + const u8 *vec5; > + > + root = of_get_flat_dt_root(); > + chosen = of_get_flat_dt_subnode_by_name(root, "chosen"); > + if (chosen == -FDT_ERR_NOTFOUND) > + return; > + vec5 = of_get_flat_dt_prop(chosen, "ibm,architecture-vec-5", ); > + if (!vec5) > + return; > + cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX; > +} > + > void __init mmu_early_init_devtree(void) > { > /* Disable radix mode based on kernel command line. */ > @@ -351,6 +375,9 @@ void __init mmu_early_init_devtree(void) > cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX; > > if (early_radix_enabled()) > + early_check_vec5(); > + > + if (early_radix_enabled()) > radix__early_init_devtree(); > else > hash__early_init_devtree(); > -- > 2.7.4
[PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
Added check for OPAL_WRONG_STATE error code returned from OPAL. Currently Linux flashes "unexpected error" over console for this error. This will avoid throwing such message and return I/O error for such OPAL failures. Signed-off-by: Vipin K Parashar--- arch/powerpc/platforms/powernv/opal.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c index 2822935..ab91d53 100644 --- a/arch/powerpc/platforms/powernv/opal.c +++ b/arch/powerpc/platforms/powernv/opal.c @@ -866,9 +866,10 @@ int opal_error_code(int rc) case OPAL_NO_MEM: return -ENOMEM; case OPAL_PERMISSION: return -EPERM; - case OPAL_UNSUPPORTED: return -EIO; - case OPAL_HARDWARE: return -EIO; - case OPAL_INTERNAL_ERROR: return -EIO; + case OPAL_UNSUPPORTED: + case OPAL_HARDWARE: + case OPAL_INTERNAL_ERROR: + case OPAL_WRONG_STATE: return -EIO; default: pr_err("%s: unexpected OPAL error %d\n", __func__, rc); return -EIO; -- 2.7.4
[PATCH] powerpc/perf/24x7: use rb_entry
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang--- arch/powerpc/perf/hv-24x7.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c index 7b2ca16..51bd3b4 100644 --- a/arch/powerpc/perf/hv-24x7.c +++ b/arch/powerpc/perf/hv-24x7.c @@ -547,7 +547,7 @@ static int event_uniq_add(struct rb_root *root, const char *name, int nl, struct event_uniq *it; int result; - it = container_of(*new, struct event_uniq, node); + it = rb_entry(*new, struct event_uniq, node); result = ev_uniq_ord(name, nl, domain, it->name, it->nl, it->domain); -- 2.9.3
Re: [PATCH v4 02/12] powerpc: move set_soft_enabled() and rename
On 20/12/16 20:32, Benjamin Herrenschmidt wrote: > On Tue, 2016-12-20 at 20:03 +1100, Balbir Singh wrote: >> >> On 19/12/16 19:06, Madhavan Srinivasan wrote: >>> Move set_soft_enabled() from powerpc/kernel/irq.c to >>> asm/hw_irq.c, to force updates to paca-soft_enabled >>> done via these access function. Add "memory" clobber >>> to hint compiler since paca->soft_enabled memory is the target >>> here >>> +static inline notrace void soft_enabled_set(unsigned long enable) >>> +{ >>> + __asm__ __volatile__("stb %0,%1(13)" >>> + : : "r" (enable), "i" (offsetof(struct paca_struct, >>> soft_enabled)) >>> + : "memory"); >>> +} >>> + >> >> Can't we just rewrite this in "C"? >> >> local_paca->soft_enabled = enable > > Well, this is digging out another bloody can of baby eating worms... > > So the reason we wrote it like that is because we had gcc playing > tricks to us. > > It has been proved to move around references to local_paca, even > accross preempt_disable boundaries. > > The above prevents it. > > I'm about 100% sure we have a bunch of other issues related to PACA > access and gcc playing games though. I remember grepping the kernel > disassembly once for gcc doing funny things with r13 such as copying it > into another register or even saving and restoring it and my grep > didn't come up empty ... > I checked for arch_local_irq_restore before suggesting it, not all functions. I tried a quick audit of disassembly not using load/stores on r13 and most of it was in hand written disassembly in the exception handling paths, secondary processor initialization, wake up from idle. Having said that I audited with one version of gcc on a little endian vmlinux. > Something we need to seriously look into one day. I have some ideas on > how to clean that up in the code to make hardening it easier, let's > talk next time Nick is in town. > Sure lets leave it as is for now Balbir
[PATCH] powerpc/64: Don't try to use radix MMU under a hypervisor
Currently, if the kernel is running on a POWER9 processor under a hypervisor, it will try to use the radix MMU even though it doesn't have the necessary code to use radix under a hypervisor (it doesn't negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL hcall). The result is that the guest kernel will crash when it tries to turn on the MMU, because it will still actually be using the HPT MMU, but it won't have set up any SLB or HPT entries. It does this because the only thing that the kernel looks at in deciding to use radix, on any platform, is the ibm,pa-features property on the cpu device nodes. This fixes it by looking for the /chosen/ibm,architecture-vec-5 property, and if it exists, clearing the radix MMU feature bit. We do this before we decide whether to initialize for radix or HPT. This property is created by the hypervisor as a result of the guest calling the ibm,client-architecture-support method to indicate its capabilities, so it only exists on systems with a hypervisor. The reason for using this property is that in future, when we have support for using radix under a hypervisor, we will need to check this property to see whether the hypervisor agreed to us using radix. Fixes: 17a3dd2f5fc7 ("powerpc/mm/radix: Use firmware feature to enable Radix MMU") Cc: sta...@vger.kernel.org # v4.7+ Signed-off-by: Paul Mackerras--- arch/powerpc/mm/init_64.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index a000c35..098531d 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -42,6 +42,8 @@ #include #include #include +#include +#include #include #include @@ -344,6 +346,28 @@ static int __init parse_disable_radix(char *p) } early_param("disable_radix", parse_disable_radix); +/* + * If we're running under a hypervisor, we currently can't do radix + * since we don't have the code to do the H_REGISTER_PROC_TBL hcall. + * We tell that we're running under a hypervisor by looking for the + * /chosen/ibm,architecture-vec-5 property. + */ +static void early_check_vec5(void) +{ + unsigned long root, chosen; + int size; + const u8 *vec5; + + root = of_get_flat_dt_root(); + chosen = of_get_flat_dt_subnode_by_name(root, "chosen"); + if (chosen == -FDT_ERR_NOTFOUND) + return; + vec5 = of_get_flat_dt_prop(chosen, "ibm,architecture-vec-5", ); + if (!vec5) + return; + cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX; +} + void __init mmu_early_init_devtree(void) { /* Disable radix mode based on kernel command line. */ @@ -351,6 +375,9 @@ void __init mmu_early_init_devtree(void) cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX; if (early_radix_enabled()) + early_check_vec5(); + + if (early_radix_enabled()) radix__early_init_devtree(); else hash__early_init_devtree(); -- 2.7.4
Re: [PATCH v3 01/15] stacktrace/x86: add function for detecting reliable stack traces
On Mon 2016-12-19 11:25:49, Josh Poimboeuf wrote: > On Mon, Dec 19, 2016 at 05:25:19PM +0100, Miroslav Benes wrote: > > On Thu, 8 Dec 2016, Josh Poimboeuf wrote: > > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > > index 215612c..b4a6663 100644 > > > --- a/arch/x86/Kconfig > > > +++ b/arch/x86/Kconfig > > > @@ -155,6 +155,7 @@ config X86 > > > select HAVE_PERF_REGS > > > select HAVE_PERF_USER_STACK_DUMP > > > select HAVE_REGS_AND_STACK_ACCESS_API > > > + select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER && > > > STACK_VALIDATION > > > > Tests to measure possible performance penalty of frame pointers were done > > by Mel Gorman. The outcome was quite clear. There IS a measurable > > impact. The percentage depends on the workflow but I think it is safe to > > say that FP usually takes 5-10 percents. > > > > If my understanding is correct there is no single culprit. Register > > pressure is definitely not a problem. We ran simple benchmarks while > > taking a register away from GCC (RBP or a common one). The impact is a > > combination of more cacheline pressure, more memory accesses and the fact > > that the kernel contains a lot of small functions. > > > > Thus, I think that DWARF should be the way to go here. > > > > Other than that the patch looks good to me. > > I agree that DWARF is generally a good idea, and I'm working toward it. > However there's still quite a bit of work to get there. > > For this consistency model to work with DWARF on x86, we would need: > > 1) a reliable x86 DWARF unwinder with Linus's blessing > 2) objtool DWARF support (I'm working on this at the moment) > 3) probably some kind of runtime NMI stack checking feature to >complement objtool, along with a lot of burn time to ensure there are >no issues, particularly in entry code Could you please provide more details about this NMI stack checking? What is it supposed to protect that objtool could not? Will it run regularly or will it be just a random check? The other points are obvious. But I do not know what to think about this NMI thing. And so I am curious :-) > 4) port save_stack_trace_tsk_reliable() to work with DWARF Best Regards, Petr
Re: [PATCH] drivers/pci/hotplug: Handle presence detection change properly
On Tue, Dec 20, 2016 at 05:49:33PM +1100, Gavin Shan wrote: > The surprise hotplug is driven by interrupt in PowerNV PCI hotplug > driver. In the interrupt handler, pnv_php_interrupt(), we bail when > pnv_pci_get_presence_state() returns zero wrongly. It causes the > presence change event is always ignored incorrectly. > > This fixes the issue by bailing on error (non-zero value) returned > from pnv_pci_get_presence_state(). > > Fixes: 360aebd85a4 ("drivers/pci/hotplug: Support surprise hotplug in powernv > driver") > Cc: sta...@vger.kernel.org #v4.9+ > Reported-by: Hank Chang> Signed-off-by: Gavin Shan > Tested-by: Willie Liauw > --- > drivers/pci/hotplug/pnv_php.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c > index 56efaf7..38a2309 100644 > --- a/drivers/pci/hotplug/pnv_php.c > +++ b/drivers/pci/hotplug/pnv_php.c > @@ -707,8 +707,12 @@ static irqreturn_t pnv_php_interrupt(int irq, void *data) > added = !!(lsts & PCI_EXP_LNKSTA_DLLLA); > } else if (sts & PCI_EXP_SLTSTA_PDC) { > ret = pnv_pci_get_presence_state(php_slot->id, ); > - if (!ret) > + if (ret) { > + dev_warn(>dev, "PCI slot [%s] error %d handling > PDC event (0x%04x)\n", > + php_slot->name, ret, sts); What can a user do with this warning?
Re: [PATCH v4 02/12] powerpc: move set_soft_enabled() and rename
On Tue, 2016-12-20 at 20:03 +1100, Balbir Singh wrote: > > On 19/12/16 19:06, Madhavan Srinivasan wrote: > > Move set_soft_enabled() from powerpc/kernel/irq.c to > > asm/hw_irq.c, to force updates to paca-soft_enabled > > done via these access function. Add "memory" clobber > > to hint compiler since paca->soft_enabled memory is the target > > here > > +static inline notrace void soft_enabled_set(unsigned long enable) > > +{ > > + __asm__ __volatile__("stb %0,%1(13)" > > + : : "r" (enable), "i" (offsetof(struct paca_struct, > > soft_enabled)) > > + : "memory"); > > +} > > + > > Can't we just rewrite this in "C"? > > local_paca->soft_enabled = enable Well, this is digging out another bloody can of baby eating worms... So the reason we wrote it like that is because we had gcc playing tricks to us. It has been proved to move around references to local_paca, even accross preempt_disable boundaries. The above prevents it. I'm about 100% sure we have a bunch of other issues related to PACA access and gcc playing games though. I remember grepping the kernel disassembly once for gcc doing funny things with r13 such as copying it into another register or even saving and restoring it and my grep didn't come up empty ... Something we need to seriously look into one day. I have some ideas on how to clean that up in the code to make hardening it easier, let's talk next time Nick is in town. Cheers, Ben.
Re: [PATCH v4 02/12] powerpc: move set_soft_enabled() and rename
On 19/12/16 19:06, Madhavan Srinivasan wrote: > Move set_soft_enabled() from powerpc/kernel/irq.c to > asm/hw_irq.c, to force updates to paca-soft_enabled > done via these access function. Add "memory" clobber > to hint compiler since paca->soft_enabled memory is the target > here > +static inline notrace void soft_enabled_set(unsigned long enable) > +{ > + __asm__ __volatile__("stb %0,%1(13)" > + : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled)) > + : "memory"); > +} > + Can't we just rewrite this in "C"? local_paca->soft_enabled = enable Balbir Singh.