Re: [PATCH v3 3/5] powerpc/mm: add radix__create_section_mapping()

2016-12-20 Thread Anshuman Khandual
On 12/19/2016 02:34 PM, Aneesh Kumar K.V wrote:
> Reza Arbab  writes:
> 
>> > Add the linear page mapping function for radix, used by memory hotplug.
>> > This is similar to vmemmap_populate().
>> >
> Ok with this patch your first patch becomes useful. Can you merge that
> with this and rename mmu_linear_psize to mmu_hotplug_psize or even use
> mmu_virtual_psize. The linear naming is confusing.

mmu_linear_psize variable was referring to the page size used to create
kernel linear mapping (the 0xc00 range) for the newly added memory
section. Why the name should be changed ?



Re: [PATCH v3 0/5] powerpc/mm: enable memory hotplug on radix

2016-12-20 Thread Anshuman Khandual
On 12/19/2016 11:28 PM, Reza Arbab wrote:
> On Sat, Dec 17, 2016 at 01:38:40AM +1100, Balbir Singh wrote:
>> Do we care about alt maps yet?
> 
> Good question. I'll try to see if/how altmaps might need special
> consideration here.

vmemmap alt-map should be enabled separately and should not be mixed
in this series which tries to just get hotplug working on radix.



Re: [PATCH v4 07/12] powerpc: Add support to take additional parameter in MASKABLE_* macro

2016-12-20 Thread Madhavan Srinivasan



On Tuesday 20 December 2016 08:06 AM, Nicholas Piggin wrote:

On Mon, 19 Dec 2016 13:37:03 +0530
Madhavan Srinivasan  wrote:


To support addition of "bitmask" to MASKABLE_* macros,
factor out the EXCPETION_PROLOG_1 macro.

Currently soft_enabled is used as the flag to determine
the interrupt state. Patch extends the soft_enabled
to be used as a mask instead of a flag.

This is really the core part of the patch -- after reversing the
soft_enable logic to be a disable boolean, now it's being extended
to be a disable mask. The exception macro changes just allow an
interrupt type bit to be passed in later.

I should have picked it up earlier, but if you do end up submitting
another version, perhaps consider splitting the disable mask change
and putting it after patch 5.

Yes will do that. Make sense.

Maddy



Thanks,
Nick





[RFC PATCH] powerpc/powernv: report error messages from opal

2016-12-20 Thread Oliver O'Halloran
Recent versions of skiboot will raise an OPAL event (read: interrupt)
when firmware writes an error message to its internal console. In
conjunction they provide an OPAL call that the kernel can use to extract
these messages from the OPAL log to allow them to be written into the
kernel's log buffer where someone will (hopefully) look at them.

For the companion skiboot patches see:

https://lists.ozlabs.org/pipermail/skiboot/2016-December/005861.html

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/include/asm/opal-api.h|  5 +++-
 arch/powerpc/include/asm/opal.h|  1 +
 arch/powerpc/platforms/powernv/opal-msglog.c   | 41 ++
 arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
 4 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 0e2e57bcab50..cb9c0e6afb33 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -167,7 +167,8 @@
 #define OPAL_INT_EOI   124
 #define OPAL_INT_SET_MFRR  125
 #define OPAL_PCI_TCE_KILL  126
-#define OPAL_LAST  126
+#define OPAL_SCRAPE_LOG128
+#define OPAL_LAST  128
 
 /* Device tree flags */
 
@@ -288,6 +289,7 @@ enum OpalPendingState {
OPAL_EVENT_PCI_ERROR   = 0x200,
OPAL_EVENT_DUMP_AVAIL  = 0x400,
OPAL_EVENT_MSG_PENDING = 0x800,
+   OPAL_EVENT_LOG_PENDING = 0x1000,
 };
 
 enum OpalThreadStatus {
@@ -406,6 +408,7 @@ enum opal_msg_type {
OPAL_MSG_DPO= 5,
OPAL_MSG_PRD= 6,
OPAL_MSG_OCC= 7,
+   OPAL_MSG_LOG= 8,
OPAL_MSG_TYPE_MAX,
 };
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 5c7db0f1a708..2b3bd3219fb4 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -232,6 +232,7 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t 
kill_type,
 int64_t opal_rm_pci_tce_kill(uint64_t phb_id, uint32_t kill_type,
 uint32_t pe_num, uint32_t tce_size,
 uint64_t dma_addr, uint32_t npages);
+int64_t opal_scrape_log(int64_t *offset, char *buf, int64_t len, int64_t *lvl);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/platforms/powernv/opal-msglog.c 
b/arch/powerpc/platforms/powernv/opal-msglog.c
index 39d6ff9e5630..78168f66fb24 100644
--- a/arch/powerpc/platforms/powernv/opal-msglog.c
+++ b/arch/powerpc/platforms/powernv/opal-msglog.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* OPAL in-memory console. Defined in OPAL source at core/console.c */
 struct memcons {
@@ -102,8 +103,36 @@ static struct bin_attribute opal_msglog_attr = {
.read = opal_msglog_read
 };
 
+static char *log_levels[] = { "Emergency", "Alert", "Critical", "Error", 
"Warning" };
+static int64_t offset = -1;
+
+static irqreturn_t opal_print_log(int irq, void *data)
+{
+   int64_t rc, log_lvl;
+   char buffer[320];
+
+   /*
+* only print one message per invokation of the IRQ handler
+*/
+
+   rc = opal_scrape_log(, buffer, sizeof(buffer), _lvl);
+
+   if (rc == OPAL_SUCCESS || rc == OPAL_PARTIAL) {
+   log_lvl = be64_to_cpu(log_lvl);
+   if (log_lvl > 4)
+   log_lvl = 4;
+
+   printk_emit(0, log_lvl, NULL, 0, "OPAL %s: %s%s\r\n",
+   log_levels[log_lvl], buffer,
+   rc == OPAL_PARTIAL ? "" : "");
+   }
+
+   return IRQ_HANDLED;
+}
+
 void __init opal_msglog_init(void)
 {
+   int virq, rc = -1;
u64 mcaddr;
struct memcons *mc;
 
@@ -123,6 +152,18 @@ void __init opal_msglog_init(void)
return;
}
 
+   virq = opal_event_request(ilog2(OPAL_EVENT_LOG_PENDING));
+   if (virq) {
+   rc = request_irq(virq, opal_print_log,
+   IRQF_TRIGGER_HIGH, "opal memcons", NULL);
+
+   if (rc)
+   irq_dispose_mapping(virq);
+   }
+
+   if (!virq || rc)
+   pr_warn("Unable to register OPAL log event handler\n");
+
opal_memcons = mc;
 }
 
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 3aa40f1b20f5..c59d7da3fd1a 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -312,3 +312,4 @@ OPAL_CALL(opal_int_set_mfrr,
OPAL_INT_SET_MFRR);
 OPAL_CALL_REAL(opal_rm_int_set_mfrr,   OPAL_INT_SET_MFRR);
 OPAL_CALL(opal_pci_tce_kill,   OPAL_PCI_TCE_KILL);
 OPAL_CALL_REAL(opal_rm_pci_tce_kill,  

Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails

2016-12-20 Thread Mukesh Ojha



On Tuesday 20 December 2016 07:46 PM, Vipin K Parashar wrote:

Added check for OPAL_WRONG_STATE error code returned from OPAL.
Currently Linux flashes "unexpected error" over console for this
error. This will avoid throwing such message and return I/O error
for such OPAL failures.

Signed-off-by: Vipin K Parashar 
---
  arch/powerpc/platforms/powernv/opal.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 2822935..ab91d53 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -866,9 +866,10 @@ int opal_error_code(int rc)
case OPAL_NO_MEM:   return -ENOMEM;
case OPAL_PERMISSION:   return -EPERM;

-   case OPAL_UNSUPPORTED:  return -EIO;
-   case OPAL_HARDWARE: return -EIO;
-   case OPAL_INTERNAL_ERROR:   return -EIO;
+   case OPAL_UNSUPPORTED:
+   case OPAL_HARDWARE:
+   case OPAL_INTERNAL_ERROR:
+   case OPAL_WRONG_STATE:  return -EIO;


Looks good.

Reviewed-by: Mukesh Ojha 


default:
pr_err("%s: unexpected OPAL error %d\n", __func__, rc);
return -EIO;




Re: [PATCH kernel v2 05/11] KVM: PPC: Use preregistered memory API to access TCE list

2016-12-20 Thread David Gibson
On Sun, Dec 18, 2016 at 12:28:54PM +1100, Alexey Kardashevskiy wrote:
> VFIO on sPAPR already implements guest memory pre-registration
> when the entire guest RAM gets pinned. This can be used to translate
> the physical address of a guest page containing the TCE list
> from H_PUT_TCE_INDIRECT.
> 
> This makes use of the pre-registrered memory API to access TCE list
> pages in order to avoid unnecessary locking on the KVM memory
> reverse map as we know that all of guest memory is pinned and
> we have a flat array mapping GPA to HPA which makes it simpler and
> quicker to index into that array (even with looking up the
> kernel page tables in vmalloc_to_phys) than it is to find the memslot,
> lock the rmap entry, look up the user page tables, and unlock the rmap
> entry. Note that the rmap pointer is initialized to NULL
> where declared (not in this patch).
> 
> If a requested chunk of memory has not been preregistered,
> this will fail with H_TOO_HARD so the virtual mode handle can
> handle the request.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v2:
> * updated the commit log with David's comment
> ---
>  arch/powerpc/kvm/book3s_64_vio_hv.c | 65 
> -
>  1 file changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
> b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index d461c440889a..a3be4bd6188f 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -180,6 +180,17 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static inline bool kvmppc_preregistered(struct kvm_vcpu *vcpu)
> +{
> + return mm_iommu_preregistered(vcpu->kvm->mm);
> +}
> +
> +static struct mm_iommu_table_group_mem_t *kvmppc_rm_iommu_lookup(
> + struct kvm_vcpu *vcpu, unsigned long ua, unsigned long size)
> +{
> + return mm_iommu_lookup_rm(vcpu->kvm->mm, ua, size);
> +}

I don't see that there's much point to these inlines.

>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>   unsigned long ioba, unsigned long tce)
>  {
> @@ -260,23 +271,44 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>   if (ret != H_SUCCESS)
>   return ret;
>  
> - if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , ))
> - return H_TOO_HARD;
> + if (kvmppc_preregistered(vcpu)) {
> + /*
> +  * We get here if guest memory was pre-registered which
> +  * is normally VFIO case and gpa->hpa translation does not
> +  * depend on hpt.
> +  */
> + struct mm_iommu_table_group_mem_t *mem;
>  
> - rmap = (void *) vmalloc_to_phys(rmap);
> + if (kvmppc_gpa_to_ua(vcpu->kvm, tce_list, , NULL))
> + return H_TOO_HARD;
>  
> - /*
> -  * Synchronize with the MMU notifier callbacks in
> -  * book3s_64_mmu_hv.c (kvm_unmap_hva_hv etc.).
> -  * While we have the rmap lock, code running on other CPUs
> -  * cannot finish unmapping the host real page that backs
> -  * this guest real page, so we are OK to access the host
> -  * real page.
> -  */
> - lock_rmap(rmap);
> - if (kvmppc_rm_ua_to_hpa(vcpu, ua, )) {
> - ret = H_TOO_HARD;
> - goto unlock_exit;
> + mem = kvmppc_rm_iommu_lookup(vcpu, ua, IOMMU_PAGE_SIZE_4K);
> + if (!mem || mm_iommu_ua_to_hpa_rm(mem, ua, ))
> + return H_TOO_HARD;
> + } else {
> + /*
> +  * This is emulated devices case.
> +  * We do not require memory to be preregistered in this case
> +  * so lock rmap and do __find_linux_pte_or_hugepte().
> +  */

Hmm.  So this isn't wrong as such, but the logic and comments are
both misleading.  The 'if' here isn't really about VFIO vs. emulated -
it's about whether the mm has *any* preregistered chunks, without any
regard to which particular device you're talking about.  For example
if your guest has two PHBs, one with VFIO devices and the other with
emulated devices, then the emulated devices will still go through the
"VFIO" case here.

Really what you have here is a fast case when the tce_list is in
preregistered memory, and a fallback case when it isn't.  But that's
obscured by the fact that if for whatever reason you have some
preregistered memory but it doesn't cover the tce_list, then you don't
go to the fallback case here, but instead fall right back to the
virtual mode handler.

So, I think you should either:
1) Fallback to the code below whenever you can't access the
   tce_list via prereg memory, regardless of whether there's any
   _other_ prereg memory
or
2) Drop the code below entirely and always return H_TOO_HARD if
   you can't get the tce_list from prereg.

> + if 

Re: [PATCH v4 02/12] powerpc: move set_soft_enabled() and rename

2016-12-20 Thread Madhavan Srinivasan



On Tuesday 20 December 2016 02:33 PM, Balbir Singh wrote:


On 19/12/16 19:06, Madhavan Srinivasan wrote:

Move set_soft_enabled() from powerpc/kernel/irq.c to
asm/hw_irq.c, to force updates to paca-soft_enabled
done via these access function. Add "memory" clobber
to hint compiler since paca->soft_enabled memory is the target
here
+static inline notrace void soft_enabled_set(unsigned long enable)
+{
+   __asm__ __volatile__("stb %0,%1(13)"
+   : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled))
+   : "memory");
+}
+

Can't we just rewrite this in "C"?

local_paca->soft_enabled = enable


Yes we did discussed this  and parked it to handle in the follow on 
patchset.

Nick did suggest something of this sort,

__asm__ __volatile__("stb %1,%0" : "=m" (local_paca->soft_enabled) : "r" 
(enable));





Balbir Singh.





Re: [PATCH v4 01/12] powerpc: Add #defs for paca->soft_enabled flags

2016-12-20 Thread Madhavan Srinivasan



On Tuesday 20 December 2016 01:01 PM, Balbir Singh wrote:




+/*
+ * flags for paca->soft_enabled
+ */
+#define IRQ_DISABLE_MASK_NONE1
+#define IRQ_DISABLE_MASK_LINUX0
+


Isn't one just the complement of the other?


Yes. But these are initial cleanup patchs and in subsequent
patches we will reuse these as mask bits instead of flag.

Maddy



Balbir





Re: [PATCH] powerpc/64: Don't try to use radix MMU under a hypervisor

2016-12-20 Thread Suraj Jitindar Singh
On Tue, 2016-12-20 at 22:40 +1100, Paul Mackerras wrote:
> Currently, if the kernel is running on a POWER9 processor under a
> hypervisor, it will try to use the radix MMU even though it doesn't
> have the necessary code to use radix under a hypervisor (it doesn't
> negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL
> hcall).  The result is that the guest kernel will crash when it tries
> to turn on the MMU, because it will still actually be using the HPT
> MMU, but it won't have set up any SLB or HPT entries.  It does this
> because the only thing that the kernel looks at in deciding to use
> radix, on any platform, is the ibm,pa-features property on the cpu
> device nodes.
> 
> This fixes it by looking for the /chosen/ibm,architecture-vec-5
> property, and if it exists, clearing the radix MMU feature bit.
> We do this before we decide whether to initialize for radix or HPT.
> This property is created by the hypervisor as a result of the guest
> calling the ibm,client-architecture-support method to indicate
> its capabilities, so it only exists on systems with a hypervisor.
> The reason for using this property is that in future, when we
> have support for using radix under a hypervisor, we will need
> to check this property to see whether the hypervisor agreed to
> us using radix.
> 
> Fixes: 17a3dd2f5fc7 ("powerpc/mm/radix: Use firmware feature to
> enable Radix MMU")
> Cc: sta...@vger.kernel.org # v4.7+
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/mm/init_64.c | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a000c35..098531d 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -42,6 +42,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -344,6 +346,28 @@ static int __init parse_disable_radix(char *p)
>  }
>  early_param("disable_radix", parse_disable_radix);
>  
> +/*
> + * If we're running under a hypervisor, we currently can't do radix
> + * since we don't have the code to do the H_REGISTER_PROC_TBL hcall.
> + * We tell that we're running under a hypervisor by looking for the
> + * /chosen/ibm,architecture-vec-5 property.
> + */
> +static void early_check_vec5(void)
> +{
> + unsigned long root, chosen;
> + int size;
> + const u8 *vec5;
> +
> + root = of_get_flat_dt_root();
> + chosen = of_get_flat_dt_subnode_by_name(root, "chosen");
> + if (chosen == -FDT_ERR_NOTFOUND)
> + return;
> + vec5 = of_get_flat_dt_prop(chosen, "ibm,architecture-vec-5", 
> );
> + if (!vec5)
> + return;
> + cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
> +}
> +
Given that currently radix guest support doesn't exist upstream, it's
sufficient to check for the existence of the vec5 node to determine
that we are a guest and thus can't run radix.

Is it worth checking the specific radix feature bit of the vec5 node so
that this code is still correct for determining the lack of radix
support by the host platform once guest radix kernels are (in the
future) supported?
>  void __init mmu_early_init_devtree(void)
>  {
>   /* Disable radix mode based on kernel command line. */
> @@ -351,6 +375,9 @@ void __init mmu_early_init_devtree(void)
>   cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
>  
>   if (early_radix_enabled())
> + early_check_vec5();
> +
> + if (early_radix_enabled())
>   radix__early_init_devtree();
>   else
>   hash__early_init_devtree();


Re: [PATCH] drivers/pci/hotplug: Handle presence detection change properly

2016-12-20 Thread Gavin Shan
On Tue, Dec 20, 2016 at 10:37:09AM +0100, Greg KH wrote:
>On Tue, Dec 20, 2016 at 05:49:33PM +1100, Gavin Shan wrote:
>> The surprise hotplug is driven by interrupt in PowerNV PCI hotplug
>> driver. In the interrupt handler, pnv_php_interrupt(), we bail when
>> pnv_pci_get_presence_state() returns zero wrongly. It causes the
>> presence change event is always ignored incorrectly.
>> 
>> This fixes the issue by bailing on error (non-zero value) returned
>> from pnv_pci_get_presence_state().
>> 
>> Fixes: 360aebd85a4 ("drivers/pci/hotplug: Support surprise hotplug in 
>> powernv driver")
>> Cc: sta...@vger.kernel.org #v4.9+
>> Reported-by: Hank Chang 
>> Signed-off-by: Gavin Shan 
>> Tested-by: Willie Liauw 
>> ---
>>  drivers/pci/hotplug/pnv_php.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
>> index 56efaf7..38a2309 100644
>> --- a/drivers/pci/hotplug/pnv_php.c
>> +++ b/drivers/pci/hotplug/pnv_php.c
>> @@ -707,8 +707,12 @@ static irqreturn_t pnv_php_interrupt(int irq, void 
>> *data)
>>  added = !!(lsts & PCI_EXP_LNKSTA_DLLLA);
>>  } else if (sts & PCI_EXP_SLTSTA_PDC) {
>>  ret = pnv_pci_get_presence_state(php_slot->id, );
>> -if (!ret)
>> +if (ret) {
>> +dev_warn(>dev, "PCI slot [%s] error %d handling 
>> PDC event (0x%04x)\n",
>> +php_slot->name, ret, sts);
>
>What can a user do with this warning?
>

This warning is to catch developer's attention only. It's very
rare for pnv_pci_get_presence_state() to fail as it simply checks
hardware regiter bit (from Slot Status regsiter in PCIe capability
or PHB's Hotplug Override register) in the skiboot firmware.

User needs to retry the operation (hot-add or hot-remove) when
seeing this warning. Greg, please let me know if below message
is better?

dev_warn(>dev, "PCI slot [%s] error %d getting presence status on event 
0x%04x, to retry the operation.\n",
 php_slot->name, ret, sts);

Thanks,
Gavin



Re: [PATCH v3 01/15] stacktrace/x86: add function for detecting reliable stack traces

2016-12-20 Thread Josh Poimboeuf
On Tue, Dec 20, 2016 at 10:39:16AM +0100, Petr Mladek wrote:
> On Mon 2016-12-19 11:25:49, Josh Poimboeuf wrote:
> > 3) probably some kind of runtime NMI stack checking feature to
> >complement objtool, along with a lot of burn time to ensure there are
> >no issues, particularly in entry code
> 
> Could you please provide more details about this NMI stack checking?
> What is it supposed to protect that objtool could not?
> Will it run regularly or will it be just a random check?

save_stack_trace_tsk_reliable(current) would be called periodically from
an NMI handler, and a warning would be printed if it ever doesn't reach
the "end" of the stack (i.e., user-mode pt_regs).  Due to the
performance impact it would probably only be a debug option.

It would verify the special hand-coded areas which objtool isn't smart
enough to understand, like entry code, ftrace, kprobes, bpf.  It would
also make sure that objtool itself didn't missing anything.

-- 
Josh


Re: [PATCH] powerpc/64: Don't try to use radix MMU under a hypervisor

2016-12-20 Thread Paul Mackerras
On Tue, Dec 20, 2016 at 08:19:02PM +0530, Aneesh Kumar K.V wrote:
> Paul Mackerras  writes:
> 
> > Currently, if the kernel is running on a POWER9 processor under a
> > hypervisor, it will try to use the radix MMU even though it doesn't
> > have the necessary code to use radix under a hypervisor (it doesn't
> > negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL
> > hcall).  The result is that the guest kernel will crash when it tries
> > to turn on the MMU, because it will still actually be using the HPT
> > MMU, but it won't have set up any SLB or HPT entries.  It does this
> > because the only thing that the kernel looks at in deciding to use
> > radix, on any platform, is the ibm,pa-features property on the cpu
> > device nodes.
> >
> > This fixes it by looking for the /chosen/ibm,architecture-vec-5
> > property, and if it exists, clearing the radix MMU feature bit.
> > We do this before we decide whether to initialize for radix or HPT.
> > This property is created by the hypervisor as a result of the guest
> > calling the ibm,client-architecture-support method to indicate
> > its capabilities, so it only exists on systems with a hypervisor.
> > The reason for using this property is that in future, when we
> > have support for using radix under a hypervisor, we will need
> > to check this property to see whether the hypervisor agreed to
> > us using radix.
> 
> Hypervisor that doesn't support radix should clear the ibm,pa-features
> radix feature bit right ? We look at that before setting
> MMU_FTR_TYPE_RADIX. So how did we end up enabling radix in the above
> case ?

Two points in response:

1. The ibm,pa-features property specifies the capabilities of the
processor, not the platform.  The radix bit would be set on POWER9
even if the hypervisor doesn't support radix.

2. The hypervisor might well support radix, but the kernel definitely
cannot operate with radix under a hypervisor, because it doesn't put
the necessary things in the ibm,client-architecture support call, and
it doesn't call the H_REGISTER_PROC_TBL hcall.  That means that the
CPU has no way to find our 1st level (process scoped) radix trees even
if the partition was in radix mode.  Most likely the hypervisor will
put the partition into HPT mode because the kernel didn't ask for
radix in the CAS call, but even if the hypervisor put the partition
into radix mode, it still wouldn't work.

Paul.


Re: [GIT PULL 00/29] perf/core improvements and fixes

2016-12-20 Thread Ingo Molnar

* Arnaldo Carvalho de Melo <a...@kernel.org> wrote:

> Hi Ingo,
> 
> Please consider pulling, I had most of this queued before your first
> pull req to Linus for 4.10, most are fixes, with 'perf sched timehist --idle'
> as a followup new feature to the 'perf sched timehist' command introduced in
> this window.
>   
>   One other thing that delayed this was the samples/bpf/ switch to
> tools/lib/bpf/ that involved fixing up merge clashes with net.git and also
> to properly test it, after more rounds than antecipated, but all seems ok
> now and would be good to get this merge issues past us ASAP.
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba:
> 
>   Merge tag 'docs-4.10' of git://git.lwn.net/linux (2016-12-12 21:58:13 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-20161220
> 
> for you to fetch changes up to 9899694a7f67714216665b87318eb367e2c5c901:
> 
>   samples/bpf: Move open_raw_sock to separate header (2016-12-20 12:00:40 
> -0300)
> 
> 
> perf/core improvements and fixes:
> 
> New features:
> 
> - Introduce 'perf sched timehist --idle', to analyse processes
>   going to/from idle state (Namhyung Kim)
> 
> Fixes:
> 
> - Allow 'perf record -u user' to continue when facing races with threads
>   going away after having scanned them via /proc (Jiri Olsa)
> 
> - Fix 'perf mem' --all-user/--all-kernel options (Jiri Olsa)
> 
> - Support jumps with multiple arguments (Ravi Bangoria)
> 
> - Fix jumps to before the function where they are located (Ravi
> Bangoria)
> 
> - Fix lock-pi help string (Davidlohr Bueso)
> 
> - Fix build of 'perf trace' in odd systems such as a RHEL PPC one (Jiri Olsa)
> 
> - Do not overwrite valid build id in 'perf diff' (Kan Liang)
> 
> - Don't throw error for zero length symbols, allowing the use of the TUI
>   in PowerPC, where such symbols became more common recently (Ravi Bangoria)
> 
> Infrastructure:
> 
> - Switch of samples/bpf/ to use tools/lib/bpf, removing libbpf
>   duplication (Joe Stringer)
> 
> - Move headers check into bash script (Jiri Olsa)
> 
> Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
> 
> 
> Arnaldo Carvalho de Melo (3):
>   perf tools: Remove some needless __maybe_unused
>   samples/bpf: Make perf_event_read() static
>   samples/bpf: Be consistent with bpf_load_program bpf_insn parameter
> 
> Davidlohr Bueso (1):
>   perf bench futex: Fix lock-pi help string
> 
> Jiri Olsa (7):
>   perf tools: Move headers check into bash script
>   perf mem: Fix --all-user/--all-kernel options
>   perf evsel: Use variable instead of repeating lengthy FD macro
>   perf thread_map: Add thread_map__remove function
>   perf evsel: Allow to ignore missing pid
>   perf record: Force ignore_missing_thread for uid option
>   perf trace: Check if MAP_32BIT is defined (again)
> 
> Joe Stringer (8):
>   tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h
>   tools lib bpf: use __u32 from linux/types.h
>   tools lib bpf: Add flags to bpf_create_map()
>   samples/bpf: Make samples more libbpf-centric
>   samples/bpf: Switch over to libbpf
>   tools lib bpf: Add bpf_prog_{attach,detach}
>   samples/bpf: Remove perf_event_open() declaration
>   samples/bpf: Move open_raw_sock to separate header
> 
> Kan Liang (1):
>   perf diff: Do not overwrite valid build id
> 
> Namhyung Kim (6):
>   perf sched timehist: Split is_idle_sample()
>   perf sched timehist: Introduce struct idle_time_data
>   perf sched timehist: Save callchain when entering idle
>   perf sched timehist: Skip non-idle events when necessary
>   perf sched timehist: Add -I/--idle-hist option
>   perf sched timehist: Show callchains for idle stat
> 
> Ravi Bangoria (3):
>   perf annotate: Support jump instruction with target as second operand
>   perf annotate: Fix jump target outside of function address range
>   perf annotate: Don't throw error for zero length symbols
> 
>  samples/bpf/Makefile  |  70 +--
>  samples/bpf/README.rst|   4 +-
>  samples/bpf/bpf_load.c|  21 +-
>  samples/bpf/bpf_load.h|   3 +
>  samples/bpf/fds_example.c |  13 +-
>  samples/bpf/lathis

Re: [PATCH net v4 0/4] fsl/fman: fixes for ARM

2016-12-20 Thread David Miller
From: Madalin Bucur 
Date: Mon, 19 Dec 2016 22:42:42 +0200

> The patch set fixes advertised speeds for QSGMII interfaces, disables
> A007273 erratum workaround on non-PowerPC platforms where it does not
> apply, enables compilation on ARM64 and addresses a probing issue on
> non PPC platforms.
> 
> Changes from v3: removed redundant comment, added ack by Scott
> Changes from v2: merged fsl/fman changes to avoid a point of failure
> Changes from v1: unifying probing on all supported platforms

Series applied, thanks.


Re: [PATCH v3 13/15] livepatch: change to a per-task consistency model

2016-12-20 Thread Petr Mladek
On Thu 2016-12-08 12:08:38, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.
> 
> [1] https://lkml.kernel.org/r/20141107140458.ga21...@suse.cz
> 
> Signed-off-by: Josh Poimboeuf 
> ---
> diff --git a/Documentation/livepatch/livepatch.txt 
> b/Documentation/livepatch/livepatch.txt
> index 6c43f6e..f87e742 100644
> --- a/Documentation/livepatch/livepatch.txt
> +++ b/Documentation/livepatch/livepatch.txt

I like the description.

Just a note that we will also need to review the section about
limitations. But I am not sure that we want to do it in this patch.
It might open a long discussion on its own.

> diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> index 1a5a93c..8e06fe5 100644
> --- a/include/linux/livepatch.h
> +++ b/include/linux/livepatch.h
> @@ -28,18 +28,40 @@
>  
>  #include 
>  
> +/* task patch states */
> +#define KLP_UNDEFINED-1
> +#define KLP_UNPATCHED 0
> +#define KLP_PATCHED   1
> +
>  /**
>   * struct klp_func - function structure for live patching
>   * @old_name:name of the function to be patched
>   * @new_func:pointer to the patched function code
>   * @old_sympos: a hint indicating which symbol position the old function
>   *   can be found (optional)
> + * @immediate:  patch the func immediately, bypassing backtrace safety checks

There are more checks possible. I would use the same description
as for klp_object.


>   * @old_addr:the address of the function being patched
>   * @kobj:kobject for sysfs resources
>   * @stack_node:  list node for klp_ops func_stack list
>   * @old_size:size of the old function
>   * @new_size:size of the new function
>   * @patched: the func has been added to the klp_ops list
> + * @transition:  the func is currently being applied or reverted
> + *
> @@ -86,6 +110,7 @@ struct klp_object {
>   * struct klp_patch - patch structure for live patching
>   * @mod: reference to the live patch module
>   * @objs:object entries for kernel objects to be patched
> + * @immediate:  patch all funcs immediately, bypassing safety mechanisms
>   * @list:list node for global list of registered patches
>   * @kobj:kobject for sysfs resources
>   * @enabled: the patch is enabled (but operation may be incomplete)

[...]

> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index fc160c6..22c0c01 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -424,7 +477,10 @@ static ssize_t enabled_store(struct kobject *kobj, 
> struct kobj_attribute *attr,
>   goto err;
>   }
>  
> - if (enabled) {
> + if (patch == klp_transition_patch) {
> + klp_reverse_transition();
> + mod_delayed_work(system_wq, _transition_work, 0);

I would put this mod_delayed_work() into klp_reverse_transition().
Also I would put that schedule_delayed_work() into
klp_try_complete_transition().

If I did not miss anything, it will allow to move the
klp_transition_work code to transition.c where it logically
belongs.

> + } else if (enabled) {
>   ret = __klp_enable_patch(patch);
>   if (ret)
>   goto err;

[...]

> diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> index 5efa262..e79ebb5 100644
> --- a/kernel/livepatch/patch.c
> +++ b/kernel/livepatch/patch.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include "patch.h"
> +#include "transition.h"
>  
>  static LIST_HEAD(klp_ops);
>  
> @@ -54,15 +55,53 @@ static void notrace klp_ftrace_handler(unsigned long ip,
>  {
>   struct klp_ops *ops;
>   struct klp_func *func;
> + int patch_state;
>  
>   ops = container_of(fops, struct klp_ops, fops);
>  
>   rcu_read_lock();
> +
>   func = list_first_or_null_rcu(>func_stack, struct klp_func,
> stack_node);
> - if (WARN_ON_ONCE(!func))
> +
> + if (!func)
>   goto unlock;

Why do you removed the WARN_ON_ONCE(), please?

We still add the function on the stack before registering
the ftrace handler. Also we unregister the ftrace handler
before removing the the last entry from the stack.

AFAIK, unregister_ftrace_function() calls rcu_synchronize()'
to make sure that no-one is inside the handler once finished.
Mirek knows more about it.

If this is not true, we have a problem. For example,
we call kfree(ops) after unregister_ftrace_function();

BTW: I thought that this change was really needed because of
klp_try_complete_transition(). But I think that the WARN
could and should stay after all. See below.


> + /*
> +  * Enforce the order of the ops->func_stack and func->transition reads.
> +  

[GIT PULL 00/29] perf/core improvements and fixes

2016-12-20 Thread Arnaldo Carvalho de Melo
Hi Ingo,

Please consider pulling, I had most of this queued before your first
pull req to Linus for 4.10, most are fixes, with 'perf sched timehist --idle'
as a followup new feature to the 'perf sched timehist' command introduced in
this window.

One other thing that delayed this was the samples/bpf/ switch to
tools/lib/bpf/ that involved fixing up merge clashes with net.git and also
to properly test it, after more rounds than antecipated, but all seems ok
now and would be good to get this merge issues past us ASAP.

- Arnaldo

Test results at the end of this message, as usual.

The following changes since commit e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba:

  Merge tag 'docs-4.10' of git://git.lwn.net/linux (2016-12-12 21:58:13 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
tags/perf-core-for-mingo-20161220

for you to fetch changes up to 9899694a7f67714216665b87318eb367e2c5c901:

  samples/bpf: Move open_raw_sock to separate header (2016-12-20 12:00:40 -0300)


perf/core improvements and fixes:

New features:

- Introduce 'perf sched timehist --idle', to analyse processes
  going to/from idle state (Namhyung Kim)

Fixes:

- Allow 'perf record -u user' to continue when facing races with threads
  going away after having scanned them via /proc (Jiri Olsa)

- Fix 'perf mem' --all-user/--all-kernel options (Jiri Olsa)

- Support jumps with multiple arguments (Ravi Bangoria)

- Fix jumps to before the function where they are located (Ravi
Bangoria)

- Fix lock-pi help string (Davidlohr Bueso)

- Fix build of 'perf trace' in odd systems such as a RHEL PPC one (Jiri Olsa)

- Do not overwrite valid build id in 'perf diff' (Kan Liang)

- Don't throw error for zero length symbols, allowing the use of the TUI
  in PowerPC, where such symbols became more common recently (Ravi Bangoria)

Infrastructure:

- Switch of samples/bpf/ to use tools/lib/bpf, removing libbpf
  duplication (Joe Stringer)

- Move headers check into bash script (Jiri Olsa)

Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>


Arnaldo Carvalho de Melo (3):
  perf tools: Remove some needless __maybe_unused
  samples/bpf: Make perf_event_read() static
  samples/bpf: Be consistent with bpf_load_program bpf_insn parameter

Davidlohr Bueso (1):
  perf bench futex: Fix lock-pi help string

Jiri Olsa (7):
  perf tools: Move headers check into bash script
  perf mem: Fix --all-user/--all-kernel options
  perf evsel: Use variable instead of repeating lengthy FD macro
  perf thread_map: Add thread_map__remove function
  perf evsel: Allow to ignore missing pid
  perf record: Force ignore_missing_thread for uid option
  perf trace: Check if MAP_32BIT is defined (again)

Joe Stringer (8):
  tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h
  tools lib bpf: use __u32 from linux/types.h
  tools lib bpf: Add flags to bpf_create_map()
  samples/bpf: Make samples more libbpf-centric
  samples/bpf: Switch over to libbpf
  tools lib bpf: Add bpf_prog_{attach,detach}
  samples/bpf: Remove perf_event_open() declaration
  samples/bpf: Move open_raw_sock to separate header

Kan Liang (1):
  perf diff: Do not overwrite valid build id

Namhyung Kim (6):
  perf sched timehist: Split is_idle_sample()
  perf sched timehist: Introduce struct idle_time_data
  perf sched timehist: Save callchain when entering idle
  perf sched timehist: Skip non-idle events when necessary
  perf sched timehist: Add -I/--idle-hist option
  perf sched timehist: Show callchains for idle stat

Ravi Bangoria (3):
  perf annotate: Support jump instruction with target as second operand
  perf annotate: Fix jump target outside of function address range
  perf annotate: Don't throw error for zero length symbols

 samples/bpf/Makefile  |  70 +--
 samples/bpf/README.rst|   4 +-
 samples/bpf/bpf_load.c|  21 +-
 samples/bpf/bpf_load.h|   3 +
 samples/bpf/fds_example.c |  13 +-
 samples/bpf/lathist_user.c|   2 +-
 samples/bpf/libbpf.c  | 176 ---
 samples/bpf/libbpf.h  |  28 +-
 samples/bpf/lwt_len_hist_user.c   |   6 +-
 samples/bpf/offwaketime_user.c|   8 +-
 samples/bpf/sampleip_user.c   |   7 +-
 samples/bpf/sock_example.c|  14 +-
 samples/bpf/sock_example.h|  35 ++
 samples/bpf/sockex1_user.c|   7 +-
 samples/bpf/sockex2_user.c|   5 +-
 samples/bpf/sockex3_user.c   

[PATCH 23/29] perf annotate: Don't throw error for zero length symbols

2016-12-20 Thread Arnaldo Carvalho de Melo
From: Ravi Bangoria 

'perf report --tui' exits with error when it finds a sample of zero
length symbol (i.e. addr == sym->start == sym->end). Actually these are
valid samples. Don't exit TUI and show report with such symbols.

Reported-and-Tested-by: Anton Blanchard 
Link: https://lkml.org/lkml/2016/10/8/189
Signed-off-by: Ravi Bangoria 
Cc: Alexander Shishkin 
Cc: Benjamin Herrenschmidt 
Cc: Chris Riyder 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Masami Hiramatsu 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: sta...@kernel.org # v4.9+
Link: 
http://lkml.kernel.org/r/1479804050-5028-1-git-send-email-ravi.bango...@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/annotate.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index c81a3950a7fe..06cc04e5806a 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -647,7 +647,8 @@ static int __symbol__inc_addr_samples(struct symbol *sym, 
struct map *map,
 
pr_debug3("%s: addr=%#" PRIx64 "\n", __func__, map->unmap_ip(map, 
addr));
 
-   if (addr < sym->start || addr >= sym->end) {
+   if ((addr < sym->start || addr >= sym->end) &&
+   (addr != sym->end || sym->start != sym->end)) {
pr_debug("%s(%d): ERANGE! sym->name=%s, start=%#" PRIx64 ", 
addr=%#" PRIx64 ", end=%#" PRIx64 "\n",
   __func__, __LINE__, sym->name, sym->start, addr, 
sym->end);
return -ERANGE;
-- 
2.9.3



[PATCH 15/29] perf annotate: Fix jump target outside of function address range

2016-12-20 Thread Arnaldo Carvalho de Melo
From: Ravi Bangoria 

If jump target is outside of function range, perf is not handling it
correctly. Especially when target address is lesser than function start
address, target offset will be negative. But, target address declared to
be unsigned, converts negative number into 2's complement. See below
example. Here target of 'jumpq' instruction at 34cf8 is 34ac0 which is
lesser than function start address(34cf0).

34ac0 - 34cf0 = -0x230 = 0xfdd0

Objdump output:

  00034cf0 <__sigaction>:
  __GI___sigaction():
34cf0: lea-0x20(%rdi),%eax
34cf3: cmp-bashx1,%eax
34cf6: jbe34d00 <__sigaction+0x10>
34cf8: jmpq   34ac0 <__GI___libc_sigaction>
34cfd: nopl   (%rax)
34d00: mov0x386161(%rip),%rax# 3bae68 <_DYNAMIC+0x2e8>
34d07: movl   -bashx16,%fs:(%rax)
34d0e: mov-bashx,%eax
34d13: retq

perf annotate before applying patch:

  __GI___sigaction  /usr/lib64/libc-2.22.so
   lea-0x20(%rdi),%eax
   cmp-bashx1,%eax
v  jbe10
v  jmpq   fdd0
   nop
10:mov_DYNAMIC+0x2e8,%rax
   movl   -bashx16,%fs:(%rax)
   mov-bashx,%eax
   retq

perf annotate after applying patch:

  __GI___sigaction  /usr/lib64/libc-2.22.so
   lea-0x20(%rdi),%eax
   cmp-bashx1,%eax
v  jbe10
^  jmpq   34ac0 <__GI___libc_sigaction>
   nop
10:mov_DYNAMIC+0x2e8,%rax
   movl   -bashx16,%fs:(%rax)
   mov-bashx,%eax
   retq

Signed-off-by: Ravi Bangoria 
Cc: Alexander Shishkin 
Cc: Chris Riyder 
Cc: Kim Phillips 
Cc: Markus Trippelsdorf 
Cc: Masami Hiramatsu 
Cc: Naveen N. Rao 
Cc: Peter Zijlstra 
Cc: Taeung Song 
Cc: linuxppc-dev@lists.ozlabs.org
Link: 
http://lkml.kernel.org/r/1480953407-7605-3-git-send-email-ravi.bango...@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/ui/browsers/annotate.c |  5 +++--
 tools/perf/util/annotate.c| 14 +-
 tools/perf/util/annotate.h|  5 +++--
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/tools/perf/ui/browsers/annotate.c 
b/tools/perf/ui/browsers/annotate.c
index ec7a30fad149..ba36aac340bc 100644
--- a/tools/perf/ui/browsers/annotate.c
+++ b/tools/perf/ui/browsers/annotate.c
@@ -215,7 +215,7 @@ static void annotate_browser__write(struct ui_browser 
*browser, void *entry, int
ui_browser__set_color(browser, color);
if (dl->ins.ops && dl->ins.ops->scnprintf) {
if (ins__is_jump(>ins)) {
-   bool fwd = dl->ops.target.offset > 
(u64)dl->offset;
+   bool fwd = dl->ops.target.offset > dl->offset;
 
ui_browser__write_graph(browser, fwd ? 
SLSMG_DARROW_CHAR :

SLSMG_UARROW_CHAR);
@@ -245,7 +245,8 @@ static bool disasm_line__is_valid_jump(struct disasm_line 
*dl, struct symbol *sy
 {
if (!dl || !dl->ins.ops || !ins__is_jump(>ins)
|| !disasm_line__has_offset(dl)
-   || dl->ops.target.offset >= symbol__size(sym))
+   || dl->ops.target.offset < 0
+   || dl->ops.target.offset >= (s64)symbol__size(sym))
return false;
 
return true;
diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index 590244e5781e..c81a3950a7fe 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -230,10 +230,12 @@ static int jump__parse(struct arch *arch __maybe_unused, 
struct ins_operands *op
else
ops->target.addr = strtoull(ops->raw, NULL, 16);
 
-   if (s++ != NULL)
+   if (s++ != NULL) {
ops->target.offset = strtoull(s, NULL, 16);
-   else
-   ops->target.offset = UINT64_MAX;
+   ops->target.offset_avail = true;
+   } else {
+   ops->target.offset_avail = false;
+   }
 
return 0;
 }
@@ -241,7 +243,7 @@ static int jump__parse(struct arch *arch __maybe_unused, 
struct ins_operands *op
 static int jump__scnprintf(struct ins *ins, char *bf, size_t size,
   struct ins_operands *ops)
 {
-   if (!ops->target.addr)
+   if (!ops->target.addr || ops->target.offset < 0)
return ins__raw_scnprintf(ins, bf, size, ops);
 
return scnprintf(bf, size, "%-6.6s %" PRIx64, ins->name, 
ops->target.offset);
@@ -1209,9 +1211,11 @@ static int symbol__parse_objdump_line(struct symbol 
*sym, struct map *map,
if (dl == NULL)

[PATCH 14/29] perf annotate: Support jump instruction with target as second operand

2016-12-20 Thread Arnaldo Carvalho de Melo
From: Ravi Bangoria 

Architectures like PowerPC have jump instructions that includes a target
address as a second operand. For example, 'bne cr7,0xc00f6154'.
Add support for such instruction in perf annotate.

objdump o/p:
  c00f6140:   ld r9,1032(r31)
  c00f6144:   cmpdi  cr7,r9,0
  c00f6148:   bnecr7,0xc00f6154
  c00f614c:   ld r9,2312(r30)
  c00f6150:   stdr9,1032(r31)
  c00f6154:   ld r9,88(r31)

Corresponding perf annotate o/p:

Before patch:
 ld r9,1032(r31)
 cmpdi  cr7,r9,0
  v  bne3ff09f2c
 ld r9,2312(r30)
 stdr9,1032(r31)
  74:ld r9,88(r31)

After patch:
 ld r9,1032(r31)
 cmpdi  cr7,r9,0
  v  bne74
 ld r9,2312(r30)
 stdr9,1032(r31)
  74:ld r9,88(r31)

Signed-off-by: Ravi Bangoria 
Cc: Alexander Shishkin 
Cc: Chris Riyder 
Cc: Kim Phillips 
Cc: Markus Trippelsdorf 
Cc: Masami Hiramatsu 
Cc: Naveen N. Rao 
Cc: Peter Zijlstra 
Cc: Taeung Song 
Cc: linuxppc-dev@lists.ozlabs.org
Link: 
http://lkml.kernel.org/r/1480953407-7605-2-git-send-email-ravi.bango...@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/annotate.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index ea7e0de4b9c1..590244e5781e 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -223,8 +223,12 @@ bool ins__is_call(const struct ins *ins)
 static int jump__parse(struct arch *arch __maybe_unused, struct ins_operands 
*ops, struct map *map __maybe_unused)
 {
const char *s = strchr(ops->raw, '+');
+   const char *c = strchr(ops->raw, ',');
 
-   ops->target.addr = strtoull(ops->raw, NULL, 16);
+   if (c++ != NULL)
+   ops->target.addr = strtoull(c, NULL, 16);
+   else
+   ops->target.addr = strtoull(ops->raw, NULL, 16);
 
if (s++ != NULL)
ops->target.offset = strtoull(s, NULL, 16);
-- 
2.9.3



Re: [PATCH v3 3/5] powerpc/mm: add radix__create_section_mapping()

2016-12-20 Thread Reza Arbab

On Tue, Dec 20, 2016 at 05:28:40PM +1100, Balbir Singh wrote:

+#ifdef CONFIG_MEMORY_HOTPLUG
+int radix__create_section_mapping(unsigned long start, unsigned long end)
+{
+   unsigned long page_size = 1 << mmu_psize_defs[mmu_linear_psize].shift;


Can we refactor bits from radix_init_pgtable() and reuse?


Yes, that's my plan for v4.

--
Reza Arbab



Re: [PATCH] powerpc/64: Don't try to use radix MMU under a hypervisor

2016-12-20 Thread Aneesh Kumar K.V
Paul Mackerras  writes:

> Currently, if the kernel is running on a POWER9 processor under a
> hypervisor, it will try to use the radix MMU even though it doesn't
> have the necessary code to use radix under a hypervisor (it doesn't
> negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL
> hcall).  The result is that the guest kernel will crash when it tries
> to turn on the MMU, because it will still actually be using the HPT
> MMU, but it won't have set up any SLB or HPT entries.  It does this
> because the only thing that the kernel looks at in deciding to use
> radix, on any platform, is the ibm,pa-features property on the cpu
> device nodes.
>
> This fixes it by looking for the /chosen/ibm,architecture-vec-5
> property, and if it exists, clearing the radix MMU feature bit.
> We do this before we decide whether to initialize for radix or HPT.
> This property is created by the hypervisor as a result of the guest
> calling the ibm,client-architecture-support method to indicate
> its capabilities, so it only exists on systems with a hypervisor.
> The reason for using this property is that in future, when we
> have support for using radix under a hypervisor, we will need
> to check this property to see whether the hypervisor agreed to
> us using radix.

Hypervisor that doesn't support radix should clear the ibm,pa-features
radix feature bit right ? We look at that before setting
MMU_FTR_TYPE_RADIX. So how did we end up enabling radix in the above
case ?


>
> Fixes: 17a3dd2f5fc7 ("powerpc/mm/radix: Use firmware feature to enable Radix 
> MMU")
> Cc: sta...@vger.kernel.org # v4.7+
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/mm/init_64.c | 27 +++
>  1 file changed, 27 insertions(+)
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a000c35..098531d 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -42,6 +42,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #include 
>  #include 
> @@ -344,6 +346,28 @@ static int __init parse_disable_radix(char *p)
>  }
>  early_param("disable_radix", parse_disable_radix);
>
> +/*
> + * If we're running under a hypervisor, we currently can't do radix
> + * since we don't have the code to do the H_REGISTER_PROC_TBL hcall.
> + * We tell that we're running under a hypervisor by looking for the
> + * /chosen/ibm,architecture-vec-5 property.
> + */
> +static void early_check_vec5(void)
> +{
> + unsigned long root, chosen;
> + int size;
> + const u8 *vec5;
> +
> + root = of_get_flat_dt_root();
> + chosen = of_get_flat_dt_subnode_by_name(root, "chosen");
> + if (chosen == -FDT_ERR_NOTFOUND)
> + return;
> + vec5 = of_get_flat_dt_prop(chosen, "ibm,architecture-vec-5", );
> + if (!vec5)
> + return;
> + cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
> +}
> +
>  void __init mmu_early_init_devtree(void)
>  {
>   /* Disable radix mode based on kernel command line. */
> @@ -351,6 +375,9 @@ void __init mmu_early_init_devtree(void)
>   cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
>
>   if (early_radix_enabled())
> + early_check_vec5();
> +
> + if (early_radix_enabled())
>   radix__early_init_devtree();
>   else
>   hash__early_init_devtree();
> -- 
> 2.7.4



[PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails

2016-12-20 Thread Vipin K Parashar
Added check for OPAL_WRONG_STATE error code returned from OPAL.
Currently Linux flashes "unexpected error" over console for this
error. This will avoid throwing such message and return I/O error
for such OPAL failures.

Signed-off-by: Vipin K Parashar 
---
 arch/powerpc/platforms/powernv/opal.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 2822935..ab91d53 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -866,9 +866,10 @@ int opal_error_code(int rc)
case OPAL_NO_MEM:   return -ENOMEM;
case OPAL_PERMISSION:   return -EPERM;
 
-   case OPAL_UNSUPPORTED:  return -EIO;
-   case OPAL_HARDWARE: return -EIO;
-   case OPAL_INTERNAL_ERROR:   return -EIO;
+   case OPAL_UNSUPPORTED:
+   case OPAL_HARDWARE:
+   case OPAL_INTERNAL_ERROR:
+   case OPAL_WRONG_STATE:  return -EIO;
default:
pr_err("%s: unexpected OPAL error %d\n", __func__, rc);
return -EIO;
-- 
2.7.4



[PATCH] powerpc/perf/24x7: use rb_entry

2016-12-20 Thread Geliang Tang
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang 
---
 arch/powerpc/perf/hv-24x7.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 7b2ca16..51bd3b4 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -547,7 +547,7 @@ static int event_uniq_add(struct rb_root *root, const char 
*name, int nl,
struct event_uniq *it;
int result;
 
-   it = container_of(*new, struct event_uniq, node);
+   it = rb_entry(*new, struct event_uniq, node);
result = ev_uniq_ord(name, nl, domain, it->name, it->nl,
it->domain);
 
-- 
2.9.3



Re: [PATCH v4 02/12] powerpc: move set_soft_enabled() and rename

2016-12-20 Thread Balbir Singh


On 20/12/16 20:32, Benjamin Herrenschmidt wrote:
> On Tue, 2016-12-20 at 20:03 +1100, Balbir Singh wrote:
>>
>> On 19/12/16 19:06, Madhavan Srinivasan wrote:
>>> Move set_soft_enabled() from powerpc/kernel/irq.c to
>>> asm/hw_irq.c, to force updates to paca-soft_enabled
>>> done via these access function. Add "memory" clobber
>>> to hint compiler since paca->soft_enabled memory is the target
>>> here
>>> +static inline notrace void soft_enabled_set(unsigned long enable)
>>> +{
>>> +   __asm__ __volatile__("stb %0,%1(13)"
>>> +   : : "r" (enable), "i" (offsetof(struct paca_struct,
>>> soft_enabled))
>>> +   : "memory");
>>> +}
>>> +
>>
>> Can't we just rewrite this in "C"?
>>
>> local_paca->soft_enabled = enable
> 
> Well, this is digging out another bloody can of baby eating worms...
> 
> So the reason we wrote it like that is because we had gcc playing
> tricks to us.
> 
> It has been proved to move around references to local_paca, even
> accross preempt_disable boundaries.
> 
> The above prevents it.
> 
> I'm about 100% sure we have a bunch of other issues related to PACA
> access and gcc playing games though. I remember grepping the kernel
> disassembly once for gcc doing funny things with r13 such as copying it
> into another register or even saving and restoring it and my grep
> didn't come up empty ...
> 


I checked for arch_local_irq_restore before suggesting it, not all
functions. I tried a quick audit of disassembly not using load/stores
on r13 and most of it was in hand written disassembly in the exception
handling paths, secondary processor initialization, wake up from idle.
Having said that I audited with one version of gcc on a little endian
vmlinux.


> Something we need to seriously look into one day. I have some ideas on
> how to clean that up in the code to make hardening it easier, let's
> talk next time Nick is in town.
> 

Sure lets leave it as is for now

Balbir


[PATCH] powerpc/64: Don't try to use radix MMU under a hypervisor

2016-12-20 Thread Paul Mackerras
Currently, if the kernel is running on a POWER9 processor under a
hypervisor, it will try to use the radix MMU even though it doesn't
have the necessary code to use radix under a hypervisor (it doesn't
negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL
hcall).  The result is that the guest kernel will crash when it tries
to turn on the MMU, because it will still actually be using the HPT
MMU, but it won't have set up any SLB or HPT entries.  It does this
because the only thing that the kernel looks at in deciding to use
radix, on any platform, is the ibm,pa-features property on the cpu
device nodes.

This fixes it by looking for the /chosen/ibm,architecture-vec-5
property, and if it exists, clearing the radix MMU feature bit.
We do this before we decide whether to initialize for radix or HPT.
This property is created by the hypervisor as a result of the guest
calling the ibm,client-architecture-support method to indicate
its capabilities, so it only exists on systems with a hypervisor.
The reason for using this property is that in future, when we
have support for using radix under a hypervisor, we will need
to check this property to see whether the hypervisor agreed to
us using radix.

Fixes: 17a3dd2f5fc7 ("powerpc/mm/radix: Use firmware feature to enable Radix 
MMU")
Cc: sta...@vger.kernel.org # v4.7+
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/mm/init_64.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a000c35..098531d 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -42,6 +42,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -344,6 +346,28 @@ static int __init parse_disable_radix(char *p)
 }
 early_param("disable_radix", parse_disable_radix);
 
+/*
+ * If we're running under a hypervisor, we currently can't do radix
+ * since we don't have the code to do the H_REGISTER_PROC_TBL hcall.
+ * We tell that we're running under a hypervisor by looking for the
+ * /chosen/ibm,architecture-vec-5 property.
+ */
+static void early_check_vec5(void)
+{
+   unsigned long root, chosen;
+   int size;
+   const u8 *vec5;
+
+   root = of_get_flat_dt_root();
+   chosen = of_get_flat_dt_subnode_by_name(root, "chosen");
+   if (chosen == -FDT_ERR_NOTFOUND)
+   return;
+   vec5 = of_get_flat_dt_prop(chosen, "ibm,architecture-vec-5", );
+   if (!vec5)
+   return;
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
+}
+
 void __init mmu_early_init_devtree(void)
 {
/* Disable radix mode based on kernel command line. */
@@ -351,6 +375,9 @@ void __init mmu_early_init_devtree(void)
cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
 
if (early_radix_enabled())
+   early_check_vec5();
+
+   if (early_radix_enabled())
radix__early_init_devtree();
else
hash__early_init_devtree();
-- 
2.7.4



Re: [PATCH v3 01/15] stacktrace/x86: add function for detecting reliable stack traces

2016-12-20 Thread Petr Mladek
On Mon 2016-12-19 11:25:49, Josh Poimboeuf wrote:
> On Mon, Dec 19, 2016 at 05:25:19PM +0100, Miroslav Benes wrote:
> > On Thu, 8 Dec 2016, Josh Poimboeuf wrote:
> > 
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index 215612c..b4a6663 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -155,6 +155,7 @@ config X86
> > >   select HAVE_PERF_REGS
> > >   select HAVE_PERF_USER_STACK_DUMP
> > >   select HAVE_REGS_AND_STACK_ACCESS_API
> > > + select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER && 
> > > STACK_VALIDATION
> > 
> > Tests to measure possible performance penalty of frame pointers were done 
> > by Mel Gorman. The outcome was quite clear. There IS a measurable 
> > impact. The percentage depends on the workflow but I think it is safe to 
> > say that FP usually takes 5-10 percents.
> > 
> > If my understanding is correct there is no single culprit. Register 
> > pressure is definitely not a problem. We ran simple benchmarks while 
> > taking a register away from GCC (RBP or a common one). The impact is a 
> > combination of more cacheline pressure, more memory accesses and the fact 
> > that the kernel contains a lot of small functions.
> > 
> > Thus, I think that DWARF should be the way to go here.
> > 
> > Other than that the patch looks good to me.
> 
> I agree that DWARF is generally a good idea, and I'm working toward it.
> However there's still quite a bit of work to get there.
> 
> For this consistency model to work with DWARF on x86, we would need:
> 
> 1) a reliable x86 DWARF unwinder with Linus's blessing
> 2) objtool DWARF support (I'm working on this at the moment)
> 3) probably some kind of runtime NMI stack checking feature to
>complement objtool, along with a lot of burn time to ensure there are
>no issues, particularly in entry code

Could you please provide more details about this NMI stack checking?
What is it supposed to protect that objtool could not?
Will it run regularly or will it be just a random check?

The other points are obvious. But I do not know what to
think about this NMI thing. And so I am curious :-)


> 4) port save_stack_trace_tsk_reliable() to work with DWARF

Best Regards,
Petr


Re: [PATCH] drivers/pci/hotplug: Handle presence detection change properly

2016-12-20 Thread Greg KH
On Tue, Dec 20, 2016 at 05:49:33PM +1100, Gavin Shan wrote:
> The surprise hotplug is driven by interrupt in PowerNV PCI hotplug
> driver. In the interrupt handler, pnv_php_interrupt(), we bail when
> pnv_pci_get_presence_state() returns zero wrongly. It causes the
> presence change event is always ignored incorrectly.
> 
> This fixes the issue by bailing on error (non-zero value) returned
> from pnv_pci_get_presence_state().
> 
> Fixes: 360aebd85a4 ("drivers/pci/hotplug: Support surprise hotplug in powernv 
> driver")
> Cc: sta...@vger.kernel.org #v4.9+
> Reported-by: Hank Chang 
> Signed-off-by: Gavin Shan 
> Tested-by: Willie Liauw 
> ---
>  drivers/pci/hotplug/pnv_php.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
> index 56efaf7..38a2309 100644
> --- a/drivers/pci/hotplug/pnv_php.c
> +++ b/drivers/pci/hotplug/pnv_php.c
> @@ -707,8 +707,12 @@ static irqreturn_t pnv_php_interrupt(int irq, void *data)
>   added = !!(lsts & PCI_EXP_LNKSTA_DLLLA);
>   } else if (sts & PCI_EXP_SLTSTA_PDC) {
>   ret = pnv_pci_get_presence_state(php_slot->id, );
> - if (!ret)
> + if (ret) {
> + dev_warn(>dev, "PCI slot [%s] error %d handling 
> PDC event (0x%04x)\n",
> + php_slot->name, ret, sts);

What can a user do with this warning?



Re: [PATCH v4 02/12] powerpc: move set_soft_enabled() and rename

2016-12-20 Thread Benjamin Herrenschmidt
On Tue, 2016-12-20 at 20:03 +1100, Balbir Singh wrote:
> 
> On 19/12/16 19:06, Madhavan Srinivasan wrote:
> > Move set_soft_enabled() from powerpc/kernel/irq.c to
> > asm/hw_irq.c, to force updates to paca-soft_enabled
> > done via these access function. Add "memory" clobber
> > to hint compiler since paca->soft_enabled memory is the target
> > here
> > +static inline notrace void soft_enabled_set(unsigned long enable)
> > +{
> > +   __asm__ __volatile__("stb %0,%1(13)"
> > +   : : "r" (enable), "i" (offsetof(struct paca_struct,
> > soft_enabled))
> > +   : "memory");
> > +}
> > +
> 
> Can't we just rewrite this in "C"?
> 
> local_paca->soft_enabled = enable

Well, this is digging out another bloody can of baby eating worms...

So the reason we wrote it like that is because we had gcc playing
tricks to us.

It has been proved to move around references to local_paca, even
accross preempt_disable boundaries.

The above prevents it.

I'm about 100% sure we have a bunch of other issues related to PACA
access and gcc playing games though. I remember grepping the kernel
disassembly once for gcc doing funny things with r13 such as copying it
into another register or even saving and restoring it and my grep
didn't come up empty ...

Something we need to seriously look into one day. I have some ideas on
how to clean that up in the code to make hardening it easier, let's
talk next time Nick is in town.

Cheers,
Ben.



Re: [PATCH v4 02/12] powerpc: move set_soft_enabled() and rename

2016-12-20 Thread Balbir Singh


On 19/12/16 19:06, Madhavan Srinivasan wrote:
> Move set_soft_enabled() from powerpc/kernel/irq.c to
> asm/hw_irq.c, to force updates to paca-soft_enabled
> done via these access function. Add "memory" clobber
> to hint compiler since paca->soft_enabled memory is the target
> here
> +static inline notrace void soft_enabled_set(unsigned long enable)
> +{
> + __asm__ __volatile__("stb %0,%1(13)"
> + : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled))
> + : "memory");
> +}
> +

Can't we just rewrite this in "C"?

local_paca->soft_enabled = enable

Balbir Singh.