date:20170418

[RFC] arch/powerpc: Turn off irqs in switch_mm()

2017-04-18 Thread David Gibson

There seems to be a mismatch in expectations between the powerpc arch code
and the generic (and x86) code in terms of the irq state when switch_mm()
is called.

powerpc expects irqs to already be (soft) disabled when switch_mm() is
called, as made clear in the commit message of 9c1e105 "powerpc: Allow
perf_counters to access user memory at interrupt time".

That seems to be true when it's called from the schedule, but not for
use_mm().  This becomes clear when looking at the x86 code paths for
switch_mm().  There, switch_mm() itself disable irqs, with a
switch_mm_irqs_off() variant which expects that to be already done.

This patch addresses the problem, making the powerpc code mirror the x86
code.

Signed-off-by: David Gibson 
---
 arch/powerpc/include/asm/mmu_context.h | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

RH-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1437794

It seems that some more recent changes in vhost have made it more
likely to hit this problem, triggering a WARN.

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index b9e3f0a..0012f03 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -70,8 +70,9 @@ extern void drop_cop(unsigned long acop, struct mm_struct 
*mm);
  * switch_mm is the entry point called from the architecture independent
  * code in kernel/sched/core.c
  */
-static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
-struct task_struct *tsk)
+static inline void switch_mm_irqs_off(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
 {
/* Mark this context has been used on the new CPU */
if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next)))
@@ -110,6 +111,18 @@ static inline void switch_mm(struct mm_struct *prev, 
struct mm_struct *next,
switch_mmu_context(prev, next, tsk);
 }
 
+static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+struct task_struct *tsk)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   switch_mm_irqs_off(prev, next, tsk);
+   local_irq_restore(flags);
+}
+#define switch_mm_irqs_off switch_mm_irqs_off
+
+
 #define deactivate_mm(tsk,mm)  do { } while (0)
 
 /*
-- 
2.9.3

[PATCH] powerpc/crypto/crct10dif-vpmsum: Fix missing preempt_disable()

2017-04-18 Thread Michael Ellerman

In crct10dif_vpmsum() we call enable_kernel_altivec() without first
disabling preemption, which is not allowed.

It used to be sufficient just to call pagefault_disable(), because that
also disabled preemption. But the two were decoupled in commit 8222dbe21e79
("sched/preempt, mm/fault: Decouple preemption from the page fault
logic") in mid 2015.

The crct10dif-vpmsum code inherited this bug from the crc32c-vpmsum code
on which it was modelled.

So add the missing preempt_disable/enable(). We should also call
disable_kernel_fp(), although it does nothing by default, there is a
debug switch to make it active and all enables should be paired with
disables.

Fixes: b01df1c16c9a ("crypto: powerpc - Add CRC-T10DIF acceleration")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/crypto/crct10dif-vpmsum_glue.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/crypto/crct10dif-vpmsum_glue.c 
b/arch/powerpc/crypto/crct10dif-vpmsum_glue.c
index bebfc329f746..02ea277863d1 100644
--- a/arch/powerpc/crypto/crct10dif-vpmsum_glue.c
+++ b/arch/powerpc/crypto/crct10dif-vpmsum_glue.c
@@ -44,10 +44,13 @@ static u16 crct10dif_vpmsum(u16 crci, unsigned char const 
*p, size_t len)
 
if (len & ~VMX_ALIGN_MASK) {
crc <<= 16;
+   preempt_disable();
pagefault_disable();
enable_kernel_altivec();
crc = __crct10dif_vpmsum(crc, p, len & ~VMX_ALIGN_MASK);
+   disable_kernel_altivec();
pagefault_enable();
+   preempt_enable();
crc >>= 16;
}
 
-- 
2.7.4

Re: [PATCH v3 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-04-18 Thread Michael Ellerman

Peter Zijlstra  writes:

> On Tue, Apr 11, 2017 at 07:21:05AM +0530, Madhavan Srinivasan wrote:
>> From: Sukadev Bhattiprolu 
>> 
>> perf_mem_data_src is an union that is initialized via the ->val field
>> and accessed via the bitmap fields. For this to work on big endian
>> platforms (Which is broken now), we also need a big-endian represenation
>> of perf_mem_data_src. i.e, in a big endian system, if user request
>> PERF_SAMPLE_DATA_SRC (perf report -d), will get the default value from
>> perf_sample_data_init(), which is PERF_MEM_NA. Value for PERF_MEM_NA
>> is constructed using shifts:
>> 
>>   /* TLB access */
>>   #define PERF_MEM_TLB_NA0x01 /* not available */
>>   ...
>>   #define PERF_MEM_TLB_SHIFT 26
>> 
>>   #define PERF_MEM_S(a, s) \
>>  (((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
>> 
>>   #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
>>  PERF_MEM_S(LVL, NA)   |\
>>  PERF_MEM_S(SNOOP, NA) |\
>>  PERF_MEM_S(LOCK, NA)  |\
>>  PERF_MEM_S(TLB, NA))
>> 
>> Which works out as:
>> 
>>   ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26))
>> 
>> Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021
>> in CPU endian.
>> 
>> But then in the perf tool, the code uses the bitfields to inspect the
>> value, and currently the bitfields are defined using little endian
>> ordering.
>> 
>> So eg. in perf_mem__tlb_scnprintf() we see:
>>   data_src->val = 0x5080021
>>  op = 0x0
>> lvl = 0x0
>>   snoop = 0x0
>>lock = 0x0
>>dtlb = 0x0
>>rsvd = 0x5080021
>> 
>> Patch does a minimal fix of adding big endian definition of the bitfields
>> to match the values that are already exported by the kernel on big endian.
>> And it makes no change on little endian.
>
> I think it is important to note that there are no current big-endian
> users. So 'fixing' this will not break anybody and will ensure future
> users (next patch) will work correctly.

Actually that's only partly true. As I describe above the PERF_MEM_NA
value is currently exported on BE platforms when a user requests it.

So I added this text after the output from perf_mem__tlb_scnprintf():

  Because of the way the perf tool code is written this is still displayed to 
the
  user as "N/A", so there is no bug visible at the UI level.
  
  Currently there are no big endian architectures which export a meaningful
  value (ie. other than PERF_MEM_NA), so the extent of the bug on big endian
  platforms is that the PERF_MEM_NA value is exported incorrectly as described
  above. Subsequent patches will add support on big endian powerpc for 
populating
  the data source value.


Hope that is clear.

It also occurred to me that we don't actually have to redefine the whole
union, it's only the bitfields that matter, so we could reduce the diff
to:

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index c66a485a24ac..97152c79df6b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -894,12 +894,23 @@ enum perf_callchain_context {
 union perf_mem_data_src {
__u64 val;
struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
__u64   mem_op:5,   /* type of opcode */
mem_lvl:14, /* memory hierarchy level */
mem_snoop:5,/* snoop mode */
mem_lock:2, /* lock instr */
mem_dtlb:7, /* tlb access */
mem_rsvd:31;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+   __u64   mem_rsvd:31,
+   mem_dtlb:7, /* tlb access */
+   mem_lock:2, /* lock instr */
+   mem_snoop:5,/* snoop mode */
+   mem_lvl:14, /* memory hierarchy level */
+   mem_op:5;   /* type of opcode */
+#else
+#error "Unknown endianness"
+#endif
};
 };
 

That looks better to me, thoughts?

cheers

[PATCH v2] powerpc/powernv: Block PCI config access on BCM5718 during EEH recovery

2017-04-18 Thread Gavin Shan

Similar to what is done in commit b6541db13952 ("powerpc/eeh: Block
PCI config access upon frozen PE"), we need block PCI config access
for BCM5719 when recovering frozen error on them. Otherwise, an
unexpected recursive fenced PHB error is observed.

   0001:06:00.0 Ethernet controller: Broadcom Corporation \
NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
   0001:06:00.1 Ethernet controller: Broadcom Corporation \
NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)

Signed-off-by: Gavin Shan 
---
v2: s/BCM5719/BCM5718 in the subject
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 6fb5522..55a3c5f 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -412,11 +412,14 @@ static void *pnv_eeh_probe(struct pci_dn *pdn, void *data)
 * been set for the PE, we will set EEH_PE_CFG_BLOCKED for
 * that PE to block its config space.
 *
+* Broadcom BCM5718 2-ports NICs (14e4:1656)
 * Broadcom Austin 4-ports NICs (14e4:1657)
 * Broadcom Shiner 4-ports 1G NICs (14e4:168a)
 * Broadcom Shiner 2-ports 10G NICs (14e4:168e)
 */
if ((pdn->vendor_id == PCI_VENDOR_ID_BROADCOM &&
+pdn->device_id == 0x1656) ||
+   (pdn->vendor_id == PCI_VENDOR_ID_BROADCOM &&
 pdn->device_id == 0x1657) ||
(pdn->vendor_id == PCI_VENDOR_ID_BROADCOM &&
 pdn->device_id == 0x168a) ||
-- 
2.7.4

Re: [PATCH v2 1/2] fadump: reduce memory consumption for capture kernel

2017-04-18 Thread Michael Ellerman

Michal Suchánek  writes:
> On Mon, 17 Apr 2017 20:43:02 +0530
> Hari Bathini  wrote:
>> On Friday 14 April 2017 01:28 AM, Michal Suchánek wrote:
>> > more (optional) properties cannot be added?  
>> 
>> Kernel change seems simple over f/w enhancement..
>
> That certainly looks so when you are a kernel developer and can
> implement the change yourself compared to convincing some firmware
> developer that this feature makes sense.
>
> On the other hand, the proposed kernel-only solution introduces
> requirement that the maintainer does not like.
>
> For the platform as a whole does it make more sense to add a hack to
> the kernel or does it make sense to enhance the firmware to provide
> more options for firmware-assisted dump?

Unfortunately it doesn't really matter, because there is firmware out
there that implements the current behaviour and will never be updated.
So we have to work with what's there.

I don't think fadump_append is a hack, though it's not pretty I'll
admit.

cheers

Re: [PATCH v4 4/5] perf report: Show branch type statistics for stdio mode

2017-04-18 Thread Jin, Yao




On 4/19/2017 8:53 AM, Jin, Yao wrote:



On 4/19/2017 2:53 AM, Jiri Olsa wrote:

On Wed, Apr 12, 2017 at 06:21:05AM +0800, Jin Yao wrote:

SNIP


+const char *branch_type_name(int type)
+{
+const char *branch_names[PERF_BR_MAX] = {
+"N/A",
+"JCC",
+"JMP",
+"IND_JMP",
+"CALL",
+"IND_CALL",
+"RET",
+"SYSCALL",
+"SYSRET",
+"IRQ",
+"INT",
+"IRET",
+"FAR_BRANCH",
+};
+
+if ((type >= 0) && (type < PERF_BR_MAX))
+return branch_names[type];
+
+return NULL;

looks like we should add util/branch.c with above functions
and merge it with util/parse-branch-options.c

we create new file even for less code ;-)

thanks,
jirka


Could we directly add branch_type_name() in util/parse-branch-options.c?

I just feel it's a bit waste of creating a new file for less code. :)

Thanks
Jin Yao


After considering again, yes, creating util/branch.c should be better. I 
will do that.


Thanks
Jin Yao

Re: powerpc: Drop include of linux/io.h from asm/io.h

2017-04-18 Thread Michael Ellerman

On Thu, 2017-04-13 at 13:14:36 UTC, Michael Ellerman wrote:
> Currently powerpc's asm/io.h includes linux/io.h, and linux/io.h
> includes asm/io.h.
> 
> This can cause problems because depending on which is included first the
> order of definitions between the two files will change.
> 
> The include of linux/io.h was added back in 2008 in commit b41e5fffe8b8
> ("[POWERPC] devres: Add devm_ioremap_prot()"). It's not entirely clear
> it was needed then, but devm_ioremap_prot() has since been removed
> entirely as unused, in dedd24a12fe6 ("powerpc: Remove unused
> devm_ioremap_prot()").
> 
> So it seems to be unnecessary and can potentially cause problems, so
> remove the include of linux/io.h from asm/io.h
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/590c369e7ecc00be736be39ae0c62d

cheers

Re: [1/4] powerpc: change the doorbell IPI calling convention

2017-04-18 Thread Michael Ellerman

On Thu, 2017-04-13 at 10:16:21 UTC, Nicholas Piggin wrote:
> Change the doorbell callers to know about their msgsnd addressing,
> rather than have them set a per-cpu target data tag at boot that gets
> sent to the cause_ipi functions. The data is only used for doorbell IPI
> functions, no other IPI types, so it makes sense to keep that detail
> local to doorbell.
> 
> Have the platform code understand doorbell IPIs, rather than the
> interrupt controller code understand them. Platform code can look at
> capabilities it has available and decide which to use.
> 
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/b866cc2199d6a6cdcefe4acfe4cfca

cheers

Re: [V4,7/7,remix] cxl: Add psl9 specific code

2017-04-18 Thread Michael Ellerman

On Wed, 2017-04-12 at 14:34:07 UTC, Frederic Barrat wrote:
> From: Christophe Lombard 
> 
> The new Coherent Accelerator Interface Architecture, level 2, for the
> IBM POWER9 brings new content and features:
> - POWER9 Service Layer
> - Registers
> - Radix mode
> - Process element entry
> - Dedicated-Shared Process Programming Model
> - Translation Fault Handling
> - CAPP
> - Memory Context ID
> If a valid mm_struct is found the memory context id is used for each
> transaction associated with the process handle. The PSL uses the
> context ID to find the corresponding process element.
> 
> Signed-off-by: Christophe Lombard 
> Acked-by: Frederic Barrat 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f24be42aab37c6d07c05126673138e

cheers

Re: powerpc/mm/hash: don't opencode VMALLOC_INDEX

2017-04-18 Thread Michael Ellerman

On Wed, 2017-04-12 at 04:40:22 UTC, "Aneesh Kumar K.V" wrote:
> Also remove wrong indentation to fix checkpatch.pl warning.
> 
> Signed-off-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/5f8122611b35c563a7c907485aafd8

cheers

Re: [V4,1/7] cxl: Read vsec perst load image

2017-04-18 Thread Michael Ellerman

On Fri, 2017-04-07 at 14:11:53 UTC, Christophe Lombard wrote:
> This bit is used to cause a flash image load for programmable
> CAIA-compliant implementation. If this bit is set to â0â, a power
> cycle of the adapter is required to load a programmable CAIA-com-
> pliant implementation from flash.
> This field will be used by the following patches.
> 
> Signed-off-by: Christophe Lombard 
> Reviewed-by: Andrew Donnellan 
> Acked-by: Frederic Barrat 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/aba81433b50350fde68bf80fe9f75d

cheers

Re: [1/2] powerpc/64s: Add msgp facility unavailable log string

2017-04-18 Thread Michael Ellerman

On Fri, 2017-04-07 at 01:27:43 UTC, Nicholas Piggin wrote:
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/794464f4dea0b13dacad267c06a01f

cheers

Re: [RFC,1/3] powerpc: Allow platforms to force-enable CONFIG_SMP

2017-04-18 Thread Michael Ellerman

On Wed, 2017-04-05 at 02:44:49 UTC, Michael Ellerman wrote:
> Of the 64-bit Book3S platforms, only powermac supports booting on an
> actual non-SMP system. The other platforms can be built with SMP
> disabled, but it doesn't make a lot of sense given the CPUs they support
> are all multicore or multithreaded.
> 
> So give platforms the option of forcing SMP=y.
> 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/ebbe9d7d3a2ca0d62f1a2c08c7e7a3

cheers

Re: [PATCH] powerpc/powernv: Block PCI config access on BCM5719 during EEH recovery

2017-04-18 Thread Gavin Shan

On Wed, Apr 19, 2017 at 12:39:43PM +1000, Gavin Shan wrote:
>Similar to what is done in commit b6541db13952 ("powerpc/eeh: Block
>PCI config access upon frozen PE"), we need block PCI config access
>for BCM5719 when recovering frozen error on them. Otherwise, an
>unexpected recursive fenced PHB error is observed.
>
>   0001:06:00.0 Ethernet controller: Broadcom Corporation \
>NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
>   0001:06:00.1 Ethernet controller: Broadcom Corporation \
>NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
>

It says BCM5719 in the subject, but it should be BCM5718. I'll
send an updated one for that.

Thanks,
Gavin

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Frank Rowand

On 04/18/17 19:46, Steven Rostedt wrote:
> On Tue, 18 Apr 2017 17:07:17 -0700
> Frank Rowand  wrote:
> 
> 
>> As far as I know, there is no easy way to combine trace data and printk()
>> style data to create a single chronology of events.  If some of the
>> information needed to debug an issue is trace data and some is printk()
>> style data then it becomes more difficult to understand the overall
>> situation.
> 
> You mean like:
> 
>  # echo 1 > /sys/kernel/debug/tracing/events/printk/console/enable
> 
> Makes all printks also go into the ftrace ring buffer.

Thanks!  I was hoping there was going to be an easy answer like this.


> -- Steve
> 
>>
>> If Rob wants to convert printk() style data to trace data (and I can't
>> convince him otherwise) then I will have further comments on this specific
>> patch.
>>
> .
>

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Steven Rostedt

On Tue, 18 Apr 2017 18:42:32 -0700
Frank Rowand  wrote:

> And of course the other issue with using tracepoints is the extra space
> required to hold the tracepoint info.  With the pr_debug() approach, the
> space usage can be easily removed for a production kernel via a config
> option.

Now if you are saying you want to be able to enable debugging without
the tracing infrastructure I would agree. As the tracing infrastructure
is large. But I'm working on shrinking it more.

> 
> Tracepoints are wonderful technology, but not always the proper tool to
> use for debug info.

But if you are going to have tracing enabled regardless, adding a few
more tracepoints isn't going to make the difference.

-- Steve

> 
> > If Rob wants to convert printk() style data to trace data (and I can't
> > convince him otherwise) then I will have further comments on this specific
> > patch.
> >

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Steven Rostedt

On Tue, 18 Apr 2017 17:07:17 -0700
Frank Rowand  wrote:


> As far as I know, there is no easy way to combine trace data and printk()
> style data to create a single chronology of events.  If some of the
> information needed to debug an issue is trace data and some is printk()
> style data then it becomes more difficult to understand the overall
> situation.

You mean like:

 # echo 1 > /sys/kernel/debug/tracing/events/printk/console/enable

Makes all printks also go into the ftrace ring buffer.

-- Steve

> 
> If Rob wants to convert printk() style data to trace data (and I can't
> convince him otherwise) then I will have further comments on this specific
> patch.
>

[PATCH] powerpc/powernv: Block PCI config access on BCM5719 during EEH recovery

2017-04-18 Thread Gavin Shan

Similar to what is done in commit b6541db13952 ("powerpc/eeh: Block
PCI config access upon frozen PE"), we need block PCI config access
for BCM5719 when recovering frozen error on them. Otherwise, an
unexpected recursive fenced PHB error is observed.

   0001:06:00.0 Ethernet controller: Broadcom Corporation \
NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)
   0001:06:00.1 Ethernet controller: Broadcom Corporation \
NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10)

Signed-off-by: Gavin Shan 
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 6fb5522..55a3c5f 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -412,11 +412,14 @@ static void *pnv_eeh_probe(struct pci_dn *pdn, void *data)
 * been set for the PE, we will set EEH_PE_CFG_BLOCKED for
 * that PE to block its config space.
 *
+* Broadcom BCM5718 2-ports NICs (14e4:1656)
 * Broadcom Austin 4-ports NICs (14e4:1657)
 * Broadcom Shiner 4-ports 1G NICs (14e4:168a)
 * Broadcom Shiner 2-ports 10G NICs (14e4:168e)
 */
if ((pdn->vendor_id == PCI_VENDOR_ID_BROADCOM &&
+pdn->device_id == 0x1656) ||
+   (pdn->vendor_id == PCI_VENDOR_ID_BROADCOM &&
 pdn->device_id == 0x1657) ||
(pdn->vendor_id == PCI_VENDOR_ID_BROADCOM &&
 pdn->device_id == 0x168a) ||
-- 
2.7.4

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Frank Rowand

On 04/18/17 18:31, Michael Ellerman wrote:
> Frank Rowand  writes:
> 
>> On 04/17/17 17:32, Tyrel Datwyler wrote:
>>> This patch introduces event tracepoints for tracking a device_nodes
>>> reference cycle as well as reconfig notifications generated in response
>>> to node/property manipulations.
>>>
>>> With the recent upstreaming of the refcount API several device_node
>>> underflows and leaks have come to my attention in the pseries (DLPAR) 
>>> dynamic
>>> logical partitioning code (ie. POWER speak for hotplugging virtual and 
>>> physcial
>>> resources at runtime such as cpus or IOAs). These tracepoints provide a
>>> easy and quick mechanism for validating the reference counting of
>>> device_nodes during their lifetime.
>>>
>>> Further, when pseries lpars are migrated to a different machine we
>>> perform a live update of our device tree to bring it into alignment with the
>>> configuration of the new machine. The of_reconfig_notify trace point
>>> provides a mechanism that can be turned for debuging the device tree
>>> modifications with out having to build a custom kernel to get at the
>>> DEBUG code introduced by commit 00aa3720.
>>
>> I do not like changing individual (or small groups of) printk() style
>> debugging information to tracepoint style.
> 
> I'm not quite sure which printks() you're referring to.
> 
> The only printks that are removed in this series are under #ifdef DEBUG,
> and so are essentially not there unless you build a custom kernel.

Yes, I am talking about pr_debug(), pr_info(), pr_err(), etc.


> 
> They also only cover the reconfig case, which is actually less
> interesting than the much more common and bug-prone get/put logic.

When I was looking at the get/put issue I used pr_debug().


>> As far as I know, there is no easy way to combine trace data and printk()
>> style data to create a single chronology of events.  If some of the
>> information needed to debug an issue is trace data and some is printk()
>> style data then it becomes more difficult to understand the overall
>> situation.
> 
> If you enable CONFIG_PRINTK_TIME then you should be able to just sort
> the trace and the printk output by the timestamp. If you're really
> trying to correlate the two then you should probably just be using
> trace_printk().

Except the existing debug code that uses pr_debug() does not use
trace_printk().

And "just sort" does not apply to multi-line output like:

cpuhp/23-147   [023]    128.324827:
of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
cpuhp/23-147   [023]    128.324829:
of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
cpuhp/23-147   [023]    128.324829:
of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
cpuhp/23-147   [023]    128.324831:
of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
   drmgr-7284  [009]    128.439000:
of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
   drmgr-7284  [009]    128.439002:
of_reconfig_notify: action=DETACH_NODE, 
dn->full_name=/cpus/PowerPC,POWER8@10,
prop->name=null, old_prop->name=null
   drmgr-7284  [009]    128.439015:
of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
   drmgr-7284  [009]    128.439016:
of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4

I was kinda hoping that maybe someone had already created a tool to deal
with this issue.  But not too optimistic.


> But IMO this level of detail, tracing every get/put, does not belong in
> printk. Trace points are absolutely the right solution for this type of
> debugging.
> 
> cheers
> .
>

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Oliver O'Halloran

On Wed, Apr 19, 2017 at 2:46 AM, Rob Herring  wrote:
> On Mon, Apr 17, 2017 at 7:32 PM, Tyrel Datwyler
>  wrote:
>> This patch introduces event tracepoints for tracking a device_nodes
>> reference cycle as well as reconfig notifications generated in response
>> to node/property manipulations.
>>
>> With the recent upstreaming of the refcount API several device_node
>> underflows and leaks have come to my attention in the pseries (DLPAR) dynamic
>> logical partitioning code (ie. POWER speak for hotplugging virtual and 
>> physcial
>> resources at runtime such as cpus or IOAs). These tracepoints provide a
>> easy and quick mechanism for validating the reference counting of
>> device_nodes during their lifetime.
>
> Not really relevant for this patch, but since you are looking at
> pseries and refcounting, the refcounting largely exists for pseries.
> It's also hard to get right as this type of fix is fairly common. It's
> now used for overlays, but we really probably only need to refcount
> the overlays or changesets as a whole, not at a node level. If you
> have any thoughts on how a different model of refcounting could work
> for pseries, I'd like to discuss it.

One idea I've been kicking around is differentiating short and long
term references to a node. I figure most leaks are due to a missing
of_node_put() within a stack frame so it might be possible to use the
ftrace infrastructure to detect and emit warnings if a short term
reference is leaked. Long term references are slightly harder to deal
with, but they're less common so we can add more detailed reference
tracking there (devm_of_get_node?).

Oliver

[PATCH] powerpc: POWER9 DD1 remove SAO feature

2017-04-18 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/cputable.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 9c3a44bb4694..456f584952f8 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -476,7 +476,8 @@ enum {
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300)
-#define CPU_FTRS_POWER9_DD1 (CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD1)
+#define CPU_FTRS_POWER9_DD1 ((CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD1) & \
+   (~CPU_FTR_SAO))
 #define CPU_FTRS_CELL  (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
-- 
2.11.0

[PATCH] powerpc: POWER9 remove ICSWX feature

2017-04-18 Thread Nicholas Piggin

POWER9 does not implement this instruction.

Fixes: c3ab300ea5 ("powerpc: Add POWER9 cputable entry")

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/cputable.h| 2 +-
 arch/powerpc/platforms/Kconfig.cputype | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 4edbc2f7569a..9c3a44bb4694 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -473,7 +473,7 @@ enum {
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
CPU_FTR_DSCR | CPU_FTR_SAO  | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
-   CPU_FTR_ICSWX | CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
+   CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300)
 #define CPU_FTRS_POWER9_DD1 (CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD1)
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index ef4c4b8fc547..3baf821a186d 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -279,7 +279,8 @@ config PPC_ICSWX
 
  This option enables kernel support for the PowerPC Initiate
  Coprocessor Store Word (icswx) coprocessor instruction on POWER7
- or newer processors.
+ and POWER8 processors. POWER9 uses new copy/paste instructions
+ to invoke the coprocessor.
 
  This option is only useful if you have a processor that supports
  the icswx coprocessor instruction. It does not have any effect
-- 
2.11.0

Re: [PATCH v11 4/7] powerpc/powernv: Override pcibios_default_alignment() to force PCI devices to be page aligned

2017-04-18 Thread Yongji Xie

On 19 April 2017 at 09:47, Michael Ellerman  wrote:
> Bjorn Helgaas  writes:
>
>> On Mon, Apr 17, 2017 at 4:36 PM, Bjorn Helgaas  wrote:
>>> From: Yongji Xie 
>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
>>> b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> index 6901a06da2f9..b724487cbd0f 100644
>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> @@ -3287,6 +3287,11 @@ static void pnv_pci_setup_bridge(struct pci_bus 
>>> *bus, unsigned long type)
>>> }
>>>  }
>>>
>>> +static resource_size_t pnv_pci_default_alignment(struct pci_dev *pdev)
>>> +{
>>> +   return PAGE_SIZE;
>>> +}
>>
>> Is it necessary that pcibios_default_alignment() take a pci_dev
>> pointer?
>
> It's not necessary given the current implementation, obviously.
>
> But it did strike me as a good idea to pass it in case we ever want to
> do anything device specific in future.
>
>> I'd like this better if it were:
>>
>>   resource_size_t pcibios_default_alignment(void) { ... }
>>
>> because the last patch relies on the assumption that all resources of
>> *all* devices will be realigned to the same alignment.
>
> But I guess that precludes doing anything device specific, at least
> without further changes. So in that case it would be better if the API
> didn't include the pci_dev.
>
> Hopefully Yongji can confirm that there were no plans to use the
> pci_dev in future patches.
>

Yes, seems like pci_dev pointer doesn't match the assumption
that all resources will be realigned to the same alignment. It's OK
to me to remove it.

Thanks,
Yongji

Re: [PATCH] scsi: ibmvfc: don't check for failure from mempool_alloc()

2017-04-18 Thread Martin K. Petersen

NeilBrown  writes:

> mempool_alloc() cannot fail when passed GFP_NOIO or any other gfp
> setting that is permitted to sleep.  So remove this pointless code.

Applied to 4.12/scsi-queue. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH v11 4/7] powerpc/powernv: Override pcibios_default_alignment() to force PCI devices to be page aligned

2017-04-18 Thread Michael Ellerman

Bjorn Helgaas  writes:

> On Mon, Apr 17, 2017 at 4:36 PM, Bjorn Helgaas  wrote:
>> From: Yongji Xie 
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
>> b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 6901a06da2f9..b724487cbd0f 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -3287,6 +3287,11 @@ static void pnv_pci_setup_bridge(struct pci_bus *bus, 
>> unsigned long type)
>> }
>>  }
>>
>> +static resource_size_t pnv_pci_default_alignment(struct pci_dev *pdev)
>> +{
>> +   return PAGE_SIZE;
>> +}
>
> Is it necessary that pcibios_default_alignment() take a pci_dev
> pointer?

It's not necessary given the current implementation, obviously.

But it did strike me as a good idea to pass it in case we ever want to
do anything device specific in future.

> I'd like this better if it were:
>
>   resource_size_t pcibios_default_alignment(void) { ... }
>
> because the last patch relies on the assumption that all resources of
> *all* devices will be realigned to the same alignment.

But I guess that precludes doing anything device specific, at least
without further changes. So in that case it would be better if the API
didn't include the pci_dev.

Hopefully Yongji can confirm that there were no plans to use the
pci_dev in future patches.

cheers

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Frank Rowand

On 04/18/17 17:07, Frank Rowand wrote:
> On 04/17/17 17:32, Tyrel Datwyler wrote:
>> This patch introduces event tracepoints for tracking a device_nodes
>> reference cycle as well as reconfig notifications generated in response
>> to node/property manipulations.
>>
>> With the recent upstreaming of the refcount API several device_node
>> underflows and leaks have come to my attention in the pseries (DLPAR) dynamic
>> logical partitioning code (ie. POWER speak for hotplugging virtual and 
>> physcial
>> resources at runtime such as cpus or IOAs). These tracepoints provide a
>> easy and quick mechanism for validating the reference counting of
>> device_nodes during their lifetime.
>>
>> Further, when pseries lpars are migrated to a different machine we
>> perform a live update of our device tree to bring it into alignment with the
>> configuration of the new machine. The of_reconfig_notify trace point
>> provides a mechanism that can be turned for debuging the device tree
>> modifications with out having to build a custom kernel to get at the
>> DEBUG code introduced by commit 00aa3720.
> 
> I do not like changing individual (or small groups of) printk() style
> debugging information to tracepoint style.
> 
> As far as I know, there is no easy way to combine trace data and printk()
> style data to create a single chronology of events.  If some of the
> information needed to debug an issue is trace data and some is printk()
> style data then it becomes more difficult to understand the overall
> situation.

And of course the other issue with using tracepoints is the extra space
required to hold the tracepoint info.  With the pr_debug() approach, the
space usage can be easily removed for a production kernel via a config
option.

Tracepoints are wonderful technology, but not always the proper tool to
use for debug info.

> If Rob wants to convert printk() style data to trace data (and I can't
> convince him otherwise) then I will have further comments on this specific
> patch.
> 
> -Frank
> 
>>
>> The following trace events are provided: of_node_get, of_node_put,
>> of_node_release, and of_reconfig_notify. These trace points require a kernel
>> built with ftrace support to be enabled. In a typical environment where
>> debugfs is mounted at /sys/kernel/debug the entire set of tracepoints
>> can be set with the following:
>>
>>   echo "of:*" > /sys/kernel/debug/tracing/set_event
>>
>> or
>>
>>   echo 1 > /sys/kernel/debug/tracing/of/enable
>>
>> The following shows the trace point data from a DLPAR remove of a cpu
>> from a pseries lpar:
>>
>> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
>>
>> cpuhp/23-147   [023]    128.324827:
>>  of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>>  of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>>  of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324831:
>>  of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439000:
>>  of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439002:
>>  of_reconfig_notify: action=DETACH_NODE, 
>> dn->full_name=/cpus/PowerPC,POWER8@10,
>>  prop->name=null, old_prop->name=null
>>drmgr-7284  [009]    128.439015:
>>  of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439016:
>>  of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
>>
>> Signed-off-by: Tyrel Datwyler 
>> ---
>>  drivers/of/dynamic.c  | 30 ++-
>>  include/trace/events/of.h | 93 
>> +++
>>  2 files changed, 105 insertions(+), 18 deletions(-)
>>  create mode 100644 include/trace/events/of.h
>>
> 
> < snip >
> 
>

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Michael Ellerman

Frank Rowand  writes:

> On 04/17/17 17:32, Tyrel Datwyler wrote:
>> This patch introduces event tracepoints for tracking a device_nodes
>> reference cycle as well as reconfig notifications generated in response
>> to node/property manipulations.
>> 
>> With the recent upstreaming of the refcount API several device_node
>> underflows and leaks have come to my attention in the pseries (DLPAR) dynamic
>> logical partitioning code (ie. POWER speak for hotplugging virtual and 
>> physcial
>> resources at runtime such as cpus or IOAs). These tracepoints provide a
>> easy and quick mechanism for validating the reference counting of
>> device_nodes during their lifetime.
>> 
>> Further, when pseries lpars are migrated to a different machine we
>> perform a live update of our device tree to bring it into alignment with the
>> configuration of the new machine. The of_reconfig_notify trace point
>> provides a mechanism that can be turned for debuging the device tree
>> modifications with out having to build a custom kernel to get at the
>> DEBUG code introduced by commit 00aa3720.
>
> I do not like changing individual (or small groups of) printk() style
> debugging information to tracepoint style.

I'm not quite sure which printks() you're referring to.

The only printks that are removed in this series are under #ifdef DEBUG,
and so are essentially not there unless you build a custom kernel.

They also only cover the reconfig case, which is actually less
interesting than the much more common and bug-prone get/put logic.

> As far as I know, there is no easy way to combine trace data and printk()
> style data to create a single chronology of events.  If some of the
> information needed to debug an issue is trace data and some is printk()
> style data then it becomes more difficult to understand the overall
> situation.

If you enable CONFIG_PRINTK_TIME then you should be able to just sort
the trace and the printk output by the timestamp. If you're really
trying to correlate the two then you should probably just be using
trace_printk().

But IMO this level of detail, tracing every get/put, does not belong in
printk. Trace points are absolutely the right solution for this type of
debugging.

cheers

Re: [PATCH v4 4/5] perf report: Show branch type statistics for stdio mode

2017-04-18 Thread Jin, Yao




On 4/19/2017 2:53 AM, Jiri Olsa wrote:

On Wed, Apr 12, 2017 at 06:21:05AM +0800, Jin Yao wrote:

SNIP


+const char *branch_type_name(int type)
+{
+   const char *branch_names[PERF_BR_MAX] = {
+   "N/A",
+   "JCC",
+   "JMP",
+   "IND_JMP",
+   "CALL",
+   "IND_CALL",
+   "RET",
+   "SYSCALL",
+   "SYSRET",
+   "IRQ",
+   "INT",
+   "IRET",
+   "FAR_BRANCH",
+   };
+
+   if ((type >= 0) && (type < PERF_BR_MAX))
+   return branch_names[type];
+
+   return NULL;

looks like we should add util/branch.c with above functions
and merge it with util/parse-branch-options.c

we create new file even for less code ;-)

thanks,
jirka


Could we directly add branch_type_name() in util/parse-branch-options.c?

I just feel it's a bit waste of creating a new file for less code. :)

Thanks
Jin Yao

Re: [PATCH v4 4/5] perf report: Show branch type statistics for stdio mode

2017-04-18 Thread Jin, Yao




On 4/19/2017 2:53 AM, Jiri Olsa wrote:

On Wed, Apr 12, 2017 at 06:21:05AM +0800, Jin Yao wrote:

SNIP


+static int hist_iter__branch_callback(struct hist_entry_iter *iter,
+ struct addr_location *al __maybe_unused,
+ bool single __maybe_unused,
+ void *arg)
+{
+   struct hist_entry *he = iter->he;
+   struct report *rep = arg;
+   struct branch_info *bi;
+
+   if (sort__mode == SORT_MODE__BRANCH) {

is this check necessary? the hist_iter__branch_callback
was set based on this check

jirka


Let me double check.

Thanks
Jin Yao

Re: [PATCH v4 5/5] perf report: Show branch type in callchain entry

2017-04-18 Thread Jin, Yao




On 4/19/2017 2:53 AM, Jiri Olsa wrote:

On Wed, Apr 12, 2017 at 06:21:06AM +0800, Jin Yao wrote:

SNIP


+static int branch_type_str(struct branch_type_stat *stat,
+  char *bf, int bfsize)
+{
+   int i, j = 0, printed = 0;
+   u64 total = 0;
+
+   for (i = 0; i < PERF_BR_MAX; i++)
+   total += stat->counts[i];
+
+   if (total == 0)
+   return 0;
+
+   printed += scnprintf(bf + printed, bfsize - printed, " (");
+
+   if (stat->jcc_fwd > 0) {
+   j++;
+   printed += scnprintf(bf + printed, bfsize - printed,
+"JCC forward");
+   }
+
+   if (stat->jcc_bwd > 0) {
+   if (j++)
+   printed += scnprintf(bf + printed, bfsize - printed,
+" JCC backward");
+   else
+   printed += scnprintf(bf + printed, bfsize - printed,
+"JCC backward");
+   }
+
+   if (stat->cross_4k > 0) {
+   if (j++)
+   printed += scnprintf(bf + printed, bfsize - printed,
+" CROSS_4K");
+   else
+   printed += scnprintf(bf + printed, bfsize - printed,
+"CROSS_4K");
+   }

could that 2 legs if be shortened to just one scnprintf like (untested):

printed += scnprintf(bf + printed, bfsize - printed, "%s%s", j++ ? " " : "", 
"CROSS_4K");

I'd also probably use some kind of macro or function
with all that similar code, but I dont insist ;-)

thanks,
jirka


Thanks for this suggestion. I will use this kind of code. Of course, I 
will test. :)


Thanks
Jin Yao

Re: [PATCH v4 5/5] perf report: Show branch type in callchain entry

2017-04-18 Thread Jin, Yao




On 4/19/2017 2:53 AM, Jiri Olsa wrote:

On Wed, Apr 12, 2017 at 06:21:06AM +0800, Jin Yao wrote:

SNIP


  static int counts_str_build(char *bf, int bfsize,
 u64 branch_count, u64 predicted_count,
 u64 abort_count, u64 cycles_count,
-u64 iter_count, u64 samples_count)
+u64 iter_count, u64 samples_count,
+struct branch_type_stat *brtype_stat)
  {
-   double predicted_percent = 0.0;
-   const char *null_str = "";
-   char iter_str[32];
-   char cycle_str[32];
-   char *istr, *cstr;
u64 cycles;
+   int printed, i = 0;
  
  	if (branch_count == 0)

return scnprintf(bf, bfsize, " (calltrace)");
  
+	printed = branch_type_str(brtype_stat, bf, bfsize);

+   if (printed)
+   i++;
+
cycles = cycles_count / branch_count;
+   if (cycles) {
+   if (i++)
+   printed += scnprintf(bf + printed, bfsize - printed,
+   " cycles:%" PRId64 "", cycles);
+   else
+   printed += scnprintf(bf + printed, bfsize - printed,
+   " (cycles:%" PRId64 "", cycles);
+   }
  
  	if (iter_count && samples_count) {

-   if (cycles > 0)
-   scnprintf(iter_str, sizeof(iter_str),
-" iterations:%" PRId64 "",
-iter_count / samples_count);
+   if (i++)
+   printed += scnprintf(bf + printed, bfsize - printed,
+   " iterations:%" PRId64 "",
+   iter_count / samples_count);
else
-   scnprintf(iter_str, sizeof(iter_str),
-"iterations:%" PRId64 "",
-iter_count / samples_count);
-   istr = iter_str;

could you please put the change from using iter_str
to bf into separate patch before the actual branch
display change?

it's hard to see if anything is broken ;-)

thanks,
jirka

Got it, I will separate the patches.

Thanks
Jin Yao

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Frank Rowand

On 04/17/17 17:32, Tyrel Datwyler wrote:
> This patch introduces event tracepoints for tracking a device_nodes
> reference cycle as well as reconfig notifications generated in response
> to node/property manipulations.
> 
> With the recent upstreaming of the refcount API several device_node
> underflows and leaks have come to my attention in the pseries (DLPAR) dynamic
> logical partitioning code (ie. POWER speak for hotplugging virtual and 
> physcial
> resources at runtime such as cpus or IOAs). These tracepoints provide a
> easy and quick mechanism for validating the reference counting of
> device_nodes during their lifetime.
> 
> Further, when pseries lpars are migrated to a different machine we
> perform a live update of our device tree to bring it into alignment with the
> configuration of the new machine. The of_reconfig_notify trace point
> provides a mechanism that can be turned for debuging the device tree
> modifications with out having to build a custom kernel to get at the
> DEBUG code introduced by commit 00aa3720.

I do not like changing individual (or small groups of) printk() style
debugging information to tracepoint style.

As far as I know, there is no easy way to combine trace data and printk()
style data to create a single chronology of events.  If some of the
information needed to debug an issue is trace data and some is printk()
style data then it becomes more difficult to understand the overall
situation.

If Rob wants to convert printk() style data to trace data (and I can't
convince him otherwise) then I will have further comments on this specific
patch.

-Frank

> 
> The following trace events are provided: of_node_get, of_node_put,
> of_node_release, and of_reconfig_notify. These trace points require a kernel
> built with ftrace support to be enabled. In a typical environment where
> debugfs is mounted at /sys/kernel/debug the entire set of tracepoints
> can be set with the following:
> 
>   echo "of:*" > /sys/kernel/debug/tracing/set_event
> 
> or
> 
>   echo 1 > /sys/kernel/debug/tracing/of/enable
> 
> The following shows the trace point data from a DLPAR remove of a cpu
> from a pseries lpar:
> 
> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
> 
> cpuhp/23-147   [023]    128.324827:
>   of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
>   of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
>   of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324831:
>   of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439000:
>   of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439002:
>   of_reconfig_notify: action=DETACH_NODE, 
> dn->full_name=/cpus/PowerPC,POWER8@10,
>   prop->name=null, old_prop->name=null
>drmgr-7284  [009]    128.439015:
>   of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439016:
>   of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
> 
> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/of/dynamic.c  | 30 ++-
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 18 deletions(-)
>  create mode 100644 include/trace/events/of.h
> 

< snip >

Re: [PATCH 1/9] mm/huge_memory: Use zap_deposited_table() more

2017-04-18 Thread David Rientjes

On Wed, 12 Apr 2017, Oliver O'Halloran wrote:

> Depending flags of the PMD being zapped there may or may not be a
> deposited pgtable to be freed. In two of the three cases this is open
> coded while the third uses the zap_deposited_table() helper. This patch
> converts the others to use the helper to clean things up a bit.
> 
> Cc: "Aneesh Kumar K.V" 
> Cc: "Kirill A. Shutemov" 
> Cc: linux...@kvack.org
> Signed-off-by: Oliver O'Halloran 

Acked-by: David Rientjes

[patch V2 13/24] powerpc/powernv: Use stop_machine_cpuslocked()

2017-04-18 Thread Thomas Gleixner

set_subcores_per_core() holds get_online_cpus() while invoking stop_machine().

stop_machine() invokes get_online_cpus() as well. This is correct, but
prevents the conversion of the hotplug locking to a percpu rwsem.

Use stop_machine_cpuslocked() to avoid the nested call.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org

---
 arch/powerpc/platforms/powernv/subcore.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -356,7 +356,8 @@ static int set_subcores_per_core(int new
/* Ensure state is consistent before we call the other cpus */
mb();
 
-   stop_machine(cpu_update_split_mode, &new_mode, cpu_online_mask);
+   stop_machine_cpuslocked(cpu_update_split_mode, &new_mode,
+   cpu_online_mask);
 
put_online_cpus();

[patch V2 07/24] KVM/PPC/Book3S HV: Use cpuhp_setup_state_nocalls_cpuslocked()

2017-04-18 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

kvmppc_alloc_host_rm_ops() holds get_online_cpus() while invoking
cpuhp_setup_state_nocalls().

cpuhp_setup_state_nocalls() invokes get_online_cpus() as well. This is
correct, but prevents the conversion of the hotplug locking to a percpu
rwsem.

Use cpuhp_setup_state_nocalls_cpuslocked() to avoid the nested call.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Cc: Alexander Graf 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: k...@vger.kernel.org
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org

---
 arch/powerpc/kvm/book3s_hv.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3336,10 +3336,10 @@ void kvmppc_alloc_host_rm_ops(void)
return;
}
 
-   cpuhp_setup_state_nocalls(CPUHP_KVM_PPC_BOOK3S_PREPARE,
- "ppc/kvm_book3s:prepare",
- kvmppc_set_host_core,
- kvmppc_clear_host_core);
+   cpuhp_setup_state_nocalls_cpuslocked(CPUHP_KVM_PPC_BOOK3S_PREPARE,
+"ppc/kvm_book3s:prepare",
+kvmppc_set_host_core,
+kvmppc_clear_host_core);
put_online_cpus();
 }

[PATCH][OPAL] cpu-features: add base and POWER8, POWER9 /cpus/ibm, powerpc-cpu-features dt

2017-04-18 Thread Nicholas Piggin

Since last time:
- Added binding specification and design overview documentation.
- Renamed the dt node to ibm,powerpc specific.
- Moved dependency building pass into its own function.

---
 core/Makefile.inc  |   2 +-
 core/cpufeatures.c | 896 +
 core/device.c  |   7 +
 core/init.c|   1 +
 .../ibm,powerpc-cpu-features/binding.txt   | 226 ++
 .../ibm,powerpc-cpu-features/design.txt| 157 
 include/device.h   |   1 +
 include/skiboot.h  |   5 +
 8 files changed, 1294 insertions(+), 1 deletion(-)
 create mode 100644 core/cpufeatures.c
 create mode 100644 doc/device-tree/ibm,powerpc-cpu-features/binding.txt
 create mode 100644 doc/device-tree/ibm,powerpc-cpu-features/design.txt

diff --git a/core/Makefile.inc b/core/Makefile.inc
index b09c30c0..7c247836 100644
--- a/core/Makefile.inc
+++ b/core/Makefile.inc
@@ -8,7 +8,7 @@ CORE_OBJS += pci-opal.o fast-reboot.o device.o exceptions.o 
trace.o affinity.o
 CORE_OBJS += vpd.o hostservices.o platform.o nvram.o nvram-format.o hmi.o
 CORE_OBJS += console-log.o ipmi.o time-utils.o pel.o pool.o errorlog.o
 CORE_OBJS += timer.o i2c.o rtc.o flash.o sensor.o ipmi-opal.o
-CORE_OBJS += flash-subpartition.o bitmap.o buddy.o pci-quirk.o
+CORE_OBJS += flash-subpartition.o bitmap.o buddy.o pci-quirk.o cpufeatures.o
 
 ifeq ($(SKIBOOT_GCOV),1)
 CORE_OBJS += gcov-profiling.o
diff --git a/core/cpufeatures.c b/core/cpufeatures.c
new file mode 100644
index ..1acedc1d
--- /dev/null
+++ b/core/cpufeatures.c
@@ -0,0 +1,896 @@
+/* Copyright 2017 IBM Corp.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ * implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * This file deals with setting up the /cpus/features device tree
+ * by discovering CPU hardware and populating feature nodes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef DEBUG
+#define DBG(fmt, a...) prlog(PR_DEBUG, "CPUFT: " fmt, ##a)
+#else
+#define DBG(fmt, a...)
+#endif
+
+/* Device-tree visible constants follow */
+#define ISA_V2_07B 2070
+#define ISA_V3_0B  3000
+
+#define USABLE_PR  (1U << 0)
+#define USABLE_OS  (1U << 1)
+#define USABLE_HV  (1U << 2)
+
+#define HV_SUPPORT_HFSCR   (1U << 0)
+#define OS_SUPPORT_FSCR(1U << 0)
+
+/* Following are definitions for the match tables, not the DT binding itself */
+#define ISA_BASE   0
+
+#define HV_NONE0
+#define HV_CUSTOM  1
+#define HV_HFSCR   2
+
+#define OS_NONE0
+#define OS_CUSTOM  1
+#define OS_FSCR2
+
+/* CPU bitmasks for match table */
+#define CPU_P8_DD1 (1U << 0)
+#define CPU_P8_DD2 (1U << 1)
+#define CPU_P9_DD1 (1U << 2)
+#define CPU_P9_DD2 (1U << 3)
+
+#define CPU_P8 (CPU_P8_DD1|CPU_P8_DD2)
+#define CPU_P9 (CPU_P9_DD1|CPU_P9_DD2)
+#define CPU_ALL(CPU_P8|CPU_P9)
+
+struct cpu_feature {
+   const char *name;
+   uint32_t cpus_supported;
+   uint32_t isa;
+   uint32_t usable_privilege;
+   uint32_t hv_support;
+   uint32_t os_support;
+   uint32_t hfscr_bit_nr;
+   uint32_t fscr_bit_nr;
+   uint32_t hwcap_bit_nr;
+   const char *dependencies_names; /* space-delimited names */
+};
+
+/*
+ * The base (or NULL) cpu feature set is the CPU features available
+ * when no child nodes of the /cpus/ibm,powerpc-cpu-features node exist. The
+ * base feature set is POWER8 (ISAv2.07B), less features that are listed
+ * explicitly.
+ *
+ * XXX: currently, the feature dependencies are not necessarily captured
+ * exactly or completely. This is somewhat acceptable because all
+ * implementations must be aware of all these features.
+ */
+static const struct cpu_feature cpu_features_table[] = {
+   /*
+* Big endian as in ISAv2.07B, MSR_LE=0
+*/
+   { "big-endian",
+   CPU_ALL,
+   ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
+   HV_CUSTOM, OS_CUSTOM,
+   -1, -1, -1,
+   NULL, },
+
+   /*
+* Little endian as in ISAv2.07B, MSR_LE=1.
+*
+* When both big and little endian are defined, there is an LPCR ILE
+* bit and implementation specific way to switch HILE mode, MSR_SLE,
+* etc.
+*/
+   { "little-endian",
+   CPU_ALL,
+   ISA_BASE, USA

[PATCH 4/4] powerpc/64s: cpu-features: initial implementation

2017-04-18 Thread Nicholas Piggin

The /cpus/ibm,powerpc-cpu-features dt binding describes CPU features with
ascii names and extensible compatibility, privilege, and enablement metadata
that allows improved flexibility and compatibility with new hardware.

Design overview and specification of features is available in the OPAL
source.


Since last post:
- Updated to powerpc next, in particular XIVE merge
- Updated and expanded documentation in skiboot. Device tree binding
  documentation in Linux is a copy of the binding specification from
  skiboot and includes a pointer back there.
- Renamed usable-mask to usable-privilege to describe it better
- Changed hv-support and os-support to bitmasks for better extensibility
  for other recipes in future.


Looking at hfscr/fscr/lpcr/msr etc bits before and after the patch to
make sure we're doing things properly. There is one difference:

- PPC_FEATURE2_EBB is set independent of PMU init. EBB facility is
theoretically more general than PMU, I don't think this should be
a problem?

POWER9:
- CPU_FTR_ICSWX bit is now clear. Should this be cleared on POEWR9?

And Mambo has a few issues with POWER8:
- HFSCR bit 54 and 57 are now clear (mambo sets at init)
- PMAO_BUG is set. This is due to mambo setting architected POWER8 mode
and POWER8E PVR. Current kernels lose PMAO_BUG bit.
- CI_LARGE_PAGE is now set (mambo boot does not set it for some reason,
haven't looked at why).
---
 .../bindings/powerpc/ibm,powerpc-cpu-features.txt  | 233 
 arch/powerpc/Kconfig   |  16 +
 arch/powerpc/include/asm/cpu_has_feature.h |   4 +-
 arch/powerpc/include/asm/cpufeatures.h |  55 ++
 arch/powerpc/include/asm/cputable.h|   2 +
 arch/powerpc/include/uapi/asm/cputable.h   |   6 +
 arch/powerpc/kernel/Makefile   |   1 +
 arch/powerpc/kernel/cpufeatures.c  | 636 +
 arch/powerpc/kernel/cputable.c |  37 +-
 arch/powerpc/kernel/prom.c | 338 ++-
 arch/powerpc/kernel/setup-common.c |   2 +-
 arch/powerpc/kernel/setup_64.c |  15 +-
 12 files changed, 1328 insertions(+), 17 deletions(-)
 create mode 100644 
Documentation/devicetree/bindings/powerpc/ibm,powerpc-cpu-features.txt
 create mode 100644 arch/powerpc/include/asm/cpufeatures.h
 create mode 100644 arch/powerpc/kernel/cpufeatures.c

diff --git 
a/Documentation/devicetree/bindings/powerpc/ibm,powerpc-cpu-features.txt 
b/Documentation/devicetree/bindings/powerpc/ibm,powerpc-cpu-features.txt
new file mode 100644
index ..bf7fc71f3062
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/ibm,powerpc-cpu-features.txt
@@ -0,0 +1,233 @@
+*** NOTE ***
+This document is copied from OPAL firmware
+(skiboot/doc/device-tree/ibm,powerpc-cpu-features/binding.txt)
+
+There is more complete overview and documentation of features in that
+source tree.  All patches and modifications should go there.
+
+ibm,powerpc-cpu-features binding
+
+
+This device tree binding describes CPU features available to software, with
+enablement, privilege, and compatibility metadata.
+
+More general description of design and implementation of this binding is
+found in design.txt, which also points to documentation of specific features.
+
+
+/cpus/ibm,powerpc-cpu-features node binding
+---
+
+Node: ibm,powerpc-cpu-features
+
+Description: Container of CPU feature nodes.
+
+The node name must be "ibm,powerpc-cpu-features" and it must be a child of the
+node "/cpus".
+
+The node is optional but should be provided by new OPAL firmware.
+
+Properties:
+
+- device_type
+  Usage: required
+  Value type: string
+  Definition: "cpu-features"
+
+- isa
+  Usage: required
+  Value type: 
+  Definition:
+
+  isa that the CPU is currently running in. This provides instruction set
+  compatibility, less the individual feature nodes. For example, an ISA v3.0
+  implementation that lacks the "transactional-memory" cpufeature node
+  should not use transactional memory facilities.
+
+  Value corresponds to the "Power ISA Version" multiplied by 1000.
+  For example, <3000> corresponds to Version 3.0, <2070> to Version 2.07.
+  The minor digit is available for revisions.
+
+/cpus/ibm,powerpc-cpu-features/example-feature node bindings
+
+
+Each child node of cpu-features represents a CPU feature / capability.
+
+Node: A string describing an architected CPU feature, e.g., "floating-point".
+
+Description: A feature or capability supported by the CPUs.
+
+The name of the node is a human readable string that forms the interface
+used to describe features to software. Features are currently documented
+in the code where they are implemented in skiboot/core/cpufeatures.c
+
+Presence of the node indicates the feature is available.
+
+Properties:
+
+- isa
+  Usage: re

[PATCH 3/4] of/fdt: introduce of_scan_flat_dt_subnodes and of_get_flat_dt_phandle

2017-04-18 Thread Nicholas Piggin

Introduce primitives for FDT parsing. These will be used for powerpc
cpufeatures node scanning, which has quite complex structure but should
be processed early.

Cc: devicet...@vger.kernel.org
Acked-by: Rob Herring 
Signed-off-by: Nicholas Piggin 
---
 drivers/of/fdt.c   | 38 ++
 include/linux/of_fdt.h |  6 ++
 2 files changed, 44 insertions(+)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index e5ce4b59e162..961ca97072a9 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -754,6 +754,36 @@ int __init of_scan_flat_dt(int (*it)(unsigned long node,
 }
 
 /**
+ * of_scan_flat_dt_subnodes - scan sub-nodes of a node call callback on each.
+ * @it: callback function
+ * @data: context data pointer
+ *
+ * This function is used to scan sub-nodes of a node.
+ */
+int __init of_scan_flat_dt_subnodes(unsigned long parent,
+   int (*it)(unsigned long node,
+ const char *uname,
+ void *data),
+   void *data)
+{
+   const void *blob = initial_boot_params;
+   int node;
+
+   fdt_for_each_subnode(node, blob, parent) {
+   const char *pathp;
+   int rc;
+
+   pathp = fdt_get_name(blob, node, NULL);
+   if (*pathp == '/')
+   pathp = kbasename(pathp);
+   rc = it(node, pathp, data);
+   if (rc)
+   return rc;
+   }
+   return 0;
+}
+
+/**
  * of_get_flat_dt_subnode_by_name - get the subnode by given name
  *
  * @node: the parent node
@@ -812,6 +842,14 @@ int __init of_flat_dt_match(unsigned long node, const char 
*const *compat)
return of_fdt_match(initial_boot_params, node, compat);
 }
 
+/**
+ * of_get_flat_dt_prop - Given a node in the flat blob, return the phandle
+ */
+uint32_t __init of_get_flat_dt_phandle(unsigned long node)
+{
+   return fdt_get_phandle(initial_boot_params, node);
+}
+
 struct fdt_scan_status {
const char *name;
int namelen;
diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
index 271b3fdf0070..1dfbfd0d8040 100644
--- a/include/linux/of_fdt.h
+++ b/include/linux/of_fdt.h
@@ -54,6 +54,11 @@ extern char __dtb_end[];
 extern int of_scan_flat_dt(int (*it)(unsigned long node, const char *uname,
 int depth, void *data),
   void *data);
+extern int of_scan_flat_dt_subnodes(unsigned long node,
+   int (*it)(unsigned long node,
+ const char *uname,
+ void *data),
+   void *data);
 extern int of_get_flat_dt_subnode_by_name(unsigned long node,
  const char *uname);
 extern const void *of_get_flat_dt_prop(unsigned long node, const char *name,
@@ -62,6 +67,7 @@ extern int of_flat_dt_is_compatible(unsigned long node, const 
char *name);
 extern int of_flat_dt_match(unsigned long node, const char *const *matches);
 extern unsigned long of_get_flat_dt_root(void);
 extern int of_get_flat_dt_size(void);
+extern uint32_t of_get_flat_dt_phandle(unsigned long node);
 
 extern int early_init_dt_scan_chosen(unsigned long node, const char *uname,
 int depth, void *data);
-- 
2.11.0

[PATCH 2/4] powerpc/64s: POWER9 no LPCR VRMASD bits

2017-04-18 Thread Nicholas Piggin

POWER9/ISAv3 has no VRMASD field in LPCR. Don't set reserved bits.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/cpu_setup_power.S | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index 1fce4ddd2e6c..10cb2896b2ae 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -30,7 +30,7 @@ _GLOBAL(__setup_cpu_power7)
mtspr   SPRN_LPID,r0
mfspr   r3,SPRN_LPCR
li  r4,(LPCR_LPES1 >> LPCR_LPES_SH)
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA206
bl  __init_tlb_power7
mtlrr11
blr
@@ -44,7 +44,7 @@ _GLOBAL(__restore_cpu_power7)
mtspr   SPRN_LPID,r0
mfspr   r3,SPRN_LPCR
li  r4,(LPCR_LPES1 >> LPCR_LPES_SH)
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA206
bl  __init_tlb_power7
mtlrr11
blr
@@ -62,7 +62,7 @@ _GLOBAL(__setup_cpu_power8)
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
li  r4,0 /* LPES = 0 */
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA206
bl  __init_HFSCR
bl  __init_tlb_power8
bl  __init_PMU_HV
@@ -84,7 +84,7 @@ _GLOBAL(__restore_cpu_power8)
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
li  r4,0 /* LPES = 0 */
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA206
bl  __init_HFSCR
bl  __init_tlb_power8
bl  __init_PMU_HV
@@ -108,7 +108,7 @@ _GLOBAL(__setup_cpu_power9)
LOAD_REG_IMMEDIATE(r4, LPCR_UPRT | LPCR_HR)
andcr3, r3, r4
li  r4,0 /* LPES = 0 */
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA300
bl  __init_HFSCR
bl  __init_tlb_power9
bl  __init_PMU_HV
@@ -132,7 +132,7 @@ _GLOBAL(__restore_cpu_power9)
LOAD_REG_IMMEDIATE(r4, LPCR_UPRT | LPCR_HR)
andcr3, r3, r4
li  r4,0 /* LPES = 0 */
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA300
bl  __init_HFSCR
bl  __init_tlb_power9
bl  __init_PMU_HV
@@ -150,7 +150,7 @@ __init_hvmode_206:
std r5,CPU_SPEC_FEATURES(r4)
blr
 
-__init_LPCR:
+__init_LPCR_ISA206:
/* Setup a sane LPCR:
 *   Called with initial LPCR in R3 and desired LPES 2-bit value in R4
 *
@@ -163,6 +163,11 @@ __init_LPCR:
 *
 * Other bits untouched for now
 */
+   li  r5,0x10
+   rldimi  r3,r5, LPCR_VRMASD_SH, 64-LPCR_VRMASD_SH-5
+
+   /* POWER9 has no VRMASD */
+__init_LPCR_ISA300:
rldimi  r3,r4, LPCR_LPES_SH, 64-LPCR_LPES_SH-2
ori r3,r3,(LPCR_PECE0|LPCR_PECE1|LPCR_PECE2)
li  r5,4
@@ -170,8 +175,6 @@ __init_LPCR:
clrrdi  r3,r3,1 /* clear HDICE */
li  r5,4
rldimi  r3,r5, LPCR_VC_SH, 0
-   li  r5,0x10
-   rldimi  r3,r5, LPCR_VRMASD_SH, 64-LPCR_VRMASD_SH-5
mtspr   SPRN_LPCR,r3
isync
blr
-- 
2.11.0

[PATCH 1/4] powerpc/64s: Revert setting LPCR LPES0 on POWER9

2017-04-18 Thread Nicholas Piggin

The XIVE enablement patches set LPES0 on POWER9 host. This bit sets
external interrupts to guest delivery mode that uses SRR[01]. The host's
EE interrupt handler expects HSRR[01] (for earlier CPUs). which is fine
because XIVE is configured not to deliver EE to the host (HVI is used
instead) so this should never be executed.

However a bug in interrupt controller code or odd configuration of
mambo/systemsim could result in the host getting EE. Keeping EE delivery
mode matching the host handler prevents strange crashes due to using
the wrong exception registers.

When running in guest mode and getting EE, the guest LPCR will be
loaded by KVM which contains the LPES0 bit.

Fixes: 08a1e650cc ("powerpc: Fixup LPCR:PECE and HEIC setting on POWER9")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/cpu_setup_power.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index 7013ae3d1675..1fce4ddd2e6c 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -107,7 +107,7 @@ _GLOBAL(__setup_cpu_power9)
or  r3, r3, r4
LOAD_REG_IMMEDIATE(r4, LPCR_UPRT | LPCR_HR)
andcr3, r3, r4
-   li  r4,(LPCR_LPES0 >> LPCR_LPES_SH)
+   li  r4,0 /* LPES = 0 */
bl  __init_LPCR
bl  __init_HFSCR
bl  __init_tlb_power9
@@ -131,7 +131,7 @@ _GLOBAL(__restore_cpu_power9)
or  r3, r3, r4
LOAD_REG_IMMEDIATE(r4, LPCR_UPRT | LPCR_HR)
andcr3, r3, r4
-   li  r4,(LPCR_LPES0 >> LPCR_LPES_SH)
+   li  r4,0 /* LPES = 0 */
bl  __init_LPCR
bl  __init_HFSCR
bl  __init_tlb_power9
-- 
2.11.0

[PATCH 0/4 v2] cpufeatures merge candidate

2017-04-18 Thread Nicholas Piggin

I included binding documentation in OPAL, and wrote a design
overview and implementation guideline document. Writing updated
documentation lead to some small changes in the specification.

Also a number of other small changes to Linux and OPAL.

Thanks,
Nick

Re: [PATCH v4 4/5] perf report: Show branch type statistics for stdio mode

2017-04-18 Thread Jiri Olsa

On Wed, Apr 12, 2017 at 06:21:05AM +0800, Jin Yao wrote:

SNIP

> +static int hist_iter__branch_callback(struct hist_entry_iter *iter,
> +   struct addr_location *al __maybe_unused,
> +   bool single __maybe_unused,
> +   void *arg)
> +{
> + struct hist_entry *he = iter->he;
> + struct report *rep = arg;
> + struct branch_info *bi;
> +
> + if (sort__mode == SORT_MODE__BRANCH) {

is this check necessary? the hist_iter__branch_callback
was set based on this check

jirka

Re: [PATCH v4 4/5] perf report: Show branch type statistics for stdio mode

2017-04-18 Thread Jiri Olsa

On Wed, Apr 12, 2017 at 06:21:05AM +0800, Jin Yao wrote:

SNIP

> +const char *branch_type_name(int type)
> +{
> + const char *branch_names[PERF_BR_MAX] = {
> + "N/A",
> + "JCC",
> + "JMP",
> + "IND_JMP",
> + "CALL",
> + "IND_CALL",
> + "RET",
> + "SYSCALL",
> + "SYSRET",
> + "IRQ",
> + "INT",
> + "IRET",
> + "FAR_BRANCH",
> + };
> +
> + if ((type >= 0) && (type < PERF_BR_MAX))
> + return branch_names[type];
> +
> + return NULL;

looks like we should add util/branch.c with above functions
and merge it with util/parse-branch-options.c

we create new file even for less code ;-)

thanks,
jirka

Re: [PATCH v4 5/5] perf report: Show branch type in callchain entry

2017-04-18 Thread Jiri Olsa

On Wed, Apr 12, 2017 at 06:21:06AM +0800, Jin Yao wrote:

SNIP

>  static int counts_str_build(char *bf, int bfsize,
>u64 branch_count, u64 predicted_count,
>u64 abort_count, u64 cycles_count,
> -  u64 iter_count, u64 samples_count)
> +  u64 iter_count, u64 samples_count,
> +  struct branch_type_stat *brtype_stat)
>  {
> - double predicted_percent = 0.0;
> - const char *null_str = "";
> - char iter_str[32];
> - char cycle_str[32];
> - char *istr, *cstr;
>   u64 cycles;
> + int printed, i = 0;
>  
>   if (branch_count == 0)
>   return scnprintf(bf, bfsize, " (calltrace)");
>  
> + printed = branch_type_str(brtype_stat, bf, bfsize);
> + if (printed)
> + i++;
> +
>   cycles = cycles_count / branch_count;
> + if (cycles) {
> + if (i++)
> + printed += scnprintf(bf + printed, bfsize - printed,
> + " cycles:%" PRId64 "", cycles);
> + else
> + printed += scnprintf(bf + printed, bfsize - printed,
> + " (cycles:%" PRId64 "", cycles);
> + }
>  
>   if (iter_count && samples_count) {
> - if (cycles > 0)
> - scnprintf(iter_str, sizeof(iter_str),
> -  " iterations:%" PRId64 "",
> -  iter_count / samples_count);
> + if (i++)
> + printed += scnprintf(bf + printed, bfsize - printed,
> + " iterations:%" PRId64 "",
> + iter_count / samples_count);
>   else
> - scnprintf(iter_str, sizeof(iter_str),
> -  "iterations:%" PRId64 "",
> -  iter_count / samples_count);
> - istr = iter_str;

could you please put the change from using iter_str
to bf into separate patch before the actual branch
display change?

it's hard to see if anything is broken ;-)

thanks,
jirka

Re: [PATCH v4 5/5] perf report: Show branch type in callchain entry

2017-04-18 Thread Jiri Olsa

On Wed, Apr 12, 2017 at 06:21:06AM +0800, Jin Yao wrote:

SNIP

> +static int branch_type_str(struct branch_type_stat *stat,
> +char *bf, int bfsize)
> +{
> + int i, j = 0, printed = 0;
> + u64 total = 0;
> +
> + for (i = 0; i < PERF_BR_MAX; i++)
> + total += stat->counts[i];
> +
> + if (total == 0)
> + return 0;
> +
> + printed += scnprintf(bf + printed, bfsize - printed, " (");
> +
> + if (stat->jcc_fwd > 0) {
> + j++;
> + printed += scnprintf(bf + printed, bfsize - printed,
> +  "JCC forward");
> + }
> +
> + if (stat->jcc_bwd > 0) {
> + if (j++)
> + printed += scnprintf(bf + printed, bfsize - printed,
> +  " JCC backward");
> + else
> + printed += scnprintf(bf + printed, bfsize - printed,
> +  "JCC backward");
> + }
> +
> + if (stat->cross_4k > 0) {
> + if (j++)
> + printed += scnprintf(bf + printed, bfsize - printed,
> +  " CROSS_4K");
> + else
> + printed += scnprintf(bf + printed, bfsize - printed,
> +  "CROSS_4K");
> + }

could that 2 legs if be shortened to just one scnprintf like (untested):

   printed += scnprintf(bf + printed, bfsize - printed, "%s%s", j++ ? " " : "", 
"CROSS_4K");

I'd also probably use some kind of macro or function
with all that similar code, but I dont insist ;-)

thanks,
jirka

[PATCH v7 4/10] powerpc/perf: Add generic IMC pmu groupand event functions

2017-04-18 Thread Anju T Sudhakar

From: Hemant Kumar 

Device tree IMC driver code parses the IMC units and their events. It
passes the information to IMC pmu code which is placed in powerpc/perf
as "imc-pmu.c".

Patch adds a set of generic imc pmu related event functions to be
used  by each imc pmu unit. Add code to setup format attribute and to
register imc pmus. Add a event_init function for nest_imc events.

Since, the IMC counters' data are periodically fed to a memory location,
the functions to read/update, start/stop, add/del can be generic and can
be used by all IMC PMU units.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h|   3 +
 arch/powerpc/perf/Makefile|   3 +
 arch/powerpc/perf/imc-pmu.c   | 269 ++
 arch/powerpc/platforms/powernv/opal-imc.c |  10 +-
 4 files changed, 283 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/perf/imc-pmu.c

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index d0193c8..6bbe184 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -92,4 +92,7 @@ struct imc_pmu {
 #define IMC_DOMAIN_NEST1
 #define IMC_DOMAIN_UNKNOWN -1
 
+extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+extern int __init init_imc_pmu(struct imc_events *events,int idx, struct 
imc_pmu *pmu_ptr);
 #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 4d606b9..b29d918 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -6,6 +6,9 @@ obj-$(CONFIG_PPC_PERF_CTRS) += core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += power4-pmu.o ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   isa207-common.o power8-pmu.o power9-pmu.o
+
+obj-$(CONFIG_HV_PERF_IMC_CTRS) += imc-pmu.o
+
 obj32-$(CONFIG_PPC_PERF_CTRS)  += mpc7450-pmu.o
 
 obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
new file mode 100644
index 000..f09a37a
--- /dev/null
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -0,0 +1,269 @@
+/*
+ * Nest Performance Monitor counter support.
+ *
+ * Copyright (C) 2017 Madhavan Srinivasan, IBM Corporation.
+ *   (C) 2017 Anju T Sudhakar, IBM Corporation.
+ *   (C) 2017 Hemant K Shaw, IBM Corporation.
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+
+/* Needed for sanity check */
+extern u64 nest_max_offset;
+
+PMU_FORMAT_ATTR(event, "config:0-20");
+static struct attribute *imc_format_attrs[] = {
+   &format_attr_event.attr,
+   NULL,
+};
+
+static struct attribute_group imc_format_group = {
+   .name = "format",
+   .attrs = imc_format_attrs,
+};
+
+static int nest_imc_event_init(struct perf_event *event)
+{
+   int chip_id;
+   u32 config = event->attr.config;
+   struct perchip_nest_info *pcni;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* Sampling not supported */
+   if (event->hw.sample_period)
+   return -EINVAL;
+
+   /* unsupported modes and filters */
+   if (event->attr.exclude_user   ||
+   event->attr.exclude_kernel ||
+   event->attr.exclude_hv ||
+   event->attr.exclude_idle   ||
+   event->attr.exclude_host   ||
+   event->attr.exclude_guest)
+   return -EINVAL;
+
+   if (event->cpu < 0)
+   return -EINVAL;
+
+   /* Sanity check for config (event offset) */
+   if (config > nest_max_offset)
+   return -EINVAL;
+
+   chip_id = topology_physical_package_id(event->cpu);
+   pcni = &nest_perchip_info[chip_id];
+
+   /*
+* Memory for Nest HW counter data could be in multiple pages.
+* Hence check and pick the right event base page for chip with
+* "chip_id" and add "config" to it".
+*/
+   event->hw.event_base = pcni->vbase[config/PAGE_SIZE] +
+   (config & ~PAGE_MASK);
+
+   return 0;
+}
+
+static void imc_read_counter(struct perf_event *event)
+{
+   u64 *addr, data;
+
+   /*
+* In-Memory Collection (IMC) counters are free flowing counters.
+* So we take a snapshot of the counter value on enable and save it
+* to calculate the delta at later stage to present the event counter
+* value.
+

[PATCH v7 07/10] powerpc/perf: PMU functions for Core IMC and hotplugging

2017-04-18 Thread Anju T Sudhakar

From: Hemant Kumar 

This patch adds the PMU function to initialize a core IMC event. It also
adds cpumask initialization function for core IMC PMU. For
initialization, a 8KB of memory is allocated per core where the data
for core IMC counters will be accumulated. The base address for this
page is sent to OPAL via an OPAL call which initializes various SCOMs
related to Core IMC initialization. Upon any errors, the pages are
free'ed and core IMC counters are disabled using the same OPAL call.

For CPU hotplugging, a cpumask is initialized which contains an online
CPU from each core. If a cpu goes offline, we check whether that cpu
belongs to the core imc cpumask, if yes, then, we migrate the PMU
context to any other online cpu (if available) in that core. If a cpu
comes back online, then this cpu will be added to the core imc cpumask
only if there was no other cpu from that core in the previous cpumask.

To register the hotplug functions for core_imc, a new state
CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE is added to the list of existing
states.

Patch also adds OPAL device shutdown callback. Needed to disable the
IMC core engine to handle kexec.

Signed-off-by: Hemant Kumar 
Signed-off-by: Anju T Sudhakar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h |   7 +
 arch/powerpc/include/asm/opal-api.h|  10 +-
 arch/powerpc/include/asm/opal.h|   2 +
 arch/powerpc/perf/imc-pmu.c| 381 -
 arch/powerpc/platforms/powernv/opal-imc.c  |   7 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 include/linux/cpuhotplug.h |   1 +
 7 files changed, 397 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 37fdd79..bf5fb7c 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -24,6 +24,7 @@
  */
 #define IMC_MAX_CHIPS  32
 #define IMC_MAX_PMUS   32
+#define IMC_MAX_CORES  32
 
 /*
  * This macro is used for memory buffer allocation of
@@ -38,6 +39,11 @@
 #define IMC_NEST_MAX_PAGES 64
 
 /*
+ * IMC Core engine expects 8K bytes of memory for counter collection.
+ */
+#define IMC_CORE_COUNTER_MEM   8192
+
+/*
  *Compatbility macros for IMC devices
  */
 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
@@ -101,4 +107,5 @@ extern struct perchip_nest_info 
nest_perchip_info[IMC_MAX_CHIPS];
 extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 extern struct imc_pmu *core_imc_pmu;
 extern int __init init_imc_pmu(struct imc_events *events,int idx, struct 
imc_pmu *pmu_ptr);
+void core_imc_disable(void);
 #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 23fc51e9..971918d 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -169,7 +169,8 @@
 #define OPAL_PCI_TCE_KILL  126
 #define OPAL_NMMU_SET_PTCR 127
 #define OPAL_NEST_IMC_COUNTERS_CONTROL 149
-#define OPAL_LAST  149
+#define OPAL_CORE_IMC_COUNTERS_CONTROL 150
+#define OPAL_LAST  150
 
 /* Device tree flags */
 
@@ -939,6 +940,13 @@ enum {
OPAL_NEST_IMC_START,
 };
 
+/* Operation argument to Core IMC */
+enum {
+   OPAL_CORE_IMC_DISABLE,
+   OPAL_CORE_IMC_ENABLE,
+   OPAL_CORE_IMC_INIT,
+};
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index ffa4293..6364458 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -238,6 +238,8 @@ int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
  */
 int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
uint64_t value2, uint64_t value3);
+int64_t opal_core_imc_counters_control(uint64_t operation, uint64_t addr,
+   uint64_t value2, uint64_t value3);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 6fdac40..e98a715 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1,5 +1,5 @@
 /*
- * Nest Performance Monitor counter support.
+ * IMC Performance Monitor counter support.
  *
  * Copyright (C) 2017 Madhavan Srinivasan, IBM Corporation.
  *   (C) 2017 Anju T Sudhakar, IBM Corporation.
@@ -21,9 +21,21 @@ struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 static cpumask_t nest_imc_cpumask;
 
 static atomic_t nest_events;
+static atomic_t core_events;
 /* Used to avoid races in calling enable/disable nest-pmu units*/
 static DEFINE_MUTEX(imc_nest_reserve);
+/* Used to avoid races in calling enable/disable

[PATCH v7 10/10] powerpc/perf: Thread imc cpuhotplug support

2017-04-18 Thread Anju T Sudhakar

This patch adds support for thread IMC on cpuhotplug.

When a cpu goes offline, the LDBAR for that cpu is disabled, and when it comes
back online the previous ldbar value is written back to the LDBAR for that cpu.

To register the hotplug functions for thread_imc, a new state
CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE is added to the list of existing
states.

Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Anju T Sudhakar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/perf/imc-pmu.c | 32 +++-
 include/linux/cpuhotplug.h  |  1 +
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 6c7d3ed..daf9151 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -984,6 +984,16 @@ static void cleanup_all_thread_imc_memory(void)
on_each_cpu(cleanup_thread_imc_memory, NULL, 1);
 }
 
+static void thread_imc_update_ldbar(unsigned int cpu_id)
+{
+   u64 ldbar_addr, ldbar_value;
+
+   ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);
+   ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
+   (u64)THREAD_IMC_ENABLE;
+   mtspr(SPRN_LDBAR, ldbar_value);
+}
+
 /*
  * Allocates a page of memory for each of the online cpus, and, writes the
  * physical base address of that page to the LDBAR for that cpu. This starts
@@ -991,21 +1001,33 @@ static void cleanup_all_thread_imc_memory(void)
  */
 static void thread_imc_mem_alloc(void *dummy)
 {
-   u64 ldbar_addr, ldbar_value;
int cpu_id = smp_processor_id();
int phys_id = topology_physical_package_id(smp_processor_id());
 
per_cpu_add[cpu_id] = (u64)alloc_pages_exact_nid(phys_id,
(size_t)IMC_THREAD_COUNTER_MEM, GFP_KERNEL | 
__GFP_ZERO);
-   ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);
-   ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
-   (u64)THREAD_IMC_ENABLE;
-   mtspr(SPRN_LDBAR, ldbar_value);
+   thread_imc_update_ldbar(cpu_id);
+}
+
+static int ppc_thread_imc_cpu_online(unsigned int cpu)
+{
+   thread_imc_update_ldbar(cpu);
+   return 0;
 }
 
+static int ppc_thread_imc_cpu_offline(unsigned int cpu)
+{
+   mtspr(SPRN_LDBAR, 0);
+   return 0;
+ }
+
 void thread_imc_cpu_init(void)
 {
on_each_cpu(thread_imc_mem_alloc, NULL, 1);
+   cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE,
+   "POWER_THREAD_IMC_ONLINE",
+   ppc_thread_imc_cpu_online,
+   ppc_thread_imc_cpu_offline);
 }
 
 /*
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index e7b7712..bbec927 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -139,6 +139,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE,
CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE,
+   CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
CPUHP_AP_ONLINE_DYN,
-- 
2.7.4

[PATCH v7 06/10] powerpc/powernv: Core IMC events detection

2017-04-18 Thread Anju T Sudhakar

From: Hemant Kumar 

This patch adds support for detection of core IMC events along with the
Nest IMC events. It adds a new domain IMC_DOMAIN_CORE and its determined
with the help of the compatibility string "ibm,imc-counters-core" based
on the IMC device tree.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h|  4 +++-
 arch/powerpc/perf/imc-pmu.c   |  3 +++
 arch/powerpc/platforms/powernv/opal-imc.c | 28 +---
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 1478d0f..37fdd79 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -42,6 +42,7 @@
  */
 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
 #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
+#define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core"
 
 /*
  * Structure to hold per chip specific memory address
@@ -90,13 +91,14 @@ struct imc_pmu {
  * Domains for IMC PMUs
  */
 #define IMC_DOMAIN_NEST1
+#define IMC_DOMAIN_CORE2
 #define IMC_DOMAIN_UNKNOWN -1
 
 #define IMC_COUNTER_ENABLE 1
 #define IMC_COUNTER_DISABLE0
 
-
 extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+extern struct imc_pmu *core_imc_pmu;
 extern int __init init_imc_pmu(struct imc_events *events,int idx, struct 
imc_pmu *pmu_ptr);
 #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index b86ef86..6fdac40 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -24,8 +24,11 @@ static atomic_t nest_events;
 /* Used to avoid races in calling enable/disable nest-pmu units*/
 static DEFINE_MUTEX(imc_nest_reserve);
 
+struct imc_pmu *core_imc_pmu;
+
 /* Needed for sanity check */
 extern u64 nest_max_offset;
+extern u64 core_max_offset;
 
 PMU_FORMAT_ATTR(event, "config:0-20");
 static struct attribute *imc_format_attrs[] = {
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index 61f6d67..d712ef3 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -34,6 +34,7 @@
 #include 
 
 u64 nest_max_offset;
+u64 core_max_offset;
 
 static int imc_event_prop_update(char *name, struct imc_events *events)
 {
@@ -114,6 +115,10 @@ static void update_max_value(u32 value, int pmu_domain)
if (nest_max_offset < value)
nest_max_offset = value;
break;
+   case IMC_DOMAIN_CORE:
+   if (core_max_offset < value)
+   core_max_offset = value;
+   break;
default:
/* Unknown domain, return */
return;
@@ -357,7 +362,7 @@ static struct imc_events *imc_events_setup(struct 
device_node *parent,
 /*
  * imc_pmu_create : Takes the parent device which is the pmu unit and a
  *  pmu_index as the inputs.
- * Allocates memory for the pmu, sets up its domain (NEST), and
+ * Allocates memory for the pmu, sets up its domain (NEST/CORE), and
  * calls imc_events_setup() to allocate memory for the events supported
  * by this pmu. Assigns a name for the pmu. Calls imc_events_node_parser()
  * to setup the individual events.
@@ -386,7 +391,10 @@ static int imc_pmu_create(struct device_node *parent, int 
pmu_index, int domain)
goto free_pmu;
 
/* Needed for hotplug/migration */
-   per_nest_pmu_arr[pmu_index] = pmu_ptr;
+   if (pmu_ptr->domain == IMC_DOMAIN_CORE)
+   core_imc_pmu = pmu_ptr;
+   else if (pmu_ptr->domain == IMC_DOMAIN_NEST)
+   per_nest_pmu_arr[pmu_index] = pmu_ptr;
 
pp = of_find_property(parent, "name", NULL);
if (!pp) {
@@ -407,7 +415,10 @@ static int imc_pmu_create(struct device_node *parent, int 
pmu_index, int domain)
goto free_pmu;
}
/* Save the name to register it later */
-   sprintf(buf, "nest_%s", (char *)pp->value);
+   if (pmu_ptr->domain == IMC_DOMAIN_NEST)
+   sprintf(buf, "nest_%s", (char *)pp->value);
+   else
+   sprintf(buf, "%s_imc", (char *)pp->value);
pmu_ptr->pmu.name = (char *)buf;
 
/*
@@ -461,6 +472,17 @@ static void __init imc_pmu_setup(struct device_node 
*parent)
return;
pmu_count++;
}
+   /*
+* Loop through the imc-counters tree for each compatible
+* "ibm,imc-counters-core", and update "struct imc_pmu".
+*/
+   for_each_compatible_node(child, NULL, IMC_DTB_CORE_COMPAT) {
+   domain = IMC_DOMAIN_CORE;
+   rc = imc_pmu_create(child, pmu_count, domain);
+   if (rc)
+

[PATCH v7 02/10] powerpc/powernv: Autoload IMC device driver module

2017-04-18 Thread Anju T Sudhakar

This patch does three things :
 - Enables "opal.c" to create a platform device for the IMC interface
   according to the appropriate compatibility string.
 - Find the reserved-memory region details from the system device tree
   and get the base address of HOMER (Reserved memory) region address for each 
chip.
 - We also get the Nest PMU counter data offsets (in the HOMER region)
   and their sizes. The offsets for the counters' data are fixed and
   won't change from chip to chip.

The device tree parsing logic is separated from the PMU creation
functions (which is done in subsequent patches).

Patch also adds a CONFIG_HV_PERF_IMC_CTRS for the IMC driver.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/platforms/powernv/Kconfig|  10 +++
 arch/powerpc/platforms/powernv/Makefile   |   1 +
 arch/powerpc/platforms/powernv/opal-imc.c | 140 ++
 arch/powerpc/platforms/powernv/opal.c |  18 
 4 files changed, 169 insertions(+)
 create mode 100644 arch/powerpc/platforms/powernv/opal-imc.c

diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 3a07e4d..1b90a98 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -27,3 +27,13 @@ config OPAL_PRD
help
  This enables the opal-prd driver, a facility to run processor
  recovery diagnostics on OpenPower machines
+
+config HV_PERF_IMC_CTRS
+   bool "Hypervisor supplied In Memory Collection PMU events (Nest & Core)"
+   default y
+   depends on PERF_EVENTS && PPC_POWERNV
+   help
+ Enable access to hypervisor supplied in-memory collection counters
+ in perf. IMC counters are available from Power9 systems.
+
+  If unsure, select Y.
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b5d98cb..715e531 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -12,3 +12,4 @@ obj-$(CONFIG_PPC_SCOM)+= opal-xscom.o
 obj-$(CONFIG_MEMORY_FAILURE)   += opal-memory-errors.o
 obj-$(CONFIG_TRACEPOINTS)  += opal-tracepoints.o
 obj-$(CONFIG_OPAL_PRD) += opal-prd.o
+obj-$(CONFIG_HV_PERF_IMC_CTRS) += opal-imc.o
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
new file mode 100644
index 000..3a87000
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -0,0 +1,140 @@
+/*
+ * OPAL IMC interface detection driver
+ * Supported on POWERNV platform
+ *
+ * Copyright   (C) 2017 Madhavan Srinivasan, IBM Corporation.
+ * (C) 2017 Anju T Sudhakar, IBM Corporation.
+ * (C) 2017 Hemant K Shaw, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+
+/*
+ * imc_pmu_setup : Setup the IMC PMUs (children of "parent").
+ */
+static void __init imc_pmu_setup(struct device_node *parent)
+{
+   if (!parent)
+   return;
+}
+
+static int opal_imc_counters_probe(struct platform_device *pdev)
+{
+   struct device_node *imc_dev, *dn, *rm_node = NULL;
+   struct perchip_nest_info *pcni;
+   u32 pages, nest_offset, nest_size, chip_id;
+   int i = 0;
+   const __be32 *addrp;
+   u64 reg_addr, reg_size;
+
+   if (!pdev || !pdev->dev.of_node)
+   return -ENODEV;
+
+   /*
+* Check whether this is kdump kernel. If yes, just return.
+*/
+   if (is_kdump_kernel())
+   return -ENODEV;
+
+   imc_dev = pdev->dev.of_node;
+
+   /*
+* Nest counter data are saved in a reserved memory called HOMER.
+* "imc-nest-offset" identifies the counter data location within HOMER.
+* size : size of the entire nest-counters region
+*/
+   if (of_property_read_u32(imc_dev, "imc-nest-offset", &nest_offset))
+   goto err;
+
+   if (of_property_read_u32(imc_dev, "imc-nest-size", &nest_size))
+   goto err;
+
+   /* Sanity check */
+   if ((nest_size/PAGE_SIZE) > IMC_NEST_MAX_PAGES)
+   goto err;
+
+   /* Find the "HOMER region" for each chip */
+   rm_node = of_find_node_by_path("/reserved-memory");
+   if (!rm_node)
+   goto err;
+
+   /

[PATCH v7 03/10] powerpc/powernv: Detect supported IMC units and its events

2017-04-18 Thread Anju T Sudhakar

Parse device tree to detect IMC units. Traverse through each IMC unit
node to find supported events and corresponding unit/scale files (if any).

Here is the DTS file for reference:


https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts

The device tree for IMC counters starts at the node "imc-counters".
This node contains all the IMC PMU nodes and event nodes
for these IMC PMUs. The PMU nodes have an "events" property which has a
phandle value for the actual events node. The events are separated from
the PMU nodes to abstract out the common events. For example, PMU node
"mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since,
the events are common between these PMUs. These events have a different
prefix based on their relation to different PMUs, and hence, the PMU
nodes themselves contain an "events-prefix" property. The value for this
property concatenated to the event name, forms the actual event
name. Also, the PMU have a "reg" field as the base offset for the events
which belong to this PMU. This "reg" field is added to event's "reg" field 
in the "events" node, which gives us the location of the counter data. Kernel
code uses this offset as event configuration value.

Device tree parser code also looks for scale/unit property in the event
node and passes on the value as an event attr for perf interface to use
in the post processing by the perf tool. Some PMUs may have common scale
and unit properties which implies that all events supported by this PMU
inherit the scale and unit properties of the PMU itself. For those
events, we need to set the common unit and scale values.

For failure to initialize any unit or any event, disable that unit and
continue setting up the rest of them.

Signed-off-by: Hemant Kumar 
Signed-off-by: Anju T Sudhakar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/platforms/powernv/opal-imc.c | 413 ++
 1 file changed, 413 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index 3a87000..0ddaf7d 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -33,15 +33,428 @@
 #include 
 #include 
 
+u64 nest_max_offset;
 struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+
+static int imc_event_prop_update(char *name, struct imc_events *events)
+{
+   char *buf;
+
+   if (!events || !name)
+   return -EINVAL;
+
+   /* memory for content */
+   buf = kzalloc(IMC_MAX_NAME_VAL_LEN, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   events->ev_name = name;
+   events->ev_value = buf;
+   return 0;
+}
+
+static int imc_event_prop_str(struct property *pp, char *name,
+ struct imc_events *events)
+{
+   int ret;
+
+   ret = imc_event_prop_update(name, events);
+   if (ret)
+   return ret;
+
+   if (!pp->value || (strnlen(pp->value, pp->length) == pp->length) ||
+  (pp->length > IMC_MAX_NAME_VAL_LEN))
+   return -EINVAL;
+   strncpy(events->ev_value, (const char *)pp->value, pp->length);
+
+   return 0;
+}
+
+static int imc_event_prop_val(char *name, u32 val,
+ struct imc_events *events)
+{
+   int ret;
+
+   ret = imc_event_prop_update(name, events);
+   if (ret)
+   return ret;
+   snprintf(events->ev_value, IMC_MAX_NAME_VAL_LEN, "event=0x%x", val);
+
+   return 0;
+}
+
+static int set_event_property(struct property *pp, char *event_prop,
+ struct imc_events *events, char *ev_name)
+{
+   char *buf;
+   int ret;
+
+   buf = kzalloc(IMC_MAX_NAME_VAL_LEN, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   sprintf(buf, "%s.%s", ev_name, event_prop);
+   ret = imc_event_prop_str(pp, buf, events);
+   if (ret) {
+   if (events->ev_name)
+   kfree(events->ev_name);
+   if (events->ev_value)
+   kfree(events->ev_value);
+   }
+   return ret;
+}
+
+/*
+ * Updates the maximum offset for an event in the pmu with domain
+ * "pmu_domain".
+ */
+static void update_max_value(u32 value, int pmu_domain)
+{
+   switch (pmu_domain) {
+   case IMC_DOMAIN_NEST:
+   if (nest_max_offset < value)
+   nest_max_offset = value;
+   break;
+   default:
+   /* Unknown domain, return */
+   return;
+   }
+}
+
+/*
+ * imc_events_node_parser: Parse the event node "dev" and assign the parsed
+ * information to event "events".
+ *
+ * Parses the "reg", "scale" and "unit" properties of this event.
+ * "reg" gives us the event offset in the counter memory.
+ */
+static int imc_events_node_parser(struct device_node *dev,
+

[PATCH v7 9/10] powerpc/perf: Thread IMC PMU functions

2017-04-18 Thread Anju T Sudhakar

This patch adds the PMU functions required for event initialization,
read, update, add, del etc. for thread IMC PMU. Thread IMC PMUs are used
for per-task monitoring. 

For each CPU, a page of memory is allocated and is kept static i.e.,
these pages will exist till the machine shuts down. The base address of
this page is assigned to the ldbar of that cpu. As soon as we do that,
the thread IMC counters start running for that cpu and the data of these
counters are assigned to the page allocated. But we use this for
per-task monitoring. Whenever we start monitoring a task, the event is
added is onto the task. At that point, we read the initial value of the
event. Whenever, we stop monitoring the task, the final value is taken
and the difference is the event data.

Now, a task can move to a different cpu. Suppose a task X is moving from
cpu A to cpu B. When the task is scheduled out of A, we get an
event_del for A, and hence, the event data is updated. And, we stop
updating the X's event data. As soon as X moves on to B, event_add is
called for B, and we again update the event_data. And this is how it
keeps on updating the event data even when the task is scheduled on to
different cpus.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h|   5 +
 arch/powerpc/perf/imc-pmu.c   | 201 ++
 arch/powerpc/platforms/powernv/opal-imc.c |   3 +
 3 files changed, 209 insertions(+)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 6260e61..cc04712 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -42,6 +42,7 @@
  * IMC Core engine expects 8K bytes of memory for counter collection.
  */
 #define IMC_CORE_COUNTER_MEM   8192
+#define IMC_THREAD_COUNTER_MEM 8192
 
 /*
  *Compatbility macros for IMC devices
@@ -51,6 +52,9 @@
 #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core"
 #define IMC_DTB_THREAD_COMPAT  "ibm,imc-counters-thread"
 
+#define THREAD_IMC_LDBAR_MASK   0x0003e000
+#define THREAD_IMC_ENABLE   0x8000
+
 /*
  * Structure to hold per chip specific memory address
  * information for nest pmus. Nest Counter data are exported
@@ -110,4 +114,5 @@ extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 extern struct imc_pmu *core_imc_pmu;
 extern int __init init_imc_pmu(struct imc_events *events,int idx, struct 
imc_pmu *pmu_ptr);
 void core_imc_disable(void);
+void thread_imc_disable(void);
 #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index ac69d81..b055748 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -38,6 +38,9 @@ static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES];
 static cpumask_t core_imc_cpumask;
 struct imc_pmu *core_imc_pmu;
 
+/* Maintains base address for all the cpus */
+static u64 per_cpu_add[NR_CPUS];
+
 /* Needed for sanity check */
 extern u64 nest_max_offset;
 extern u64 core_max_offset;
@@ -482,6 +485,56 @@ static int core_imc_event_init(struct perf_event *event)
return 0;
 }
 
+static int thread_imc_event_init(struct perf_event *event)
+{
+   struct task_struct *target;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* Sampling not supported */
+   if (event->hw.sample_period)
+   return -EINVAL;
+
+   event->hw.idx = -1;
+
+   /* Sanity check for config (event offset) */
+   if (event->attr.config > thread_max_offset)
+   return -EINVAL;
+
+   target = event->hw.target;
+
+   if (!target)
+   return -EINVAL;
+
+   event->pmu->task_ctx_nr = perf_sw_context;
+   return 0;
+}
+
+static void thread_imc_read_counter(struct perf_event *event)
+{
+   u64 *addr, data;
+   int cpu_id = smp_processor_id();
+
+   addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config);
+   data = __be64_to_cpu(READ_ONCE(*addr));
+   local64_set(&event->hw.prev_count, data);
+}
+
+static void thread_imc_perf_event_update(struct perf_event *event)
+{
+   u64 counter_prev, counter_new, final_count, *addr;
+   int cpu_id = smp_processor_id();
+
+   addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config);
+   counter_prev = local64_read(&event->hw.prev_count);
+   counter_new = __be64_to_cpu(READ_ONCE(*addr));
+   final_count = counter_new - counter_prev;
+
+   local64_set(&event->hw.prev_count, counter_new);
+   local64_add(final_count, &event->count);
+}
+
 static void imc_read_counter(struct perf_event *event)
 {
u64 *addr, data;
@@ -723,6 +776,84 @@ static int core_imc_event_add(struct perf_event *event, 
int flags)
 }
 
 
+static void thread_imc_event_start(struct perf_event *event, int flags)
+{
+   int rc;
+
+   /*
+* Core pmu unit

[PATCH v7 8/10] powerpc/powernv: Thread IMC events detection

2017-04-18 Thread Anju T Sudhakar

Patch adds support for detection of thread IMC events. It adds a new
domain IMC_DOMAIN_THREAD and it is determined with the help of the
compatibility string "ibm,imc-counters-thread" based on the IMC device
tree.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h|  2 ++
 arch/powerpc/perf/imc-pmu.c   |  1 +
 arch/powerpc/platforms/powernv/opal-imc.c | 18 +-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index bf5fb7c..6260e61 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -49,6 +49,7 @@
 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
 #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
 #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core"
+#define IMC_DTB_THREAD_COMPAT  "ibm,imc-counters-thread"
 
 /*
  * Structure to hold per chip specific memory address
@@ -98,6 +99,7 @@ struct imc_pmu {
  */
 #define IMC_DOMAIN_NEST1
 #define IMC_DOMAIN_CORE2
+#define IMC_DOMAIN_THREAD  3
 #define IMC_DOMAIN_UNKNOWN -1
 
 #define IMC_COUNTER_ENABLE 1
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index aa9e8ba..ac69d81 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -41,6 +41,7 @@ struct imc_pmu *core_imc_pmu;
 /* Needed for sanity check */
 extern u64 nest_max_offset;
 extern u64 core_max_offset;
+extern u64 thread_max_offset;
 
 PMU_FORMAT_ATTR(event, "config:0-20");
 static struct attribute *imc_format_attrs[] = {
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index 23507d7..940f6b9 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -35,6 +35,7 @@
 
 u64 nest_max_offset;
 u64 core_max_offset;
+u64 thread_max_offset;
 
 static int imc_event_prop_update(char *name, struct imc_events *events)
 {
@@ -119,6 +120,10 @@ static void update_max_value(u32 value, int pmu_domain)
if (core_max_offset < value)
core_max_offset = value;
break;
+   case IMC_DOMAIN_THREAD:
+   if (thread_max_offset < value)
+   thread_max_offset = value;
+   break;
default:
/* Unknown domain, return */
return;
@@ -362,7 +367,7 @@ static struct imc_events *imc_events_setup(struct 
device_node *parent,
 /*
  * imc_pmu_create : Takes the parent device which is the pmu unit and a
  *  pmu_index as the inputs.
- * Allocates memory for the pmu, sets up its domain (NEST/CORE), and
+ * Allocates memory for the pmu, sets up its domain (NEST/CORE/THREAD), and
  * calls imc_events_setup() to allocate memory for the events supported
  * by this pmu. Assigns a name for the pmu. Calls imc_events_node_parser()
  * to setup the individual events.
@@ -483,6 +488,17 @@ static void __init imc_pmu_setup(struct device_node 
*parent)
return;
pmu_count++;
}
+   /*
+* Loop through the imc-counters tree for each compatible
+* "ibm,imc-counters-thread", and update "struct imc_pmu".
+*/
+   for_each_compatible_node(child, NULL, IMC_DTB_THREAD_COMPAT) {
+   domain = IMC_DOMAIN_THREAD;
+   rc = imc_pmu_create(child, pmu_count, domain);
+   if (rc)
+   return;
+   pmu_count++;
+   }
 }
 
 static int opal_imc_counters_probe(struct platform_device *pdev)
-- 
2.7.4

[PATCH v7 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support

2017-04-18 Thread Anju T Sudhakar

Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any
online CPU) from each chip for nest PMUs is designated to read counters.

On CPU hotplug, dying CPU is checked to see whether it is one of the
designated cpus, if yes, next online cpu from the same chip (for nest
units) is designated as new cpu to read counters. For this purpose, we
introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h |   4 +
 arch/powerpc/include/asm/opal-api.h|  13 +-
 arch/powerpc/include/asm/opal.h|  12 ++
 arch/powerpc/perf/imc-pmu.c| 250 -
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 include/linux/cpuhotplug.h |   1 +
 6 files changed, 275 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 6bbe184..1478d0f 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -92,6 +92,10 @@ struct imc_pmu {
 #define IMC_DOMAIN_NEST1
 #define IMC_DOMAIN_UNKNOWN -1
 
+#define IMC_COUNTER_ENABLE 1
+#define IMC_COUNTER_DISABLE0
+
+
 extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 extern int __init init_imc_pmu(struct imc_events *events,int idx, struct 
imc_pmu *pmu_ptr);
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index a0aa285..23fc51e9 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -168,7 +168,8 @@
 #define OPAL_INT_SET_MFRR  125
 #define OPAL_PCI_TCE_KILL  126
 #define OPAL_NMMU_SET_PTCR 127
-#define OPAL_LAST  127
+#define OPAL_NEST_IMC_COUNTERS_CONTROL 149
+#define OPAL_LAST  149
 
 /* Device tree flags */
 
@@ -928,6 +929,16 @@ enum {
OPAL_PCI_TCE_KILL_ALL,
 };
 
+/* Operation argument to OPAL_NEST_IMC_COUNTERS_CONTROL */
+enum {
+   OPAL_NEST_IMC_PRODUCTION_MODE = 1,
+};
+
+enum {
+   OPAL_NEST_IMC_STOP,
+   OPAL_NEST_IMC_START,
+};
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 1ff03a6..ffa4293 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -227,6 +227,18 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t 
kill_type,
  uint64_t dma_addr, uint32_t npages);
 int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
 
+/*
+ * OPAL_NEST_IMC_COUNTERS_CONTROL:
+ * mode -- Target mode for the microcode to operation, currently only
+ *"PRODUCTION_MODE" is supported, but in-plan to support other modes
+ *like "DEBUG" mode specific to nest units.
+ *
+ * value1, value2, value3 -- Based on mode parameter, input values for these
+ *  will differ. These are documented in detail in OPAL-API docs.
+ */
+int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
+   uint64_t value2, uint64_t value3);
+
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
   int depth, void *data);
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 0dbab77..b86ef86 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -18,6 +18,11 @@
 
 struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+static cpumask_t nest_imc_cpumask;
+
+static atomic_t nest_events;
+/* Used to avoid races in calling enable/disable nest-pmu units*/
+static DEFINE_MUTEX(imc_nest_reserve);
 
 /* Needed for sanity check */
 extern u64 nest_max_offset;
@@ -33,6 +38,161 @@ static struct attribute_group imc_format_group = {
.attrs = imc_format_attrs,
 };
 
+/* Get the cpumask printed to a buffer "buf" */
+static ssize_t imc_pmu_cpumask_get_attr(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   cpumask_t *active_mask;
+
+   active_mask = &nest_imc_cpumask;
+   return cpumap_print_to_pagebuf(true, buf, active_mask);
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, imc_pmu_cpumask_get_attr, NULL);
+
+static struct attribute *imc_pmu_cpumask_attrs[] = {
+   &dev_attr_cpumask.attr,
+   NULL,
+};
+
+static struct attribute_group imc_pmu_cpumask_attr_group = {
+   .attrs = imc_pmu_cpumask_attrs,
+};
+
+/*
+ * nest_init : Initializes the nest imc engine for the current chip.
+ * by default the nest engine is disabled.
+ */
+static void nest_init(int *cpu_opal_rc)
+{
+

[PATCH v7 01/10] powerpc/powernv: Data structure and macros definitions for IMC

2017-04-18 Thread Anju T Sudhakar

From: Hemant Kumar 

Create a new header file to add the data structures and
macros needed for In-Memory Collection (IMC) counter support.

Signed-off-by: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/imc-pmu.h | 95 ++
 1 file changed, 95 insertions(+)
 create mode 100644 arch/powerpc/include/asm/imc-pmu.h

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
new file mode 100644
index 000..d0193c8
--- /dev/null
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -0,0 +1,95 @@
+#ifndef PPC_POWERNV_IMC_PMU_DEF_H
+#define PPC_POWERNV_IMC_PMU_DEF_H
+
+/*
+ * IMC Nest Performance Monitor counter support.
+ *
+ * Copyright (C) 2017 Madhavan Srinivasan, IBM Corporation.
+ *   (C) 2017 Anju T Sudhakar, IBM Corporation.
+ *   (C) 2017 Hemant K Shaw, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * For static allocation of some of the structures.
+ */
+#define IMC_MAX_CHIPS  32
+#define IMC_MAX_PMUS   32
+
+/*
+ * This macro is used for memory buffer allocation of
+ * event names and event string
+ */
+#define IMC_MAX_NAME_VAL_LEN   96
+
+/*
+ * Currently Microcode supports a max of 256KB of counter memory
+ * in the reserved memory region. Max pages to mmap (considering 4K PAGESIZE).
+ */
+#define IMC_NEST_MAX_PAGES 64
+
+/*
+ *Compatbility macros for IMC devices
+ */
+#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
+#define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
+
+/*
+ * Structure to hold per chip specific memory address
+ * information for nest pmus. Nest Counter data are exported
+ * in per-chip reserved memory region by the PORE Engine.
+ */
+struct perchip_nest_info {
+   u32 chip_id;
+   u64 pbase;
+   u64 vbase[IMC_NEST_MAX_PAGES];
+   u64 size;
+};
+
+/*
+ * Place holder for nest pmu events and values.
+ */
+struct imc_events {
+   char *ev_name;
+   char *ev_value;
+};
+
+#define IMC_FORMAT_ATTR0
+#define IMC_CPUMASK_ATTR   1
+#define IMC_EVENT_ATTR 2
+#define IMC_NULL_ATTR  3
+
+/*
+ * Device tree parser code detects IMC pmu support and
+ * registers new IMC pmus. This structure will
+ * hold the pmu functions and attrs for each imc pmu and
+ * will be referenced at the time of pmu registration.
+ */
+struct imc_pmu {
+   struct pmu pmu;
+   int domain;
+   /*
+* Attribute groups for the PMU. Slot 0 used for
+* format attribute, slot 1 used for cpusmask attribute,
+* slot 2 used for event attribute. Slot 3 keep as
+* NULL.
+*/
+   const struct attribute_group *attr_groups[4];
+};
+
+/*
+ * Domains for IMC PMUs
+ */
+#define IMC_DOMAIN_NEST1
+#define IMC_DOMAIN_UNKNOWN -1
+
+#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
-- 
2.7.4

[PATCH v7 00/10] IMC Instrumentation Support

2017-04-18 Thread Anju T Sudhakar

Power9 has In-Memory-Collection (IMC) infrastructure which contains
various Performance Monitoring Units (PMUs) at Nest level (these are
on-chip but off-core), Core level and Thread level.

The Nest PMU counters are handled by a Nest IMC microcode which runs
in the OCC (On-Chip Controller) complex. The microcode collects the
counter data and moves the nest IMC counter data to memory.

The Core and Thread IMC PMU counters are handled in the core. Core
level PMU counters give us the IMC counters' data per core and thread
level PMU counters give us the IMC counters' data per CPU thread.

This patchset enables the nest IMC, core IMC and thread IMC
PMUs and is based on the initial work done by Madhavan Srinivasan.
"Nest Instrumentation Support" :
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132078.html

v1 for this patchset can be found here :
https://lwn.net/Articles/705475/

Nest events:
Per-chip nest instrumentation provides various per-chip metrics
such as memory, powerbus, Xlink and Alink bandwidth.

Core events:
Per-core IMC instrumentation provides various per-core metrics
such as non-idle cycles, non-idle instructions, various cache and
memory related metrics etc.

Thread events:
All the events for thread level are same as core level with the
difference being in the domain. These are per-cpu metrics.

PMU Events' Information:
OPAL obtains the IMC PMU and event information from the IMC Catalog
and passes on to the kernel via the device tree. The events' information
contains :
 - Event name
 - Event Offset
 - Event description
and, maybe :
 - Event scale
 - Event unit

Some PMUs may have a common scale and unit values for all their
supported events. For those cases, the scale and unit properties for
those events must be inherited from the PMU.

The event offset in the memory is where the counter data gets
accumulated.

The OPAL-side patches are posted upstream :
https://lists.ozlabs.org/pipermail/skiboot/2017-April/006880.html

The kernel discovers the IMC counters information in the device tree
at the "imc-counters" device node which has a compatible field
"ibm,opal-in-memory-counters".

Parsing of the Events' information:
To parse the IMC PMUs and events information, the kernel has to
discover the "imc-counters" node and walk through the pmu and event
nodes.

Here is an excerpt of the dt showing the imc-counters with
mcs0 (nest), core and thread node:
https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts


/dts-v1/;

[...]

/dts-v1/;

/ {
name = "";
compatible = "ibm,opal-in-memory-counters";
#address-cells = <0x1>;
#size-cells = <0x1>;
imc-nest-offset = <0x32>;
imc-nest-size = <0x3>;
version-id = "";

NEST_MCS: nest-mcs-events {
#address-cells = <0x1>;
#size-cells = <0x1>;

event@0 {
event-name = "RRTO_QFULL_NO_DISP" ;
reg = <0x0 0x8>;
desc = "RRTO not dispatched in MCS0 due to capacity - 
pulses once for each time a valid RRTO op is not dispatched due to a command 
list full condition" ;
};
event@8 {
event-name = "WRTO_QFULL_NO_DISP" ;
reg = <0x8 0x8>;
desc = "WRTO not dispatched in MCS0 due to capacity - 
pulses once for each time a valid WRTO op is not dispatched due to a command 
list full condition" ;
};
[...]
mcs0 {
compatible = "ibm,imc-counters-nest";
events-prefix = "PM_MCS0_";
unit = "";
scale = "";
reg = <0x118 0x8>;
events = < &NEST_MCS >;
};

mcs1 {
compatible = "ibm,imc-counters-nest";
events-prefix = "PM_MCS1_";
unit = "";
scale = "";
reg = <0x198 0x8>;
events = < &NEST_MCS >;
};
[...]
CORE_EVENTS: core-events {
#address-cells = <0x1>;
#size-cells = <0x1>;

event@e0 {
event-name = "0THRD_NON_IDLE_PCYC" ;
reg = <0xe0 0x8>;
desc = "The number of processor cycles when all threads 
are idle" ;
};
event@120 {
event-name = "1THRD_NON_IDLE_PCYC" ;
reg = <0x120 0x8>;
desc = "The number of processor cycles when exactly one 
SMT thread is executing non-idle code" ;
};
[...]
   core {
compatible = "ibm,imc-counters-core";
events-prefix = "CPM_";
unit = "";
scale = "";
reg = <0x0 0x8>;
events = < &CORE_EVENTS >;
};

thread {
compatible

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Rob Herring

On Mon, Apr 17, 2017 at 7:32 PM, Tyrel Datwyler
 wrote:
> This patch introduces event tracepoints for tracking a device_nodes
> reference cycle as well as reconfig notifications generated in response
> to node/property manipulations.
>
> With the recent upstreaming of the refcount API several device_node
> underflows and leaks have come to my attention in the pseries (DLPAR) dynamic
> logical partitioning code (ie. POWER speak for hotplugging virtual and 
> physcial
> resources at runtime such as cpus or IOAs). These tracepoints provide a
> easy and quick mechanism for validating the reference counting of
> device_nodes during their lifetime.

Not really relevant for this patch, but since you are looking at
pseries and refcounting, the refcounting largely exists for pseries.
It's also hard to get right as this type of fix is fairly common. It's
now used for overlays, but we really probably only need to refcount
the overlays or changesets as a whole, not at a node level. If you
have any thoughts on how a different model of refcounting could work
for pseries, I'd like to discuss it.

> Further, when pseries lpars are migrated to a different machine we
> perform a live update of our device tree to bring it into alignment with the
> configuration of the new machine. The of_reconfig_notify trace point
> provides a mechanism that can be turned for debuging the device tree
> modifications with out having to build a custom kernel to get at the
> DEBUG code introduced by commit 00aa3720.
>
> The following trace events are provided: of_node_get, of_node_put,
> of_node_release, and of_reconfig_notify. These trace points require a kernel
> built with ftrace support to be enabled. In a typical environment where
> debugfs is mounted at /sys/kernel/debug the entire set of tracepoints
> can be set with the following:
>
>   echo "of:*" > /sys/kernel/debug/tracing/set_event
>
> or
>
>   echo 1 > /sys/kernel/debug/tracing/of/enable
>
> The following shows the trace point data from a DLPAR remove of a cpu
> from a pseries lpar:
>
> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
>
> cpuhp/23-147   [023]    128.324827:
> of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324831:
> of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439000:
> of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439002:
> of_reconfig_notify: action=DETACH_NODE, 
> dn->full_name=/cpus/PowerPC,POWER8@10,
> prop->name=null, old_prop->name=null
>drmgr-7284  [009]    128.439015:
> of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439016:
> of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
>
> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/of/dynamic.c  | 30 ++-
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 18 deletions(-)
>  create mode 100644 include/trace/events/of.h
>
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index 888fdbc..85c0966 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -16,6 +16,9 @@
>
>  #include "of_private.h"
>
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  /**
>   * of_node_get() - Increment refcount of a node
>   * @node:  Node to inc refcount, NULL is supported to simplify writing of
> @@ -25,8 +28,10 @@
>   */
>  struct device_node *of_node_get(struct device_node *node)
>  {
> -   if (node)
> +   if (node) {
> kobject_get(&node->kobj);
> +   trace_of_node_get(refcount_read(&node->kobj.kref.refcount), 
> node->full_name);

Seems like there should be a kobj wrapper to read the refcount.

> +   }
> return node;
>  }
>  EXPORT_SYMBOL(of_node_get);
> @@ -38,8 +43,10 @@ struct device_node *of_node_get(struct device_node *node)
>   */
>  void of_node_put(struct device_node *node)
>  {
> -   if (node)
> +   if (node) {
> +   trace_of_node_put(refcount_read(&node->kobj.kref.refcount) - 
> 1, node->full_name);
> kobject_put(&node->kobj);
> +   }
>  }
>  EXPORT_SYMBOL(of_node_put);
>
> @@ -92,24 +99,9 @@ int of_reconfig_notifier_unregister(struct notifier_block 
> *nb)
>  int of_reconfig_notify(unsigned long action, struct of_reconfig_data *p)
>  {
> int rc;
> -#ifdef DEBUG
> -   struct of_reconfig_data *pr = p;
>
> -   switch (action) {
> -   case OF_RECONFIG_ATTACH_NODE:
> -   case OF_RECONFIG_DETACH_NODE:
> -   pr_debug("notify %-15s %s\n", action_names[action

[PATCH v2] powerpc/book3s: mce: Move add_taint() later in virtual mode.

2017-04-18 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

machine_check_early() gets called in real mode. The very first time when
add_taint() is called, it prints a warning which ends up calling opal
call (that uses OPAL_CALL wrapper) for writing it to console. If we get a
very first machine check while we are in opal we are doomed. OPAL_CALL
overwrites the PACASAVEDMSR in r13 and in this case when we are done with
MCE handling the original opal call will use this new MSR on it's way
back to opal_return. This usually leads unexpected behaviour or kernel
to panic. Instead move the add_taint() call later in the virtual mode
where it is safe to call.

This is broken with current FW level. We got lucky so far for not getting
very first MCE hit while in OPAL. But easily reproducible on Mambo.
This should go to stable.

Fixes: 27ea2c420cad powerpc: Set the correct kernel taint on machine check 
errors.
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/mce.c   |2 ++
 arch/powerpc/kernel/traps.c |4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index a1475e6..b23b323 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -221,6 +221,8 @@ static void machine_check_process_queued_event(struct 
irq_work *work)
 {
int index;
 
+   add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
+
/*
 * For now just print it to console.
 * TODO: log this error event to FSP or nvram.
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index ff365f9..af97e81 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -306,8 +306,6 @@ long machine_check_early(struct pt_regs *regs)
 
__this_cpu_inc(irq_stat.mce_exceptions);
 
-   add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
-
if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
handled = cur_cpu_spec->machine_check_early(regs);
return handled;
@@ -741,6 +739,8 @@ void machine_check_exception(struct pt_regs *regs)
 
__this_cpu_inc(irq_stat.mce_exceptions);
 
+   add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
+
/* See if any machine dependent calls. In theory, we would want
 * to call the CPU first, and call the ppc_md. one if the CPU
 * one returns a positive number. However there is existing code

Re: [PATCH v2 1/2] fadump: reduce memory consumption for capture kernel

2017-04-18 Thread Michal Suchánek

On Mon, 17 Apr 2017 20:43:02 +0530
Hari Bathini  wrote:

> On Friday 14 April 2017 01:28 AM, Michal Suchánek wrote:
> > On Thu, 13 Apr 2017 01:59:13 +0530
> > Hari Bathini  wrote:
> >  
> >> On Friday 07 April 2017 07:16 PM, Michael Ellerman wrote:  
> >>> Hari Bathini  writes:  
>  On Friday 07 April 2017 07:24 AM, Michael Ellerman wrote:  
> > My preference would be that the fadump kernel "just works". If
> > it's using too much memory then the fadump kernel should do
> > whatever it needs to use less memory, eg. shrinking nr_cpu_ids
> > etc. Do we actually know *why* the fadump kernel is running out
> > of memory? Obviously large numbers of CPUs is one of the main
> > drivers (lots of stacks required). But other than that what is
> > causing the memory pressure? I would like some data on that
> > before we proceed.  
>  Almost the same amount of memory in comparison with the memory
>  required to boot the production kernel but that is unwarranted
>  for fadump (dump capture) kernel.  
> >>> That's not data! :)
> >>>
> >>> The dump kernel is booted with *much* less memory than the
> >>> production kernel (that's the whole issue!) and so it doesn't need
> >>> to create struct pages for all that memory, which means it should
> >>> need less memory.
> >>>
> >>> The vfs caches are also sized based on the available memory, so
> >>> they should also shrink in the dump kernel.
> >>>
> >>> I want some actual numbers on what's driving the memory usage.
> >>>
> >>> I tried some of these parameters to see how much memory they would
> >>> save:  
> >> Hi Michael,
> >>
> >> Tried to get data to show parameters like numa=off &
> >> cgroup_disable=memory matter too but parameter nr_cpus=1 is making
> >> parameters like numa=off, cgroup_disable=memory insignificant.
> >> Also, these parameters not using much of early memory reservations
> >> is making quantification of memory saved for each of them that
> >> much more difficult. But I would still like to argue that passing
> >> additional parameters to fadump is better than enforcing nr_cpus=1
> >> in the kernel for:
> >>
> >> a) With makedumpfile tool supporting multi-threading it would
> >> make sense to leave the choice of how many CPUs to have, to the
> >> user.
> >>
> >> b) Parameters like udev.children-max=2 can help to reduce the
> >> number of parallel executed events bringing down the memory
> >> pressure on fadump kernel (when it is booted with more than one
> >> CPU).
> >>
> >> c) Ease of maintainability is better (considering any new
> >> kernel features with some memory to save or stability to gain on
> >> disabling, possible platform supports) with append approach over
> >> enforcing these parameters
> >>in the kernel.
> >>
> >> d) It would give user the flexibility to disable unwanted
> >> kernel features in fadump kernel (numa=off,
> >> cgroup_disable=memory). For every feature enabled in the
> >> production kernel, fadump kernel will have the choice to
> >>opt out of it, provided there is such cmdline option.  
> > Hello,  
> 
> Hi Michal,
> 
> > can't the extra parameters be passed in the devicetree?  
> 
> Hmmm.. possible. Without change in f/w, this may not be guaranteed
> though.
> 
> > The docs say that the kernel can tell it's a fadump crash kernel by
> > checking the devicetree ibm,dump-kernel property. Is there any
> > reason  
> 
> This node is exported by firmware
> 
> > more (optional) properties cannot be added?  
> 
> Kernel change seems simple over f/w enhancement..

That certainly looks so when you are a kernel developer and can
implement the change yourself compared to convincing some firmware
developer that this feature makes sense.

On the other hand, the proposed kernel-only solution introduces
requirement that the maintainer does not like.

For the platform as a whole does it make more sense to add a hack to
the kernel or does it make sense to enhance the firmware to provide
more options for firmware-assisted dump?

Thanks

Michal

Re: [PATCH v11 4/7] powerpc/powernv: Override pcibios_default_alignment() to force PCI devices to be page aligned

2017-04-18 Thread Bjorn Helgaas

On Mon, Apr 17, 2017 at 4:36 PM, Bjorn Helgaas  wrote:
> From: Yongji Xie 
>
> This overrides pcibios_default_alignment() to set default alignment
> to PAGE_SIZE for all PCI devices on PowerNV platform. Thus sub-page
> BARs would not share a page and could be mapped into guest when VFIO
> passthrough them.
>
> Signed-off-by: Yongji Xie 
> Signed-off-by: Bjorn Helgaas 
> ---
>  arch/powerpc/include/asm/machdep.h|2 ++
>  arch/powerpc/kernel/pci-common.c  |8 
>  arch/powerpc/platforms/powernv/pci-ioda.c |7 +++
>  3 files changed, 17 insertions(+)

> +resource_size_t pcibios_default_alignment(struct pci_dev *pdev)
> +{
> +   if (ppc_md.pcibios_default_alignment)
> +   return ppc_md.pcibios_default_alignment(pdev);
> +
> +   return 0;
> +}
> +
>  #ifdef CONFIG_PCI_IOV
>  resource_size_t pcibios_iov_resource_alignment(struct pci_dev *pdev, int 
> resno)
>  {
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 6901a06da2f9..b724487cbd0f 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -3287,6 +3287,11 @@ static void pnv_pci_setup_bridge(struct pci_bus *bus, 
> unsigned long type)
> }
>  }
>
> +static resource_size_t pnv_pci_default_alignment(struct pci_dev *pdev)
> +{
> +   return PAGE_SIZE;
> +}

Is it necessary that pcibios_default_alignment() take a pci_dev
pointer?  I'd like this better if it were:

  resource_size_t pcibios_default_alignment(void) { ... }

because the last patch relies on the assumption that all resources of
*all* devices will be realigned to the same alignment.

Bjorn

RE: [PATCH v2 1/5] kprobes: convert kprobe_lookup_name() to a function

2017-04-18 Thread David Laight

From: Naveen N. Rao
> Sent: 12 April 2017 11:58
...
> +kprobe_opcode_t *kprobe_lookup_name(const char *name)
> +{
...
> + char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
> + const char *modsym;
> + bool dot_appended = false;
> + if ((modsym = strchr(name, ':')) != NULL) {
> + modsym++;
> + if (*modsym != '\0' && *modsym != '.') {
> + /* Convert to  */
> + strncpy(dot_name, name, modsym - name);
> + dot_name[modsym - name] = '.';
> + dot_name[modsym - name + 1] = '\0';
> + strncat(dot_name, modsym,
> + sizeof(dot_name) - (modsym - name) - 2);
> + dot_appended = true;

If the ':' is 'a way down' name[] then although the strncpy() won't
overrun dot_name[] the rest of the code can.

The strncat() call is particularly borked.

David

Re: powerpc/64: Fix HMI exception on LE with CONFIG_RELOCATABLE=y

2017-04-18 Thread Michael Ellerman

On Tue, 2017-04-18 at 05:04:46 UTC, Michael Ellerman wrote:
> Prior to commit 2337d207288f ("powerpc/64: CONFIG_RELOCATABLE support for hmi
> interrupts"), the branch from hmi_exception_early() to 
> hmi_exception_realmode()
> was just a bl hmi_exception_realmode, which the linker would turn into a bl to
> the local entry point of hmi_exception_realmode. This was broken when
> CONFIG_RELOCATABLE=y because hmi_exception_realmode() is not in the low part 
> of
> the kernel text that is copied down to 0x0.
> 
> But in fixing that, we added a new bug on little endian kernels. Because the
> branch is now a bctrl when CONFIG_RELOCATABLE=y, we branch to the global entry
> point of hmi_exception_realmode(). The global entry point must be called with
> r12 containing the address of hmi_exception_realmode(), because it uses that
> value to calculate the TOC value (r2).
> 
> This may manifest as a checkstop, because we take a junk value from r12 which
> came from HSRR1, add a small constant to it and then use that as the TOC
> pointer. The HSRR1 value will have 0x9 as the top nibble, which puts it above
> RAM and somewhere in MMIO space.
> 
> Fix it by changing the BRANCH_LINK_TO_FAR() macro to always use r12 to load 
> the
> label we're branching to. This means r12 will be setup correctly on LE, fixing
> this bug, and r12 is also volatile across function calls on BE so it's a good
> choice anyway.
> 
> Fixes: 2337d207288f ("powerpc/64: CONFIG_RELOCATABLE support for hmi 
> interrupts")
> Reported-by: Mahesh Salgaonkar 
> Acked-by: Nicholas Piggin 
> Signed-off-by: Michael Ellerman 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/be5c5e843c4afa1c8397cb740b6032

cheers

Re: [v2] ppc64/kprobe: Fix oops when kprobed on 'stdu' instruction

2017-04-18 Thread Michael Ellerman

On Tue, 2017-04-11 at 05:08:13 UTC, Ravi Bangoria wrote:
> If we set a kprobe on a 'stdu' instruction on powerpc64, we see a kernel 
> OOPS:
> 
>   [ 1275.165932] Bad kernel stack pointer cd93c840 at c0009868
>   [ 1275.166378] Oops: Bad kernel stack pointer, sig: 6 [#1]
>   ...
>   GPR00: c01fcd93cb30 cd93c840 c15c5e00 cd93c840
>   ...
>   [ 1275.178305] NIP [c0009868] resume_kernel+0x2c/0x58
>   [ 1275.178594] LR [c0006208] program_check_common+0x108/0x180
> 
> Basically, on 64 bit system, when user probes on 'stdu' instruction,
> kernel does not emulate actual store in emulate_step itself because it
> may corrupt exception frame. So kernel does actual store operation in
> exception return code i.e. resume_kernel().
> 
> resume_kernel() loads the saved stack pointer from memory using lwz,
> effectively loading a corrupt (32bit) address, causing the kernel crash.
> 
> Fix this by loading the 64bit value instead.
> 
> Fixes: be96f63375a1 ("powerpc: Split out instruction analysis part of 
> emulate_step()") 
> Signed-off-by: Ravi Bangoria 
> Reviewed-by: Naveen N. Rao 
> Reviewed-by: Ananth N Mavinakayanahalli 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/9e1ba4f27f018742a1aa95d11e3510

cheers

Re: [PATCH] powerpc/32: Remove redundant test in transfer_to_handler()

2017-04-18 Thread Christophe LEROY




Le 14/04/2017 à 15:31, Christophe Leroy a écrit :

As stated in the comment on top of transfer_to_handler(),
all callers set cr0.eq if the exception occurred in kernel mode
(i.e. MSR:PR = 0)

Therefore, it is not needed to do the test again


Indeed, all callers set cr0.eq as required  except that some of
them just destroy it by inserting another test before calling 
transfer_to_handler().


So this patch cannot be applied as is at the time being.

Christophe



Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/entry_32.S | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index a38600949f3a..0651a7438fa6 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -141,7 +141,6 @@ transfer_to_handler:
stw r2,GPR2(r11)
stw r12,_NIP(r11)
stw r9,_MSR(r11)
-   andi.   r2,r9,MSR_PR
mfctr   r12
mfspr   r2,SPRN_XER
stw r12,_CTR(r11)

[PATCH] powerpc/xive: fix spelling mistake: "initialize"

2017-04-18 Thread Colin King

From: Colin Ian King 

Trivial fix to spelling mistake in pr_info message.

Signed-off-by: Colin Ian King 
---
 arch/powerpc/sysdev/xive/common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index d9cd7f705f21..480364c1293c 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1287,7 +1287,7 @@ bool xive_core_init(const struct xive_ops *ops, void 
__iomem *area, u32 offset,
/* Get ready for interrupts */
xive_setup_cpu();
 
-   pr_info("Interrupt handling intialized with %s backend\n",
+   pr_info("Interrupt handling initialized with %s backend\n",
xive_ops->name);
pr_info("Using priority %d for all interrupts\n", max_prio);
 
-- 
2.11.0

Re: [PATCH] powerpc/32: Fix protection of kernel RAM after freeing unused memory

2017-04-18 Thread Christophe LEROY




Le 18/04/2017 à 08:40, Michael Ellerman a écrit :

Christophe Leroy  writes:



diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index a65c0b4c0669..d506bd61b629 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -323,6 +323,26 @@ get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t 
**ptep, pmd_t **pmdp)
 return(retval);
 }

+void remap_init_ram(void)
+{
+   unsigned long start = (unsigned long)_sinittext & PAGE_MASK;
+   unsigned long end = (unsigned long)_einittext;
+   unsigned long va;
+
+   for (va = start; va < end; va += PAGE_SIZE) {
+   pte_t *kpte;
+   pmd_t *kpmd;
+   pte_t pte = pfn_pte(__pa(va) >> PAGE_SHIFT, PAGE_KERNEL);
+
+   if (!get_pteptr(&init_mm, va, &kpte, &kpmd))
+   continue;
+   __set_pte_at(&init_mm, va, kpte, pte, 0);
+   wmb();
+   pte_unmap(kpte);
+   }
+   flush_tlb_kernel_range(start, end);
+}


Can we just use unmap_kernel_range() ?


We only want to remove the X bit.
I think unmap_kernel_range() will unmap the area, wheareas we want to 
keep it as part of the linear data area.


Christophe



Is this sufficient on all 32-bit PPC? (I have no idea)

cheers

[PATCH V2] powerpc/mm: Update mm context addr limit correctly.

2017-04-18 Thread Aneesh Kumar K.V

We added the addr < TASK_SIZE check to avoid updating addr_limit unnecessarily 
and
also to avoid calling slice_flush_segments on all the cpus. This had the side
effect of having different behaviour when using an addr value above TASK_SIZE
before updating addr_limit and after updating addr_limit as show by below
output:

requesting with hint 0x0
Addr returned 0x7fff893a
requesting with hint 0x
Addr returned 0x7fff891b  <= 1st return
requesting with hint 0x1
Addr returned 0x1
requesting with hint 0x
Addr returned 0x18941< second return

After fix:
requesting with hint 0x0
Addr returned 0x7fff8bc0
requesting with hint 0x
Addr returned 0x18bc8< 1st return
requesting with hint 0x1
Addr returned 0x1
requesting with hint 0x
Addr returned 0x18bc6< second return

This also bring the ABI in-line with what is proposed on x86. ie, to use
a hint addr above 128TB up to and including (2^64)-1 as an indication to search
the full address space.

Fixes: f4ea6dcb08ea2c (powerpc/mm: Enable mappings above 128TB)
Signed-off-by: Aneesh Kumar K.V 
---

Changes from V1:
* Update with correct Fixes: sha1
* Also add information about how hint addr is used 

 arch/powerpc/mm/mmap.c  | 6 --
 arch/powerpc/mm/slice.c | 3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index b2111baa0da6..355b6fe8a1e6 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -97,7 +97,8 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned 
long addr,
struct vm_area_struct *vma;
struct vm_unmapped_area_info info;
 
-   if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
+   if (unlikely(addr > mm->context.addr_limit &&
+mm->context.addr_limit != TASK_SIZE))
mm->context.addr_limit = TASK_SIZE;
 
if (len > mm->context.addr_limit - mmap_min_addr)
@@ -139,7 +140,8 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
unsigned long addr = addr0;
struct vm_unmapped_area_info info;
 
-   if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
+   if (unlikely(addr > mm->context.addr_limit &&
+mm->context.addr_limit != TASK_SIZE))
mm->context.addr_limit = TASK_SIZE;
 
/* requested length too big for entire address space */
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 251b6bae7023..2d2d9760d057 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -419,7 +419,8 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
/*
 * Check if we need to expland slice area.
 */
-   if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE)) {
+   if (unlikely(addr > mm->context.addr_limit &&
+mm->context.addr_limit != TASK_SIZE)) {
mm->context.addr_limit = TASK_SIZE;
on_each_cpu(slice_flush_segments, mm, 1);
}
-- 
2.7.4

[PATCH] powerpc/32: Move entry_32 functions just after HEAD functions.

2017-04-18 Thread Christophe Leroy

By default, PPC8xx PINs an ITLB on the first 8M of memory in order
to avoid any ITLB miss on kernel code.
However, with some debug functions like DEBUG_PAGEALLOC and
(soon to come) DEBUG_RODATA, the PINned TLB is invalidated soon
after startup so ITLB missed start to happen also on the kernel code.

In order to avoid any ITLB miss in a critical section, we have to
ensure that their is no page boundary crossed between the setup of
a new value in SRR0/SRR1 and the associated RFI. This cannot be done
easily if entry_32 functions sits in the middle of other .text
functions. By placing entry_32 just after the .head section (as already
done for entry_64 on PPC64), we can more easily ensure the issue
doesn't happen.

Ex: with this change, all entry_32 functions playing with srr0/srr1
appears between 0xc0002000 and 0xc0002fff on PPC_8xx.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Makefile| 2 +-
 arch/powerpc/kernel/Makefile | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 19b0d1a81959..8aa1dafb14eb 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -243,7 +243,7 @@ head-$(CONFIG_40x)  := 
arch/powerpc/kernel/head_40x.o
 head-$(CONFIG_44x) := arch/powerpc/kernel/head_44x.o
 head-$(CONFIG_FSL_BOOKE)   := arch/powerpc/kernel/head_fsl_booke.o
 
-head-$(CONFIG_PPC64)   += arch/powerpc/kernel/entry_64.o
+head-y += arch/powerpc/kernel/entry_$(BITS).o
 head-$(CONFIG_PPC_FPU) += arch/powerpc/kernel/fpu.o
 head-$(CONFIG_ALTIVEC) += arch/powerpc/kernel/vector.o
 head-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE)  += arch/powerpc/kernel/prom_init.o
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 811f441a125f..eb2db44c5fe9 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -90,7 +90,7 @@ extra-y   += vmlinux.lds
 
 obj-$(CONFIG_RELOCATABLE)  += reloc_$(BITS).o
 
-obj-$(CONFIG_PPC32)+= entry_32.o setup_32.o
+obj-$(CONFIG_PPC32)+= setup_32.o
 obj-$(CONFIG_PPC64)+= dma-iommu.o iommu.o
 obj-$(CONFIG_KGDB) += kgdb.o
 obj-$(CONFIG_BOOTX_TEXT)   += btext.o
@@ -154,7 +154,7 @@ UBSAN_SANITIZE_vdso.o := n
 
 extra-$(CONFIG_PPC_FPU)+= fpu.o
 extra-$(CONFIG_ALTIVEC)+= vector.o
-extra-$(CONFIG_PPC64)  += entry_64.o
+extra-y+= entry_$(BITS).o
 extra-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_init.o
 
 extra-y+= systbl_chk.i
-- 
2.12.0

68 matches

Mail list logo