date:20170412

Re: [PATCH v2 5/5] powerpc: kprobes: emulate instructions on kprobe handler re-entry

2017-04-12 Thread Naveen N. Rao

On 2017/04/13 01:37PM, Masami Hiramatsu wrote:
> On Wed, 12 Apr 2017 16:28:28 +0530
> "Naveen N. Rao"  wrote:
> 
> > On kprobe handler re-entry, try to emulate the instruction rather than
> > single stepping always.
> > 
> 
> > As a related change, remove the duplicate saving of msr as that is
> > already done in set_current_kprobe()
> 
> If so, this part might be separated as a cleanup patch...

Sure, thanks for the review!

- Naveen

Re: [PATCH v2 4/5] powerpc: kprobes: factor out code to emulate instruction into a helper

2017-04-12 Thread Naveen N. Rao

On 2017/04/13 01:34PM, Masami Hiramatsu wrote:
> On Wed, 12 Apr 2017 16:28:27 +0530
> "Naveen N. Rao"  wrote:
> 
> > This helper will be used in a subsequent patch to emulate instructions
> > on re-entering the kprobe handler. No functional change.
> 
> In this case, please merge this patch into the next patch which
> actually uses the factored out function unless that changes
> too much.

Ok, will do.

Thanks,
Naveen

Re: [PATCH v2 3/5] powerpc: introduce a new helper to obtain function entry points

2017-04-12 Thread Naveen N. Rao

On 2017/04/13 01:32PM, Masami Hiramatsu wrote:
> On Wed, 12 Apr 2017 16:28:26 +0530
> "Naveen N. Rao"  wrote:
> 
> > kprobe_lookup_name() is specific to the kprobe subsystem and may not
> > always return the function entry point (in a subsequent patch for
> > KPROBES_ON_FTRACE).
> 
> If so, please move this patch into that series. It is hard to review
> patches which requires for other series.

:-)
This patch was originally the first in this series to try avoiding the 
need for converting kprobe_lookup_name() in optprobes.c. But, with the 
re-shuffle, this is more suitable in the other series. I will move it.

Thanks,
Naveen

Re: [PATCH v2 0/5] powerpc: a few kprobe fixes and refactoring

2017-04-12 Thread Naveen N. Rao

On 2017/04/13 12:02PM, Masami Hiramatsu wrote:
> Hi Naveen,

Hi Masami,

> 
> BTW, I saw you sent 3 different series, are there any
> conflict each other? or can we pick those independently?

Yes, all these three patch series are based off powerpc/next and they do 
depend on each other, as they are all about powerpc kprobes.

Patches 1 and 2 in this series touch generic kprobes bits and Michael 
was planning on putting those in a topic branch so that -tip can pull 
them too.

Apart from those two, your optprobes patch 3/5 
(https://patchwork.ozlabs.org/patch/749934/) also touches generic code, 
but it is needed for KPROBES_ON_FTRACE on powerpc. So, I've posted that 
as part of my series. We could probably also put that in the topic 
branch.

Thanks,
Naveen

Re: [patch 05/13] powerpc/smp: Replace open coded task affinity logic

2017-04-12 Thread Michael Ellerman

Thomas Gleixner  writes:

> Init task invokes smp_ops->setup_cpu() from smp_cpus_done(). Init task can
> run on any online CPU at this point, but the setup_cpu() callback requires
> to be invoked on the boot CPU. This is achieved by temporarily setting the
> affinity of the calling user space thread to the requested CPU and reset it
> to the original affinity afterwards.
>
> That's racy vs. CPU hotplug and concurrent affinity settings for that
> thread resulting in code executing on the wrong CPU and overwriting the
> new affinity setting.
>
> That's actually not a problem in this context as neither CPU hotplug nor
> affinity settings can happen, but the access to task_struct::cpus_allowed
> is about to restricted.
>
> Replace it with a call to work_on_cpu_safe() which achieves the same result.
>
> Signed-off-by: Thomas Gleixner 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/kernel/smp.c |   26 +++---
>  1 file changed, 11 insertions(+), 15 deletions(-)

LGTM.

Acked-by: Michael Ellerman  (powerpc)

cheers

Re: [PATCH 1/2] powerpc/mm: fix up pgtable dump flags

2017-04-12 Thread Michael Ellerman

Oliver O'Halloran  writes:

> On Wed, Apr 12, 2017 at 4:52 PM, Michael Ellerman  wrote:
>> Rashmica Gupta  writes:
>>
>>> On 31/03/17 12:37, Oliver O'Halloran wrote:
 On Book3s we have two PTE flags used to mark cache-inhibited mappings:
 _PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page
 table dumper only looks at the generic _PAGE_NO_CACHE which is
 defined to be _PAGE_TOLERANT. This patch modifies the dumper so
 both flags are shown in the dump.

 Cc: Rashmica Gupta 
 Signed-off-by: Oliver O'Halloran 
>>
>>> Should we also add in _PAGE_SAO  that is in Book3s?
>>
>> I don't think we ever expect to see it in the kernel page tables. But if
>> we did that would be "interesting".
>>
>> I've forgotten what the code does with unknown bits, does it already
>> print them in some way?
>
> Currently it just traverses the list of known bits and prints out a
> message for each. Printing any unknown bits is probably a good idea.
> I'll send another patch to add that though and leave this one as-is.

Thanks.

cheers

Re: [PATCH 3/9] powerpc/mm: Add _PAGE_DEVMAP for ppc64.

2017-04-12 Thread Aneesh Kumar K.V

Oliver O'Halloran  writes:

> From: "Aneesh Kumar K.V" 
>
> Add a _PAGE_DEVMAP bit for PTE and DAX PMD entires. PowerPC doesn't
> currently support PUD faults so we haven't extended it to the PUD
> level.
>
> Cc: Aneesh Kumar K.V 
> Signed-off-by: Oliver O'Halloran 

Few changes we would need. We will now need to make sure a devmap
pmd entry is not confused with THP. ie,

we should compare against _PAGE_PTE and _PAGE_DEVMAP in
pmd_trans_huge(). hash already has one bit we use to differentiate
between hugetlb and THP. May be we can genarlize this and come up with a
way to differentiate THP, HUGETLB,pmd DEVMAP entries ?

also I don't see you handing get_user_pages_fast() ?

-aneesh

[RFC][PATCH] powerpc/mm: convert tlbie to tlbiel when no batch is active

2017-04-12 Thread Balbir Singh

Do the checks that __flush_tlb_pending() does and check if
a local flush will do when batch->active is false inside of
hpte_need_flush(). I've checked the changes with tlbie tracing,
I see local flushes as applicable now and I've also run
some basic ltp testcases on top of these changes on a powernv
machine.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/mm/tlb_hash64.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
index 4517aa4..1268e3a 100644
--- a/arch/powerpc/mm/tlb_hash64.c
+++ b/arch/powerpc/mm/tlb_hash64.c
@@ -93,12 +93,17 @@ void hpte_need_flush(struct mm_struct *mm, unsigned long 
addr,
 
/*
 * Check if we have an active batch on this CPU. If not, just
-* flush now and return. For now, we don global invalidates
-* in that case, might be worth testing the mm cpu mask though
-* and decide to use local invalidates instead...
+* flush now and return.
 */
if (!batch->active) {
-   flush_hash_page(vpn, rpte, psize, ssize, 0);
+   const struct cpumask *tmp;
+   int local = 0;
+
+   tmp = cpumask_of(smp_processor_id());
+   if (cpumask_equal(mm_cpumask(mm), tmp))
+   local = 1;
+
+   flush_hash_page(vpn, rpte, psize, ssize, local);
put_cpu_var(ppc64_tlb_batch);
return;
}
-- 
2.9.3

Re: [PATCH v2 5/5] powerpc: kprobes: emulate instructions on kprobe handler re-entry

2017-04-12 Thread Masami Hiramatsu

On Wed, 12 Apr 2017 16:28:28 +0530
"Naveen N. Rao"  wrote:

> On kprobe handler re-entry, try to emulate the instruction rather than
> single stepping always.
> 

> As a related change, remove the duplicate saving of msr as that is
> already done in set_current_kprobe()

If so, this part might be separated as a cleanup patch...

Thanks,

> 
> Acked-by: Ananth N Mavinakayanahalli 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/kprobes.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 8b48f7d046bd..005bd4a75902 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -273,10 +273,17 @@ int __kprobes kprobe_handler(struct pt_regs *regs)
>*/
>   save_previous_kprobe(kcb);
>   set_current_kprobe(p, regs, kcb);
> - kcb->kprobe_saved_msr = regs->msr;
>   kprobes_inc_nmissed_count(p);
>   prepare_singlestep(p, regs);
>   kcb->kprobe_status = KPROBE_REENTER;
> + if (p->ainsn.boostable >= 0) {
> + ret = try_to_emulate(p, regs);
> +
> + if (ret > 0) {
> + restore_previous_kprobe(kcb);
> + return 1;
> + }
> + }
>   return 1;
>   } else {
>   if (*addr != BREAKPOINT_INSTRUCTION) {
> -- 
> 2.12.1
> 


-- 
Masami Hiramatsu

Re: [PATCH v2 4/5] powerpc: kprobes: factor out code to emulate instruction into a helper

2017-04-12 Thread Masami Hiramatsu

On Wed, 12 Apr 2017 16:28:27 +0530
"Naveen N. Rao"  wrote:

> This helper will be used in a subsequent patch to emulate instructions
> on re-entering the kprobe handler. No functional change.

In this case, please merge this patch into the next patch which
actually uses the factored out function unless that changes
too much.

Thank you,

> 
> Acked-by: Ananth N Mavinakayanahalli 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/kprobes.c | 52 
> ++-
>  1 file changed, 31 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 0732a0291ace..8b48f7d046bd 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -207,6 +207,35 @@ void __kprobes arch_prepare_kretprobe(struct 
> kretprobe_instance *ri,
>   regs->link = (unsigned long)kretprobe_trampoline;
>  }
>  
> +int __kprobes try_to_emulate(struct kprobe *p, struct pt_regs *regs)
> +{
> + int ret;
> + unsigned int insn = *p->ainsn.insn;
> +
> + /* regs->nip is also adjusted if emulate_step returns 1 */
> + ret = emulate_step(regs, insn);
> + if (ret > 0) {
> + /*
> +  * Once this instruction has been boosted
> +  * successfully, set the boostable flag
> +  */
> + if (unlikely(p->ainsn.boostable == 0))
> + p->ainsn.boostable = 1;
> + } else if (ret < 0) {
> + /*
> +  * We don't allow kprobes on mtmsr(d)/rfi(d), etc.
> +  * So, we should never get here... but, its still
> +  * good to catch them, just in case...
> +  */
> + printk("Can't step on instruction %x\n", insn);
> + BUG();
> + } else if (ret == 0)
> + /* This instruction can't be boosted */
> + p->ainsn.boostable = -1;
> +
> + return ret;
> +}
> +
>  int __kprobes kprobe_handler(struct pt_regs *regs)
>  {
>   struct kprobe *p;
> @@ -302,18 +331,9 @@ int __kprobes kprobe_handler(struct pt_regs *regs)
>  
>  ss_probe:
>   if (p->ainsn.boostable >= 0) {
> - unsigned int insn = *p->ainsn.insn;
> + ret = try_to_emulate(p, regs);
>  
> - /* regs->nip is also adjusted if emulate_step returns 1 */
> - ret = emulate_step(regs, insn);
>   if (ret > 0) {
> - /*
> -  * Once this instruction has been boosted
> -  * successfully, set the boostable flag
> -  */
> - if (unlikely(p->ainsn.boostable == 0))
> - p->ainsn.boostable = 1;
> -
>   if (p->post_handler)
>   p->post_handler(p, regs, 0);
>  
> @@ -321,17 +341,7 @@ int __kprobes kprobe_handler(struct pt_regs *regs)
>   reset_current_kprobe();
>   preempt_enable_no_resched();
>   return 1;
> - } else if (ret < 0) {
> - /*
> -  * We don't allow kprobes on mtmsr(d)/rfi(d), etc.
> -  * So, we should never get here... but, its still
> -  * good to catch them, just in case...
> -  */
> - printk("Can't step on instruction %x\n", insn);
> - BUG();
> - } else if (ret == 0)
> - /* This instruction can't be boosted */
> - p->ainsn.boostable = -1;
> + }
>   }
>   prepare_singlestep(p, regs);
>   kcb->kprobe_status = KPROBE_HIT_SS;
> -- 
> 2.12.1
> 


-- 
Masami Hiramatsu

Re: [PATCH v2 3/5] powerpc: introduce a new helper to obtain function entry points

2017-04-12 Thread Masami Hiramatsu

On Wed, 12 Apr 2017 16:28:26 +0530
"Naveen N. Rao"  wrote:

> kprobe_lookup_name() is specific to the kprobe subsystem and may not
> always return the function entry point (in a subsequent patch for
> KPROBES_ON_FTRACE).

If so, please move this patch into that series. It is hard to review
patches which requires for other series.

Thank you,

> For looking up function entry points, introduce a
> separate helper and use the same in optprobes.c
> 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/include/asm/code-patching.h | 37 
> 
>  arch/powerpc/kernel/optprobes.c  |  6 +++---
>  2 files changed, 40 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/code-patching.h 
> b/arch/powerpc/include/asm/code-patching.h
> index 8ab937771068..3e994f404434 100644
> --- a/arch/powerpc/include/asm/code-patching.h
> +++ b/arch/powerpc/include/asm/code-patching.h
> @@ -12,6 +12,8 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  /* Flags for create_branch:
>   * "b"   == create_branch(addr, target, 0);
> @@ -99,6 +101,41 @@ static inline unsigned long 
> ppc_global_function_entry(void *func)
>  #endif
>  }
>  
> +/*
> + * Wrapper around kallsyms_lookup() to return function entry address:
> + * - For ABIv1, we lookup the dot variant.
> + * - For ABIv2, we return the local entry point.
> + */
> +static inline unsigned long ppc_kallsyms_lookup_name(const char *name)
> +{
> + unsigned long addr;
> +#ifdef PPC64_ELF_ABI_v1
> + /* check for dot variant */
> + char dot_name[1 + KSYM_NAME_LEN];
> + bool dot_appended = false;
> + if (name[0] != '.') {
> + dot_name[0] = '.';
> + dot_name[1] = '\0';
> + strncat(dot_name, name, KSYM_NAME_LEN - 2);
> + dot_appended = true;
> + } else {
> + dot_name[0] = '\0';
> + strncat(dot_name, name, KSYM_NAME_LEN - 1);
> + }
> + addr = kallsyms_lookup_name(dot_name);
> + if (!addr && dot_appended)
> + /* Let's try the original non-dot symbol lookup */
> + addr = kallsyms_lookup_name(name);
> +#elif defined(PPC64_ELF_ABI_v2)
> + addr = kallsyms_lookup_name(name);
> + if (addr)
> + addr = ppc_function_entry((void *)addr);
> +#else
> + addr = kallsyms_lookup_name(name);
> +#endif
> + return addr;
> +}
> +
>  #ifdef CONFIG_PPC64
>  /*
>   * Some instruction encodings commonly used in dynamic ftracing
> diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
> index ce81a322251c..ec60ed0d4aad 100644
> --- a/arch/powerpc/kernel/optprobes.c
> +++ b/arch/powerpc/kernel/optprobes.c
> @@ -243,10 +243,10 @@ int arch_prepare_optimized_kprobe(struct 
> optimized_kprobe *op, struct kprobe *p)
>   /*
>* 2. branch to optimized_callback() and emulate_step()
>*/
> - op_callback_addr = kprobe_lookup_name("optimized_callback", 0);
> - emulate_step_addr = kprobe_lookup_name("emulate_step", 0);
> + op_callback_addr = (kprobe_opcode_t 
> *)ppc_kallsyms_lookup_name("optimized_callback");
> + emulate_step_addr = (kprobe_opcode_t 
> *)ppc_kallsyms_lookup_name("emulate_step");
>   if (!op_callback_addr || !emulate_step_addr) {
> - WARN(1, "kprobe_lookup_name() failed\n");
> + WARN(1, "Unable to lookup 
> optimized_callback()/emulate_step()\n");
>   goto error;
>   }
>  
> -- 
> 2.12.1
> 


-- 
Masami Hiramatsu

Re: [PATCH v2 2/5] powerpc: kprobes: fix handling of function offsets on ABIv2

2017-04-12 Thread Masami Hiramatsu

On Wed, 12 Apr 2017 16:28:25 +0530
"Naveen N. Rao"  wrote:

> commit 239aeba76409 ("perf powerpc: Fix kprobe and kretprobe handling
> with kallsyms on ppc64le") changed how we use the offset field in struct
> kprobe on ABIv2. perf now offsets from the GEP (Global entry point) if an
> offset is specified and otherwise chooses the LEP (Local entry point).
> 
> Fix the same in kernel for kprobe API users. We do this by extending
> kprobe_lookup_name() to accept an additional parameter to indicate the
> offset specified with the kprobe registration. If offset is 0, we return
> the local function entry and return the global entry point otherwise.
> 
> With:
>   # cd /sys/kernel/debug/tracing/
>   # echo "p _do_fork" >> kprobe_events
>   # echo "p _do_fork+0x10" >> kprobe_events
> 
> before this patch:
>   # cat ../kprobes/list
>   c00d0748  k  _do_fork+0x8[DISABLED]
>   c00d0758  k  _do_fork+0x18[DISABLED]
>   c00412b0  k  kretprobe_trampoline+0x0[OPTIMIZED]
> 
> and after:
>   # cat ../kprobes/list
>   c00d04c8  k  _do_fork+0x8[DISABLED]
>   c00d04d0  k  _do_fork+0x10[DISABLED]
>   c00412b0  k  kretprobe_trampoline+0x0[OPTIMIZED]
> 
> Acked-by: Ananth N Mavinakayanahalli 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/kprobes.c   | 4 ++--
>  arch/powerpc/kernel/optprobes.c | 4 ++--
>  include/linux/kprobes.h | 2 +-
>  kernel/kprobes.c| 7 ---
>  4 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index a7aa7394954d..0732a0291ace 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -42,14 +42,14 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>  
>  struct kretprobe_blackpoint kretprobe_blacklist[] = {{NULL, NULL}};
>  
> -kprobe_opcode_t *kprobe_lookup_name(const char *name)
> +kprobe_opcode_t *kprobe_lookup_name(const char *name, unsigned int offset)

Hmm, if we do this change, it is natural that kprobe_lookup_name()
returns the address + offset.

Thank you,



-- 
Masami Hiramatsu

Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv

2017-04-12 Thread Oliver O'Halloran

On Wed, Apr 12, 2017 at 11:53 AM, Balbir Singh  wrote:
> On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote:
>> From: Rashmica Gupta 
>>
>> Adds support for removing bolted (i.e kernel linear mapping) mappings on
>> powernv. This is needed to support memory hot unplug operations which
>> are required for the teardown of DAX/PMEM devices.
>>
>> Cc: Rashmica Gupta 
>> Cc: Anton Blanchard 
>> Signed-off-by: Oliver O'Halloran 
>> ---
>> Could the original author of this add their S-o-b? I pulled it out of
>> Rashmica's memtrace patch, but I remember someone saying Anton wrote
>> it originally.
>> ---
>>  arch/powerpc/mm/hash_native_64.c | 31 +++
>>  1 file changed, 31 insertions(+)
>>
>> diff --git a/arch/powerpc/mm/hash_native_64.c 
>> b/arch/powerpc/mm/hash_native_64.c
>> index 65bb8f33b399..9ba91d4905a4 100644
>> --- a/arch/powerpc/mm/hash_native_64.c
>> +++ b/arch/powerpc/mm/hash_native_64.c
>> @@ -407,6 +407,36 @@ static void native_hpte_updateboltedpp(unsigned long 
>> newpp, unsigned long ea,
>>   tlbie(vpn, psize, psize, ssize, 0);
>>  }
>>
>> +/*
>> + * Remove a bolted kernel entry. Memory hotplug uses this.
>> + *
>> + * No need to lock here because we should be the only user.
>
> As long as this is after the necessary isolation and is called from
> arch_remove_memory(), I think we should be fine
>
>> + */
>> +static int native_hpte_removebolted(unsigned long ea, int psize, int ssize)
>> +{
>> + unsigned long vpn;
>> + unsigned long vsid;
>> + long slot;
>> + struct hash_pte *hptep;
>> +
>> + vsid = get_kernel_vsid(ea, ssize);
>> + vpn = hpt_vpn(ea, vsid, ssize);
>> +
>> + slot = native_hpte_find(vpn, psize, ssize);
>> + if (slot == -1)
>> + return -ENOENT;
>
> If slot == -1, it means someone else removed the HPTE entry? Are we racing?
> I suspect we should never hit this situation during hotunplug, specifically
> since this is bolted.

Or the slot was never populated in the first place. I'd rather keep
the current behaviour since it aligns with the behaviour of
pSeries_lpar_hpte_removebolted and we might hit these situations in
the future if the sub-section hotplug patches are merged (big if...).

>
>> +
>> + hptep = htab_address + slot;
>> +
>> + /* Invalidate the hpte */
>> + hptep->v = 0;
>
> Under DEBUG or otherwise, I would add more checks like
>
> 1. was hpte_v & HPTE_V_VALID and BOLTED set? If not, we've already invalidated
> that hpte and we can skip the tlbie. Since this was bolted you might be right
> that it is always valid and bolted

A VM_WARN_ON() if the bolted bit is clear might be appropriate. We
don't need to check the valid bit since hpte_native_find() will fail
if it's cleared.

>
>> +
>> + /* Invalidate the TLB */
>> + tlbie(vpn, psize, psize, ssize, 0);
>
> The API also does not clear linear_map_hash_slots[] under DEBUG_PAGEALLOC

I'm not sure what API you're referring to here. The tracking for
linear_map_hash_slots[] is agnostic of mmu_hash_ops so we shouldn't be
touching it here. It also looks like DEBUG_PAGEALLOC is a bit broken
with hotplugged memory anyway so I think that's a fix for a different
patch.

>
>> + return 0;
>> +}
>> +
>> +
>
> Balbir Singh.

Re: [PATCH 3/3] powernv:idle: Set LPCR_UPRT on wakeup from deep-stop

2017-04-12 Thread Benjamin Herrenschmidt

On Thu, 2017-04-13 at 09:28 +0530, Aneesh Kumar K.V wrote:
> >   #endif
> >    mtctr   r12
> >    bctrl
> > +/*
> > + * cur_cpu_spec->cpu_restore would restore LPCR to a
> > + * sane value that is set at early boot time,
> > + * thereby clearing LPCR_UPRT.
> > + * LPCR_UPRT is required if we are running in Radix mode.
> > + * Set it here if that be the case.
> > + */
> > +BEGIN_MMU_FTR_SECTION
> > + mfspr   r3, SPRN_LPCR
> > + LOAD_REG_IMMEDIATE(r4, LPCR_UPRT)
> > + or  r3, r3, r4
> > + mtspr   SPRN_LPCR, r3
> > +END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)

We are probably better off saving the value somewhere during boot
and just "blasting" it whole back.

Cheers
Ben.

Re: [PATCH 3/3] powernv:idle: Set LPCR_UPRT on wakeup from deep-stop

2017-04-12 Thread Aneesh Kumar K.V

"Gautham R. Shenoy"  writes:

> From: "Gautham R. Shenoy" 
>
> On wakeup from a deep-stop used for CPU-Hotplug, we invoke
> cur_cpu_spec->cpu_restore() which would set sane default values to
> various SPRs including LPCR.
>
> On POWER9, the cpu_restore_power9() call would would restore LPCR to a
> sane value that is set at early boot time, thereby clearing LPCR_UPRT.
>
> However, LPCR_UPRT is required to be set if we are running in Radix
> mode. If this is not set we will end up with a crash when we enable
> IR,DR.
>
> To fix this, after returning from cur_cpu_spec->cpu_restore() in the
> idle exit path, set LPCR_UPRT if we are running in Radix mode.
>
> Cc: Aneesh Kumar K.V 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/kernel/idle_book3s.S | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 6a9bd28..39a9b63 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -804,6 +804,19 @@ no_segments:
>  #endif
>   mtctr   r12
>   bctrl
> +/*
> + * cur_cpu_spec->cpu_restore would restore LPCR to a
> + * sane value that is set at early boot time,
> + * thereby clearing LPCR_UPRT.
> + * LPCR_UPRT is required if we are running in Radix mode.
> + * Set it here if that be the case.
> + */
> +BEGIN_MMU_FTR_SECTION
> + mfspr   r3, SPRN_LPCR
> + LOAD_REG_IMMEDIATE(r4, LPCR_UPRT)
> + or  r3, r3, r4
> + mtspr   SPRN_LPCR, r3
> +END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)

What about LPCR_HR ?

>  
>  hypervisor_state_restored:
>  
> -- 
> 1.9.4

-aneesh

Re: [PATCH v4 0/5] perf report: Show branch type

2017-04-12 Thread Jin, Yao




On 4/13/2017 10:00 AM, Jin, Yao wrote:



On 4/12/2017 6:58 PM, Jiri Olsa wrote:

On Wed, Apr 12, 2017 at 06:21:01AM +0800, Jin Yao wrote:

SNIP


3. Use 2 bits in perf_branch_entry for a "cross" metrics checking
for branch cross 4K or 2M area. It's an approximate computing
for checking if the branch cross 4K page or 2MB page.

For example:

perf record -g --branch-filter any,save_type 

perf report --stdio

  JCC forward:  27.7%
 JCC backward:   9.8%
  JMP:   0.0%
  IND_JMP:   6.5%
 CALL:  26.6%
 IND_CALL:   0.0%
  RET:  29.3%
 IRET:   0.0%
 CROSS_4K:   0.0%
 CROSS_2M:  14.3%

got mangled perf report --stdio output for:


[root@ibm-x3650m4-02 perf]# ./perf record -j any,save_type kill
kill: not enough arguments
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB perf.data (18 samples) ]

[root@ibm-x3650m4-02 perf]# ./perf report --stdio -f | head -30
# To display the perf.data header info, please use 
--header/--header-only options.

#
#
# Total Lost Samples: 0
#
# Samples: 253  of event 'cycles'
# Event count (approx.): 253
#
# Overhead  Command  Source Shared Object  Source 
SymbolTarget 
SymbolBasic Block Cycles
#   ...   
... 
...  ..

#
  8.30%  perf
Um  [kernel.vmlinux]  [k] __intel_pmu_enable_all.constprop.17  
[k] native_write_msr -

  7.91%  perf
Um  [kernel.vmlinux]  [k] intel_pmu_lbr_enable_all 
[k] __intel_pmu_enable_all.constprop.17  -

  7.91%  perf
Um  [kernel.vmlinux]  [k] native_write_msr 
[k] intel_pmu_lbr_enable_all -
  6.32%  kill libc-2.24.so  [.] 
_dl_addr [.] 
_dl_addr -

  5.93%  perf
Um  [kernel.vmlinux]  [k] perf_iterate_ctx 
[k] perf_iterate_ctx -
  2.77%  kill libc-2.24.so  [.] 
malloc   [.] 
malloc   -
  1.98%  kill libc-2.24.so  [.] 
_int_malloc  [.] 
_int_malloc  -
  1.58%  kill [kernel.vmlinux]  [k] 
__rb_insert_augmented[k] 
__rb_insert_augmented-

  1.58%  perf
Um  [kernel.vmlinux]  [k] perf_event_exec  
[k] perf_event_exec  -
  1.19%  kill [kernel.vmlinux]  [k] 
anon_vma_interval_tree_insert[k] 
anon_vma_interval_tree_insert-
  1.19%  kill [kernel.vmlinux]  [k] 
free_pgd_range   [k] 
free_pgd_range   -
  1.19%  kill [kernel.vmlinux]  [k] 
n_tty_write  [k] 
n_tty_write  -

  1.19%  perf
Um  [kernel.vmlinux]  [k] native_sched_clock   
[k] sched_clock  -

...
SNIP


jirka


Sorry, I look at this issue at midnight in Shanghai. I misunderstood 
that the above output was only a mail format issue. Sorry about that.


Now I recheck the output, and yes, the perf report output is mangled. 
But my patch doesn't touch the associated code.


Anyway I remove my patches, pull the latest update from perf/core 
branch and run tests to check if its a regression issue. I test on HSW 
and SKL both.


1. On HSW.

root@hsw:/tmp# perf record -j any kill
.. /* SNIP */
For more details see kill(1).
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (9 samples) ]

root@hsw:/tmp# perf report --stdio
# To display the perf.data header info, please use 
--header/--header-only options.

#
#
# Total Lost Samples: 0
#
# Samples: 144  of event 'cycles'
# Event count (approx.): 144
#
# Overhead  Command  Source Shared Object  Source 
SymbolTarget SymbolBasic Block 
Cycles
#   ...   
...  ... 
..

#
10.42%  kill libc-2.23.so  [.] 
read_alias_file  [.] read_alias_file  -
 9.72%  kill [kernel.vmlinux]  [k] 
update_load_avg  [k] update_load_avg  -

 9.03%  perf
Um  [unknown] [k]  [k] 
 -
 8.33%  kill libc-2.23.so  [.] 
_int_malloc  [.] _int_malloc  -

.. /* SNIP */
 0.69%  kill [kernel.vmlinux]  [k] 
_raw_spin_lock   [k] unmap_page_range -

 0.69%  perf
Um  [kernel.vmlinux]  [k] __intel_pmu_enable_all   [k] 
native_write_msr -

 0.69%

Re: [PATCH v2 1/5] kprobes: convert kprobe_lookup_name() to a function

2017-04-12 Thread Masami Hiramatsu

On Wed, 12 Apr 2017 16:28:24 +0530
"Naveen N. Rao"  wrote:

> The macro is now pretty long and ugly on powerpc. In the light of
> further changes needed here, convert it to a __weak variant to be
> over-ridden with a nicer looking function.

Looks good to me.

Acked-by: Masami Hiramatsu 

Thanks!

> 
> Suggested-by: Masami Hiramatsu 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/include/asm/kprobes.h | 53 --
>  arch/powerpc/kernel/kprobes.c  | 58 
> ++
>  arch/powerpc/kernel/optprobes.c|  4 +--
>  include/linux/kprobes.h|  1 +
>  kernel/kprobes.c   | 20 ++---
>  5 files changed, 69 insertions(+), 67 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kprobes.h 
> b/arch/powerpc/include/asm/kprobes.h
> index 0503c98b2117..a843884aafaf 100644
> --- a/arch/powerpc/include/asm/kprobes.h
> +++ b/arch/powerpc/include/asm/kprobes.h
> @@ -61,59 +61,6 @@ extern kprobe_opcode_t optprobe_template_end[];
>  #define MAX_OPTINSN_SIZE (optprobe_template_end - 
> optprobe_template_entry)
>  #define RELATIVEJUMP_SIZEsizeof(kprobe_opcode_t) /* 4 bytes */
>  
> -#ifdef PPC64_ELF_ABI_v2
> -/* PPC64 ABIv2 needs local entry point */
> -#define kprobe_lookup_name(name, addr)   
> \
> -{\
> - addr = (kprobe_opcode_t *)kallsyms_lookup_name(name);   \
> - if (addr)   \
> - addr = (kprobe_opcode_t *)ppc_function_entry(addr); \
> -}
> -#elif defined(PPC64_ELF_ABI_v1)
> -/*
> - * 64bit powerpc ABIv1 uses function descriptors:
> - * - Check for the dot variant of the symbol first.
> - * - If that fails, try looking up the symbol provided.
> - *
> - * This ensures we always get to the actual symbol and not the descriptor.
> - * Also handle  format.
> - */
> -#define kprobe_lookup_name(name, addr)   
> \
> -{\
> - char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN]; \
> - const char *modsym; 
> \
> - bool dot_appended = false;  \
> - if ((modsym = strchr(name, ':')) != NULL) { \
> - modsym++;   \
> - if (*modsym != '\0' && *modsym != '.') {\
> - /* Convert to  */   \
> - strncpy(dot_name, name, modsym - name); \
> - dot_name[modsym - name] = '.';  \
> - dot_name[modsym - name + 1] = '\0'; \
> - strncat(dot_name, modsym,   \
> - sizeof(dot_name) - (modsym - name) - 2);\
> - dot_appended = true;\
> - } else {\
> - dot_name[0] = '\0'; \
> - strncat(dot_name, name, sizeof(dot_name) - 1);  \
> - }   \
> - } else if (name[0] != '.') {\
> - dot_name[0] = '.';  \
> - dot_name[1] = '\0'; \
> - strncat(dot_name, name, KSYM_NAME_LEN - 2); \
> - dot_appended = true;\
> - } else {\
> - dot_name[0] = '\0'; \
> - strncat(dot_name, name, KSYM_NAME_LEN - 1); \
> - }   \
> - addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name);   \
> - if (!addr && dot_appended) {\
> - /* Let's try the original non-dot symbol lookup */  \
> - addr = (kprobe_opcode_t *)kallsyms_lookup_name(name);   \
> - }   \
> -}
> -#endif
> -
>  #define flush_insn_slot(p)   do { } while (0)
>  #define kretprobe_blacklist_size 0
>  
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 331751701fed..a7aa7394954d 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -42,6 +42,64 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>  
>  struct kretprobe_blackpoint kretprobe_blacklist[] = {{NULL, NULL}};
>  
>

Re: [PATCH v2 0/5] powerpc: a few kprobe fixes and refactoring

2017-04-12 Thread Masami Hiramatsu

Hi Naveen,

BTW, I saw you sent 3 different series, are there any
conflict each other? or can we pick those independently?

Thanks,

On Wed, 12 Apr 2017 16:28:23 +0530
"Naveen N. Rao"  wrote:

> v1:
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1334843.html
> 
> For v2, this series has been re-ordered and rebased on top of
> powerpc/next so as to make it easier to resolve conflicts with -tip. No
> other changes.
> 
> - Naveen
> 
> 
> Naveen N. Rao (5):
>   kprobes: convert kprobe_lookup_name() to a function
>   powerpc: kprobes: fix handling of function offsets on ABIv2
>   powerpc: introduce a new helper to obtain function entry points
>   powerpc: kprobes: factor out code to emulate instruction into a helper
>   powerpc: kprobes: emulate instructions on kprobe handler re-entry
> 
>  arch/powerpc/include/asm/code-patching.h |  37 ++
>  arch/powerpc/include/asm/kprobes.h   |  53 --
>  arch/powerpc/kernel/kprobes.c| 119 
> +--
>  arch/powerpc/kernel/optprobes.c  |   6 +-
>  include/linux/kprobes.h  |   1 +
>  kernel/kprobes.c |  21 +++---
>  6 files changed, 147 insertions(+), 90 deletions(-)
> 
> -- 
> 2.12.1
> 


-- 
Masami Hiramatsu

Re: [PATCH 1/2] powerpc/mm: fix up pgtable dump flags

2017-04-12 Thread Oliver O'Halloran

On Wed, Apr 12, 2017 at 4:52 PM, Michael Ellerman  wrote:
> Rashmica Gupta  writes:
>
>> On 31/03/17 12:37, Oliver O'Halloran wrote:
>>> On Book3s we have two PTE flags used to mark cache-inhibited mappings:
>>> _PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page
>>> table dumper only looks at the generic _PAGE_NO_CACHE which is
>>> defined to be _PAGE_TOLERANT. This patch modifies the dumper so
>>> both flags are shown in the dump.
>>>
>>> Cc: Rashmica Gupta 
>>> Signed-off-by: Oliver O'Halloran 
>
>> Should we also add in _PAGE_SAO  that is in Book3s?
>
> I don't think we ever expect to see it in the kernel page tables. But if
> we did that would be "interesting".
>
> I've forgotten what the code does with unknown bits, does it already
> print them in some way?

Currently it just traverses the list of known bits and prints out a
message for each. Printing any unknown bits is probably a good idea.
I'll send another patch to add that though and leave this one as-is.

> If not we should either add that or add _PAGE_SAO and everything else
> that could possibly ever be there.

ok

Re: [PATCH] powerpc/mm: Fix swapper_pg_dir size on 64-bit hash w/64K pages

2017-04-12 Thread Aneesh Kumar K.V




On Wednesday 12 April 2017 03:41 PM, Michael Ellerman wrote:

Recently in commit f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB"),
we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This
makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to
calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong.

The end result is that the PGD (Page Global Directory, ie top level page table)
of the kernel (aka. swapper_pg_dir), is too small.

This generally doesn't lead to a crash, as we don't use the full range in normal
operation. However if we try to dump the kernel pagetables we can trigger a
crash because we walk off the end of the pgd into other memory and eventually
try to dereference something bogus:

  $ cat /sys/kernel/debug/kernel_pagetables
  Unable to handle kernel paging request for data at address 0xe8fece00
  Faulting instruction address: 0xc0072314
  cpu 0xc: Vector: 380 (Data SLB Access) at [c000daa13890]
  pc: c0072314: ptdump_show+0x164/0x430
  lr: c0072550: ptdump_show+0x3a0/0x430
 dar: e802cf00
  seq_read+0xf8/0x560
  full_proxy_read+0x84/0xc0
  __vfs_read+0x6c/0x1d0
  vfs_read+0xbc/0x1b0
  SyS_read+0x6c/0x110
  system_call+0x38/0xfc

The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be
the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move
the calculation into asm-offsets.c where we can do it easily using
max().

Fixes: f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
Signed-off-by: Michael Ellerman 


 Reviewed-by: Aneesh Kumar K.V 



---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 4 
 arch/powerpc/kernel/asm-offsets.c| 4 ++--
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index fb72ff6b98e6..fb8380a2d8d5 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -232,10 +232,6 @@ extern unsigned long __pte_frag_nr;
 extern unsigned long __pte_frag_size_shift;
 #define PTE_FRAG_SIZE_SHIFT __pte_frag_size_shift
 #define PTE_FRAG_SIZE (1UL << PTE_FRAG_SIZE_SHIFT)
-/*
- * Pgtable size used by swapper, init in asm code
- */
-#define MAX_PGD_TABLE_SIZE (sizeof(pgd_t) << RADIX_PGD_INDEX_SIZE)

 #define PTRS_PER_PTE   (1 << PTE_INDEX_SIZE)
 #define PTRS_PER_PMD   (1 << PMD_INDEX_SIZE)
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index e7c8229a8812..8e1163426ccb 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -400,8 +400,8 @@ int main(void)
DEFINE(BUG_ENTRY_SIZE, sizeof(struct bug_entry));
 #endif

-#ifdef MAX_PGD_TABLE_SIZE
-   DEFINE(PGD_TABLE_SIZE, MAX_PGD_TABLE_SIZE);
+#ifdef CONFIG_PPC_BOOK3S_64
+   DEFINE(PGD_TABLE_SIZE, (sizeof(pgd_t) << max(RADIX_PGD_INDEX_SIZE, 
H_PGD_INDEX_SIZE)));
 #else
DEFINE(PGD_TABLE_SIZE, PGD_TABLE_SIZE);
 #endif

Re: [PATCH v4 0/5] perf report: Show branch type

2017-04-12 Thread Jin, Yao




On 4/12/2017 6:58 PM, Jiri Olsa wrote:

On Wed, Apr 12, 2017 at 06:21:01AM +0800, Jin Yao wrote:

SNIP


3. Use 2 bits in perf_branch_entry for a "cross" metrics checking
for branch cross 4K or 2M area. It's an approximate computing
for checking if the branch cross 4K page or 2MB page.

For example:

perf record -g --branch-filter any,save_type 

perf report --stdio

  JCC forward:  27.7%
 JCC backward:   9.8%
  JMP:   0.0%
  IND_JMP:   6.5%
 CALL:  26.6%
 IND_CALL:   0.0%
  RET:  29.3%
 IRET:   0.0%
 CROSS_4K:   0.0%
 CROSS_2M:  14.3%

got mangled perf report --stdio output for:


[root@ibm-x3650m4-02 perf]# ./perf record -j any,save_type kill
kill: not enough arguments
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB perf.data (18 samples) ]

[root@ibm-x3650m4-02 perf]# ./perf report --stdio -f | head -30
# To display the perf.data header info, please use --header/--header-only 
options.
#
#
# Total Lost Samples: 0
#
# Samples: 253  of event 'cycles'
# Event count (approx.): 253
#
# Overhead  Command  Source Shared Object  Source Symbol
Target SymbolBasic Block Cycles
#   ...    
...  
...  ..
#
  8.30%  perf
Um  [kernel.vmlinux]  [k] __intel_pmu_enable_all.constprop.17  [k] 
native_write_msr -
  7.91%  perf
Um  [kernel.vmlinux]  [k] intel_pmu_lbr_enable_all [k] 
__intel_pmu_enable_all.constprop.17  -
  7.91%  perf
Um  [kernel.vmlinux]  [k] native_write_msr [k] 
intel_pmu_lbr_enable_all -
  6.32%  kill libc-2.24.so  [.] _dl_addr
 [.] _dl_addr -
  5.93%  perf
Um  [kernel.vmlinux]  [k] perf_iterate_ctx [k] 
perf_iterate_ctx -
  2.77%  kill libc-2.24.so  [.] malloc  
 [.] malloc   -
  1.98%  kill libc-2.24.so  [.] _int_malloc 
 [.] _int_malloc  -
  1.58%  kill [kernel.vmlinux]  [k] __rb_insert_augmented   
 [k] __rb_insert_augmented-
  1.58%  perf
Um  [kernel.vmlinux]  [k] perf_event_exec  [k] 
perf_event_exec  -
  1.19%  kill [kernel.vmlinux]  [k] anon_vma_interval_tree_insert   
 [k] anon_vma_interval_tree_insert-
  1.19%  kill [kernel.vmlinux]  [k] free_pgd_range  
 [k] free_pgd_range   -
  1.19%  kill [kernel.vmlinux]  [k] n_tty_write 
 [k] n_tty_write  -
  1.19%  perf
Um  [kernel.vmlinux]  [k] native_sched_clock   [k] 
sched_clock  -
...
SNIP


jirka


Sorry, I look at this issue at midnight in Shanghai. I misunderstood 
that the above output was only a mail format issue. Sorry about that.


Now I recheck the output, and yes, the perf report output is mangled. 
But my patch doesn't touch the associated code.


Anyway I remove my patches, pull the latest update from perf/core branch 
and run tests to check if its a regression issue. I test on HSW and SKL 
both.


1. On HSW.

root@hsw:/tmp# perf record -j any kill
.. /* SNIP */
For more details see kill(1).
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (9 samples) ]

root@hsw:/tmp# perf report --stdio
# To display the perf.data header info, please use 
--header/--header-only options.

#
#
# Total Lost Samples: 0
#
# Samples: 144  of event 'cycles'
# Event count (approx.): 144
#
# Overhead  Command  Source Shared Object  Source 
SymbolTarget SymbolBasic Block 
Cycles
#   ...   
...  ... 
..

#
10.42%  kill libc-2.23.so  [.] 
read_alias_file  [.] read_alias_file  -
 9.72%  kill [kernel.vmlinux]  [k] 
update_load_avg  [k] update_load_avg  -

 9.03%  perf
Um  [unknown] [k]  [k] 
 -
 8.33%  kill libc-2.23.so  [.] 
_int_malloc  [.] _int_malloc  -

.. /* SNIP */
 0.69%  kill [kernel.vmlinux]  [k] 
_raw_spin_lock   [k] unmap_page_range -

 0.69%  perf
Um  [kernel.vmlinux]  [k] __intel_pmu_enable_all   [k] 
native_write_msr -

 0.69%  perf
Um  [kernel.vmlinux]  [k]

Re: [PATCH] powerpc/64s: catch external interrupts going to host in POWER9

2017-04-12 Thread Nicholas Piggin

On Thu, 13 Apr 2017 07:34:51 +1000
Benjamin Herrenschmidt  wrote:

> On Thu, 2017-04-13 at 00:12 +1000, Nicholas Piggin wrote:
> > Yeah sure that sounds good. How's this then?  
> 
> I suppose so :-) When I was testing all that I had a "b ." at 0x500 and
> 0x4500 and I didn't hit them :)

If only for the benefit of poorly configured systemsim users
like me. Could remove it when the dust settles.

Re: powerpc: Avoid taking a data miss on every userspace instruction miss

2017-04-12 Thread Anton Blanchard

Hi Balbir,

> FYI: The version you applied does not have checks for is_write

Yeah, we decided to do that in a follow up patch. I'm ok if someone
gets to it before me :)

Anton

Re: powerpc: Avoid taking a data miss on every userspace instruction miss

2017-04-12 Thread Balbir Singh

On Thu, 2017-04-06 at 23:06 +1000, Michael Ellerman wrote:
> On Mon, 2017-04-03 at 06:41:02 UTC, Anton Blanchard wrote:
> > From: Anton Blanchard 
> > 
> Applied to powerpc next, thanks.
> 
> https://git.kernel.org/powerpc/c/a7a9dcd882a67b68568868b988289f
>

FYI: The version you applied does not have checks for is_write

Balbir Singh.

[v4 2/2] raid6/altivec: Add vpermxor implementation for raid6 Q syndrome

2017-04-12 Thread Matt Brown

The raid6 Q syndrome check has been optimised using the vpermxor
instruction. This instruction was made available with POWER8, ISA version
2.07. It allows for both vperm and vxor instructions to be done in a single
instruction. This has been tested for correctness on a ppc64le vm with a
basic RAID6 setup containing 5 drives.

The performance benchmarks are from the raid6test in the /lib/raid6/test
directory. These results are from an IBM Firestone machine with ppc64le
architecture. The benchmark results show a 35% speed increase over the best
existing algorithm for powerpc (altivec). The raid6test has also been run
on a big-endian ppc64 vm to ensure it also works for big-endian
architectures.

Performance benchmarks:
raid6: altivecx4 gen() 18773 MB/s
raid6: altivecx8 gen() 19438 MB/s

raid6: vpermxor4 gen() 25112 MB/s
raid6: vpermxor8 gen() 26279 MB/s

Note: Fixed minor bug in pq.h regarding missing and mismatched ifdef
statements.

Signed-off-by: Matt Brown 
---
 include/linux/raid/pq.h |   4 ++
 lib/raid6/Makefile  |  27 -
 lib/raid6/algos.c   |   4 ++
 lib/raid6/altivec.uc|   3 ++
 lib/raid6/test/Makefile |  14 ++-
 lib/raid6/vpermxor.uc   | 104 
 6 files changed, 154 insertions(+), 2 deletions(-)
 create mode 100644 lib/raid6/vpermxor.uc

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bba..3df9aa6 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -107,6 +107,10 @@ extern const struct raid6_calls raid6_avx512x2;
 extern const struct raid6_calls raid6_avx512x4;
 extern const struct raid6_calls raid6_tilegx8;
 extern const struct raid6_calls raid6_s390vx8;
+extern const struct raid6_calls raid6_vpermxor1;
+extern const struct raid6_calls raid6_vpermxor2;
+extern const struct raid6_calls raid6_vpermxor4;
+extern const struct raid6_calls raid6_vpermxor8;
 
 struct raid6_recov_calls {
void (*data2)(int, size_t, int, int, void **);
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 3057011..db095a7 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -4,7 +4,8 @@ raid6_pq-y  += algos.o recov.o tables.o int1.o int2.o 
int4.o \
   int8.o int16.o int32.o
 
 raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o 
avx2.o avx512.o recov_avx512.o
-raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
+raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
+  vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
 raid6_pq-$(CONFIG_TILEGX) += tilegx8.o
 raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
@@ -88,6 +89,30 @@ $(obj)/altivec8.c:   UNROLL := 8
 $(obj)/altivec8.c:   $(src)/altivec.uc $(src)/unroll.awk FORCE
$(call if_changed,unroll)
 
+CFLAGS_vpermxor1.o += $(altivec_flags)
+targets += vpermxor1.c
+$(obj)/vpermxor1.c: UNROLL := 1
+$(obj)/vpermxor1.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
+CFLAGS_vpermxor2.o += $(altivec_flags)
+targets += vpermxor2.c
+$(obj)/vpermxor2.c: UNROLL := 2
+$(obj)/vpermxor2.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
+CFLAGS_vpermxor4.o += $(altivec_flags)
+targets += vpermxor4.c
+$(obj)/vpermxor4.c: UNROLL := 4
+$(obj)/vpermxor4.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
+CFLAGS_vpermxor8.o += $(altivec_flags)
+targets += vpermxor8.c
+$(obj)/vpermxor8.c: UNROLL := 8
+$(obj)/vpermxor8.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
 CFLAGS_neon1.o += $(NEON_FLAGS)
 targets += neon1.c
 $(obj)/neon1.c:   UNROLL := 1
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 7857049..edd4f69 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -74,6 +74,10 @@ const struct raid6_calls * const raid6_algos[] = {
_altivec2,
_altivec4,
_altivec8,
+   _vpermxor1,
+   _vpermxor2,
+   _vpermxor4,
+   _vpermxor8,
 #endif
 #if defined(CONFIG_TILEGX)
_tilegx8,
diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc
index 682aae8..d20ed0d 100644
--- a/lib/raid6/altivec.uc
+++ b/lib/raid6/altivec.uc
@@ -24,10 +24,13 @@
 
 #include 
 
+#ifdef CONFIG_ALTIVEC
+
 #include 
 #ifdef __KERNEL__
 # include 
 # include 
+#endif /* __KERNEL__ */
 
 /*
  * This is the C data type to use.  We use a vector of
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 2c7b60e..9c333e9 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -97,6 +97,18 @@ altivec4.c: altivec.uc ../unroll.awk
 altivec8.c: altivec.uc ../unroll.awk
$(AWK) ../unroll.awk -vN=8 < altivec.uc > $@
 
+vpermxor1.c: vpermxor.uc ../unroll.awk
+   $(AWK) ../unroll.awk -vN=1 <

[v4 1/2] lib/raid6: Build proper files on corresponding arch

2017-04-12 Thread Matt Brown

Previously the raid6 test Makefile did not correctly build the files for
testing on PowerPC. This patch fixes the bug, so that all appropriate files
for PowerPC are built.

Signed-off-by: Matt Brown 
---
Changlog
v2 - v4
- fixup whitespace
- change versioning to match other patch
---
 lib/raid6/test/Makefile | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 9c333e9..b64a267 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -44,10 +44,12 @@ else ifeq ($(HAS_NEON),yes)
 CFLAGS += -DCONFIG_KERNEL_MODE_NEON=1
 else
 HAS_ALTIVEC := $(shell printf '\#include \nvector int a;\n' 
|\
- gcc -c -x c - >&/dev/null && \
- rm ./-.o && echo yes)
+ gcc -c -x c - >/dev/null && rm ./-.o && echo yes)
 ifeq ($(HAS_ALTIVEC),yes)
-OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
+CFLAGS += -I../../../arch/powerpc/include
+CFLAGS += -DCONFIG_ALTIVEC
+OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
+vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 endif
 endif
 ifeq ($(ARCH),tilegx)
-- 
2.9.3

Re: [PATCH] powerpc/64s: catch external interrupts going to host in POWER9

2017-04-12 Thread Benjamin Herrenschmidt

On Thu, 2017-04-13 at 00:12 +1000, Nicholas Piggin wrote:
> Yeah sure that sounds good. How's this then?

I suppose so :-) When I was testing all that I had a "b ." at 0x500 and
0x4500 and I didn't hit them :)

Re: [PATCH v2 1/2] fadump: reduce memory consumption for capture kernel

2017-04-12 Thread Hari Bathini




On Friday 07 April 2017 07:16 PM, Michael Ellerman wrote:

Hari Bathini  writes:

On Friday 07 April 2017 07:24 AM, Michael Ellerman wrote:

My preference would be that the fadump kernel "just works". If it's
using too much memory then the fadump kernel should do whatever it needs
to use less memory, eg. shrinking nr_cpu_ids etc.
Do we actually know *why* the fadump kernel is running out of memory?
Obviously large numbers of CPUs is one of the main drivers (lots of
stacks required). But other than that what is causing the memory
pressure? I would like some data on that before we proceed.

Almost the same amount of memory in comparison with the memory
required to boot the production kernel but that is unwarranted for fadump
(dump capture) kernel.

That's not data! :)

The dump kernel is booted with *much* less memory than the production
kernel (that's the whole issue!) and so it doesn't need to create struct
pages for all that memory, which means it should need less memory.

The vfs caches are also sized based on the available memory, so they
should also shrink in the dump kernel.

I want some actual numbers on what's driving the memory usage.

I tried some of these parameters to see how much memory they would save:


Hi Michael,

Tried to get data to show parameters like numa=off & cgroup_disable=memory
matter too but parameter nr_cpus=1 is making parameters like numa=off,
cgroup_disable=memory insignificant. Also, these parameters not using much
of early memory reservations is making quantification of memory saved for
each of them that much more difficult. But I would still like to argue that
passing additional parameters to fadump is better than enforcing nr_cpus=1
in the kernel for:

  a) With makedumpfile tool supporting multi-threading it would make sense
 to leave the choice of how many CPUs to have, to the user.

  b) Parameters like udev.children-max=2 can help to reduce the number of
 parallel executed events bringing down the memory pressure on fadump
 kernel (when it is booted with more than one CPU).

  c) Ease of maintainability is better (considering any new kernel features
 with some memory to save or stability to gain on disabling, possible
 platform supports) with append approach over enforcing these 
parameters

 in the kernel.

  d) It would give user the flexibility to disable unwanted kernel features
 in fadump kernel (numa=off, cgroup_disable=memory). For every feature
 enabled in the production kernel, fadump kernel will have the 
choice to

 opt out of it, provided there is such cmdline option.


So, if parameters like
cgroup_disable=memory,

0 bytes saved.


transparent_hugepages=never,

0 bytes saved.


numa=off,

64KB saved.


nr_cpus=1,

3MB saved (vs 16 CPUs)



Hmmm... On a system with single core and 8GB memory, fadump kernel captures
dump successfully with 272MB passing nr_cpus=1 while it needed 320MB (+48MB)
to do the same without nr_cpus=1. So, while the early reservations saved 
is only a
couple of megabytes, it rubs off further in the boot process to reduce 
memory

consumption by nearly 50MB :)


Now maybe on your system those do save memory for some reason, but
please prove it to me. Otherwise I'm inclined to merge:

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 8ff0dd4e77a7..03f1f253c372 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -79,8 +79,10 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
 * dump data waiting for us.
 */
fdm_active = of_get_flat_dt_prop(node, "ibm,kernel-dump", NULL);
-   if (fdm_active)
+   if (fdm_active) {
fw_dump.dump_active = 1;
+   nr_cpu_ids = 1;
+   }

/* Get the sizes required to store dump data for the firmware provided
 * dump sections.




Based on your suggestion, I am thinking of something like the below:

--
powerpc/fadump: reduce memory consumption for capture kernel

With fadump (dump capture) kernel booting like a regular kernel, it almost
needs the same amount of memory to boot as the production kernel, which is
unwarranted for a dump capture kernel. But with no option to disable some
of the unnecessary subsystems in fadump kernel, that much memory is wasted
on fadump, depriving the production kernel of that memory.

Introduce kernel parameter 'fadump_append=' that would take regular kernel
parameters as a comma separated list, to be enforced when fadump is active.
This 'fadump_append=' parameter can be leveraged to pass parameters like
nr_cpus=1, cgroup_disable=memory and numa=off, to disable unwarranted
resources/subsystems.

Also, ensure the log "Firmware-assisted dump is active" is printed early
in the boot process to put the subsequent fadump messages in context.

Suggested-by: Michael Ellerman 
Signed-off-by: Hari Bathini 
---

[patch 05/13] powerpc/smp: Replace open coded task affinity logic

2017-04-12 Thread Thomas Gleixner

Init task invokes smp_ops->setup_cpu() from smp_cpus_done(). Init task can
run on any online CPU at this point, but the setup_cpu() callback requires
to be invoked on the boot CPU. This is achieved by temporarily setting the
affinity of the calling user space thread to the requested CPU and reset it
to the original affinity afterwards.

That's racy vs. CPU hotplug and concurrent affinity settings for that
thread resulting in code executing on the wrong CPU and overwriting the
new affinity setting.

That's actually not a problem in this context as neither CPU hotplug nor
affinity settings can happen, but the access to task_struct::cpus_allowed
is about to restricted.

Replace it with a call to work_on_cpu_safe() which achieves the same result.

Signed-off-by: Thomas Gleixner 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/smp.c |   26 +++---
 1 file changed, 11 insertions(+), 15 deletions(-)

--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -787,24 +787,21 @@ static struct sched_domain_topology_leve
{ NULL, },
 };
 
-void __init smp_cpus_done(unsigned int max_cpus)
+static __init long smp_setup_cpu_workfn(void *data __always_unused)
 {
-   cpumask_var_t old_mask;
+   smp_ops->setup_cpu(boot_cpuid);
+   return 0;
+}
 
-   /* We want the setup_cpu() here to be called from CPU 0, but our
-* init thread may have been "borrowed" by another CPU in the meantime
-* se we pin us down to CPU 0 for a short while
+void __init smp_cpus_done(unsigned int max_cpus)
+{
+   /*
+* We want the setup_cpu() here to be called on the boot CPU, but
+* init might run on any CPU, so make sure it's invoked on the boot
+* CPU.
 */
-   alloc_cpumask_var(_mask, GFP_NOWAIT);
-   cpumask_copy(old_mask, >cpus_allowed);
-   set_cpus_allowed_ptr(current, cpumask_of(boot_cpuid));
-   
if (smp_ops && smp_ops->setup_cpu)
-   smp_ops->setup_cpu(boot_cpuid);
-
-   set_cpus_allowed_ptr(current, old_mask);
-
-   free_cpumask_var(old_mask);
+   work_on_cpu_safe(boot_cpuid, smp_setup_cpu_workfn, NULL);
 
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
@@ -812,7 +809,6 @@ void __init smp_cpus_done(unsigned int m
dump_numa_cpu_topology();
 
set_sched_topology(powerpc_topology);
-
 }
 
 #ifdef CONFIG_HOTPLUG_CPU

Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-12 Thread Tyrel Datwyler

On 04/11/2017 07:10 PM, Michael Ellerman wrote:
> Tyrel Datwyler  writes:
>> On 04/11/2017 02:00 AM, Michael Ellerman wrote:
>>> Tyrel Datwyler  writes:
 I started looking at it when Bharata submitted a patch trying to fix the
 issue for CPUs, but got side tracked by other things. I suspect that
 this underflow has actually been an issue for quite some time, and we
 are just now becoming aware of it thanks to the recount_t patchset being
 merged.
>>>
>>> Yes I agree. Which means it might be broken in existing distros.
>>
>> Definitely. I did some profiling last night, and I understand the
>> hotplug case. It turns out to be as I suggested in the original thread
>> about CPUs. When the devicetree code was worked to move the tree out of
>> proc and into sysfs the sysfs detach code added a of_node_put to remove
>> the original of_init reference. pSeries Being the sole original
>> *dynamic* device tree user we had always issued a of_node_put in our
>> dlpar specific detach function to achieve that end. So, this should be a
>> pretty straight forward trivial fix.
> 
> Excellent, thanks.
> 
>> However, for the case where devices are present at boot it appears we a
>> leaking a lot of references resulting in the device nodes never actually
>> being released/freed after a dlpar remove. In the CPU case after boot I
>> count 8 more references taken than the hotplug case, and corresponding
>> of_node_put's are not called at dlpar remove time either. That will take
>> some time to track them down, review and clean up.

I found our reference leak. In topology_init() we call register_cpu()
for each possible logical cpu id. For any logical cpu present a
reference to the device node of the cpu core is grabbed and added to
cpu->dev.of_node. Which matches what I'm seeing on a Power8 lpar, 8
extraneous references which is equal to the 8 hardware threads of a core.

> 
> Yes that is a perennial problem unfortunately which we've never come up
> with a good solution for.
> 
> The (old) patch below might help track some of them down. I remember
> having a script to process the output of the trace and find mismatches,
> but I can't find it right now - but I'm sure you can hack up something
> :)

Haha, this patch is almost identical to what I hacked up Monday to get
an idea of where the refcounts were at. Probably wouldn't hurt to try
and upstream it into the driver/of tree.

-Tyrel

> 
> cheers
> 
> 
> diff --git a/arch/powerpc/include/asm/trace.h 
> b/arch/powerpc/include/asm/trace.h
> index 32e36b16773f..ad32365082a0 100644
> --- a/arch/powerpc/include/asm/trace.h
> +++ b/arch/powerpc/include/asm/trace.h
> @@ -168,6 +168,44 @@ TRACE_EVENT(hash_fault,
> __entry->addr, __entry->access, __entry->trap)
>  );
>  
> +TRACE_EVENT(of_node_get,
> +
> + TP_PROTO(struct device_node *dn, int val),
> +
> + TP_ARGS(dn, val),
> +
> + TP_STRUCT__entry(
> + __field(struct device_node *, dn)
> + __field(int, val)
> + ),
> +
> + TP_fast_assign(
> + __entry->dn = dn;
> + __entry->val = val;
> + ),
> +
> + TP_printk("get %d -> %d %s", __entry->val - 1, __entry->val, 
> __entry->dn->full_name)
> +);
> +
> +TRACE_EVENT(of_node_put,
> +
> + TP_PROTO(struct device_node *dn, int val),
> +
> + TP_ARGS(dn, val),
> +
> + TP_STRUCT__entry(
> + __field(struct device_node *, dn)
> + __field(int, val)
> + ),
> +
> + TP_fast_assign(
> + __entry->dn = dn;
> + __entry->val = val;
> + ),
> +
> + TP_printk("put %d -> %d %s", __entry->val + 1, __entry->val, 
> __entry->dn->full_name)
> +);
> +
>  #endif /* _TRACE_POWERPC_H */
>  
>  #undef TRACE_INCLUDE_PATH
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index c647bd1b6903..f5c3d761f3cd 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -14,6 +14,8 @@
>  
>  #include "of_private.h"
>  
> +#include 
> +
>  /**
>   * of_node_get() - Increment refcount of a node
>   * @node:Node to inc refcount, NULL is supported to simplify writing of
> @@ -23,8 +25,12 @@
>   */
>  struct device_node *of_node_get(struct device_node *node)
>  {
> - if (node)
> + if (node) {
>   kobject_get(>kobj);
> +
> + trace_of_node_get(node, atomic_read(>kobj.kref.refcount));
> + }
> +
>   return node;
>  }
>  EXPORT_SYMBOL(of_node_get);
> @@ -36,8 +42,10 @@ EXPORT_SYMBOL(of_node_get);
>   */
>  void of_node_put(struct device_node *node)
>  {
> - if (node)
> + if (node) {
>   kobject_put(>kobj);
> + trace_of_node_put(node, atomic_read(>kobj.kref.refcount));
> + }
>  }
>  EXPORT_SYMBOL(of_node_put);
>  
>

[PATCH 17/17] cxlflash: Introduce hardware queue steering

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

As an enhancement to distribute requests to multiple hardware queues, add
the infrastructure to hash a SCSI command into a particular hardware queue.
Support the following scenarios when deriving which queue to use: single
queue, tagging when SCSI-MQ enabled, and simple hash via CPU ID when
SCSI-MQ is disabled. Rather than altering the existing send API, the
derived hardware queue is stored in the AFU command where it can be used
for sending a command to the chosen hardware queue.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h |  12 -
 drivers/scsi/cxlflash/main.c   | 120 +++--
 2 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 8fd7a1f..256af81 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -96,6 +96,13 @@ enum cxlflash_state {
STATE_FAILTERM  /* Failed/terminating state, error out users/threads */
 };
 
+enum cxlflash_hwq_mode {
+   HWQ_MODE_RR,/* Roundrobin (default) */
+   HWQ_MODE_TAG,   /* Distribute based on block MQ tag */
+   HWQ_MODE_CPU,   /* CPU affinity */
+   MAX_HWQ_MODE
+};
+
 /*
  * Each context has its own set of resource handles that is visible
  * only from that context.
@@ -146,9 +153,9 @@ struct afu_cmd {
struct scsi_cmnd *scp;
struct completion cevent;
struct list_head queue;
+   u32 hwq_index;
 
u8 cmd_tmf:1;
-   u32 hwq_index;
 
/* As per the SISLITE spec the IOARCB EA has to be 16-byte aligned.
 * However for performance reasons the IOARCB/IOASA should be
@@ -213,8 +220,11 @@ struct afu {
atomic_t cmds_active;   /* Number of currently active AFU commands */
u64 hb;
u32 internal_lun;   /* User-desired LUN mode for this AFU */
+
u32 num_hwqs;   /* Number of hardware queues */
u32 desired_hwqs;   /* Desired h/w queues, effective on AFU reset */
+   enum cxlflash_hwq_mode hwq_mode; /* Steering mode for h/w queues */
+   u32 hwq_rr_count;   /* Count to distribute traffic for roundrobin */
 
char version[16];
u64 interface_version;
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 113797a..a7d57c3 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -358,6 +358,43 @@ static int wait_resp(struct afu *afu, struct afu_cmd *cmd)
 }
 
 /**
+ * cmd_to_target_hwq() - selects a target hardware queue for a SCSI command
+ * @host:  SCSI host associated with device.
+ * @scp:   SCSI command to send.
+ * @afu:   SCSI command to send.
+ *
+ * Hashes a command based upon the hardware queue mode.
+ *
+ * Return: Trusted index of target hardware queue
+ */
+static u32 cmd_to_target_hwq(struct Scsi_Host *host, struct scsi_cmnd *scp,
+struct afu *afu)
+{
+   u32 tag;
+   u32 hwq = 0;
+
+   if (afu->num_hwqs == 1)
+   return 0;
+
+   switch (afu->hwq_mode) {
+   case HWQ_MODE_RR:
+   hwq = afu->hwq_rr_count++ % afu->num_hwqs;
+   break;
+   case HWQ_MODE_TAG:
+   tag = blk_mq_unique_tag(scp->request);
+   hwq = blk_mq_unique_tag_to_hwq(tag);
+   break;
+   case HWQ_MODE_CPU:
+   hwq = smp_processor_id() % afu->num_hwqs;
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   }
+
+   return hwq;
+}
+
+/**
  * send_tmf() - sends a Task Management Function (TMF)
  * @afu:   AFU to checkout from.
  * @scp:   SCSI command from stack.
@@ -368,10 +405,12 @@ static int wait_resp(struct afu *afu, struct afu_cmd *cmd)
  */
 static int send_tmf(struct afu *afu, struct scsi_cmnd *scp, u64 tmfcmd)
 {
-   struct cxlflash_cfg *cfg = shost_priv(scp->device->host);
+   struct Scsi_Host *host = scp->device->host;
+   struct cxlflash_cfg *cfg = shost_priv(host);
struct afu_cmd *cmd = sc_to_afucz(scp);
struct device *dev = >dev->dev;
-   struct hwq *hwq = get_hwq(afu, PRIMARY_HWQ);
+   int hwq_index = cmd_to_target_hwq(host, scp, afu);
+   struct hwq *hwq = get_hwq(afu, hwq_index);
ulong lock_flags;
int rc = 0;
ulong to;
@@ -388,7 +427,7 @@ static int send_tmf(struct afu *afu, struct scsi_cmnd *scp, 
u64 tmfcmd)
cmd->scp = scp;
cmd->parent = afu;
cmd->cmd_tmf = true;
-   cmd->hwq_index = hwq->index;
+   cmd->hwq_index = hwq_index;
 
cmd->rcb.ctx_id = hwq->ctx_hndl;
cmd->rcb.msi = SISL_MSI_RRQ_UPDATED;
@@ -448,7 +487,8 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scp)
struct device *dev = >dev->dev;
struct afu_cmd *cmd = sc_to_afucz(scp);

[PATCH 16/17] cxlflash: Add hardware queues attribute

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

As staging for supporting multiple hardware queues, add an attribute to
show and set the current number of hardware queues for the host. Support
specifying a hard limit or a CPU affinitized value. This will allow the
number of hardware queues to be tuned by a system administrator.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h |  10 ++--
 drivers/scsi/cxlflash/main.c   | 112 -
 2 files changed, 106 insertions(+), 16 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index b5858ae..8fd7a1f 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -60,7 +60,9 @@ extern const struct file_operations cxlflash_cxl_fops;
 /* SQ for master issued cmds */
 #define NUM_SQ_ENTRY   CXLFLASH_MAX_CMDS
 
-#define CXLFLASH_NUM_HWQS  1
+/* Hardware queue definitions */
+#define CXLFLASH_DEF_HWQS  1
+#define CXLFLASH_MAX_HWQS  8
 #define PRIMARY_HWQ0
 
 
@@ -201,7 +203,7 @@ struct hwq {
 } __aligned(cache_line_size());
 
 struct afu {
-   struct hwq hwqs[CXLFLASH_NUM_HWQS];
+   struct hwq hwqs[CXLFLASH_MAX_HWQS];
int (*send_cmd)(struct afu *, struct afu_cmd *);
void (*context_reset)(struct afu_cmd *);
 
@@ -211,6 +213,8 @@ struct afu {
atomic_t cmds_active;   /* Number of currently active AFU commands */
u64 hb;
u32 internal_lun;   /* User-desired LUN mode for this AFU */
+   u32 num_hwqs;   /* Number of hardware queues */
+   u32 desired_hwqs;   /* Desired h/w queues, effective on AFU reset */
 
char version[16];
u64 interface_version;
@@ -221,7 +225,7 @@ struct afu {
 
 static inline struct hwq *get_hwq(struct afu *afu, u32 index)
 {
-   WARN_ON(index >= CXLFLASH_NUM_HWQS);
+   WARN_ON(index >= CXLFLASH_MAX_HWQS);
 
return >hwqs[index];
 }
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 5d06869..113797a 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -566,7 +566,7 @@ static void stop_afu(struct cxlflash_cfg *cfg)
ssleep(1);
 
if (afu_is_irqpoll_enabled(afu)) {
-   for (i = 0; i < CXLFLASH_NUM_HWQS; i++) {
+   for (i = 0; i < afu->num_hwqs; i++) {
hwq = get_hwq(afu, i);
 
irq_poll_disable(>irqpoll);
@@ -676,13 +676,13 @@ static void term_afu(struct cxlflash_cfg *cfg)
 * 2) Unmap the problem state area
 * 3) Stop each master context
 */
-   for (k = CXLFLASH_NUM_HWQS - 1; k >= 0; k--)
+   for (k = cfg->afu->num_hwqs - 1; k >= 0; k--)
term_intr(cfg, UNMAP_THREE, k);
 
if (cfg->afu)
stop_afu(cfg);
 
-   for (k = CXLFLASH_NUM_HWQS - 1; k >= 0; k--)
+   for (k = cfg->afu->num_hwqs - 1; k >= 0; k--)
term_mc(cfg, k);
 
dev_dbg(dev, "%s: returning\n", __func__);
@@ -823,6 +823,7 @@ static int alloc_mem(struct cxlflash_cfg *cfg)
goto out;
}
cfg->afu->parent = cfg;
+   cfg->afu->desired_hwqs = CXLFLASH_DEF_HWQS;
cfg->afu->afu_map = NULL;
 out:
return rc;
@@ -1116,7 +1117,7 @@ static void afu_err_intr_init(struct afu *afu)
/* IOARRIN yet), so there is nothing to clear. */
 
/* set LISN#, it is always sent to the context that wrote IOARRIN */
-   for (i = 0; i < CXLFLASH_NUM_HWQS; i++) {
+   for (i = 0; i < afu->num_hwqs; i++) {
hwq = get_hwq(afu, i);
 
writeq_be(SISL_MSI_SYNC_ERROR, >host_map->ctx_ctrl);
@@ -1551,7 +1552,7 @@ static void init_pcr(struct cxlflash_cfg *cfg)
}
 
/* Copy frequently used fields into hwq */
-   for (i = 0; i < CXLFLASH_NUM_HWQS; i++) {
+   for (i = 0; i < afu->num_hwqs; i++) {
hwq = get_hwq(afu, i);
 
hwq->ctx_hndl = (u16) cxl_process_element(hwq->ctx);
@@ -1586,7 +1587,7 @@ static int init_global(struct cxlflash_cfg *cfg)
}
 
/* Set up RRQ and SQ in HWQ for master issued cmds */
-   for (i = 0; i < CXLFLASH_NUM_HWQS; i++) {
+   for (i = 0; i < afu->num_hwqs; i++) {
hwq = get_hwq(afu, i);
hmap = hwq->host_map;
 
@@ -1640,7 +1641,7 @@ static int init_global(struct cxlflash_cfg *cfg)
/* Set up master's own CTX_CAP to allow real mode, host translation */
/* tables, afu cmds and read/write GSCSI cmds. */
/* First, unlock ctx_cap write by reading mbox */
-   for (i = 0; i < CXLFLASH_NUM_HWQS; i++) {
+   for (i = 0; i < afu->num_hwqs; i++) {
hwq = get_hwq(afu, i);

[PATCH 15/17] cxlflash: Support multiple hardware queues

2017-04-12 Thread Uma Krishnan

Introduce multiple hardware queues to improve legacy I/O path performance.
Each hardware queue is comprised of a master context and associated I/O
resources. The hardware queues are initially implemented as a static array
embedded in the AFU. This will be transitioned to a dynamic allocation in a
later series to improve the memory footprint of the driver.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h|  41 ++--
 drivers/scsi/cxlflash/main.c  | 426 --
 drivers/scsi/cxlflash/superpipe.c |   6 +-
 3 files changed, 309 insertions(+), 164 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index c69cdcf..b5858ae 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -60,6 +60,9 @@ extern const struct file_operations cxlflash_cxl_fops;
 /* SQ for master issued cmds */
 #define NUM_SQ_ENTRY   CXLFLASH_MAX_CMDS
 
+#define CXLFLASH_NUM_HWQS  1
+#define PRIMARY_HWQ0
+
 
 static inline void check_sizes(void)
 {
@@ -98,7 +101,6 @@ enum cxlflash_state {
 
 struct cxlflash_cfg {
struct afu *afu;
-   struct cxl_context *mcctx;
 
struct pci_dev *dev;
struct pci_device_id *dev_id;
@@ -144,6 +146,7 @@ struct afu_cmd {
struct list_head queue;
 
u8 cmd_tmf:1;
+   u32 hwq_index;
 
/* As per the SISLITE spec the IOARCB EA has to be 16-byte aligned.
 * However for performance reasons the IOARCB/IOASA should be
@@ -164,7 +167,7 @@ static inline struct afu_cmd *sc_to_afucz(struct scsi_cmnd 
*sc)
return afuc;
 }
 
-struct afu {
+struct hwq {
/* Stuff requiring alignment go first. */
struct sisl_ioarcb sq[NUM_SQ_ENTRY];/* 16K SQ */
u64 rrq_entry[NUM_RRQ_ENTRY];   /* 2K RRQ */
@@ -172,17 +175,13 @@ struct afu {
/* Beware of alignment till here. Preferably introduce new
 * fields after this point
 */
-
-   int (*send_cmd)(struct afu *, struct afu_cmd *);
-   void (*context_reset)(struct afu_cmd *);
-
-   /* AFU HW */
+   struct afu *afu;
+   struct cxl_context *ctx;
struct cxl_ioctl_start_work work;
-   struct cxlflash_afu_map __iomem *afu_map;   /* entire MMIO map */
struct sisl_host_map __iomem *host_map; /* MC host map */
struct sisl_ctrl_map __iomem *ctrl_map; /* MC control map */
-
ctx_hndl_t ctx_hndl;/* master's context handle */
+   u32 index;  /* Index of this hwq */
 
atomic_t hsq_credits;
spinlock_t hsq_slock;
@@ -194,9 +193,22 @@ struct afu {
u64 *hrrq_end;
u64 *hrrq_curr;
bool toggle;
-   atomic_t cmds_active;   /* Number of currently active AFU commands */
+
s64 room;
spinlock_t rrin_slock; /* Lock to rrin queuing and cmd_room updates */
+
+   struct irq_poll irqpoll;
+} __aligned(cache_line_size());
+
+struct afu {
+   struct hwq hwqs[CXLFLASH_NUM_HWQS];
+   int (*send_cmd)(struct afu *, struct afu_cmd *);
+   void (*context_reset)(struct afu_cmd *);
+
+   /* AFU HW */
+   struct cxlflash_afu_map __iomem *afu_map;   /* entire MMIO map */
+
+   atomic_t cmds_active;   /* Number of currently active AFU commands */
u64 hb;
u32 internal_lun;   /* User-desired LUN mode for this AFU */
 
@@ -204,11 +216,16 @@ struct afu {
u64 interface_version;
 
u32 irqpoll_weight;
-   struct irq_poll irqpoll;
struct cxlflash_cfg *parent; /* Pointer back to parent cxlflash_cfg */
-
 };
 
+static inline struct hwq *get_hwq(struct afu *afu, u32 index)
+{
+   WARN_ON(index >= CXLFLASH_NUM_HWQS);
+
+   return >hwqs[index];
+}
+
 static inline bool afu_is_irqpoll_enabled(struct afu *afu)
 {
return !!afu->irqpoll_weight;
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index c60936f..5d06869 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -223,8 +223,9 @@ static void context_reset(struct afu_cmd *cmd, __be64 
__iomem *reset_reg)
 static void context_reset_ioarrin(struct afu_cmd *cmd)
 {
struct afu *afu = cmd->parent;
+   struct hwq *hwq = get_hwq(afu, cmd->hwq_index);
 
-   context_reset(cmd, >host_map->ioarrin);
+   context_reset(cmd, >host_map->ioarrin);
 }
 
 /**
@@ -234,8 +235,9 @@ static void context_reset_ioarrin(struct afu_cmd *cmd)
 static void context_reset_sq(struct afu_cmd *cmd)
 {
struct afu *afu = cmd->parent;
+   struct hwq *hwq = get_hwq(afu, cmd->hwq_index);
 
-   context_reset(cmd, >host_map->sq_ctx_reset);
+   context_reset(cmd, >host_map->sq_ctx_reset);
 }
 
 /**
@@ -250,6 +252,7 @@ static int send_cmd_ioarrin(struct afu *afu, struct afu_cmd 
*cmd)
 {
struct cxlflash_cfg *cfg = afu->parent;
struct device *dev = >dev->dev;
+

[PATCH 14/17] cxlflash: Improve asynchronous interrupt processing

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

The method used to decode asynchronous interrupts involves unnecessary
loops to match up bits that are set with corresponding entries in the
asynchronous interrupt information table. This algorithm is wasteful
and does not scale well as new status bits are supported.

As an improvement, use the for_each_set_bit() service to iterate over
the asynchronous status bits and refactor the information table such
that it can be indexed by bit position.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 94 
 1 file changed, 42 insertions(+), 52 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index f5c952c..c60936f 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -1017,52 +1017,6 @@ static void afu_link_reset(struct afu *afu, int port, 
__be64 __iomem *fc_regs)
dev_dbg(dev, "%s: returning port_sel=%016llx\n", __func__, port_sel);
 }
 
-/*
- * Asynchronous interrupt information table
- *
- * NOTE: The checkpatch script considers the BUILD_SISL_ASTATUS_FC_PORT macro
- * as complex and complains because it is not wrapped with parentheses/braces.
- */
-#define ASTATUS_FC(_a, _b, _c, _d)  \
-   { SISL_ASTATUS_FC##_a##_##_b, _c, _a, (_d) }
-
-#define BUILD_SISL_ASTATUS_FC_PORT(_a)  \
-   ASTATUS_FC(_a, OTHER, "other error", CLR_FC_ERROR | LINK_RESET), \
-   ASTATUS_FC(_a, LOGO, "target initiated LOGO", 0),\
-   ASTATUS_FC(_a, CRC_T, "CRC threshold exceeded", LINK_RESET), \
-   ASTATUS_FC(_a, LOGI_R, "login timed out, retrying", LINK_RESET), \
-   ASTATUS_FC(_a, LOGI_F, "login failed", CLR_FC_ERROR),\
-   ASTATUS_FC(_a, LOGI_S, "login succeeded", SCAN_HOST),\
-   ASTATUS_FC(_a, LINK_DN, "link down", 0), \
-   ASTATUS_FC(_a, LINK_UP, "link up", 0)
-
-static const struct asyc_intr_info ainfo[] = {
-   BUILD_SISL_ASTATUS_FC_PORT(2),
-   BUILD_SISL_ASTATUS_FC_PORT(3),
-   BUILD_SISL_ASTATUS_FC_PORT(0),
-   BUILD_SISL_ASTATUS_FC_PORT(1),
-   { 0x0, "", 0, 0 }
-};
-
-/**
- * find_ainfo() - locates and returns asynchronous interrupt information
- * @status:Status code set by AFU on error.
- *
- * Return: The located information or NULL when the status code is invalid.
- */
-static const struct asyc_intr_info *find_ainfo(u64 status)
-{
-   const struct asyc_intr_info *info;
-
-   BUILD_BUG_ON(ainfo[ARRAY_SIZE(ainfo) - 1].status != 0);
-
-   for (info = [0]; info->status; info++)
-   if (info->status == status)
-   return info;
-
-   return NULL;
-}
-
 /**
  * afu_err_intr_init() - clears and initializes the AFU for error interrupts
  * @afu:   AFU associated with the host.
@@ -1293,6 +1247,35 @@ static irqreturn_t cxlflash_rrq_irq(int irq, void *data)
return IRQ_HANDLED;
 }
 
+/*
+ * Asynchronous interrupt information table
+ *
+ * NOTE:
+ * - Order matters here as this array is indexed by bit position.
+ *
+ * - The checkpatch script considers the BUILD_SISL_ASTATUS_FC_PORT macro
+ *   as complex and complains due to a lack of parentheses/braces.
+ */
+#define ASTATUS_FC(_a, _b, _c, _d)  \
+   { SISL_ASTATUS_FC##_a##_##_b, _c, _a, (_d) }
+
+#define BUILD_SISL_ASTATUS_FC_PORT(_a)  \
+   ASTATUS_FC(_a, LINK_UP, "link up", 0),   \
+   ASTATUS_FC(_a, LINK_DN, "link down", 0), \
+   ASTATUS_FC(_a, LOGI_S, "login succeeded", SCAN_HOST),\
+   ASTATUS_FC(_a, LOGI_F, "login failed", CLR_FC_ERROR),\
+   ASTATUS_FC(_a, LOGI_R, "login timed out, retrying", LINK_RESET), \
+   ASTATUS_FC(_a, CRC_T, "CRC threshold exceeded", LINK_RESET), \
+   ASTATUS_FC(_a, LOGO, "target initiated LOGO", 0),\
+   ASTATUS_FC(_a, OTHER, "other error", CLR_FC_ERROR | LINK_RESET)
+
+static const struct asyc_intr_info ainfo[] = {
+   BUILD_SISL_ASTATUS_FC_PORT(1),
+   BUILD_SISL_ASTATUS_FC_PORT(0),
+   BUILD_SISL_ASTATUS_FC_PORT(3),
+   BUILD_SISL_ASTATUS_FC_PORT(2)
+};
+
 /**
  * cxlflash_async_err_irq() - interrupt handler for asynchronous errors
  * @irq:   Interrupt number.
@@ -1305,18 +1288,18 @@ static irqreturn_t cxlflash_async_err_irq(int irq, void 
*data)
struct afu *afu = (struct afu *)data;
struct cxlflash_cfg *cfg = afu->parent;
struct device *dev = >dev->dev;
-   u64 reg_unmasked;
const struct asyc_intr_info *info;
struct sisl_global_map __iomem *global = >afu_map->global;
__be64 __iomem *fc_port_regs;
+   u64 reg_unmasked;
u64 reg;
+

[PATCH 13/17] cxlflash: Fix warnings/errors

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

As a general cleanup, address all reasonable checkpatch warnings and
errors. These include enforcement of comment styles and including named
identifiers in function prototypes.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h| 27 ++---
 drivers/scsi/cxlflash/sislite.h   | 27 +++--
 drivers/scsi/cxlflash/superpipe.h | 51 ++-
 drivers/scsi/cxlflash/vlun.h  |  2 +-
 4 files changed, 57 insertions(+), 50 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 455fd4d..c69cdcf 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -36,20 +36,19 @@ extern const struct file_operations cxlflash_cxl_fops;
 #define PORTMASK2CHAN(_x)  (ilog2((_x)))   /* port mask to channel */
 #define PORTNUM2CHAN(_x)   ((_x) - 1)  /* port number to channel */
 
-#define CXLFLASH_BLOCK_SIZE4096/* 4K blocks */
+#define CXLFLASH_BLOCK_SIZE4096/* 4K blocks */
 #define CXLFLASH_MAX_XFER_SIZE 16777216/* 16MB transfer */
 #define CXLFLASH_MAX_SECTORS   (CXLFLASH_MAX_XFER_SIZE/512)/* SCSI wants
-  max_sectors
-  in units of
-  512 byte
-  sectors
-   */
+* max_sectors
+* in units of
+* 512 byte
+* sectors
+*/
 
 #define MAX_RHT_PER_CONTEXT (PAGE_SIZE / sizeof(struct sisl_rht_entry))
 
 /* AFU command retry limit */
-#define MC_RETRY_CNT 5 /* sufficient for SCSI check and
-  certain AFU errors */
+#define MC_RETRY_CNT   5   /* Sufficient for SCSI and certain AFU errors */
 
 /* Command management definitions */
 #define CXLFLASH_MAX_CMDS   256
@@ -262,14 +261,14 @@ static inline __be64 __iomem *get_fc_port_luns(struct 
cxlflash_cfg *cfg, int i)
return >fc_port_luns[CHAN2BANKPORT(i)][0];
 }
 
-int cxlflash_afu_sync(struct afu *, ctx_hndl_t, res_hndl_t, u8);
+int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t c, res_hndl_t r, u8 mode);
 void cxlflash_list_init(void);
 void cxlflash_term_global_luns(void);
 void cxlflash_free_errpage(void);
-int cxlflash_ioctl(struct scsi_device *, int, void __user *);
-void cxlflash_stop_term_user_contexts(struct cxlflash_cfg *);
-int cxlflash_mark_contexts_error(struct cxlflash_cfg *);
-void cxlflash_term_local_luns(struct cxlflash_cfg *);
-void cxlflash_restore_luntable(struct cxlflash_cfg *);
+int cxlflash_ioctl(struct scsi_device *sdev, int cmd, void __user *arg);
+void cxlflash_stop_term_user_contexts(struct cxlflash_cfg *cfg);
+int cxlflash_mark_contexts_error(struct cxlflash_cfg *cfg);
+void cxlflash_term_local_luns(struct cxlflash_cfg *cfg);
+void cxlflash_restore_luntable(struct cxlflash_cfg *cfg);
 
 #endif /* ifndef _CXLFLASH_COMMON_H */
diff --git a/drivers/scsi/cxlflash/sislite.h b/drivers/scsi/cxlflash/sislite.h
index 0e52bbb..a768360 100644
--- a/drivers/scsi/cxlflash/sislite.h
+++ b/drivers/scsi/cxlflash/sislite.h
@@ -90,15 +90,15 @@ struct sisl_rc {
 #define SISL_AFU_RC_RHT_UNALIGNED 0x02U/* should never happen 
*/
 #define SISL_AFU_RC_RHT_OUT_OF_BOUNDS 0x03u/* user error */
 #define SISL_AFU_RC_RHT_DMA_ERR   0x04u/* see afu_extra
-  may retry if afu_retry is off
-  possible on master exit
+* may retry if afu_retry is off
+* possible on master exit
 */
 #define SISL_AFU_RC_RHT_RW_PERM   0x05u/* no RW perms, user 
error */
 #define SISL_AFU_RC_LXT_UNALIGNED 0x12U/* should never happen 
*/
 #define SISL_AFU_RC_LXT_OUT_OF_BOUNDS 0x13u/* user error */
 #define SISL_AFU_RC_LXT_DMA_ERR   0x14u/* see afu_extra
-  may retry if afu_retry is off
-  possible on master exit
+* may retry if afu_retry is off
+* possible on master exit

[PATCH 11/17] cxlflash: Remove unnecessary DMA mapping

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

Devices supported by the cxlflash driver are fully coherent and do not
require a bus address mapping. Avoid unnecessary path length by using
the virtual address and length already present in the scatter-gather
entry.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index ebba3c9..3c4a833 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -176,7 +176,6 @@ static void cmd_complete(struct afu_cmd *cmd)
dev_dbg_ratelimited(dev, "%s:scp=%p result=%08x ioasc=%08x\n",
__func__, scp, scp->result, cmd->sa.ioasc);
 
-   scsi_dma_unmap(scp);
scp->scsi_done(scp);
 
if (cmd_is_tmf) {
@@ -445,7 +444,6 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scp)
struct scatterlist *sg = scsi_sglist(scp);
u16 req_flags = SISL_REQ_FLAGS_SUP_UNDERRUN;
ulong lock_flags;
-   int nseg = 0;
int rc = 0;
 
dev_dbg_ratelimited(dev, "%s: (scp=%p) %d/%d/%d/%llu "
@@ -487,15 +485,8 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scp)
}
 
if (likely(sg)) {
-   nseg = scsi_dma_map(scp);
-   if (unlikely(nseg < 0)) {
-   dev_err(dev, "%s: Fail DMA map\n", __func__);
-   rc = SCSI_MLQUEUE_HOST_BUSY;
-   goto out;
-   }
-
-   cmd->rcb.data_len = sg_dma_len(sg);
-   cmd->rcb.data_ea = sg_dma_address(sg);
+   cmd->rcb.data_len = sg->length;
+   cmd->rcb.data_ea = (uintptr_t)sg_virt(sg);
}
 
cmd->scp = scp;
@@ -513,8 +504,6 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scp)
memcpy(cmd->rcb.cdb, scp->cmnd, sizeof(cmd->rcb.cdb));
 
rc = afu->send_cmd(afu, cmd);
-   if (unlikely(rc))
-   scsi_dma_unmap(scp);
 out:
return rc;
 }
-- 
2.1.0

[PATCH 12/17] cxlflash: Fix power-of-two validations

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

Validation statements to enforce assumptions about specific defines
are not being evaluated by the compiler due to the fact that they
reside in a routine that is not used. To activate them, call the
routine as part of module initialization. As an additional, related
cleanup, remove the now-defunct CXLFLASH_NUM_CMDS.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h | 7 +--
 drivers/scsi/cxlflash/main.c   | 1 +
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 17aa74a..455fd4d 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -52,12 +52,6 @@ extern const struct file_operations cxlflash_cxl_fops;
   certain AFU errors */
 
 /* Command management definitions */
-#define CXLFLASH_NUM_CMDS  (2 * CXLFLASH_MAX_CMDS) /* Must be a pow2 for
-  alignment and more
-  efficient array
-  index derivation
-*/
-
 #define CXLFLASH_MAX_CMDS   256
 #define CXLFLASH_MAX_CMDS_PER_LUN   CXLFLASH_MAX_CMDS
 
@@ -71,6 +65,7 @@ extern const struct file_operations cxlflash_cxl_fops;
 static inline void check_sizes(void)
 {
BUILD_BUG_ON_NOT_POWER_OF_2(CXLFLASH_NUM_FC_PORTS_PER_BANK);
+   BUILD_BUG_ON_NOT_POWER_OF_2(CXLFLASH_MAX_CMDS);
 }
 
 /* AFU defines a fixed size of 4K for command buffers (borrow 4K page define) 
*/
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 3c4a833..f5c952c 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -2847,6 +2847,7 @@ static struct pci_driver cxlflash_driver = {
  */
 static int __init init_cxlflash(void)
 {
+   check_sizes();
cxlflash_list_init();
 
return pci_register_driver(_driver);
-- 
2.1.0

[PATCH 10/17] cxlflash: Fence EEH during probe

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

An EEH during probe can lead to a crash as the recovery thread races
with the probe thread. To avoid this issue, introduce new states to
fence out EEH recovery until probe has completed. Also ensure the reset
wait queue is flushed during device removal to avoid orphaned threads.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h|  2 ++
 drivers/scsi/cxlflash/main.c  | 25 +
 drivers/scsi/cxlflash/superpipe.c |  8 +---
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 28bb716..17aa74a 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -90,6 +90,8 @@ enum cxlflash_init_state {
 };
 
 enum cxlflash_state {
+   STATE_PROBING,  /* Initial state during probe */
+   STATE_PROBED,   /* Temporary state, probe completed but EEH occurred */
STATE_NORMAL,   /* Normal running state, everything good */
STATE_RESET,/* Reset state, trying to reset/recover */
STATE_FAILTERM  /* Failed/terminating state, error out users/threads */
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 568cd63..ebba3c9 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -470,6 +470,8 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scp)
spin_unlock_irqrestore(>tmf_slock, lock_flags);
 
switch (cfg->state) {
+   case STATE_PROBING:
+   case STATE_PROBED:
case STATE_RESET:
dev_dbg_ratelimited(dev, "%s: device is in reset\n", __func__);
rc = SCSI_MLQUEUE_HOST_BUSY;
@@ -719,7 +721,8 @@ static void notify_shutdown(struct cxlflash_cfg *cfg, bool 
wait)
  * cxlflash_remove() - PCI entry point to tear down host
  * @pdev:  PCI device associated with the host.
  *
- * Safe to use as a cleanup in partially allocated/initialized state.
+ * Safe to use as a cleanup in partially allocated/initialized state. Note that
+ * the reset_waitq is flushed as part of the stop/termination of user contexts.
  */
 static void cxlflash_remove(struct pci_dev *pdev)
 {
@@ -752,7 +755,6 @@ static void cxlflash_remove(struct pci_dev *pdev)
case INIT_STATE_SCSI:
cxlflash_term_local_luns(cfg);
scsi_remove_host(cfg->host);
-   /* fall through */
case INIT_STATE_AFU:
term_afu(cfg);
case INIT_STATE_PCI:
@@ -2624,6 +2626,15 @@ static void cxlflash_worker_thread(struct work_struct 
*work)
  * @pdev:  PCI device associated with the host.
  * @dev_id:PCI device id associated with device.
  *
+ * The device will initially start out in a 'probing' state and
+ * transition to the 'normal' state at the end of a successful
+ * probe. Should an EEH event occur during probe, the notification
+ * thread (error_detected()) will wait until the probe handler
+ * is nearly complete. At that time, the device will be moved to
+ * a 'probed' state and the EEH thread woken up to drive the slot
+ * reset and recovery (device moves to 'normal' state). Meanwhile,
+ * the probe will be allowed to exit successfully.
+ *
  * Return: 0 on success, -errno on failure
  */
 static int cxlflash_probe(struct pci_dev *pdev,
@@ -2707,7 +2718,7 @@ static int cxlflash_probe(struct pci_dev *pdev,
cfg->init_state = INIT_STATE_PCI;
 
rc = init_afu(cfg);
-   if (rc) {
+   if (rc && !wq_has_sleeper(>reset_waitq)) {
dev_err(dev, "%s: init_afu failed rc=%d\n", __func__, rc);
goto out_remove;
}
@@ -2720,6 +2731,11 @@ static int cxlflash_probe(struct pci_dev *pdev,
}
cfg->init_state = INIT_STATE_SCSI;
 
+   if (wq_has_sleeper(>reset_waitq)) {
+   cfg->state = STATE_PROBED;
+   wake_up_all(>reset_waitq);
+   } else
+   cfg->state = STATE_NORMAL;
 out:
dev_dbg(dev, "%s: returning rc=%d\n", __func__, rc);
return rc;
@@ -2750,7 +2766,8 @@ static pci_ers_result_t 
cxlflash_pci_error_detected(struct pci_dev *pdev,
 
switch (state) {
case pci_channel_io_frozen:
-   wait_event(cfg->reset_waitq, cfg->state != STATE_RESET);
+   wait_event(cfg->reset_waitq, cfg->state != STATE_RESET &&
+cfg->state != STATE_PROBING);
if (cfg->state == STATE_FAILTERM)
return PCI_ERS_RESULT_DISCONNECT;
 
diff --git a/drivers/scsi/cxlflash/superpipe.c 
b/drivers/scsi/cxlflash/superpipe.c
index 488330f..158fa00 100644
--- a/drivers/scsi/cxlflash/superpipe.c
+++ b/drivers/scsi/cxlflash/superpipe.c
@@ -78,17 +78,18 @@ void cxlflash_free_errpage(void)
  * memory freed. This is accomplished by putting the contexts in

[PATCH 09/17] cxlflash: Support up to 4 ports

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

Update the driver to allow for future cards with 4 ports.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c| 78 -
 drivers/scsi/cxlflash/sislite.h |  6 ++--
 2 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 64ad76b..568cd63 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -1419,7 +1419,7 @@ static int read_vpd(struct cxlflash_cfg *cfg, u64 wwpn[])
ssize_t vpd_size;
char vpd_data[CXLFLASH_VPD_LEN];
char tmp_buf[WWPN_BUF_LEN] = { 0 };
-   char *wwpn_vpd_tags[MAX_FC_PORTS] = { "V5", "V6" };
+   char *wwpn_vpd_tags[MAX_FC_PORTS] = { "V5", "V6", "V7", "V8" };
 
/* Get the VPD data from the device */
vpd_size = cxl_read_adapter_vpd(pdev, vpd_data, sizeof(vpd_data));
@@ -2175,6 +2175,40 @@ static ssize_t port1_show(struct device *dev,
 }
 
 /**
+ * port2_show() - queries and presents the current status of port 2
+ * @dev:   Generic device associated with the host owning the port.
+ * @attr:  Device attribute representing the port.
+ * @buf:   Buffer of length PAGE_SIZE to report back port status in ASCII.
+ *
+ * Return: The size of the ASCII string returned in @buf.
+ */
+static ssize_t port2_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+   struct cxlflash_cfg *cfg = shost_priv(class_to_shost(dev));
+
+   return cxlflash_show_port_status(2, cfg, buf);
+}
+
+/**
+ * port3_show() - queries and presents the current status of port 3
+ * @dev:   Generic device associated with the host owning the port.
+ * @attr:  Device attribute representing the port.
+ * @buf:   Buffer of length PAGE_SIZE to report back port status in ASCII.
+ *
+ * Return: The size of the ASCII string returned in @buf.
+ */
+static ssize_t port3_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+   struct cxlflash_cfg *cfg = shost_priv(class_to_shost(dev));
+
+   return cxlflash_show_port_status(3, cfg, buf);
+}
+
+/**
  * lun_mode_show() - presents the current LUN mode of the host
  * @dev:   Generic device associated with the host.
  * @attr:  Device attribute representing the LUN mode.
@@ -2327,6 +2361,40 @@ static ssize_t port1_lun_table_show(struct device *dev,
 }
 
 /**
+ * port2_lun_table_show() - presents the current LUN table of port 2
+ * @dev:   Generic device associated with the host owning the port.
+ * @attr:  Device attribute representing the port.
+ * @buf:   Buffer of length PAGE_SIZE to report back port status in ASCII.
+ *
+ * Return: The size of the ASCII string returned in @buf.
+ */
+static ssize_t port2_lun_table_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct cxlflash_cfg *cfg = shost_priv(class_to_shost(dev));
+
+   return cxlflash_show_port_lun_table(2, cfg, buf);
+}
+
+/**
+ * port3_lun_table_show() - presents the current LUN table of port 3
+ * @dev:   Generic device associated with the host owning the port.
+ * @attr:  Device attribute representing the port.
+ * @buf:   Buffer of length PAGE_SIZE to report back port status in ASCII.
+ *
+ * Return: The size of the ASCII string returned in @buf.
+ */
+static ssize_t port3_lun_table_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct cxlflash_cfg *cfg = shost_priv(class_to_shost(dev));
+
+   return cxlflash_show_port_lun_table(3, cfg, buf);
+}
+
+/**
  * irqpoll_weight_show() - presents the current IRQ poll weight for the host
  * @dev:   Generic device associated with the host.
  * @attr:  Device attribute representing the IRQ poll weight.
@@ -2417,19 +2485,27 @@ static ssize_t mode_show(struct device *dev,
  */
 static DEVICE_ATTR_RO(port0);
 static DEVICE_ATTR_RO(port1);
+static DEVICE_ATTR_RO(port2);
+static DEVICE_ATTR_RO(port3);
 static DEVICE_ATTR_RW(lun_mode);
 static DEVICE_ATTR_RO(ioctl_version);
 static DEVICE_ATTR_RO(port0_lun_table);
 static DEVICE_ATTR_RO(port1_lun_table);
+static DEVICE_ATTR_RO(port2_lun_table);
+static DEVICE_ATTR_RO(port3_lun_table);
 static DEVICE_ATTR_RW(irqpoll_weight);
 
 static struct device_attribute *cxlflash_host_attrs[] = {
_attr_port0,
_attr_port1,
+   _attr_port2,
+   _attr_port3,
_attr_lun_mode,
_attr_ioctl_version,
_attr_port0_lun_table,
_attr_port1_lun_table,
+   _attr_port2_lun_table,
+   _attr_port3_lun_table,
_attr_irqpoll_weight,
NULL
 };
diff

[PATCH 08/17] cxlflash: SISlite updates to support 4 ports

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

Update the SISlite header to support 4 ports as outlined in the
SISlite specification. Address fallout from structure renames and
refreshed organization throughout the driver. Determine the number
of ports supported by a card from the global port selection mask
register reset value.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h  | 25 ---
 drivers/scsi/cxlflash/main.c| 77 +
 drivers/scsi/cxlflash/sislite.h | 96 -
 3 files changed, 141 insertions(+), 57 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index e6a7c97..28bb716 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -26,8 +26,11 @@
 extern const struct file_operations cxlflash_cxl_fops;
 
 #define MAX_CONTEXTCXLFLASH_MAX_CONTEXT/* num contexts per afu */
-#define NUM_FC_PORTS   CXLFLASH_NUM_FC_PORTS   /* ports per AFU */
-#define MAX_FC_PORTS   CXLFLASH_MAX_FC_PORTS   /* ports per AFU */
+#define MAX_FC_PORTS   CXLFLASH_MAX_FC_PORTS   /* max ports per AFU */
+#define LEGACY_FC_PORTS2   /* legacy ports per AFU 
*/
+
+#define CHAN2PORTBANK(_x)  ((_x) >> ilog2(CXLFLASH_NUM_FC_PORTS_PER_BANK))
+#define CHAN2BANKPORT(_x)  ((_x) & (CXLFLASH_NUM_FC_PORTS_PER_BANK - 1))
 
 #define CHAN2PORTMASK(_x)  (1 << (_x)) /* channel to port mask */
 #define PORTMASK2CHAN(_x)  (ilog2((_x)))   /* port mask to channel */
@@ -67,7 +70,7 @@ extern const struct file_operations cxlflash_cxl_fops;
 
 static inline void check_sizes(void)
 {
-   BUILD_BUG_ON_NOT_POWER_OF_2(CXLFLASH_NUM_CMDS);
+   BUILD_BUG_ON_NOT_POWER_OF_2(CXLFLASH_NUM_FC_PORTS_PER_BANK);
 }
 
 /* AFU defines a fixed size of 4K for command buffers (borrow 4K page define) 
*/
@@ -240,18 +243,26 @@ static inline u64 lun_to_lunid(u64 lun)
return be64_to_cpu(lun_id);
 }
 
-static inline __be64 __iomem *get_fc_port_regs(struct cxlflash_cfg *cfg, int i)
+static inline struct fc_port_bank __iomem *get_fc_port_bank(
+   struct cxlflash_cfg *cfg, int i)
 {
struct afu *afu = cfg->afu;
 
-   return >afu_map->global.fc_regs[i][0];
+   return >afu_map->global.bank[CHAN2PORTBANK(i)];
+}
+
+static inline __be64 __iomem *get_fc_port_regs(struct cxlflash_cfg *cfg, int i)
+{
+   struct fc_port_bank __iomem *fcpb = get_fc_port_bank(cfg, i);
+
+   return >fc_port_regs[CHAN2BANKPORT(i)][0];
 }
 
 static inline __be64 __iomem *get_fc_port_luns(struct cxlflash_cfg *cfg, int i)
 {
-   struct afu *afu = cfg->afu;
+   struct fc_port_bank __iomem *fcpb = get_fc_port_bank(cfg, i);
 
-   return >afu_map->global.fc_port[i][0];
+   return >fc_port_luns[CHAN2BANKPORT(i)][0];
 }
 
 int cxlflash_afu_sync(struct afu *, ctx_hndl_t, res_hndl_t, u8);
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index e198605..64ad76b 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -1028,25 +1028,29 @@ static void afu_link_reset(struct afu *afu, int port, 
__be64 __iomem *fc_regs)
 
 /*
  * Asynchronous interrupt information table
+ *
+ * NOTE: The checkpatch script considers the BUILD_SISL_ASTATUS_FC_PORT macro
+ * as complex and complains because it is not wrapped with parentheses/braces.
  */
+#define ASTATUS_FC(_a, _b, _c, _d)  \
+   { SISL_ASTATUS_FC##_a##_##_b, _c, _a, (_d) }
+
+#define BUILD_SISL_ASTATUS_FC_PORT(_a)  \
+   ASTATUS_FC(_a, OTHER, "other error", CLR_FC_ERROR | LINK_RESET), \
+   ASTATUS_FC(_a, LOGO, "target initiated LOGO", 0),\
+   ASTATUS_FC(_a, CRC_T, "CRC threshold exceeded", LINK_RESET), \
+   ASTATUS_FC(_a, LOGI_R, "login timed out, retrying", LINK_RESET), \
+   ASTATUS_FC(_a, LOGI_F, "login failed", CLR_FC_ERROR),\
+   ASTATUS_FC(_a, LOGI_S, "login succeeded", SCAN_HOST),\
+   ASTATUS_FC(_a, LINK_DN, "link down", 0), \
+   ASTATUS_FC(_a, LINK_UP, "link up", 0)
+
 static const struct asyc_intr_info ainfo[] = {
-   {SISL_ASTATUS_FC0_OTHER, "other error", 0, CLR_FC_ERROR | LINK_RESET},
-   {SISL_ASTATUS_FC0_LOGO, "target initiated LOGO", 0, 0},
-   {SISL_ASTATUS_FC0_CRC_T, "CRC threshold exceeded", 0, LINK_RESET},
-   {SISL_ASTATUS_FC0_LOGI_R, "login timed out, retrying", 0, LINK_RESET},
-   {SISL_ASTATUS_FC0_LOGI_F, "login failed", 0, CLR_FC_ERROR},
-   {SISL_ASTATUS_FC0_LOGI_S, "login succeeded", 0, SCAN_HOST},
-   {SISL_ASTATUS_FC0_LINK_DN, "link down", 0, 0},
-   {SISL_ASTATUS_FC0_LINK_UP, "link up", 0, 0},
-   {SISL_ASTATUS_FC1_OTHER, "other error", 1, CLR_FC_ERROR | LINK_RESET},
-   {SISL_ASTATUS_FC1_LOGO,

[PATCH 07/17] cxlflash: Hide FC internals behind common access routine

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

As staging to support FC-related updates to the SISlite specification,
introduce helper routines to obtain references to FC resources that exist
within the global map. This will allow changes to the underlying global
map structure without impacting existing code paths.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h | 14 
 drivers/scsi/cxlflash/main.c   | 72 +++---
 drivers/scsi/cxlflash/vlun.c   | 16 +-
 3 files changed, 61 insertions(+), 41 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index ee23e81..e6a7c97 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -240,6 +240,20 @@ static inline u64 lun_to_lunid(u64 lun)
return be64_to_cpu(lun_id);
 }
 
+static inline __be64 __iomem *get_fc_port_regs(struct cxlflash_cfg *cfg, int i)
+{
+   struct afu *afu = cfg->afu;
+
+   return >afu_map->global.fc_regs[i][0];
+}
+
+static inline __be64 __iomem *get_fc_port_luns(struct cxlflash_cfg *cfg, int i)
+{
+   struct afu *afu = cfg->afu;
+
+   return >afu_map->global.fc_port[i][0];
+}
+
 int cxlflash_afu_sync(struct afu *, ctx_hndl_t, res_hndl_t, u8);
 void cxlflash_list_init(void);
 void cxlflash_term_global_luns(void);
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 04e1a8e..e198605 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -670,8 +670,8 @@ static void notify_shutdown(struct cxlflash_cfg *cfg, bool 
wait)
 {
struct afu *afu = cfg->afu;
struct device *dev = >dev->dev;
-   struct sisl_global_map __iomem *global;
struct dev_dependent_vals *ddv;
+   __be64 __iomem *fc_port_regs;
u64 reg, status;
int i, retry_cnt = 0;
 
@@ -684,13 +684,13 @@ static void notify_shutdown(struct cxlflash_cfg *cfg, 
bool wait)
return;
}
 
-   global = >afu_map->global;
-
/* Notify AFU */
for (i = 0; i < cfg->num_fc_ports; i++) {
-   reg = readq_be(>fc_regs[i][FC_CONFIG2 / 8]);
+   fc_port_regs = get_fc_port_regs(cfg, i);
+
+   reg = readq_be(_port_regs[FC_CONFIG2 / 8]);
reg |= SISL_FC_SHUTDOWN_NORMAL;
-   writeq_be(reg, >fc_regs[i][FC_CONFIG2 / 8]);
+   writeq_be(reg, _port_regs[FC_CONFIG2 / 8]);
}
 
if (!wait)
@@ -698,9 +698,11 @@ static void notify_shutdown(struct cxlflash_cfg *cfg, bool 
wait)
 
/* Wait up to 1.5 seconds for shutdown processing to complete */
for (i = 0; i < cfg->num_fc_ports; i++) {
+   fc_port_regs = get_fc_port_regs(cfg, i);
retry_cnt = 0;
+
while (true) {
-   status = readq_be(>fc_regs[i][FC_STATUS / 8]);
+   status = readq_be(_port_regs[FC_STATUS / 8]);
if (status & SISL_STATUS_SHUTDOWN_COMPLETE)
break;
if (++retry_cnt >= MC_RETRY_CNT) {
@@ -1071,6 +1073,7 @@ static const struct asyc_intr_info *find_ainfo(u64 status)
 static void afu_err_intr_init(struct afu *afu)
 {
struct cxlflash_cfg *cfg = afu->parent;
+   __be64 __iomem *fc_port_regs;
int i;
u64 reg;
 
@@ -1099,17 +1102,19 @@ static void afu_err_intr_init(struct afu *afu)
writeq_be(-1ULL, >afu_map->global.regs.aintr_clear);
 
/* Clear/Set internal lun bits */
-   reg = readq_be(>afu_map->global.fc_regs[0][FC_CONFIG2 / 8]);
+   fc_port_regs = get_fc_port_regs(cfg, 0);
+   reg = readq_be(_port_regs[FC_CONFIG2 / 8]);
reg &= SISL_FC_INTERNAL_MASK;
if (afu->internal_lun)
reg |= ((u64)(afu->internal_lun - 1) << SISL_FC_INTERNAL_SHIFT);
-   writeq_be(reg, >afu_map->global.fc_regs[0][FC_CONFIG2 / 8]);
+   writeq_be(reg, _port_regs[FC_CONFIG2 / 8]);
 
/* now clear FC errors */
for (i = 0; i < cfg->num_fc_ports; i++) {
-   writeq_be(0xU,
- >afu_map->global.fc_regs[i][FC_ERROR / 8]);
-   writeq_be(0, >afu_map->global.fc_regs[i][FC_ERRCAP / 8]);
+   fc_port_regs = get_fc_port_regs(cfg, i);
+
+   writeq_be(0xU, _port_regs[FC_ERROR / 8]);
+   writeq_be(0, _port_regs[FC_ERRCAP / 8]);
}
 
/* sync interrupts for master's IOARRIN write */
@@ -1306,6 +1311,7 @@ static irqreturn_t cxlflash_async_err_irq(int irq, void 
*data)
u64 reg_unmasked;
const struct asyc_intr_info *info;
struct sisl_global_map __iomem *global = >afu_map->global;
+   __be64 __iomem *fc_port_regs;
u64 reg;
u8 port;
int i;
@@ -1329,10 +1335,11 @@ static irqreturn_t cxlflash_async_err_irq(int irq, void

[PATCH 06/17] cxlflash: Remove port configuration assumptions

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

At present, the cxlflash driver only supports hardware with two FC
ports. The code was initially designed with this assumption and is
dependent on having two FC ports - adding more ports will break logic
within the driver.

To mitigate this issue, remove the existing port assumptions and
transition the code to support more than two ports. As a side effect,
clarify the interpretation of the DK_CXLFLASH_ALL_PORTS_ACTIVE flag.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 Documentation/powerpc/cxlflash.txt |  5 +++
 drivers/scsi/cxlflash/common.h |  4 ++
 drivers/scsi/cxlflash/lunmgt.c |  4 +-
 drivers/scsi/cxlflash/main.c   | 13 +++---
 drivers/scsi/cxlflash/sislite.h|  2 +-
 drivers/scsi/cxlflash/superpipe.c  |  2 +-
 drivers/scsi/cxlflash/superpipe.h  |  3 --
 drivers/scsi/cxlflash/vlun.c   | 89 +-
 8 files changed, 77 insertions(+), 45 deletions(-)

diff --git a/Documentation/powerpc/cxlflash.txt 
b/Documentation/powerpc/cxlflash.txt
index 6d9a2ed..66b4496 100644
--- a/Documentation/powerpc/cxlflash.txt
+++ b/Documentation/powerpc/cxlflash.txt
@@ -239,6 +239,11 @@ DK_CXLFLASH_USER_VIRTUAL
 resource handle that is provided is already referencing provisioned
 storage. This is reflected by the last LBA being a non-zero value.
 
+When a LUN is accessible from more than one port, this ioctl will
+return with the DK_CXLFLASH_ALL_PORTS_ACTIVE return flag set. This
+provides the user with a hint that I/O can be retried in the event
+of an I/O error as the LUN can be reached over multiple paths.
+
 DK_CXLFLASH_VLUN_RESIZE
 ---
 This ioctl is responsible for resizing a previously created virtual
diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 6a04867..ee23e81 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -29,6 +29,10 @@ extern const struct file_operations cxlflash_cxl_fops;
 #define NUM_FC_PORTS   CXLFLASH_NUM_FC_PORTS   /* ports per AFU */
 #define MAX_FC_PORTS   CXLFLASH_MAX_FC_PORTS   /* ports per AFU */
 
+#define CHAN2PORTMASK(_x)  (1 << (_x)) /* channel to port mask */
+#define PORTMASK2CHAN(_x)  (ilog2((_x)))   /* port mask to channel */
+#define PORTNUM2CHAN(_x)   ((_x) - 1)  /* port number to channel */
+
 #define CXLFLASH_BLOCK_SIZE4096/* 4K blocks */
 #define CXLFLASH_MAX_XFER_SIZE 16777216/* 16MB transfer */
 #define CXLFLASH_MAX_SECTORS   (CXLFLASH_MAX_XFER_SIZE/512)/* SCSI wants
diff --git a/drivers/scsi/cxlflash/lunmgt.c b/drivers/scsi/cxlflash/lunmgt.c
index 0efed17..4d232e2 100644
--- a/drivers/scsi/cxlflash/lunmgt.c
+++ b/drivers/scsi/cxlflash/lunmgt.c
@@ -252,7 +252,7 @@ int cxlflash_manage_lun(struct scsi_device *sdev,
 * in unpacked, AFU-friendly format, and hang LUN reference in
 * the sdev.
 */
-   lli->port_sel |= CHAN2PORT(chan);
+   lli->port_sel |= CHAN2PORTMASK(chan);
lli->lun_id[chan] = lun_to_lunid(sdev->lun);
sdev->hostdata = lli;
} else if (flags & DK_CXLFLASH_MANAGE_LUN_DISABLE_SUPERPIPE) {
@@ -264,7 +264,7 @@ int cxlflash_manage_lun(struct scsi_device *sdev,
 * tracking when no more references exist.
 */
sdev->hostdata = NULL;
-   lli->port_sel &= ~CHAN2PORT(chan);
+   lli->port_sel &= ~CHAN2PORTMASK(chan);
if (lli->port_sel == 0U)
lli->in_table = false;
}
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 3f9c869..04e1a8e 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -365,7 +365,6 @@ static int wait_resp(struct afu *afu, struct afu_cmd *cmd)
  */
 static int send_tmf(struct afu *afu, struct scsi_cmnd *scp, u64 tmfcmd)
 {
-   u32 port_sel = scp->device->channel + 1;
struct cxlflash_cfg *cfg = shost_priv(scp->device->host);
struct afu_cmd *cmd = sc_to_afucz(scp);
struct device *dev = >dev->dev;
@@ -388,7 +387,7 @@ static int send_tmf(struct afu *afu, struct scsi_cmnd *scp, 
u64 tmfcmd)
 
cmd->rcb.ctx_id = afu->ctx_hndl;
cmd->rcb.msi = SISL_MSI_RRQ_UPDATED;
-   cmd->rcb.port_sel = port_sel;
+   cmd->rcb.port_sel = CHAN2PORTMASK(scp->device->channel);
cmd->rcb.lun_id = lun_to_lunid(scp->device->lun);
cmd->rcb.req_flags = (SISL_REQ_FLAGS_PORT_LUN_ID |
  SISL_REQ_FLAGS_SUP_UNDERRUN |
@@ -444,7 +443,6 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scp)
struct device *dev = >dev->dev;
struct afu_cmd *cmd = sc_to_afucz(scp);
struct scatterlist *sg =

[PATCH 05/17] cxlflash: Support dynamic number of FC ports

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

Transition from a static number of FC ports to a value that is derived
during probe. For now, a static value is used but this will later be
based on the type of card being configured.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h|  7 ++--
 drivers/scsi/cxlflash/main.c  | 71 ---
 drivers/scsi/cxlflash/main.h  |  2 --
 drivers/scsi/cxlflash/sislite.h   |  1 +
 drivers/scsi/cxlflash/superpipe.h |  2 +-
 5 files changed, 51 insertions(+), 32 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 3ff05f1..6a04867 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -25,7 +25,9 @@
 
 extern const struct file_operations cxlflash_cxl_fops;
 
-#define MAX_CONTEXT  CXLFLASH_MAX_CONTEXT   /* num contexts per afu */
+#define MAX_CONTEXTCXLFLASH_MAX_CONTEXT/* num contexts per afu */
+#define NUM_FC_PORTS   CXLFLASH_NUM_FC_PORTS   /* ports per AFU */
+#define MAX_FC_PORTS   CXLFLASH_MAX_FC_PORTS   /* ports per AFU */
 
 #define CXLFLASH_BLOCK_SIZE4096/* 4K blocks */
 #define CXLFLASH_MAX_XFER_SIZE 16777216/* 16MB transfer */
@@ -98,6 +100,7 @@ struct cxlflash_cfg {
struct pci_dev *dev;
struct pci_device_id *dev_id;
struct Scsi_Host *host;
+   int num_fc_ports;
 
ulong cxlflash_regs_pci;
 
@@ -118,7 +121,7 @@ struct cxlflash_cfg {
struct file_operations cxl_fops;
 
/* Parameters that are LUN table related */
-   int last_lun_index[CXLFLASH_NUM_FC_PORTS];
+   int last_lun_index[MAX_FC_PORTS];
int promote_lun_index;
struct list_head lluns; /* list of llun_info structs */
 
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 157d806..3f9c869 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -689,7 +689,7 @@ static void notify_shutdown(struct cxlflash_cfg *cfg, bool 
wait)
global = >afu_map->global;
 
/* Notify AFU */
-   for (i = 0; i < NUM_FC_PORTS; i++) {
+   for (i = 0; i < cfg->num_fc_ports; i++) {
reg = readq_be(>fc_regs[i][FC_CONFIG2 / 8]);
reg |= SISL_FC_SHUTDOWN_NORMAL;
writeq_be(reg, >fc_regs[i][FC_CONFIG2 / 8]);
@@ -699,7 +699,7 @@ static void notify_shutdown(struct cxlflash_cfg *cfg, bool 
wait)
return;
 
/* Wait up to 1.5 seconds for shutdown processing to complete */
-   for (i = 0; i < NUM_FC_PORTS; i++) {
+   for (i = 0; i < cfg->num_fc_ports; i++) {
retry_cnt = 0;
while (true) {
status = readq_be(>fc_regs[i][FC_STATUS / 8]);
@@ -1072,6 +1072,7 @@ static const struct asyc_intr_info *find_ainfo(u64 status)
  */
 static void afu_err_intr_init(struct afu *afu)
 {
+   struct cxlflash_cfg *cfg = afu->parent;
int i;
u64 reg;
 
@@ -1107,7 +1108,7 @@ static void afu_err_intr_init(struct afu *afu)
writeq_be(reg, >afu_map->global.fc_regs[0][FC_CONFIG2 / 8]);
 
/* now clear FC errors */
-   for (i = 0; i < NUM_FC_PORTS; i++) {
+   for (i = 0; i < cfg->num_fc_ports; i++) {
writeq_be(0xU,
  >afu_map->global.fc_regs[i][FC_ERROR / 8]);
writeq_be(0, >afu_map->global.fc_regs[i][FC_ERRCAP / 8]);
@@ -1394,7 +1395,7 @@ static int start_context(struct cxlflash_cfg *cfg)
 /**
  * read_vpd() - obtains the WWPNs from VPD
  * @cfg:   Internal structure associated with the host.
- * @wwpn:  Array of size NUM_FC_PORTS to pass back WWPNs
+ * @wwpn:  Array of size MAX_FC_PORTS to pass back WWPNs
  *
  * Return: 0 on success, -errno on failure
  */
@@ -1407,7 +1408,7 @@ static int read_vpd(struct cxlflash_cfg *cfg, u64 wwpn[])
ssize_t vpd_size;
char vpd_data[CXLFLASH_VPD_LEN];
char tmp_buf[WWPN_BUF_LEN] = { 0 };
-   char *wwpn_vpd_tags[NUM_FC_PORTS] = { "V5", "V6" };
+   char *wwpn_vpd_tags[MAX_FC_PORTS] = { "V5", "V6" };
 
/* Get the VPD data from the device */
vpd_size = cxl_read_adapter_vpd(pdev, vpd_data, sizeof(vpd_data));
@@ -1445,7 +1446,7 @@ static int read_vpd(struct cxlflash_cfg *cfg, u64 wwpn[])
 * because the conversion service requires that the ASCII
 * string be terminated.
 */
-   for (k = 0; k < NUM_FC_PORTS; k++) {
+   for (k = 0; k < cfg->num_fc_ports; k++) {
j = ro_size;
i = ro_start + PCI_VPD_LRDT_TAG_SIZE;
 
@@ -1474,6 +1475,8 @@ static int read_vpd(struct cxlflash_cfg *cfg, u64 wwpn[])
rc = -ENODEV;
goto out;
}
+
+   dev_dbg(dev, "%s: wwpn%d=%016llx\n", __func__, k, wwpn[k]);
}
 
 out:
@@ -1520,7 +1523,7 @@ static

[PATCH 04/17] cxlflash: Update sysfs helper routines to pass config structure

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

As staging for future function, pass the config pointer instead of the
AFU pointer for port-related sysfs helper routines.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 30d68af..157d806 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -2058,13 +2058,16 @@ static int cxlflash_change_queue_depth(struct 
scsi_device *sdev, int qdepth)
 /**
  * cxlflash_show_port_status() - queries and presents the current port status
  * @port:  Desired port for status reporting.
- * @afu:   AFU owning the specified port.
+ * @cfg:   Internal structure associated with the host.
  * @buf:   Buffer of length PAGE_SIZE to report back port status in ASCII.
  *
  * Return: The size of the ASCII string returned in @buf.
  */
-static ssize_t cxlflash_show_port_status(u32 port, struct afu *afu, char *buf)
+static ssize_t cxlflash_show_port_status(u32 port,
+struct cxlflash_cfg *cfg,
+char *buf)
 {
+   struct afu *afu = cfg->afu;
char *disp_status;
u64 status;
__be64 __iomem *fc_regs;
@@ -2099,9 +2102,8 @@ static ssize_t port0_show(struct device *dev,
  char *buf)
 {
struct cxlflash_cfg *cfg = shost_priv(class_to_shost(dev));
-   struct afu *afu = cfg->afu;
 
-   return cxlflash_show_port_status(0, afu, buf);
+   return cxlflash_show_port_status(0, cfg, buf);
 }
 
 /**
@@ -2117,9 +2119,8 @@ static ssize_t port1_show(struct device *dev,
  char *buf)
 {
struct cxlflash_cfg *cfg = shost_priv(class_to_shost(dev));
-   struct afu *afu = cfg->afu;
 
-   return cxlflash_show_port_status(1, afu, buf);
+   return cxlflash_show_port_status(1, cfg, buf);
 }
 
 /**
@@ -2208,15 +2209,16 @@ static ssize_t ioctl_version_show(struct device *dev,
 /**
  * cxlflash_show_port_lun_table() - queries and presents the port LUN table
  * @port:  Desired port for status reporting.
- * @afu:   AFU owning the specified port.
+ * @cfg:   Internal structure associated with the host.
  * @buf:   Buffer of length PAGE_SIZE to report back port status in ASCII.
  *
  * Return: The size of the ASCII string returned in @buf.
  */
 static ssize_t cxlflash_show_port_lun_table(u32 port,
-   struct afu *afu,
+   struct cxlflash_cfg *cfg,
char *buf)
 {
+   struct afu *afu = cfg->afu;
int i;
ssize_t bytes = 0;
__be64 __iomem *fc_port;
@@ -2245,9 +2247,8 @@ static ssize_t port0_lun_table_show(struct device *dev,
char *buf)
 {
struct cxlflash_cfg *cfg = shost_priv(class_to_shost(dev));
-   struct afu *afu = cfg->afu;
 
-   return cxlflash_show_port_lun_table(0, afu, buf);
+   return cxlflash_show_port_lun_table(0, cfg, buf);
 }
 
 /**
@@ -2263,9 +2264,8 @@ static ssize_t port1_lun_table_show(struct device *dev,
char *buf)
 {
struct cxlflash_cfg *cfg = shost_priv(class_to_shost(dev));
-   struct afu *afu = cfg->afu;
 
-   return cxlflash_show_port_lun_table(1, afu, buf);
+   return cxlflash_show_port_lun_table(1, cfg, buf);
 }
 
 /**
-- 
2.1.0

[PATCH 03/17] cxlflash: Implement IRQ polling for RRQ processing

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

Currently, RRQ processing takes place on hardware interrupt context. This
can be a heavy burden in some environments due to the overhead encountered
while completing RRQ entries. In an effort to improve system performance,
use the IRQ polling API to schedule this processing on softirq context.

This function will be disabled by default until starting values can be
established for the hardware supported by this driver.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h |   8 +++
 drivers/scsi/cxlflash/main.c   | 123 +++--
 2 files changed, 127 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 9d56b8c..3ff05f1 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -15,6 +15,7 @@
 #ifndef _CXLFLASH_COMMON_H
 #define _CXLFLASH_COMMON_H
 
+#include 
 #include 
 #include 
 #include 
@@ -196,10 +197,17 @@ struct afu {
char version[16];
u64 interface_version;
 
+   u32 irqpoll_weight;
+   struct irq_poll irqpoll;
struct cxlflash_cfg *parent; /* Pointer back to parent cxlflash_cfg */
 
 };
 
+static inline bool afu_is_irqpoll_enabled(struct afu *afu)
+{
+   return !!afu->irqpoll_weight;
+}
+
 static inline bool afu_is_cmd_mode(struct afu *afu, u64 cmd_mode)
 {
u64 afu_cap = afu->interface_version >> SISL_INTVER_CAP_SHIFT;
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 8c207ba..30d68af 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -554,7 +554,7 @@ static void free_mem(struct cxlflash_cfg *cfg)
  * Safe to call with AFU in a partially allocated/initialized state.
  *
  * Cancels scheduled worker threads, waits for any active internal AFU
- * commands to timeout and then unmaps the MMIO space.
+ * commands to timeout, disables IRQ polling and then unmaps the MMIO space.
  */
 static void stop_afu(struct cxlflash_cfg *cfg)
 {
@@ -565,6 +565,8 @@ static void stop_afu(struct cxlflash_cfg *cfg)
if (likely(afu)) {
while (atomic_read(>cmds_active))
ssleep(1);
+   if (afu_is_irqpoll_enabled(afu))
+   irq_poll_disable(>irqpoll);
if (likely(afu->afu_map)) {
cxl_psa_unmap((void __iomem *)afu->afu_map);
afu->afu_map = NULL;
@@ -1158,12 +1160,13 @@ static irqreturn_t cxlflash_sync_err_irq(int irq, void 
*data)
  * process_hrrq() - process the read-response queue
  * @afu:   AFU associated with the host.
  * @doneq: Queue of commands harvested from the RRQ.
+ * @budget:Threshold of RRQ entries to process.
  *
  * This routine must be called holding the disabled RRQ spin lock.
  *
  * Return: The number of entries processed.
  */
-static int process_hrrq(struct afu *afu, struct list_head *doneq)
+static int process_hrrq(struct afu *afu, struct list_head *doneq, int budget)
 {
struct afu_cmd *cmd;
struct sisl_ioasa *ioasa;
@@ -1175,7 +1178,7 @@ static int process_hrrq(struct afu *afu, struct list_head 
*doneq)
*hrrq_end = afu->hrrq_end,
*hrrq_curr = afu->hrrq_curr;
 
-   /* Process however many RRQ entries that are ready */
+   /* Process ready RRQ entries up to the specified budget (if any) */
while (true) {
entry = *hrrq_curr;
 
@@ -1204,6 +1207,9 @@ static int process_hrrq(struct afu *afu, struct list_head 
*doneq)
 
atomic_inc(>hsq_credits);
num_hrrq++;
+
+   if (budget > 0 && num_hrrq >= budget)
+   break;
}
 
afu->hrrq_curr = hrrq_curr;
@@ -1229,6 +1235,32 @@ static void process_cmd_doneq(struct list_head *doneq)
 }
 
 /**
+ * cxlflash_irqpoll() - process a queue of harvested RRQ commands
+ * @irqpoll:   IRQ poll structure associated with queue to poll.
+ * @budget:Threshold of RRQ entries to process per poll.
+ *
+ * Return: The number of entries processed.
+ */
+static int cxlflash_irqpoll(struct irq_poll *irqpoll, int budget)
+{
+   struct afu *afu = container_of(irqpoll, struct afu, irqpoll);
+   unsigned long hrrq_flags;
+   LIST_HEAD(doneq);
+   int num_entries = 0;
+
+   spin_lock_irqsave(>hrrq_slock, hrrq_flags);
+
+   num_entries = process_hrrq(afu, , budget);
+   if (num_entries < budget)
+   irq_poll_complete(irqpoll);
+
+   spin_unlock_irqrestore(>hrrq_slock, hrrq_flags);
+
+   process_cmd_doneq();
+   return num_entries;
+}
+
+/**
  * cxlflash_rrq_irq() - interrupt handler for read-response queue (normal path)
  * @irq:   Interrupt number.
  * @data:  Private data provided at interrupt registration, the AFU.
@@ -1243,7 +1275,14 @@ static irqreturn_t

[PATCH 02/17] cxlflash: Serialize RRQ access and support offlevel processing

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

As further staging to support processing the HRRQ by other means, access
to the HRRQ needs to be serialized by a disabled lock. This will allow
safe access in other non-hardware interrupt contexts. In an effort to
minimize the period where interrupts are disabled, support is added to
queue up commands harvested from the RRQ such that they can be processed
with hardware interrupts enabled. While this doesn't offer any improvement
with processing on a hardware interrupt it will help when IRQ polling is
supported and the command completions can execute on softirq context.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h |  2 ++
 drivers/scsi/cxlflash/main.c   | 42 +++---
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index d11dcc5..9d56b8c 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -134,6 +134,7 @@ struct afu_cmd {
struct afu *parent;
struct scsi_cmnd *scp;
struct completion cevent;
+   struct list_head queue;
 
u8 cmd_tmf:1;
 
@@ -181,6 +182,7 @@ struct afu {
struct sisl_ioarcb *hsq_start;
struct sisl_ioarcb *hsq_end;
struct sisl_ioarcb *hsq_curr;
+   spinlock_t hrrq_slock;
u64 *hrrq_start;
u64 *hrrq_end;
u64 *hrrq_curr;
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 30c09593c..8c207ba 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -1157,10 +1157,13 @@ static irqreturn_t cxlflash_sync_err_irq(int irq, void 
*data)
 /**
  * process_hrrq() - process the read-response queue
  * @afu:   AFU associated with the host.
+ * @doneq: Queue of commands harvested from the RRQ.
+ *
+ * This routine must be called holding the disabled RRQ spin lock.
  *
  * Return: The number of entries processed.
  */
-static int process_hrrq(struct afu *afu)
+static int process_hrrq(struct afu *afu, struct list_head *doneq)
 {
struct afu_cmd *cmd;
struct sisl_ioasa *ioasa;
@@ -1189,7 +1192,7 @@ static int process_hrrq(struct afu *afu)
cmd = container_of(ioarcb, struct afu_cmd, rcb);
}
 
-   cmd_complete(cmd);
+   list_add_tail(>queue, doneq);
 
/* Advance to next entry or wrap and flip the toggle bit */
if (hrrq_curr < hrrq_end)
@@ -1210,17 +1213,43 @@ static int process_hrrq(struct afu *afu)
 }
 
 /**
+ * process_cmd_doneq() - process a queue of harvested RRQ commands
+ * @doneq: Queue of completed commands.
+ *
+ * Note that upon return the queue can no longer be trusted.
+ */
+static void process_cmd_doneq(struct list_head *doneq)
+{
+   struct afu_cmd *cmd, *tmp;
+
+   WARN_ON(list_empty(doneq));
+
+   list_for_each_entry_safe(cmd, tmp, doneq, queue)
+   cmd_complete(cmd);
+}
+
+/**
  * cxlflash_rrq_irq() - interrupt handler for read-response queue (normal path)
  * @irq:   Interrupt number.
  * @data:  Private data provided at interrupt registration, the AFU.
  *
- * Return: Always return IRQ_HANDLED.
+ * Return: IRQ_HANDLED or IRQ_NONE when no ready entries found.
  */
 static irqreturn_t cxlflash_rrq_irq(int irq, void *data)
 {
struct afu *afu = (struct afu *)data;
+   unsigned long hrrq_flags;
+   LIST_HEAD(doneq);
+   int num_entries = 0;
 
-   process_hrrq(afu);
+   spin_lock_irqsave(>hrrq_slock, hrrq_flags);
+   num_entries = process_hrrq(afu, );
+   spin_unlock_irqrestore(>hrrq_slock, hrrq_flags);
+
+   if (num_entries == 0)
+   return IRQ_NONE;
+
+   process_cmd_doneq();
return IRQ_HANDLED;
 }
 
@@ -1540,14 +1569,13 @@ static int start_afu(struct cxlflash_cfg *cfg)
 
init_pcr(cfg);
 
-   /* After an AFU reset, RRQ entries are stale, clear them */
+   /* Initialize RRQ */
memset(>rrq_entry, 0, sizeof(afu->rrq_entry));
-
-   /* Initialize RRQ pointers */
afu->hrrq_start = >rrq_entry[0];
afu->hrrq_end = >rrq_entry[NUM_RRQ_ENTRY - 1];
afu->hrrq_curr = afu->hrrq_start;
afu->toggle = 1;
+   spin_lock_init(>hrrq_slock);
 
/* Initialize SQ */
if (afu_is_sq_cmd_mode(afu)) {
-- 
2.1.0

[PATCH 01/17] cxlflash: Separate RRQ processing from the RRQ interrupt handler

2017-04-12 Thread Uma Krishnan

From: "Matthew R. Ochs" 

In order to support processing the HRRQ by other means (e.g. polling),
the processing portion of the current RRQ interrupt handler needs to be
broken out into a separate routine. This will allow RRQ processing from
places other than the RRQ hardware interrupt handler.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 3061d80..30c09593c 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -1155,19 +1155,18 @@ static irqreturn_t cxlflash_sync_err_irq(int irq, void 
*data)
 }
 
 /**
- * cxlflash_rrq_irq() - interrupt handler for read-response queue (normal path)
- * @irq:   Interrupt number.
- * @data:  Private data provided at interrupt registration, the AFU.
+ * process_hrrq() - process the read-response queue
+ * @afu:   AFU associated with the host.
  *
- * Return: Always return IRQ_HANDLED.
+ * Return: The number of entries processed.
  */
-static irqreturn_t cxlflash_rrq_irq(int irq, void *data)
+static int process_hrrq(struct afu *afu)
 {
-   struct afu *afu = (struct afu *)data;
struct afu_cmd *cmd;
struct sisl_ioasa *ioasa;
struct sisl_ioarcb *ioarcb;
bool toggle = afu->toggle;
+   int num_hrrq = 0;
u64 entry,
*hrrq_start = afu->hrrq_start,
*hrrq_end = afu->hrrq_end,
@@ -1201,11 +1200,27 @@ static irqreturn_t cxlflash_rrq_irq(int irq, void *data)
}
 
atomic_inc(>hsq_credits);
+   num_hrrq++;
}
 
afu->hrrq_curr = hrrq_curr;
afu->toggle = toggle;
 
+   return num_hrrq;
+}
+
+/**
+ * cxlflash_rrq_irq() - interrupt handler for read-response queue (normal path)
+ * @irq:   Interrupt number.
+ * @data:  Private data provided at interrupt registration, the AFU.
+ *
+ * Return: Always return IRQ_HANDLED.
+ */
+static irqreturn_t cxlflash_rrq_irq(int irq, void *data)
+{
+   struct afu *afu = (struct afu *)data;
+
+   process_hrrq(afu);
return IRQ_HANDLED;
 }
 
-- 
2.1.0

[PATCH 00/17] cxlflash: Enhancements and miscellaneous fixes

2017-04-12 Thread Uma Krishnan

This patch series contains miscellaneous patches and adds 4 port device
support. This series also includes patches to improve performance of the
driver in the legacy IO path.

This series is intended for 4.12 and is bisectable

Matthew R. Ochs (16):
  cxlflash: Separate RRQ processing from the RRQ interrupt handler
  cxlflash: Serialize RRQ access and support offlevel processing
  cxlflash: Implement IRQ polling for RRQ processing
  cxlflash: Update sysfs helper routines to pass config structure
  cxlflash: Support dynamic number of FC ports
  cxlflash: Remove port configuration assumptions
  cxlflash: Hide FC internals behind common access routine
  cxlflash: SISlite updates to support 4 ports
  cxlflash: Support up to 4 ports
  cxlflash: Fence EEH during probe
  cxlflash: Remove unnecessary DMA mapping
  cxlflash: Fix power-of-two validations
  cxlflash: Fix warnings/errors
  cxlflash: Improve asynchronous interrupt processing
  cxlflash: Add hardware queues attribute
  cxlflash: Introduce hardware queue steering

Uma Krishnan (1):
  cxlflash: Support multiple hardware queues

 Documentation/powerpc/cxlflash.txt |5 +
 drivers/scsi/cxlflash/common.h |  137 +++--
 drivers/scsi/cxlflash/lunmgt.c |4 +-
 drivers/scsi/cxlflash/main.c   | 1162 +++-
 drivers/scsi/cxlflash/main.h   |2 -
 drivers/scsi/cxlflash/sislite.h|  124 ++--
 drivers/scsi/cxlflash/superpipe.c  |   16 +-
 drivers/scsi/cxlflash/superpipe.h  |   56 +-
 drivers/scsi/cxlflash/vlun.c   |   99 +--
 drivers/scsi/cxlflash/vlun.h   |2 +-
 10 files changed, 1182 insertions(+), 425 deletions(-)

-- 
2.1.0

[PATCH 5/5] powerpc: Enable support for new 'ibm, dynamic-memory-v2' devtree property

2017-04-12 Thread Michael Bringmann

prom_init.c: Enable support for new DRC device tree properties
"ibm,dynamic-memory-v2" in initial handshake between the Linux kernel
and the front end processor.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/kernel/prom_init.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 102b1a1..6ab7b6a 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -869,7 +869,7 @@ struct ibm_arch_vec __cacheline_aligned 
ibm_architecture_vec = {
.mmu = 0,
.hash_ext = 0,
.radix_ext = 0,
-   .byte22 = OV5_FEAT(OV5_DRC_INFO),
+   .byte22 = OV5_FEAT(OV5_DRC_INFO) | OV5_FEAT(OV5_DYN_MEM_V2),
},
 
/* option vector 6: IBM PAPR hints */

[PATCH 4/5] pseries/hotplug init: Convert new DRC memory property for hotplug runtime

2017-04-12 Thread Michael Bringmann

hotplug_init: Simplify the code needed for runtime memory hotplug and
maintenance with a conversion routine that transforms the compressed
property "ibm,dynamic-memory-v2" to the form of "ibm,dynamic-memory"
within the "ibm,dynamic-reconfiguration-memory" property.  Thus only
a single set of routines should be required at runtime to parse, edit,
and manipulate the memory representation in the device tree.  Similarly,
any userspace applications that need this information will only need
to recognize the older format to be able to continue to operate.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/platforms/pseries/Makefile |4 -
 arch/powerpc/platforms/pseries/hotplug-memory.c |   96 +++
 2 files changed, 96 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 8f4ba08..87eb665 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -5,14 +5,14 @@ obj-y := lpar.o hvCall.o nvram.o reconfig.o \
   of_helpers.o \
   setup.o iommu.o event_sources.o ras.o \
   firmware.o power.o dlpar.o mobility.o rng.o \
-  pci.o pci_dlpar.o eeh_pseries.o msi.o
+  pci.o pci_dlpar.o eeh_pseries.o msi.o \
+  hotplug-memory.o
 obj-$(CONFIG_SMP)  += smp.o
 obj-$(CONFIG_SCANLOG)  += scanlog.o
 obj-$(CONFIG_KEXEC_CORE)   += kexec.o
 obj-$(CONFIG_PSERIES_ENERGY)   += pseries_energy.o
 
 obj-$(CONFIG_HOTPLUG_CPU)  += hotplug-cpu.o
-obj-$(CONFIG_MEMORY_HOTPLUG)   += hotplug-memory.o
 
 obj-$(CONFIG_HVC_CONSOLE)  += hvconsole.o
 obj-$(CONFIG_HVCS) += hvcserver.o
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index e104c71..92f41a1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -24,8 +24,6 @@
 #include 
 #include "pseries.h"
 
-static bool rtas_hp_event;
-
 unsigned long pseries_memory_block_size(void)
 {
struct device_node *np;
@@ -69,6 +67,10 @@ unsigned long pseries_memory_block_size(void)
return memblock_size;
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+static bool rtas_hp_event;
+
 static void dlpar_free_property(struct property *prop)
 {
kfree(prop->name);
@@ -1165,11 +1167,101 @@ static int pseries_memory_notifier(struct 
notifier_block *nb,
 static struct notifier_block pseries_mem_nb = {
.notifier_call = pseries_memory_notifier,
 };
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+static int pseries_rewrite_dynamic_memory_v2(void)
+{
+   unsigned long memblock_size;
+   struct device_node *dn;
+   struct property *prop, *prop_v2;
+   __be32 *p;
+   struct of_drconf_cell *lmbs;
+   u32 num_lmb_desc_sets, num_lmbs;
+   int i, j, k;
+
+   dn = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+   if (!dn)
+   return -EINVAL;
+
+   prop_v2 = of_find_property(dn, "ibm,dynamic-memory-v2", NULL);
+   if (!prop_v2)
+   return -EINVAL;
+
+   memblock_size = pseries_memory_block_size();
+   if (!memblock_size)
+   return -EINVAL;
+
+   /* The first int of the property is the number of lmb sets
+* described by the property.
+*/
+   p = (__be32 *)prop_v2->value;
+   num_lmb_desc_sets = be32_to_cpu(*p++);
+
+   /* Count the number of LMBs for generating the alternate format
+*/
+   for (i = 0, num_lmbs = 0; i < num_lmb_desc_sets; i++) {
+   struct of_drconf_cell_v2 drmem;
+
+   read_drconf_cell_v2(, (const __be32 **));
+   num_lmbs += drmem.num_seq_lmbs;
+   }
+
+   /* Create an empty copy of the new 'ibm,dynamic-memory' property
+*/
+   prop = kzalloc(sizeof(*prop), GFP_KERNEL);
+   if (!prop)
+   return -ENOMEM;
+   prop->name = kstrdup("ibm,dynamic-memory", GFP_KERNEL);
+   prop->length = dyn_mem_v2_len(num_lmbs);
+   prop->value = kzalloc(prop->length, GFP_KERNEL);
+
+   /* Copy/expand the ibm,dynamic-memory-v2 format to produce the
+* ibm,dynamic-memory format.
+*/
+   p = (__be32 *)prop->value;
+   *p = cpu_to_be32(num_lmbs);
+   p++;
+   lmbs = (struct of_drconf_cell *)p;
+
+   p = (__be32 *)prop_v2->value;
+   p++;
+
+   for (i = 0, k = 0; i < num_lmb_desc_sets; i++) {
+   struct of_drconf_cell_v2 drmem;
+
+   read_drconf_cell_v2(, (const __be32 **));
+
+   for (j = 0; j < drmem.num_seq_lmbs; j++) {
+   lmbs[k+j].base_addr = be64_to_cpu(drmem.base_addr);
+   lmbs[k+j].drc_index = be32_to_cpu(drmem.drc_index);
+   lmbs[k+j].aa_index  =

[PATCH 3/5] powerpc/memory: Parse new memory property to initialize structures.

2017-04-12 Thread Michael Bringmann

powerpc/memory: Add parallel routines to parse the new property
"ibm,dynamic-memory-v2" property when it is present, and then to
finish initialization of the relevant memory structures with the
operating system.  This code is shared between the boot-time
initialization functions and the runtime functions for memory
hotplug, so it needs to be able to handle both formats.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/include/asm/prom.h |8 ++
 arch/powerpc/mm/numa.c  |  193 +--
 2 files changed, 152 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 77d76d8..b919c1e 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -117,6 +117,14 @@ extern int of_one_drc_info(struct property **prop, void 
**curval,
u32 *sequential_inc_p,
u32 *last_drc_index_p);
 
+static inline int dyn_mem_v2_len(int entries)
+{
+   /* Calculate for counter + number of cells that follow */
+   int drconf_v2_cells = (n_mem_addr_cells + 4);
+   int drconf_v2_cells_len = (drconf_v2_cells * sizeof(unsigned int));
+   return (((entries) * drconf_v2_cells_len) + sizeof(unsigned int));
+}
+
 /*
  * There are two methods for telling firmware what our capabilities are.
  * Newer machines have an "ibm,client-architecture-support" method on the
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 4fdc5ff..b035a8a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -425,30 +425,55 @@ void read_drconf_cell_v2(struct of_drconf_cell_v2 *drmem, 
const __be32 **cellp)
 }
 
 /*
- * Retrieve and validate the ibm,dynamic-memory property of the device tree.
+ * Retrieve and validate the ibm,dynamic-memory[-v2] property of the
+ * device tree.
+ *
+ * The layout of the ibm,dynamic-memory property is a number N of memory
+ * block description list entries followed by N memory block description
+ * list entries.  Each memory block description list entry contains
+ * information as laid out in the of_drconf_cell struct above.
  *
- * The layout of the ibm,dynamic-memory property is a number N of memblock
- * list entries followed by N memblock list entries.  Each memblock list entry
- * contains information as laid out in the of_drconf_cell struct above.
+ * The layout of the ibm,dynamic-memory-v2 property is a number N of memory
+ * block set description list entries, followed by N memory block set
+ * description set entries.
  */
 static int of_get_drconf_memory(struct device_node *memory, const __be32 **dm)
 {
const __be32 *prop;
u32 len, entries;
 
-   prop = of_get_property(memory, "ibm,dynamic-memory", );
-   if (!prop || len < sizeof(unsigned int))
-   return 0;
+   if (firmware_has_feature(FW_FEATURE_DYN_MEM_V2)) {
 
-   entries = of_read_number(prop++, 1);
+   prop = of_get_property(memory, "ibm,dynamic-memory-v2", );
+   if (!prop || len < sizeof(unsigned int))
+   return 0;
 
-   /* Now that we know the number of entries, revalidate the size
-* of the property read in to ensure we have everything
-*/
-   if (len < (entries * (n_mem_addr_cells + 4) + 1) * sizeof(unsigned int))
-   return 0;
+   entries = of_read_number(prop++, 1);
+
+   /* Now that we know the number of set entries, revalidate the
+* size of the property read in to ensure we have everything.
+*/
+   if (len < dyn_mem_v2_len(entries))
+   return 0;
+
+   *dm = prop;
+   } else {
+   prop = of_get_property(memory, "ibm,dynamic-memory", );
+   if (!prop || len < sizeof(unsigned int))
+   return 0;
+
+   entries = of_read_number(prop++, 1);
+
+   /* Now that we know the number of entries, revalidate the size
+* of the property read in to ensure we have everything
+*/
+   if (len < (entries * (n_mem_addr_cells + 4) + 1) *
+  sizeof(unsigned int))
+   return 0;
+
+   *dm = prop;
+   }
 
-   *dm = prop;
return entries;
 }
 
@@ -511,7 +536,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
-static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
+static int of_drconf_to_nid_single(u32 drmem_flags, u32 drmem_aa_index,
   struct assoc_arrays *aa)
 {
int default_nid = 0;
@@ -519,16 +544,16 @@ static int of_drconf_to_nid_single(struct of_drconf_cell 
*drmem,
int index;
 
if (min_common_depth > 0 && min_common_depth <= aa->array_sz

[PATCH 2/5] powerpc/memory: Parse new memory property to register blocks.

2017-04-12 Thread Michael Bringmann

powerpc/memory: Add parallel routines to parse the new property
"ibm,dynamic-memory-v2" property when it is present, and then to
register the relevant memory blocks with the operating system.
This property format is intended to provide a more compact
representation of memory when communicating with the front end
processor, especially when describing vast amounts of RAM.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/include/asm/firmware.h   |4 +
 arch/powerpc/include/asm/prom.h   |   25 +-
 arch/powerpc/kernel/prom.c|  125 -
 arch/powerpc/mm/numa.c|   20 -
 arch/powerpc/platforms/pseries/firmware.c |1 
 5 files changed, 146 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 329d537..062e5f5 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -52,6 +52,7 @@
 #define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0004)
+#define FW_FEATURE_DYN_MEM_V2  ASM_CONST(0x0008)
 
 #ifndef __ASSEMBLY__
 
@@ -68,7 +69,8 @@ enum {
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
-   FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRC_INFO,
+   FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRC_INFO |
+   FW_FEATURE_DYN_MEM_V2,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index d469d7c..77d76d8 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -69,6 +69,8 @@ struct boot_param_header {
  * OF address retreival & translation
  */
 
+extern int n_mem_addr_cells;
+
 /* Parse the ibm,dma-window property of an OF node into the busno, phys and
  * size parameters.
  */
@@ -81,8 +83,9 @@ void of_parse_dma_window(struct device_node *dn, const __be32 
*dma_window,
 extern int of_get_ibm_chip_id(struct device_node *np);
 
 /* The of_drconf_cell struct defines the layout of the LMB array
- * specified in the device tree property
- * ibm,dynamic-reconfiguration-memory/ibm,dynamic-memory
+ * specified in the device tree properties,
+ * ibm,dynamic-reconfiguration-memory/ibm,dynamic-memory
+ * ibm,dynamic-reconfiguration-memory/ibm,dynamic-memory-v2
  */
 struct of_drconf_cell {
u64 base_addr;
@@ -92,9 +95,20 @@ struct of_drconf_cell {
u32 flags;
 };
 
-#define DRCONF_MEM_ASSIGNED0x0008
-#define DRCONF_MEM_AI_INVALID  0x0040
-#define DRCONF_MEM_RESERVED0x0080
+#define DRCONF_MEM_ASSIGNED0x0008
+#define DRCONF_MEM_AI_INVALID  0x0040
+#define DRCONF_MEM_RESERVED0x0080
+
+struct of_drconf_cell_v2 {
+   u32 num_seq_lmbs;
+   u64 base_addr;
+   u32 drc_index;
+   u32 aa_index;
+   u32 flags;
+} __attribute__((packed));
+
+extern void read_drconf_cell_v2(struct of_drconf_cell_v2 *drmem,
+   const __be32 **cellp);
 
 extern int of_one_drc_info(struct property **prop, void **curval,
char **dtype, char **dname,
@@ -180,6 +194,7 @@ extern int of_one_drc_info(struct property **prop, void 
**curval,
 /* Radix Table Extensions */
 #define OV5_RADIX_GTSE 0x1A40  /* Guest Translation Shoot Down Avail */
 #define OV5_DRC_INFO   0x1640  /* Redef Prop Structures: drc-info   */
+#define OV5_DYN_MEM_V2 0x1680  /* Redef Prop Structures: dyn-mem-v2   
*/
 
 /* Option Vector 6: IBM PAPR hints */
 #define OV6_LINUX  0x02/* Linux is our OS */
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index bca4abd..1bc0a36 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -76,6 +76,21 @@
 static phys_addr_t first_memblock_size;
 static int __initdata boot_cpu_count;
 
+#ifdef CONFIG_PPC64
+static int if_iommu_is_off(u64 base, u64 *size)
+{
+   if (iommu_is_off) {
+   if (base >= 0x8000ul)
+   return 1;
+   if ((base + (*size)) > 0x8000ul)
+   (*size) = 0x8000ul - base;
+   }
+   return 0;
+}
+#else
+#defineif_iommu_is_off(base, size) 0
+#endif
+
 static int __init early_parse_mem(char *p)
 {
if (!p)
@@ -444,23 +459,34 @@ static int __init early_init_dt_scan_chosen_ppc(unsigned 
long node,
 
 #ifdef CONFIG_PPC_PSERIES
 /*
- * Interpret the ibm,dynamic-memory property in the
- * /ibm,dynamic-reconfiguration-memory node.
+ * Retrieve and validate the ibm,lmb-size

[PATCH 1/5] powerpc/dynmemv2: Check arch.vec earlier during boot for memory features

2017-04-12 Thread Michael Bringmann

architecture.vec5 features: The boot-time memory management needs to
know the form of the "ibm,dynamic-memory-v2" property early during
scanning of the flattened device tree.  This patch moves execution of
the function pseries_probe_fw_features() early enough to be before
the scanning of the memory properties in the device tree to allow
recognition of the supported properties.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/kernel/prom.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index f5d399e..bca4abd 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -679,6 +679,9 @@ void __init early_init_devtree(void *params)
 */
of_scan_flat_dt(early_init_dt_scan_chosen_ppc, boot_command_line);
 
+   /* Now try to figure out if we are running on LPAR and so on */
+   pseries_probe_fw_features();
+
/* Scan memory nodes and rebuild MEMBLOCKs */
of_scan_flat_dt(early_init_dt_scan_root, NULL);
of_scan_flat_dt(early_init_dt_scan_memory_ppc, NULL);
@@ -746,9 +749,6 @@ void __init early_init_devtree(void *params)
 #endif
epapr_paravirt_early_init();
 
-   /* Now try to figure out if we are running on LPAR and so on */
-   pseries_probe_fw_features();
-
 #ifdef CONFIG_PPC_PS3
/* Identify PS3 firmware */
if (of_flat_dt_is_compatible(of_get_flat_dt_root(), "sony,ps3"))

[PATCH 0/5] powerpc/devtree: Add support for 'ibm, dynamic-memory-v2' property

2017-04-12 Thread Michael Bringmann

"ibm,dynamic-memory-v2": This property replaces the "ibm,dynamic-memory"
node representation within the "ibm,dynamic-reconfiguration-memory"
property provided by the BMC.  This element format is intended to provide
a more compact representation of memory, especially, for systems with
massive amounts of RAM.  To simplify portability, this property is
converted to the "ibm,dynamic-memory" property during system boot.

"ibm,architecture.vec": Bidirectional communication mechanism between
the host system and the front end processor indicating what features
the host system supports and what features the front end processor will
actually provide.  In this case, we are indicating that the host system
can support the new device tree structure "ibm,dynamic-memory-v2".

Signed-off-by: Michael Bringmann 

Michael Bringmann (5):
  powerpc: Check arch.vec earlier during boot for memory features
  powerpc/memory: Parse new memory property to register blocks.
  powerpc/memory: Parse new memory property to initialize structures.
  pseries/hotplug init: Convert new DRC memory property for hotplug runtime
  powerpc: Enable support for new 'ibm,dynamic-memory-v2' devtree property

 arch/powerpc/include/asm/firmware.h |4 
 arch/powerpc/include/asm/prom.h |   33 +++-
 arch/powerpc/kernel/prom.c  |  131 +++---
 arch/powerpc/kernel/prom_init.c |2 
 arch/powerpc/mm/numa.c  |  213 ++-
 arch/powerpc/platforms/pseries/Makefile |4 
 arch/powerpc/platforms/pseries/firmware.c   |1 
 arch/powerpc/platforms/pseries/hotplug-memory.c |   96 ++
 8 files changed, 398 insertions(+), 86 deletions(-)

--

[PATCH][OPAL] cpufeatures: add base and POWER8, POWER9 /cpus/features dt

2017-04-12 Thread Nicholas Piggin

This is the skiboot patch included here if anyone wants to test or review.
I've put more of the feature documentation into this patch, so I haven't
duplicated it on the Linux side -- firmware will be canonical definition.

---
 core/Makefile.inc  |   2 +-
 core/cpufeatures.c | 888 +
 core/device.c  |   7 +
 core/init.c|   1 +
 include/device.h   |   1 +
 include/skiboot.h  |   5 +
 6 files changed, 903 insertions(+), 1 deletion(-)
 create mode 100644 core/cpufeatures.c

diff --git a/core/Makefile.inc b/core/Makefile.inc
index b09c30c0..7c247836 100644
--- a/core/Makefile.inc
+++ b/core/Makefile.inc
@@ -8,7 +8,7 @@ CORE_OBJS += pci-opal.o fast-reboot.o device.o exceptions.o 
trace.o affinity.o
 CORE_OBJS += vpd.o hostservices.o platform.o nvram.o nvram-format.o hmi.o
 CORE_OBJS += console-log.o ipmi.o time-utils.o pel.o pool.o errorlog.o
 CORE_OBJS += timer.o i2c.o rtc.o flash.o sensor.o ipmi-opal.o
-CORE_OBJS += flash-subpartition.o bitmap.o buddy.o pci-quirk.o
+CORE_OBJS += flash-subpartition.o bitmap.o buddy.o pci-quirk.o cpufeatures.o
 
 ifeq ($(SKIBOOT_GCOV),1)
 CORE_OBJS += gcov-profiling.o
diff --git a/core/cpufeatures.c b/core/cpufeatures.c
new file mode 100644
index ..d717e4d7
--- /dev/null
+++ b/core/cpufeatures.c
@@ -0,0 +1,888 @@
+/* Copyright 2017 IBM Corp.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ * implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/*
+ * This file deals with setting up the /cpus/features device tree
+ * by discovering CPU hardware and populating feature nodes.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#define DEBUG 1
+#ifdef DEBUG
+#define DBG(fmt, a...) prlog(PR_DEBUG, "CPUFT: " fmt, ##a)
+#else
+#define DBG(fmt, a...)
+#endif
+
+#define USABLE_PR  (1U << 0)
+#define USABLE_OS  (1U << 1)
+#define USABLE_HV  (1U << 2)
+
+#define HV_SUPPORT_NONE0
+#define HV_SUPPORT_CUSTOM  1
+#define HV_SUPPORT_HFSCR   2
+
+#define OS_SUPPORT_NONE0
+#define OS_SUPPORT_CUSTOM  1
+#define OS_SUPPORT_FSCR2
+
+/* CPU variant numbers */
+#define CPUFEATURES_CPU_P8_DD1 1 /* leave 0 unused */
+#define CPUFEATURES_CPU_P8_DD2 2
+#define CPUFEATURES_CPU_P9_DD1 3
+#define CPUFEATURES_CPU_P9_DD2 4
+
+/* Bitmasks for the match table */
+#define P8_DD1 (1U << CPUFEATURES_CPU_P8_DD1)
+#define P8_DD2 (1U << CPUFEATURES_CPU_P8_DD2)
+#define P9_DD1 (1U << CPUFEATURES_CPU_P9_DD1)
+#define P9_DD2 (1U << CPUFEATURES_CPU_P9_DD2)
+
+#define P8 (P8_DD1|P8_DD2)
+#define P9 (P9_DD1|P9_DD2)
+#define CPU_ALL(P8|P9)
+
+#define CPUFEATURES_ISA_V2_07B 2070
+#define CPUFEATURES_ISA_V3_0B  3000
+
+#define ISA_BASE   0
+#define ISA_V3_0B  CPUFEATURES_ISA_V3_0B
+
+struct cpu_feature {
+   const char *name;
+   uint32_t cpus_supported;
+   uint32_t isa;
+   uint32_t usable_mask;
+   uint32_t hv_support;
+   uint32_t os_support;
+   uint32_t hfscr_bit_nr;
+   uint32_t fscr_bit_nr;
+   uint32_t hwcap_bit_nr;
+   const char *dependencies_names; /* space-delimited names */
+};
+
+/*
+ * The base (or NULL) cpu feature set is the CPU features available
+ * when no child nodes of the /cpus/features node exist. The base feature
+ * set is POWER8 (ISAv2.07B), less features that are listed explicitly.
+ *
+ * There will be a /cpus/features/isa property that specifies the currently
+ * active ISA level. Those architected features without explicit nodes
+ * will match the current ISA level. A greater ISA level will imply some
+ * features are phased out.
+ *
+ * XXX: currently, the feature dependencies are not necessarily captured
+ * exactly or completely. This is somewhat acceptable because all
+ * implementations must be aware of all these features.
+ */
+static const struct cpu_feature cpu_features_table[] = {
+   /*
+* Big endian as in ISAv2.07B, MSR_LE=0
+*/
+   { "big-endian",
+   CPU_ALL,
+   ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
+   HV_SUPPORT_CUSTOM, OS_SUPPORT_CUSTOM,
+   -1, -1, -1,
+   NULL, },
+
+   /*
+* Little endian as in ISAv2.07B, MSR_LE=1.
+*
+* When both big and little endian are defined, there is an LPCR ILE
+* bit and implementation specific way to switch HILE mode, MSR_SLE,
+* etc.
+*/
+   {

[PATCH 3/3] powerpc/64s: cpufeatures: add initial implementation for cpufeatures

2017-04-12 Thread Nicholas Piggin

The /cpus/features dt binding describes architected CPU features along
with some compatibility, privilege, and enablement properties that allow
flexibility with discovering and enabling capabilities.

Presence of this feature implies a base level of functionality, then
additional feature nodes advertise the presence of new features.

A given feature and its setup procedure is defined once and used by all
CPUs which are compatible by that feature. Features that follow a
supported "prescription" can be enabled by a hypervisor or OS that
does not understand them natively.

One difference with features after this patch is that PPC_FEATURE2_EBB
is set independent of PMU init. EBB facility is more general than PMU,
so I think this is reasonable.

Signed-off-by: Nicholas Piggin 

---
Since last post:
- Looked at XIVE patches compatibility. There is a dependency there now
  on LPCR setup (HEIC, LPES, etc).
- Split vector-scalar feature from vector
- Get cpu_name from cputable if exists, for /proc/cpuinfo compatibility.


Testing under mambo shows a few differences with POWER8, but they all
seem to be mambo related:
- HFSCR bit 54 and 57 are now clear (mambo sets at init)
- PMAO_BUG is set. This is due to mambo setting architected POWER8 mode
  and POWER8E PVR. Current kernels lose PMAO_BUG bit.
- CI_LARGE_PAGE is now set (mambo boot does not set it for some reason,
  haven't looked at why).

Under POWER9 I haven't found differences.

 .../devicetree/bindings/powerpc/cpufeatures.txt| 264 +
 arch/powerpc/Kconfig   |  16 +
 arch/powerpc/include/asm/cpu_has_feature.h |   4 +-
 arch/powerpc/include/asm/cpufeatures.h |  57 ++
 arch/powerpc/include/asm/cputable.h|   2 +
 arch/powerpc/include/asm/reg.h |   1 +
 arch/powerpc/kernel/Makefile   |   1 +
 arch/powerpc/kernel/cpufeatures.c  | 631 +
 arch/powerpc/kernel/cputable.c |  37 +-
 arch/powerpc/kernel/prom.c | 333 ++-
 arch/powerpc/kernel/setup-common.c |   2 +-
 arch/powerpc/kernel/setup_64.c |  15 +-
 12 files changed, 1346 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/powerpc/cpufeatures.txt
 create mode 100644 arch/powerpc/include/asm/cpufeatures.h
 create mode 100644 arch/powerpc/kernel/cpufeatures.c

diff --git a/Documentation/devicetree/bindings/powerpc/cpufeatures.txt 
b/Documentation/devicetree/bindings/powerpc/cpufeatures.txt
new file mode 100644
index ..325b263f4cdf
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/cpufeatures.txt
@@ -0,0 +1,264 @@
+powerpc cpu features binding
+
+
+The device tree describes supported CPU features as nodes containing
+compatibility and enablement information as properties.
+
+The binding specifies features common to all CPUs in the system.
+Heterogeneous CPU features are not supported at present (such could be added
+by providing nodes with additional features and linking those to particular
+CPUs).
+
+This binding is intended to provide fine grained control of CPU features at
+all levels of the stack (firmware, hypervisor, OS, userspace), with the
+ability for new CPU features to be used by some components without all
+components being upgraded (e.g., a new floating point instruction could be
+used by userspace math library without upgrading kernel and hypervisor).
+
+The binding is passed to the hypervisor by firmware. The hypervisor must
+remove any features that require hypervisor enablement but that it does not
+enable. It must remove any features that depend on removed features. It may
+pass remaining features usable to the OS and PR to guests, depending on
+configuration policy (not specified here).
+
+The modified binding is passed to the guest by hypervisor, with HV bit
+cleared from the usable-mask and the hv-support and hfscr-bit properties
+removed. The guest must similarly rmeove features that require OS enablement
+that it does not enable. The OS may pass PR usable features to userspace via
+ELF AUX vectors AT_HWCAP, AT_HWCAP2, AT_HWCAP3, etc., or use some other
+method (outside the scope of this specification).
+
+The binding will specify a "base" level of features that will be present
+when the cpu features binding exists. Additional features will be explicitly
+specified.
+
+/cpus/features node binding
+---
+
+Node: features
+
+Description: Container of CPU feature nodes.
+
+The node name must be "features" and it must be a child of the node "/cpus".
+
+The node is optional but should be provided by new firmware.
+
+Each child node of cpufeatures represents an architected CPU feature (e.g.,
+a new set of vector instructions) or an important CPU performance
+characteristic (e.g., fast unaligned memory operations). The specification
+of each

[PATCH 2/3] of/fdt: introduce of_scan_flat_dt_subnodes and of_get_flat_dt_phandle

2017-04-12 Thread Nicholas Piggin

Introduce primitives for FDT parsing. These will be used for powerpc
cpufeatures node scanning, which has quite complex structure but should
be processed early.

Acked-by: Rob Herring 
Signed-off-by: Nicholas Piggin 
---
 drivers/of/fdt.c   | 38 ++
 include/linux/of_fdt.h |  6 ++
 2 files changed, 44 insertions(+)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index e5ce4b59e162..961ca97072a9 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -754,6 +754,36 @@ int __init of_scan_flat_dt(int (*it)(unsigned long node,
 }
 
 /**
+ * of_scan_flat_dt_subnodes - scan sub-nodes of a node call callback on each.
+ * @it: callback function
+ * @data: context data pointer
+ *
+ * This function is used to scan sub-nodes of a node.
+ */
+int __init of_scan_flat_dt_subnodes(unsigned long parent,
+   int (*it)(unsigned long node,
+ const char *uname,
+ void *data),
+   void *data)
+{
+   const void *blob = initial_boot_params;
+   int node;
+
+   fdt_for_each_subnode(node, blob, parent) {
+   const char *pathp;
+   int rc;
+
+   pathp = fdt_get_name(blob, node, NULL);
+   if (*pathp == '/')
+   pathp = kbasename(pathp);
+   rc = it(node, pathp, data);
+   if (rc)
+   return rc;
+   }
+   return 0;
+}
+
+/**
  * of_get_flat_dt_subnode_by_name - get the subnode by given name
  *
  * @node: the parent node
@@ -812,6 +842,14 @@ int __init of_flat_dt_match(unsigned long node, const char 
*const *compat)
return of_fdt_match(initial_boot_params, node, compat);
 }
 
+/**
+ * of_get_flat_dt_prop - Given a node in the flat blob, return the phandle
+ */
+uint32_t __init of_get_flat_dt_phandle(unsigned long node)
+{
+   return fdt_get_phandle(initial_boot_params, node);
+}
+
 struct fdt_scan_status {
const char *name;
int namelen;
diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
index 271b3fdf0070..1dfbfd0d8040 100644
--- a/include/linux/of_fdt.h
+++ b/include/linux/of_fdt.h
@@ -54,6 +54,11 @@ extern char __dtb_end[];
 extern int of_scan_flat_dt(int (*it)(unsigned long node, const char *uname,
 int depth, void *data),
   void *data);
+extern int of_scan_flat_dt_subnodes(unsigned long node,
+   int (*it)(unsigned long node,
+ const char *uname,
+ void *data),
+   void *data);
 extern int of_get_flat_dt_subnode_by_name(unsigned long node,
  const char *uname);
 extern const void *of_get_flat_dt_prop(unsigned long node, const char *name,
@@ -62,6 +67,7 @@ extern int of_flat_dt_is_compatible(unsigned long node, const 
char *name);
 extern int of_flat_dt_match(unsigned long node, const char *const *matches);
 extern unsigned long of_get_flat_dt_root(void);
 extern int of_get_flat_dt_size(void);
+extern uint32_t of_get_flat_dt_phandle(unsigned long node);
 
 extern int early_init_dt_scan_chosen(unsigned long node, const char *uname,
 int depth, void *data);
-- 
2.11.0

[PATCH 1/3] powerpc/64s: POWER9 no LPCR VRMASD bits

2017-04-12 Thread Nicholas Piggin

POWER9/ISAv3 has no VRMASD field in LPCR. Don't set reserved bits.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/cpu_setup_power.S | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index 7fe8c79e6937..3737685e1f54 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -29,7 +29,7 @@ _GLOBAL(__setup_cpu_power7)
li  r0,0
mtspr   SPRN_LPID,r0
mfspr   r3,SPRN_LPCR
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA206
bl  __init_tlb_power7
mtlrr11
blr
@@ -42,7 +42,7 @@ _GLOBAL(__restore_cpu_power7)
li  r0,0
mtspr   SPRN_LPID,r0
mfspr   r3,SPRN_LPCR
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA206
bl  __init_tlb_power7
mtlrr11
blr
@@ -59,7 +59,7 @@ _GLOBAL(__setup_cpu_power8)
mtspr   SPRN_LPID,r0
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA206
bl  __init_HFSCR
bl  __init_tlb_power8
bl  __init_PMU_HV
@@ -80,7 +80,7 @@ _GLOBAL(__restore_cpu_power8)
mtspr   SPRN_LPID,r0
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA206
bl  __init_HFSCR
bl  __init_tlb_power8
bl  __init_PMU_HV
@@ -103,7 +103,7 @@ _GLOBAL(__setup_cpu_power9)
or  r3, r3, r4
LOAD_REG_IMMEDIATE(r4, LPCR_UPRT | LPCR_HR)
andcr3, r3, r4
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA300
bl  __init_HFSCR
bl  __init_tlb_power9
bl  __init_PMU_HV
@@ -126,7 +126,7 @@ _GLOBAL(__restore_cpu_power9)
or  r3, r3, r4
LOAD_REG_IMMEDIATE(r4, LPCR_UPRT | LPCR_HR)
andcr3, r3, r4
-   bl  __init_LPCR
+   bl  __init_LPCR_ISA300
bl  __init_HFSCR
bl  __init_tlb_power9
bl  __init_PMU_HV
@@ -144,7 +144,7 @@ __init_hvmode_206:
std r5,CPU_SPEC_FEATURES(r4)
blr
 
-__init_LPCR:
+__init_LPCR_ISA206:
/* Setup a sane LPCR:
 *   Called with initial LPCR in R3
 *
@@ -157,6 +157,11 @@ __init_LPCR:
 *
 * Other bits untouched for now
 */
+   li  r5,0x10
+   rldimi  r3,r5, LPCR_VRMASD_SH, 64-LPCR_VRMASD_SH-5
+
+   /* POWER9 has no VRMASD */
+__init_LPCR_ISA300:
li  r5,1
rldimi  r3,r5, LPCR_LPES_SH, 64-LPCR_LPES_SH-2
ori r3,r3,(LPCR_PECE0|LPCR_PECE1|LPCR_PECE2)
@@ -165,8 +170,6 @@ __init_LPCR:
clrrdi  r3,r3,1 /* clear HDICE */
li  r5,4
rldimi  r3,r5, LPCR_VC_SH, 0
-   li  r5,0x10
-   rldimi  r3,r5, LPCR_VRMASD_SH, 64-LPCR_VRMASD_SH-5
mtspr   SPRN_LPCR,r3
isync
blr
-- 
2.11.0

[PATCH 0/3] cpufeatures merge candidate

2017-04-12 Thread Nicholas Piggin

I expect this will require still some more changes, but I think it's
getting close to polished. Intention is to make it default off and
unsupported at the initial merge, to give a new more weeks to test
then freeze the format.

I included the firmware cpufeatures patch here as well.

Thanks,
Nick

[PATCH tip/core/rcu 02/40] rcu: Make arch select smp_mb__after_unlock_lock() strength

2017-04-12 Thread Paul E. McKenney

The definition of smp_mb__after_unlock_lock() is currently smp_mb()
for CONFIG_PPC and a no-op otherwise.  It would be better to instead
provide an architecture-selectable Kconfig option, and select the
strength of smp_mb__after_unlock_lock() based on that option.  This
commit therefore creates CONFIG_ARCH_WEAK_RELACQ, has PPC select it,
and bases the definition of smp_mb__after_unlock_lock() on this new
CONFIG_ARCH_WEAK_RELACQ Kconfig option.

Reported-by: Ingo Molnar 
Signed-off-by: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Will Deacon 
Cc: Boqun Feng 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Acked-by: Michael Ellerman 
Cc: 
---
 arch/Kconfig | 3 +++
 arch/powerpc/Kconfig | 1 +
 include/linux/rcupdate.h | 6 +++---
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index cd211a14a88f..adefaf344239 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -320,6 +320,9 @@ config HAVE_CMPXCHG_LOCAL
 config HAVE_CMPXCHG_DOUBLE
bool
 
+config ARCH_WEAK_RELEASE_ACQUIRE
+   bool
+
 config ARCH_WANT_IPC_PARSE_VERSION
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 97a8bc8a095c..7a5c9b764cd2 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -99,6 +99,7 @@ config PPC
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_WANT_IPC_PARSE_VERSION
+   select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
select BUILDTIME_EXTABLE_SORT
select CLONE_BACKWARDS
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index de88b33c0974..e6146d0074f8 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1127,11 +1127,11 @@ do { \
  * if the UNLOCK and LOCK are executed by the same CPU or if the
  * UNLOCK and LOCK operate on the same lock variable.
  */
-#ifdef CONFIG_PPC
+#ifdef CONFIG_ARCH_WEAK_RELEASE_ACQUIRE
 #define smp_mb__after_unlock_lock()smp_mb()  /* Full ordering for lock. */
-#else /* #ifdef CONFIG_PPC */
+#else /* #ifdef CONFIG_ARCH_WEAK_RELEASE_ACQUIRE */
 #define smp_mb__after_unlock_lock()do { } while (0)
-#endif /* #else #ifdef CONFIG_PPC */
+#endif /* #else #ifdef CONFIG_ARCH_WEAK_RELEASE_ACQUIRE */
 
 
 #endif /* __LINUX_RCUPDATE_H */
-- 
2.5.2

Re: [v3,1/2] powerpc/mm: Dump linux pagetables

2017-04-12 Thread Christophe LEROY


Hi Rashmica,

Le 17/11/2016 à 13:03, Michael Ellerman a écrit :

On Fri, 2016-27-05 at 05:48:59 UTC, Rashmica Gupta wrote:

Useful to be able to dump the kernels page tables to check permissions
and memory types - derived from arm64's implementation.

Add a debugfs file to check the page tables. To use this the PPC_PTDUMP
config option must be selected.

Signed-off-by: Rashmica Gupta 


Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8eb07b187000d48152c4ef97f075bd


For your information, compilation fails on mpc885_ads_defconfig when we 
activate PPC_PTDUMP:


  CC  arch/powerpc/mm/dump_linuxpagetables.o
arch/powerpc/mm/dump_linuxpagetables.c: In function 'walk_pagetables':
arch/powerpc/mm/dump_linuxpagetables.c:369:10: error: 'KERN_VIRT_START' 
undeclared (first use in this function)
arch/powerpc/mm/dump_linuxpagetables.c:369:10: note: each undeclared 
identifier is reported only once for each function it appears in

arch/powerpc/mm/dump_linuxpagetables.c: In function 'populate_markers':
arch/powerpc/mm/dump_linuxpagetables.c:383:37: error: 'ISA_IO_BASE' 
undeclared (first use in this function)
arch/powerpc/mm/dump_linuxpagetables.c:384:37: error: 'ISA_IO_END' 
undeclared (first use in this function)
arch/powerpc/mm/dump_linuxpagetables.c:385:37: error: 'PHB_IO_BASE' 
undeclared (first use in this function)
arch/powerpc/mm/dump_linuxpagetables.c:386:37: error: 'PHB_IO_END' 
undeclared (first use in this function)
arch/powerpc/mm/dump_linuxpagetables.c:387:37: error: 'IOREMAP_BASE' 
undeclared (first use in this function)
arch/powerpc/mm/dump_linuxpagetables.c:388:37: error: 'IOREMAP_END' 
undeclared (first use in this function)
arch/powerpc/mm/dump_linuxpagetables.c:392:38: error: 'VMEMMAP_BASE' 
undeclared (first use in this function)

arch/powerpc/mm/dump_linuxpagetables.c: In function 'ptdump_show':
arch/powerpc/mm/dump_linuxpagetables.c:400:20: error: 'KERN_VIRT_START' 
undeclared (first use in this function)

make[1]: *** [arch/powerpc/mm/dump_linuxpagetables.o] Error 1
make: *** [arch/powerpc/mm] Error 2



Christophe

Re: [PATCH] powerpc: Avoid taking a data miss on every userspace instruction miss

2017-04-12 Thread Christophe LEROY


Hi Anton,

Le 04/04/2017 à 00:00, Anton Blanchard a écrit :

Hi Christophe,


-   if (user_mode(regs))
+   if (!is_exec && user_mode(regs))


Shouldn't it also check 'is_write' ?
If it is a store, is_write should be set, shouldn't it ?


Thanks, Ben had the same suggestion. I'll add that further optimisation
in a subsequent patch.

Anton



For your information, I made some benchmark test using 'perf stat' with 
your app on MPC8321 and MPC885, and I got the following results:


MPC8321 before the change:

 Performance counter stats for './fault 1000' (10 runs):

   4491.971466  cpu-clock (msec) 
  ( +-  0.03% )
 47386  faults 
  ( +-  0.02% )


   4.727864465 seconds time elapsed 
 ( +-  0.17% )



MPC8321 after your change:

 Performance counter stats for './fault 1000' (10 runs):

   4278.738845  cpu-clock (msec) 
  ( +-  0.02% )
 35181  faults 
  ( +-  0.02% )


   4.504443891 seconds time elapsed 
 ( +-  0.19% )



MPC8321 after changing !is_exec by is_write

 Performance counter stats for './fault 1000' (10 runs):

   4268.187261  cpu-clock (msec) 
  ( +-  0.03% )
 35181  faults 
  ( +-  0.01% )


   4.489207922 seconds time elapsed 
 ( +-  0.20% )






MPC885 before the change:

 Performance counter stats for './fault 500' (10 runs):

 726605854  cpu-cycles 
  ( +-  0.03% )
176067  dTLB-load-misses 
  ( +-  0.08% )
 52722  iTLB-load-misses 
  ( +-  0.01% )
 25718  faults 
  ( +-  0.03% )


   5.795924654 seconds time elapsed 
 ( +-  0.14% )



MPC885 after your change:

 Performance counter stats for './fault 500' (10 runs):

 711233251  cpu-cycles 
  ( +-  0.04% )
152462  dTLB-load-misses 
  ( +-  0.09% )
 52715  iTLB-load-misses 
  ( +-  0.01% )
 19611  faults 
  ( +-  0.02% )


   5.673784606 seconds time elapsed 
 ( +-  0.14% )



MPC885 after changing !is_exec by is_write

 Performance counter stats for './fault 500' (10 runs):

 710904083  cpu-cycles 
  ( +-  0.05% )
147162  dTLB-load-misses 
  ( +-  0.06% )
 52716  iTLB-load-misses 
  ( +-  0.01% )
 19610  faults 
  ( +-  0.02% )


   5.672091139 seconds time elapsed 
 ( +-  0.15% )




Christophe

Re: [PATCH v4 0/5] perf report: Show branch type

2017-04-12 Thread Jiri Olsa

On Wed, Apr 12, 2017 at 11:42:44PM +0800, Jin, Yao wrote:
> 
> 
> On 4/12/2017 10:26 PM, Jiri Olsa wrote:
> > On Wed, Apr 12, 2017 at 08:25:34PM +0800, Jin, Yao wrote:
> > 
> > SNIP
> > 
> > > > # Overhead  Command  Source Shared Object  Source Symbol
> > > > Target SymbolBasic Block Cycles
> > > > #   ...    
> > > > ...  
> > > > ...  ..
> > > > #
> > > >8.30%  perf
> > > > Um  [kernel.vmlinux]  [k] __intel_pmu_enable_all.constprop.17  [k] 
> > > > native_write_msr -
> > > >7.91%  perf
> > > > Um  [kernel.vmlinux]  [k] intel_pmu_lbr_enable_all [k] 
> > > > __intel_pmu_enable_all.constprop.17  -
> > > >7.91%  perf
> > > > Um  [kernel.vmlinux]  [k] native_write_msr [k] 
> > > > intel_pmu_lbr_enable_all -
> > > >6.32%  kill libc-2.24.so  [.] _dl_addr   
> > > >   [.] _dl_addr -
> > > >5.93%  perf
> > > > Um  [kernel.vmlinux]  [k] perf_iterate_ctx [k] 
> > > > perf_iterate_ctx -
> > > >2.77%  kill libc-2.24.so  [.] malloc 
> > > >   [.] malloc   -
> > > >1.98%  kill libc-2.24.so  [.] _int_malloc
> > > >   [.] _int_malloc  -
> > > >1.58%  kill [kernel.vmlinux]  [k] __rb_insert_augmented  
> > > >   [k] __rb_insert_augmented-
> > > >1.58%  perf
> > > > Um  [kernel.vmlinux]  [k] perf_event_exec  [k] 
> > > > perf_event_exec  -
> > > >1.19%  kill [kernel.vmlinux]  [k] 
> > > > anon_vma_interval_tree_insert[k] anon_vma_interval_tree_insert  
> > > >   -
> > > >1.19%  kill [kernel.vmlinux]  [k] free_pgd_range 
> > > >   [k] free_pgd_range   -
> > > >1.19%  kill [kernel.vmlinux]  [k] n_tty_write
> > > >   [k] n_tty_write  -
> > > >1.19%  perf
> > > > Um  [kernel.vmlinux]  [k] native_sched_clock   [k] 
> > > > sched_clock  -
> > > > ...
> > > > SNIP
> > > > 
> > > > 
> > > > jirka
> > > Hi,
> > > 
> > > Thanks so much for trying this patch.
> > > 
> > > The branch statistics is printed at the end of perf report --stdio.
> > yep, but for some reason with your changes the head report
> > got changed as well, I haven't checked the details yet..
> > 
> > jirka
> 
> The kill returns immediately with no parameter error. Could you try an
> application which can run for a while?
> 
> For example:
> perf record -j any,save_type top

sure, but it does not change the fact that the report output is broken,
we need to fix it even for the 'kill' record case

jirka

Re: [PATCH v4 0/5] perf report: Show branch type

2017-04-12 Thread Jin, Yao




On 4/12/2017 10:26 PM, Jiri Olsa wrote:

On Wed, Apr 12, 2017 at 08:25:34PM +0800, Jin, Yao wrote:

SNIP


# Overhead  Command  Source Shared Object  Source Symbol
Target SymbolBasic Block Cycles
#   ...    
...  
...  ..
#
   8.30%  perf
Um  [kernel.vmlinux]  [k] __intel_pmu_enable_all.constprop.17  [k] 
native_write_msr -
   7.91%  perf
Um  [kernel.vmlinux]  [k] intel_pmu_lbr_enable_all [k] 
__intel_pmu_enable_all.constprop.17  -
   7.91%  perf
Um  [kernel.vmlinux]  [k] native_write_msr [k] 
intel_pmu_lbr_enable_all -
   6.32%  kill libc-2.24.so  [.] _dl_addr   
  [.] _dl_addr -
   5.93%  perf
Um  [kernel.vmlinux]  [k] perf_iterate_ctx [k] 
perf_iterate_ctx -
   2.77%  kill libc-2.24.so  [.] malloc 
  [.] malloc   -
   1.98%  kill libc-2.24.so  [.] _int_malloc
  [.] _int_malloc  -
   1.58%  kill [kernel.vmlinux]  [k] __rb_insert_augmented  
  [k] __rb_insert_augmented-
   1.58%  perf
Um  [kernel.vmlinux]  [k] perf_event_exec  [k] 
perf_event_exec  -
   1.19%  kill [kernel.vmlinux]  [k] anon_vma_interval_tree_insert  
  [k] anon_vma_interval_tree_insert-
   1.19%  kill [kernel.vmlinux]  [k] free_pgd_range 
  [k] free_pgd_range   -
   1.19%  kill [kernel.vmlinux]  [k] n_tty_write
  [k] n_tty_write  -
   1.19%  perf
Um  [kernel.vmlinux]  [k] native_sched_clock   [k] 
sched_clock  -
...
SNIP


jirka

Hi,

Thanks so much for trying this patch.

The branch statistics is printed at the end of perf report --stdio.

yep, but for some reason with your changes the head report
got changed as well, I haven't checked the details yet..

jirka


The kill returns immediately with no parameter error. Could you try an 
application which can run for a while?


For example:
perf record -j any,save_type top

Thanks
Jin Yao

Re: [linuxppc-dev] Patch notification: 1 patch updated

2017-04-12 Thread Christophe LEROY


Hello Michael and Scott,

I see that the status of the below patch has been changed to 'Not 
Applicable' in the linuxppc-dev Patchwork.



About this serie, David S. Miller said:

Sujet : Re: [PATCH 0/2] get rid of immrbar_virt_to_phys()
Date : Wed, 08 Feb 2017 13:17:32 -0500 (EST)
De : David Miller 
Pour : christophe.le...@c-s.fr
Copie à : le...@freescale.com, qiang.z...@nxp.com, o...@buserror.net, 
linux-ker...@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, 
net...@vger.kernel.org, linux-arm-ker...@lists.infradead.org


From: Christophe Leroy 
Date: Tue,  7 Feb 2017 10:05:07 +0100 (CET)

> ucc_geth ethernet driver is the only driver using 
immrbar_virt_to_phys() and it uses it incorrectly.

>
> This patch fixes ucc_geth driver then removes immrbar_virt_to_phys()

Feel free to merge this via whatever tree handles that SOC fsl driver.

Acked-by: David S. Miller 




Therefore, who is going to commit this patch ?

Regards
Christophe



Le 07/04/2017 à 14:10, Patchwork a écrit :

Hello,

The following patch (submitted by you) has been updated in patchwork:

 * linuxppc-dev: [1/2] net: ethernet: ucc_geth: fix MEM_PART_MURAM mode
 - http://patchwork.ozlabs.org/patch/725043/
 - for: Linux PPC development
was: New
now: Not Applicable

This email is a notification only - you do not need to respond.

Happy patchworking.

--

This is an automated mail sent by the patchwork system at
patchwork.ozlabs.org. To stop receiving these notifications, edit
your mail settings at:
  http://patchwork.ozlabs.org/mail/

Re: [PATCH 1/2] powerpc: string: implement optimized memset variants

2017-04-12 Thread Naveen N. Rao


Excerpts from PrasannaKumar Muralidharan's message of April 5, 2017 11:21:

On 30 March 2017 at 12:46, Naveen N. Rao
 wrote:

Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here
are the results:
generic:0.245315533 seconds time elapsed( +-  1.83% )
optimized:  0.169282701 seconds time elapsed( +-  1.96% )


Wondering what makes gcc not to produce efficient assembly code. Can
you please post the disassembly of C implementation of memset64? Just
for info purpose.


It's largely the same as what Christophe posted for powerpc32.

Others will have better insights, but afaics, gcc only seems to be 
unrolling the loop with -funroll-loops (which we don't use).


As an aside, it looks like gcc recently picked up an optimization in v7 
that can also help (from https://gcc.gnu.org/gcc-7/changes.html):
"A new store merging pass has been added. It merges constant stores to 
adjacent memory locations into fewer, wider, stores. It is enabled by 
the -fstore-merging option and at the -O2 optimization level or higher 
(and -Os)."



- Naveen

[PATCH V4 7/7 remix] cxl: Add psl9 specific code

2017-04-12 Thread Frederic Barrat

From: Christophe Lombard 

The new Coherent Accelerator Interface Architecture, level 2, for the
IBM POWER9 brings new content and features:
- POWER9 Service Layer
- Registers
- Radix mode
- Process element entry
- Dedicated-Shared Process Programming Model
- Translation Fault Handling
- CAPP
- Memory Context ID
If a valid mm_struct is found the memory context id is used for each
transaction associated with the process handle. The PSL uses the
context ID to find the corresponding process element.

Signed-off-by: Christophe Lombard 
Acked-by: Frederic Barrat 
---
 Documentation/powerpc/cxl.txt |  15 ++-
 drivers/misc/cxl/context.c|  16 ++-
 drivers/misc/cxl/cxl.h| 137 ---
 drivers/misc/cxl/debugfs.c|  19 
 drivers/misc/cxl/fault.c  |  64 +++
 drivers/misc/cxl/guest.c  |   8 +-
 drivers/misc/cxl/irq.c|  53 +
 drivers/misc/cxl/native.c | 223 +++--
 drivers/misc/cxl/pci.c| 251 +++---
 drivers/misc/cxl/trace.h  |  43 
 10 files changed, 753 insertions(+), 76 deletions(-)

diff --git a/Documentation/powerpc/cxl.txt b/Documentation/powerpc/cxl.txt
index d5506ba0..c5e8d50 100644
--- a/Documentation/powerpc/cxl.txt
+++ b/Documentation/powerpc/cxl.txt
@@ -21,7 +21,7 @@ Introduction
 Hardware overview
 =
 
-  POWER8   FPGA
+ POWER8/9 FPGA
+--++-+
|  || |
|   CPU||   AFU   |
@@ -34,7 +34,7 @@ Hardware overview
|   | CAPP |<-->| |
+---+--+  PCIE  +-+
 
-The POWER8 chip has a Coherently Attached Processor Proxy (CAPP)
+The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP)
 unit which is part of the PCIe Host Bridge (PHB). This is managed
 by Linux by calls into OPAL. Linux doesn't directly program the
 CAPP.
@@ -59,6 +59,17 @@ Hardware overview
 the fault. The context to which this fault is serviced is based on
 who owns that acceleration function.
 
+POWER8 <-> PSL Version 8 is compliant to the CAIA Version 1.0.
+POWER9 <-> PSL Version 9 is compliant to the CAIA Version 2.0.
+This PSL Version 9 provides new features such as:
+* Interaction with the nest MMU on the P9 chip.
+* Native DMA support.
+* Supports sending ASB_Notify messages for host thread wakeup.
+* Supports Atomic operations.
+* 
+
+Cards with a PSL9 won't work on a POWER8 system and cards with a
+PSL8 won't work on a POWER9 system.
 
 AFU Modes
 =
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index ac2531e..45363be 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -188,12 +188,24 @@ int cxl_context_iomap(struct cxl_context *ctx, struct 
vm_area_struct *vma)
if (ctx->afu->current_mode == CXL_MODE_DEDICATED) {
if (start + len > ctx->afu->adapter->ps_size)
return -EINVAL;
+
+   if (cxl_is_psl9(ctx->afu)) {
+   /* make sure there is a valid problem state
+* area space for this AFU
+*/
+   if (ctx->master && !ctx->afu->psa) {
+   pr_devel("AFU doesn't support mmio space\n");
+   return -EINVAL;
+   }
+
+   /* Can't mmap until the AFU is enabled */
+   if (!ctx->afu->enabled)
+   return -EBUSY;
+   }
} else {
if (start + len > ctx->psn_size)
return -EINVAL;
-   }
 
-   if (ctx->afu->current_mode != CXL_MODE_DEDICATED) {
/* make sure there is a valid per process space for this AFU */
if ((ctx->master && !ctx->afu->psa) || (!ctx->afu->pp_psa)) {
pr_devel("AFU doesn't support mmio space\n");
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 82335c0..df40e6e 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -63,7 +63,7 @@ typedef struct {
 /* Memory maps. Ref CXL Appendix A */
 
 /* PSL Privilege 1 Memory Map */
-/* Configuration and Control area */
+/* Configuration and Control area - CAIA 1&2 */
 static const cxl_p1_reg_t CXL_PSL_CtxTime = {0x};
 static const cxl_p1_reg_t CXL_PSL_ErrIVTE = {0x0008};
 static const cxl_p1_reg_t CXL_PSL_KEY1= {0x0010};
@@ -98,11 +98,29 @@ static const cxl_p1_reg_t CXL_XSL_Timebase  = {0x0100};
 static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108};
 static const cxl_p1_reg_t CXL_XSL_FEC   = {0x0158};
 static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168};
+/* PSL registers - CAIA 2 */
+static

Re: [PATCH v4 0/5] perf report: Show branch type

2017-04-12 Thread Jiri Olsa

On Wed, Apr 12, 2017 at 08:25:34PM +0800, Jin, Yao wrote:

SNIP

> > # Overhead  Command  Source Shared Object  Source Symbol
> > Target SymbolBasic Block Cycles
> > #   ...    
> > ...  
> > ...  ..
> > #
> >   8.30%  perf
> > Um  [kernel.vmlinux]  [k] __intel_pmu_enable_all.constprop.17  [k] 
> > native_write_msr -
> >   7.91%  perf
> > Um  [kernel.vmlinux]  [k] intel_pmu_lbr_enable_all [k] 
> > __intel_pmu_enable_all.constprop.17  -
> >   7.91%  perf
> > Um  [kernel.vmlinux]  [k] native_write_msr [k] 
> > intel_pmu_lbr_enable_all -
> >   6.32%  kill libc-2.24.so  [.] _dl_addr
> >  [.] _dl_addr -
> >   5.93%  perf
> > Um  [kernel.vmlinux]  [k] perf_iterate_ctx [k] 
> > perf_iterate_ctx -
> >   2.77%  kill libc-2.24.so  [.] malloc  
> >  [.] malloc   -
> >   1.98%  kill libc-2.24.so  [.] _int_malloc 
> >  [.] _int_malloc  -
> >   1.58%  kill [kernel.vmlinux]  [k] __rb_insert_augmented   
> >  [k] __rb_insert_augmented-
> >   1.58%  perf
> > Um  [kernel.vmlinux]  [k] perf_event_exec  [k] 
> > perf_event_exec  -
> >   1.19%  kill [kernel.vmlinux]  [k] 
> > anon_vma_interval_tree_insert[k] anon_vma_interval_tree_insert  
> >   -
> >   1.19%  kill [kernel.vmlinux]  [k] free_pgd_range  
> >  [k] free_pgd_range   -
> >   1.19%  kill [kernel.vmlinux]  [k] n_tty_write 
> >  [k] n_tty_write  -
> >   1.19%  perf
> > Um  [kernel.vmlinux]  [k] native_sched_clock   [k] 
> > sched_clock  -
> > ...
> > SNIP
> > 
> > 
> > jirka
> 
> Hi,
> 
> Thanks so much for trying this patch.
> 
> The branch statistics is printed at the end of perf report --stdio.

yep, but for some reason with your changes the head report
got changed as well, I haven't checked the details yet..

jirka

Re: [PATCH] powerpc/64s: catch external interrupts going to host in POWER9

2017-04-12 Thread Nicholas Piggin

On Wed, 12 Apr 2017 23:45:42 +1000
Benjamin Herrenschmidt  wrote:

> On Wed, 2017-04-12 at 23:11 +1000, Nicholas Piggin wrote:
> > After setting LPES0 in the host on POWER9, the host external interrupt
> > handler no longer works correctly, because it's set to HV mode (HSRR)
> > for POWER7/8 with LPES0 clear. We don't expect to get any EE in the host
> > with XIVE, but it seems preferable to catch unexpected interrupts in case
> > there are bugs or unexpected behaviour.
> >   
> > > Signed-off-by: Nicholas Piggin   
> > ---  
> 
> No. Let's just get LPES back to P8 value in the host, we don't care as
> we don't get those EEs on normal systems. Then make sure KVM properly
> sets it the way we want when setting up the guest LPCR (which it should
> be doing with my patches).
> Much simpler patch...


Yeah sure that sounds good. How's this then?

---
 arch/powerpc/kernel/exceptions-64s.S | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 857bf7c5b946..c78165e5fb77 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -735,8 +735,20 @@ EXC_VIRT_END(hardware_interrupt, 0x4500, 0x100)
 
 TRAMP_KVM(PACA_EXGEN, 0x500)
 TRAMP_KVM_HV(PACA_EXGEN, 0x500)
-EXC_COMMON_ASYNC(hardware_interrupt_common, 0x500, do_IRQ)
 
+EXC_COMMON_BEGIN(hardware_interrupt_common)
+BEGIN_FTR_SECTION
+   /*
+* The POWER9 XIVE interrupt controller should be configured to send
+* all interrupts to the host as HVI, even with the OPAL XICS
+* emulation, so HVMODE should never see a 0x500 interrupt. However we
+* catch it in case of a bug.
+*/
+   b   unknown_host_ee_common
+END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_300)
+   STD_EXCEPTION_COMMON_ASYNC(0x500, hardware_interrupt_common, do_IRQ)
+
+EXC_COMMON_ASYNC(unknown_host_ee_common, 0x500, unknown_exception)
 
 EXC_REAL(alignment, 0x600, 0x100)
 EXC_VIRT(alignment, 0x4600, 0x100, 0x600)
-- 
2.11.0

Re: EEH error in doing DMA with PEX 8619

2017-04-12 Thread Benjamin Herrenschmidt

On Wed, 2017-04-12 at 01:42 -0700, IanJiang wrote:
> 
> In my test, DMA buffers are allocated with  (bus 2, device 1, function 
> 0) in module Plx8000_NT, but DMA is issued by (bus 1 device 0 function 
> 1) in module Plx8000_DMA. And error of (bus 1 device 0 function 1) is 
> reported by EEH. 

This is going to break on other systems too. If you enable strict iommu
on x86 for example.

You need to ensure that DMA are allocated for the same requester ID
that will be performing the transactions.

Cheers,
Ben.

Re: [PATCH] powerpc/64s: catch external interrupts going to host in POWER9

2017-04-12 Thread Benjamin Herrenschmidt

On Wed, 2017-04-12 at 23:11 +1000, Nicholas Piggin wrote:
> After setting LPES0 in the host on POWER9, the host external interrupt
> handler no longer works correctly, because it's set to HV mode (HSRR)
> for POWER7/8 with LPES0 clear. We don't expect to get any EE in the host
> with XIVE, but it seems preferable to catch unexpected interrupts in case
> there are bugs or unexpected behaviour.
> 
> > Signed-off-by: Nicholas Piggin 
> ---

No. Let's just get LPES back to P8 value in the host, we don't care as
we don't get those EEs on normal systems. Then make sure KVM properly
sets it the way we want when setting up the guest LPCR (which it should
be doing with my patches).

Much simpler patch...

Cheers,
Ben.


> Hi,
> 
> I was testing the LPES0 code on POWER9 under mambo, which exploded
> because I didn't use --enable-xive_interrupts so the host was getting
> EEs.
> 
> Errant 0x500 in the host will end up hrfid'ing to uninitialized HSRR[01]
> which ends up dying in interesting ways. Should we add this patch to
> Ben's xive topic branch that sets LPES0? (Or do you rebase topic branches?
> It could be rolled up with that particular patch if so).
> 
> Thanks,
> Nick
> 
>  arch/powerpc/kernel/exceptions-64s.S | 26 +++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 857bf7c5b946..2f26a0553a4a 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -718,9 +718,21 @@ hardware_interrupt_hv:
> >     _MASKABLE_EXCEPTION_PSERIES(0x500, hardware_interrupt_common,
> >     EXC_HV, SOFTEN_TEST_HV)
> >     FTR_SECTION_ELSE
> > +   /*
> > +    * The POWER9 XIVE interrupt controller should be configured
> > +    * to send all interrupts to the host as HVI, even with the
> > +    * OPAL XICS emulation, so HVMODE should never see a 0x500
> > +    * interrupt. However we catch it in case of a bug.
> > +    *
> > +    * POWER9 sets the LPES0 LPCR bit in the host, which
> > +    * delivers external interrupts to SRR[01] with MSR_HV
> > +    * unchanged (intended for guest delivery), so these need
> > +    * to be caught as EXC_STD interrupts in the host.
> > +    */
> >     _MASKABLE_EXCEPTION_PSERIES(0x500, hardware_interrupt_common,
> >     EXC_STD, SOFTEN_TEST_PR)
> > -   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
> > +   ALT_FTR_SECTION_END(CPU_FTR_HVMODE|CPU_FTR_ARCH_206|CPU_FTR_ARCH_300,
> > +   CPU_FTR_HVMODE|CPU_FTR_ARCH_206)
>  EXC_REAL_END(hardware_interrupt, 0x500, 0x100)
>  
>  EXC_VIRT_BEGIN(hardware_interrupt, 0x4500, 0x100)
> @@ -730,13 +742,21 @@ hardware_interrupt_relon_hv:
> >     _MASKABLE_RELON_EXCEPTION_PSERIES(0x500, 
> > hardware_interrupt_common, EXC_HV, SOFTEN_TEST_HV)
> >     FTR_SECTION_ELSE
> >     _MASKABLE_RELON_EXCEPTION_PSERIES(0x500, 
> > hardware_interrupt_common, EXC_STD, SOFTEN_TEST_PR)
> > -   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
> > +   ALT_FTR_SECTION_END(CPU_FTR_HVMODE|CPU_FTR_ARCH_206|CPU_FTR_ARCH_300,
> > +   CPU_FTR_HVMODE|CPU_FTR_ARCH_206)
>  EXC_VIRT_END(hardware_interrupt, 0x4500, 0x100)
>  
>  TRAMP_KVM(PACA_EXGEN, 0x500)
>  TRAMP_KVM_HV(PACA_EXGEN, 0x500)
> -EXC_COMMON_ASYNC(hardware_interrupt_common, 0x500, do_IRQ)
>  
> +EXC_COMMON_BEGIN(hardware_interrupt_common)
> +BEGIN_FTR_SECTION
> > +   /* See POWER9 comment above */
> > > + b   unknown_host_ee_common
> +END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_300)
> > +   STD_EXCEPTION_COMMON_ASYNC(0x500, hardware_interrupt_common, do_IRQ)
> +
> +EXC_COMMON_ASYNC(unknown_host_ee_common, 0x500, unknown_exception)
>  
>  EXC_REAL(alignment, 0x600, 0x100)
>  EXC_VIRT(alignment, 0x4600, 0x100, 0x600)

Re: [1/2] powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there

2017-04-12 Thread Michael Ellerman

On Mon, 2017-04-10 at 22:48:26 UTC, Michael Ellerman wrote:
> powerpc_debugfs_root is the dentry representing the root of the
> "powerpc" directory tree in debugfs.
> 
> Currently it sits in asm/debug.h, a long with some other things that
> have "debug" in the name, but are otherwise unrelated.
> 
> Pull it out into a separate header, which also includes linux/debugfs.h,
> and convert all the users to include debugfs.h instead of debug.h.
> 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/7644d5819cf8956d799a0a0e5dc75f

cheers

Re: [v3, 1/4] powernv: Move CPU-Offline idle state invocation from smp.c to idle.c

2017-04-12 Thread Michael Ellerman

On Wed, 2017-03-22 at 15:04:14 UTC, "Gautham R. Shenoy" wrote:
> From: "Gautham R. Shenoy" 
> 
> Move the piece of code in powernv/smp.c::pnv_smp_cpu_kill_self() which
> transitions the CPU to the deepest available platform idle state to a
> new function named pnv_cpu_offline() in powernv/idle.c. The rationale
> behind this code movement is that the data required to determine the
> deepest available platform state resides in powernv/idle.c.
> 
> Reviewed-by: Nicholas Piggin 
> Signed-off-by: Gautham R. Shenoy 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a7cd88da97040513e17cd77ae3e577

cheers

[PATCH] powerpc/64s: catch external interrupts going to host in POWER9

2017-04-12 Thread Nicholas Piggin

After setting LPES0 in the host on POWER9, the host external interrupt
handler no longer works correctly, because it's set to HV mode (HSRR)
for POWER7/8 with LPES0 clear. We don't expect to get any EE in the host
with XIVE, but it seems preferable to catch unexpected interrupts in case
there are bugs or unexpected behaviour.

Signed-off-by: Nicholas Piggin 
---

Hi,

I was testing the LPES0 code on POWER9 under mambo, which exploded
because I didn't use --enable-xive_interrupts so the host was getting
EEs.

Errant 0x500 in the host will end up hrfid'ing to uninitialized HSRR[01]
which ends up dying in interesting ways. Should we add this patch to
Ben's xive topic branch that sets LPES0? (Or do you rebase topic branches?
It could be rolled up with that particular patch if so).

Thanks,
Nick

 arch/powerpc/kernel/exceptions-64s.S | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 857bf7c5b946..2f26a0553a4a 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -718,9 +718,21 @@ hardware_interrupt_hv:
_MASKABLE_EXCEPTION_PSERIES(0x500, hardware_interrupt_common,
EXC_HV, SOFTEN_TEST_HV)
FTR_SECTION_ELSE
+   /*
+* The POWER9 XIVE interrupt controller should be configured
+* to send all interrupts to the host as HVI, even with the
+* OPAL XICS emulation, so HVMODE should never see a 0x500
+* interrupt. However we catch it in case of a bug.
+*
+* POWER9 sets the LPES0 LPCR bit in the host, which
+* delivers external interrupts to SRR[01] with MSR_HV
+* unchanged (intended for guest delivery), so these need
+* to be caught as EXC_STD interrupts in the host.
+*/
_MASKABLE_EXCEPTION_PSERIES(0x500, hardware_interrupt_common,
EXC_STD, SOFTEN_TEST_PR)
-   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
+   ALT_FTR_SECTION_END(CPU_FTR_HVMODE|CPU_FTR_ARCH_206|CPU_FTR_ARCH_300,
+   CPU_FTR_HVMODE|CPU_FTR_ARCH_206)
 EXC_REAL_END(hardware_interrupt, 0x500, 0x100)
 
 EXC_VIRT_BEGIN(hardware_interrupt, 0x4500, 0x100)
@@ -730,13 +742,21 @@ hardware_interrupt_relon_hv:
_MASKABLE_RELON_EXCEPTION_PSERIES(0x500, 
hardware_interrupt_common, EXC_HV, SOFTEN_TEST_HV)
FTR_SECTION_ELSE
_MASKABLE_RELON_EXCEPTION_PSERIES(0x500, 
hardware_interrupt_common, EXC_STD, SOFTEN_TEST_PR)
-   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
+   ALT_FTR_SECTION_END(CPU_FTR_HVMODE|CPU_FTR_ARCH_206|CPU_FTR_ARCH_300,
+   CPU_FTR_HVMODE|CPU_FTR_ARCH_206)
 EXC_VIRT_END(hardware_interrupt, 0x4500, 0x100)
 
 TRAMP_KVM(PACA_EXGEN, 0x500)
 TRAMP_KVM_HV(PACA_EXGEN, 0x500)
-EXC_COMMON_ASYNC(hardware_interrupt_common, 0x500, do_IRQ)
 
+EXC_COMMON_BEGIN(hardware_interrupt_common)
+BEGIN_FTR_SECTION
+   /* See POWER9 comment above */
+   b   unknown_host_ee_common
+END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_300)
+   STD_EXCEPTION_COMMON_ASYNC(0x500, hardware_interrupt_common, do_IRQ)
+
+EXC_COMMON_ASYNC(unknown_host_ee_common, 0x500, unknown_exception)
 
 EXC_REAL(alignment, 0x600, 0x100)
 EXC_VIRT(alignment, 0x4600, 0x100, 0x600)
-- 
2.11.0

Re: [PATCH v4 0/5] perf report: Show branch type

2017-04-12 Thread Jin, Yao




On 4/12/2017 6:58 PM, Jiri Olsa wrote:

On Wed, Apr 12, 2017 at 06:21:01AM +0800, Jin Yao wrote:

SNIP


3. Use 2 bits in perf_branch_entry for a "cross" metrics checking
for branch cross 4K or 2M area. It's an approximate computing
for checking if the branch cross 4K page or 2MB page.

For example:

perf record -g --branch-filter any,save_type 

perf report --stdio

  JCC forward:  27.7%
 JCC backward:   9.8%
  JMP:   0.0%
  IND_JMP:   6.5%
 CALL:  26.6%
 IND_CALL:   0.0%
  RET:  29.3%
 IRET:   0.0%
 CROSS_4K:   0.0%
 CROSS_2M:  14.3%

got mangled perf report --stdio output for:


[root@ibm-x3650m4-02 perf]# ./perf record -j any,save_type kill
kill: not enough arguments
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB perf.data (18 samples) ]

[root@ibm-x3650m4-02 perf]# ./perf report --stdio -f | head -30
# To display the perf.data header info, please use --header/--header-only 
options.
#
#
# Total Lost Samples: 0
#
# Samples: 253  of event 'cycles'
# Event count (approx.): 253
#
# Overhead  Command  Source Shared Object  Source Symbol
Target SymbolBasic Block Cycles
#   ...    
...  
...  ..
#
  8.30%  perf
Um  [kernel.vmlinux]  [k] __intel_pmu_enable_all.constprop.17  [k] 
native_write_msr -
  7.91%  perf
Um  [kernel.vmlinux]  [k] intel_pmu_lbr_enable_all [k] 
__intel_pmu_enable_all.constprop.17  -
  7.91%  perf
Um  [kernel.vmlinux]  [k] native_write_msr [k] 
intel_pmu_lbr_enable_all -
  6.32%  kill libc-2.24.so  [.] _dl_addr
 [.] _dl_addr -
  5.93%  perf
Um  [kernel.vmlinux]  [k] perf_iterate_ctx [k] 
perf_iterate_ctx -
  2.77%  kill libc-2.24.so  [.] malloc  
 [.] malloc   -
  1.98%  kill libc-2.24.so  [.] _int_malloc 
 [.] _int_malloc  -
  1.58%  kill [kernel.vmlinux]  [k] __rb_insert_augmented   
 [k] __rb_insert_augmented-
  1.58%  perf
Um  [kernel.vmlinux]  [k] perf_event_exec  [k] 
perf_event_exec  -
  1.19%  kill [kernel.vmlinux]  [k] anon_vma_interval_tree_insert   
 [k] anon_vma_interval_tree_insert-
  1.19%  kill [kernel.vmlinux]  [k] free_pgd_range  
 [k] free_pgd_range   -
  1.19%  kill [kernel.vmlinux]  [k] n_tty_write 
 [k] n_tty_write  -
  1.19%  perf
Um  [kernel.vmlinux]  [k] native_sched_clock   [k] 
sched_clock  -
...
SNIP


jirka


Hi,

Thanks so much for trying this patch.

The branch statistics is printed at the end of perf report --stdio.

For example, on my machine,

root@skl:/tmp# perf record -j any,save_type kill
. . . . . .

For more details see kill(1).
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.011 MB perf.data (1 samples) ]

root@skl:/tmp# perf report --stdio
# To display the perf.data header info, please use 
--header/--header-only options.

#
#
# Total Lost Samples: 0
#
# Samples: 3  of event 'cycles'
# Event count (approx.): 3
#
# Overhead  Command  Source Shared Object  Source Symbol 
Target Symbol Basic Block Cycles
#   ...     
 ..

#
33.33%  perf [kernel.vmlinux]  [k] 
__intel_pmu_enable_all[k] native_write_msr  10
33.33%  perf [kernel.vmlinux]  [k] 
intel_pmu_lbr_enable_all  [k] __intel_pmu_enable_all4
33.33%  perf [kernel.vmlinux]  [k] 
native_write_msr  [k] intel_pmu_lbr_enable_all  -



#
# (Tip: Show current config key-value pairs: perf config --list)
#

#
# Branch Statistics:
#
CROSS_4K: 100.0%
CALL:  33.3%
 RET:  66.7%

Thanks
Jin Yao

Re: [PATCH V4 7/7] cxl: Add psl9 specific code

2017-04-12 Thread Frederic Barrat




Le 12/04/2017 à 09:52, Andrew Donnellan a écrit :

On 08/04/17 00:11, Christophe Lombard wrote:

+static u32 get_phb_index(struct device_node *np)
 {
 u32 phb_index;

 if (of_property_read_u32(np, "ibm,phb-index", _index))
-return 0;
+return -ENODEV;


Function is unsigned.



[Christophe is off till the end of the week, so I'm following up]

Michael: what's the easiest for you at this point? Shall I send a new 
version of the 7th patch with all changes consolidated (tab error + doc 
+ Andrew's remark above)?


  Fred

Re: [PATCH V4 7/7] cxl: Add psl9 specific code

2017-04-12 Thread Michael Ellerman

christophe lombard  writes:
> Le 12/04/2017 à 04:11, Michael Ellerman a écrit :
> Hi,
>
> Here is a new patch which updates the documentation based
> on the complet PATCH V4 7/7.
> Let me know if it suits you.

Fine by me, I'll wait for Fred's ack before I merge it all.

> Index: capi2_linux_prepare_patch_V4/Documentation/powerpc/cxl.txt
> ===
> --- capi2_linux_prepare_patch_V4.orig/Documentation/powerpc/cxl.txt
> +++ capi2_linux_prepare_patch_V4/Documentation/powerpc/cxl.txt
> @@ -62,6 +62,7 @@ Hardware overview
>   POWER8 <-> PSL Version 8 is compliant to the CAIA Version 1.0.
>   POWER9 <-> PSL Version 9 is compliant to the CAIA Version 2.0.
>   This PSL Version 9 provides new features as:
> +* Interaction with the nest MMU which resides within each P9 chip.
>   * Native DMA support.
>   * Supports sending ASB_Notify messages for host thread wakeup.
>   * Supports Atomic operations.

The path didn't actually apply, the whitespace is messed up, but I fixed
it up.

cheers

[PATCH 2/3] powernv:idle: Decouple TB restore & Per-core SPRs restore

2017-04-12 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

The idle-exit code assumes that if Timebase is not lost, then neither
are the per-core hypervisor resources lost. This was true on POWER8
where fast-sleep lost only TB but not per-core resources, and winkle
lost both.

This assumption is not true for POWER9 however, since there can be
states which do not lose timebase but can lose per-core SPRs.

Hence check if we need to restore the per-core hypervisor state even
if timebase is not lost.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/idle_book3s.S | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 9b747e9..6a9bd28 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -723,13 +723,14 @@ timebase_resync:
 * Use cr3 which indicates that we are waking up with atleast partial
 * hypervisor state loss to determine if TIMEBASE RESYNC is needed.
 */
-   ble cr3,clear_lock
+   ble cr3,.Ltb_resynced
/* Time base re-sync */
bl  opal_resync_timebase;
/*
-* If waking up from sleep, per core state is not lost, skip to
-* clear_lock.
+* If waking up from sleep (POWER8), per core state
+* is not lost, skip to clear_lock.
 */
+.Ltb_resynced:
blt cr4,clear_lock
 
/*
-- 
1.9.4

[PATCH 3/3] powernv:idle: Set LPCR_UPRT on wakeup from deep-stop

2017-04-12 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

On wakeup from a deep-stop used for CPU-Hotplug, we invoke
cur_cpu_spec->cpu_restore() which would set sane default values to
various SPRs including LPCR.

On POWER9, the cpu_restore_power9() call would would restore LPCR to a
sane value that is set at early boot time, thereby clearing LPCR_UPRT.

However, LPCR_UPRT is required to be set if we are running in Radix
mode. If this is not set we will end up with a crash when we enable
IR,DR.

To fix this, after returning from cur_cpu_spec->cpu_restore() in the
idle exit path, set LPCR_UPRT if we are running in Radix mode.

Cc: Aneesh Kumar K.V 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/idle_book3s.S | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 6a9bd28..39a9b63 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -804,6 +804,19 @@ no_segments:
 #endif
mtctr   r12
bctrl
+/*
+ * cur_cpu_spec->cpu_restore would restore LPCR to a
+ * sane value that is set at early boot time,
+ * thereby clearing LPCR_UPRT.
+ * LPCR_UPRT is required if we are running in Radix mode.
+ * Set it here if that be the case.
+ */
+BEGIN_MMU_FTR_SECTION
+   mfspr   r3, SPRN_LPCR
+   LOAD_REG_IMMEDIATE(r4, LPCR_UPRT)
+   or  r3, r3, r4
+   mtspr   SPRN_LPCR, r3
+END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
 
 hypervisor_state_restored:
 
-- 
1.9.4

[PATCH 1/3] powernv:idle: Use correct IDLE_THREAD_BITS in POWER8/9

2017-04-12 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

This patch ensures that POWER8 and POWER9 processors use the correct
value of IDLE_THREAD_BITS as POWER8 has 8 threads per core and hence
the IDLE_THREAD_BITS should be 0xFF while POWER9 has only 4 threads
per core and hence the IDLE_THREAD_BITS should be 0xF.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/cpuidle.h| 3 ++-
 arch/powerpc/kernel/idle_book3s.S | 9 ++---
 arch/powerpc/platforms/powernv/idle.c | 5 -
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index 52586f9..fece6ca 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -34,7 +34,8 @@
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT 8
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS   0xFF00
 
-#define PNV_CORE_IDLE_THREAD_BITS  0x00FF
+#define PNV_CORE_IDLE_4THREAD_BITS 0x000F
+#define PNV_CORE_IDLE_8THREAD_BITS 0x00FF
 
 /*
  *  NOTE =
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 2b13fe2..9b747e9 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -223,7 +223,7 @@ lwarx_loop1:
add r15,r15,r5  /* Add if winkle */
andcr15,r15,r7  /* Clear thread bit */
 
-   andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
+   andi.   r9,r15,PNV_CORE_IDLE_8THREAD_BITS
 
 /*
  * If cr0 = 0, then current thread is the last thread of the core entering
@@ -582,8 +582,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
stwcx.  r15,0,r14
bne-1b
isync
-
-   andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
+BEGIN_FTR_SECTION
+   andi.   r9,r15,PNV_CORE_IDLE_4THREAD_BITS
+FTR_SECTION_ELSE
+   andi.   r9,r15,PNV_CORE_IDLE_8THREAD_BITS
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
cmpwi   cr2,r9,0
 
/*
diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 445f30a..d46920b 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -112,7 +112,10 @@ static void pnv_alloc_idle_core_states(void)
size_t paca_ptr_array_size;
 
core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
-   *core_idle_state = PNV_CORE_IDLE_THREAD_BITS;
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   *core_idle_state = PNV_CORE_IDLE_4THREAD_BITS;
+   else
+   *core_idle_state = PNV_CORE_IDLE_8THREAD_BITS;
paca_ptr_array_size = (threads_per_core *
   sizeof(struct paca_struct *));
 
-- 
1.9.4

[PATCH 0/3] powernv:stop: Some fixes for handling deep stop

2017-04-12 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

This patchset contains three fixes required to get a deep stop state
that can lose the Hypervisor state to work correctly.

The first patch in the series uses the correct value for the
IDLE_THREAD_BITS on POWER8 which has 8 threads per core and on POWER9
which has 4 threads per core.

The second patch decouples restoring Timebase from restoring per-core
spr state as the current code assumes that if the timebase is not lost
then neither is per-core state. This was true on POWER8, but no longer
true on POWER9.

The third patch in the series sets the UPRT bit in LPCR on wakeup from
a deep stop if we are running in radix mode, without which the kernel
crashes once we switch to virtual mode.

These patches are on top of the patches for fixing CPU-Hotplug on
POWER9 DD1.0 (https://lkml.org/lkml/2017/3/22/472) and Nicholas
Piggin's idle fixes and changes for POWER8 and POWER9
(https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-March/155608.html)

Gautham R. Shenoy (3):
  powernv:idle: Use correct IDLE_THREAD_BITS in POWER8 vs POWER9
  powernv:idle: Decouple TB restore & Per-core SPRs restore
  powernv:idle: Set LPCR_UPRT on wakeup from deep-stop

 arch/powerpc/include/asm/cpuidle.h|  3 ++-
 arch/powerpc/kernel/idle_book3s.S | 29 +++--
 arch/powerpc/platforms/powernv/idle.c |  5 -
 3 files changed, 29 insertions(+), 8 deletions(-)

-- 
1.9.4

Re: powerpc: Add XIVE related definitions to opal-api.h

2017-04-12 Thread Michael Ellerman

On Wed, 2017-04-05 at 23:01:33 UTC, Benjamin Herrenschmidt wrote:
> Signed-off-by: Benjamin Herrenschmidt 

Applied to topic/xive, thanks.

https://git.kernel.org/powerpc/c/eeea1a434ddedbb5aaeac1a8661445

cheers

Re: [v2,01/10] powerpc: Add more PPC bit conversion macros

2017-04-12 Thread Michael Ellerman

On Wed, 2017-04-05 at 07:54:47 UTC, Benjamin Herrenschmidt wrote:
> Add 32 and 8 bit variants
> 
> Signed-off-by: Benjamin Herrenschmidt 

Series applied to topic/xive, thanks.

https://git.kernel.org/powerpc/c/22bd64a621cc80beeb009abec3d3df

cheers

[PATCH v2] powerpc: kprobes: convert __kprobes to NOKPROBE_SYMBOL()

2017-04-12 Thread Naveen N. Rao

Along similar lines as commit 9326638cbee2 ("kprobes, x86: Use
NOKPROBE_SYMBOL() instead of __kprobes annotation"), convert __kprobes
annotation to either NOKPROBE_SYMBOL() or nokprobe_inline. The latter
forces inlining, in which case the caller needs to be added to
NOKPROBE_SYMBOL().

Also:
- blacklist arch_deref_entry_point, and
- convert a few regular inlines to nokprobe_inline in lib/sstep.c

A key benefit is the ability to detect such symbols as being
blacklisted. Before this patch:

  naveen@ubuntu:~/linux/tools/perf$ sudo cat 
/sys/kernel/debug/kprobes/blacklist | grep read_mem
  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe read_mem
  Failed to write event: Invalid argument
Error: Failed to add events.
  naveen@ubuntu:~/linux/tools/perf$ dmesg | tail -1
  [ 3736.112815] Could not insert probe at _text+10014968: -22

After patch:
  naveen@ubuntu:~/linux/tools/perf$ sudo cat 
/sys/kernel/debug/kprobes/blacklist | grep read_mem
  0xc0072b50-0xc0072d20 read_mem
  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe read_mem
  read_mem is blacklisted function, skip it.
  Added new events:
(null):(null)(on read_mem)
probe:read_mem   (on read_mem)

  You can now use it in all perf tools, such as:

  perf record -e probe:read_mem -aR sleep 1

  naveen@ubuntu:~/linux/tools/perf$ sudo grep " read_mem" /proc/kallsyms
  c0072b50 t read_mem
  c05f3b40 t read_mem
  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
  c05f3b48  k  read_mem+0x8[DISABLED]

Acked-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
v2:
- rebased on top of powerpc/next along with related kprobes patches
- removed incorrect blacklist of kretprobe_trampoline.

 arch/powerpc/kernel/kprobes.c| 58 +---
 arch/powerpc/lib/code-patching.c |  4 +-
 arch/powerpc/lib/sstep.c | 82 +---
 3 files changed, 83 insertions(+), 61 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 23d19678a56f..1983ed2c1544 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -113,7 +113,7 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
unsigned int offset)
return addr;
 }
 
-int __kprobes arch_prepare_kprobe(struct kprobe *p)
+int arch_prepare_kprobe(struct kprobe *p)
 {
int ret = 0;
kprobe_opcode_t insn = *p->addr;
@@ -145,30 +145,34 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
p->ainsn.boostable = 0;
return ret;
 }
+NOKPROBE_SYMBOL(arch_prepare_kprobe);
 
-void __kprobes arch_arm_kprobe(struct kprobe *p)
+void arch_arm_kprobe(struct kprobe *p)
 {
*p->addr = BREAKPOINT_INSTRUCTION;
flush_icache_range((unsigned long) p->addr,
   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
 }
+NOKPROBE_SYMBOL(arch_arm_kprobe);
 
-void __kprobes arch_disarm_kprobe(struct kprobe *p)
+void arch_disarm_kprobe(struct kprobe *p)
 {
*p->addr = p->opcode;
flush_icache_range((unsigned long) p->addr,
   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
 }
+NOKPROBE_SYMBOL(arch_disarm_kprobe);
 
-void __kprobes arch_remove_kprobe(struct kprobe *p)
+void arch_remove_kprobe(struct kprobe *p)
 {
if (p->ainsn.insn) {
free_insn_slot(p->ainsn.insn, 0);
p->ainsn.insn = NULL;
}
 }
+NOKPROBE_SYMBOL(arch_remove_kprobe);
 
-static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs 
*regs)
+static nokprobe_inline void prepare_singlestep(struct kprobe *p, struct 
pt_regs *regs)
 {
enable_single_step(regs);
 
@@ -181,21 +185,21 @@ static void __kprobes prepare_singlestep(struct kprobe 
*p, struct pt_regs *regs)
regs->nip = (unsigned long)p->ainsn.insn;
 }
 
-static void __kprobes save_previous_kprobe(struct kprobe_ctlblk *kcb)
+static nokprobe_inline void save_previous_kprobe(struct kprobe_ctlblk *kcb)
 {
kcb->prev_kprobe.kp = kprobe_running();
kcb->prev_kprobe.status = kcb->kprobe_status;
kcb->prev_kprobe.saved_msr = kcb->kprobe_saved_msr;
 }
 
-static void __kprobes restore_previous_kprobe(struct kprobe_ctlblk *kcb)
+static nokprobe_inline void restore_previous_kprobe(struct kprobe_ctlblk *kcb)
 {
__this_cpu_write(current_kprobe, kcb->prev_kprobe.kp);
kcb->kprobe_status = kcb->prev_kprobe.status;
kcb->kprobe_saved_msr = kcb->prev_kprobe.saved_msr;
 }
 
-static void __kprobes set_current_kprobe(struct kprobe *p, struct pt_regs 
*regs,
+static nokprobe_inline void set_current_kprobe(struct kprobe *p, struct 
pt_regs *regs,
struct kprobe_ctlblk *kcb)
 {
__this_cpu_write(current_kprobe, p);
@@ -215,16 +219,16 @@ bool arch_function_offset_within_entry(unsigned long 
offset)
 #endif
 }
 
-void

[PATCH v4 0/2] powerpc: split ftrace bits into a separate

2017-04-12 Thread Naveen N. Rao

v3:
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg114669.html

For v4, this has been rebased on top of powerpc/next as well as the
KPROBES_ON_FTRACE series. No other changes.

- Naveen

Naveen N. Rao (2):
  powerpc: split ftrace bits into a separate file
  powerpc: ftrace_64: split further based on -mprofile-kernel

 arch/powerpc/kernel/Makefile   |   9 +-
 arch/powerpc/kernel/entry_32.S | 107 ---
 arch/powerpc/kernel/entry_64.S | 379 -
 arch/powerpc/kernel/trace/Makefile |  29 ++
 arch/powerpc/kernel/{ => trace}/ftrace.c   |   0
 arch/powerpc/kernel/trace/ftrace_32.S  | 118 
 arch/powerpc/kernel/trace/ftrace_64.S  |  85 ++
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 272 ++
 arch/powerpc/kernel/trace/ftrace_64_pg.S   |  69 +
 arch/powerpc/kernel/{ => trace}/trace_clock.c  |   0
 10 files changed, 574 insertions(+), 494 deletions(-)
 create mode 100644 arch/powerpc/kernel/trace/Makefile
 rename arch/powerpc/kernel/{ => trace}/ftrace.c (100%)
 create mode 100644 arch/powerpc/kernel/trace/ftrace_32.S
 create mode 100644 arch/powerpc/kernel/trace/ftrace_64.S
 create mode 100644 arch/powerpc/kernel/trace/ftrace_64_mprofile.S
 create mode 100644 arch/powerpc/kernel/trace/ftrace_64_pg.S
 rename arch/powerpc/kernel/{ => trace}/trace_clock.c (100%)

-- 
2.12.1

[PATCH v4 1/2] powerpc: split ftrace bits into a separate file

2017-04-12 Thread Naveen N. Rao

entry_*.S now includes a lot more than just kernel entry/exit code. As a
first step at cleaning this up, let's split out the ftrace bits into
separate files. Also move all related tracing code into a new trace/
subdirectory.

No functional changes.

Suggested-by: Michael Ellerman 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/Makefile  |   9 +-
 arch/powerpc/kernel/entry_32.S| 107 ---
 arch/powerpc/kernel/entry_64.S| 379 -
 arch/powerpc/kernel/trace/Makefile|  24 ++
 arch/powerpc/kernel/{ => trace}/ftrace.c  |   0
 arch/powerpc/kernel/trace/ftrace_32.S | 118 
 arch/powerpc/kernel/trace/ftrace_64.S | 390 ++
 arch/powerpc/kernel/{ => trace}/trace_clock.c |   0
 8 files changed, 533 insertions(+), 494 deletions(-)
 create mode 100644 arch/powerpc/kernel/trace/Makefile
 rename arch/powerpc/kernel/{ => trace}/ftrace.c (100%)
 create mode 100644 arch/powerpc/kernel/trace/ftrace_32.S
 create mode 100644 arch/powerpc/kernel/trace/ftrace_64.S
 rename arch/powerpc/kernel/{ => trace}/trace_clock.c (100%)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 3e461637b64d..b9db46ae545b 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -25,8 +25,6 @@ CFLAGS_REMOVE_cputable.o = -mno-sched-epilog 
$(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_prom_init.o = -mno-sched-epilog $(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_btext.o = -mno-sched-epilog $(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_prom.o = -mno-sched-epilog $(CC_FLAGS_FTRACE)
-# do not trace tracer code
-CFLAGS_REMOVE_ftrace.o = -mno-sched-epilog $(CC_FLAGS_FTRACE)
 # timers used by tracing
 CFLAGS_REMOVE_time.o = -mno-sched-epilog $(CC_FLAGS_FTRACE)
 endif
@@ -119,10 +117,7 @@ obj64-$(CONFIG_AUDIT)  += compat_audit.o
 
 obj-$(CONFIG_PPC_IO_WORKAROUNDS)   += io-workarounds.o
 
-obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace.o
-obj-$(CONFIG_FUNCTION_GRAPH_TRACER)+= ftrace.o
-obj-$(CONFIG_FTRACE_SYSCALLS)  += ftrace.o
-obj-$(CONFIG_TRACING)  += trace_clock.o
+obj-y  += trace/
 
 ifneq ($(CONFIG_PPC_INDIRECT_PIO),y)
 obj-y  += iomap.o
@@ -143,8 +138,6 @@ obj-$(CONFIG_KVM_GUEST) += kvm.o kvm_emul.o
 # Disable GCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
 UBSAN_SANITIZE_prom_init.o := n
-GCOV_PROFILE_ftrace.o := n
-UBSAN_SANITIZE_ftrace.o := n
 GCOV_PROFILE_machine_kexec_64.o := n
 UBSAN_SANITIZE_machine_kexec_64.o := n
 GCOV_PROFILE_machine_kexec_32.o := n
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index a38600949f3a..8587059ad848 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -1315,109 +1314,3 @@ machine_check_in_rtas:
/* XXX load up BATs and panic */
 
 #endif /* CONFIG_PPC_RTAS */
-
-#ifdef CONFIG_FUNCTION_TRACER
-#ifdef CONFIG_DYNAMIC_FTRACE
-_GLOBAL(mcount)
-_GLOBAL(_mcount)
-   /*
-* It is required that _mcount on PPC32 must preserve the
-* link register. But we have r0 to play with. We use r0
-* to push the return address back to the caller of mcount
-* into the ctr register, restore the link register and
-* then jump back using the ctr register.
-*/
-   mflrr0
-   mtctr   r0
-   lwz r0, 4(r1)
-   mtlrr0
-   bctr
-
-_GLOBAL(ftrace_caller)
-   MCOUNT_SAVE_FRAME
-   /* r3 ends up with link register */
-   subir3, r3, MCOUNT_INSN_SIZE
-.globl ftrace_call
-ftrace_call:
-   bl  ftrace_stub
-   nop
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-.globl ftrace_graph_call
-ftrace_graph_call:
-   b   ftrace_graph_stub
-_GLOBAL(ftrace_graph_stub)
-#endif
-   MCOUNT_RESTORE_FRAME
-   /* old link register ends up in ctr reg */
-   bctr
-#else
-_GLOBAL(mcount)
-_GLOBAL(_mcount)
-
-   MCOUNT_SAVE_FRAME
-
-   subir3, r3, MCOUNT_INSN_SIZE
-   LOAD_REG_ADDR(r5, ftrace_trace_function)
-   lwz r5,0(r5)
-
-   mtctr   r5
-   bctrl
-   nop
-
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   b   ftrace_graph_caller
-#endif
-   MCOUNT_RESTORE_FRAME
-   bctr
-#endif
-EXPORT_SYMBOL(_mcount)
-
-_GLOBAL(ftrace_stub)
-   blr
-
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-_GLOBAL(ftrace_graph_caller)
-   /* load r4 with local address */
-   lwz r4, 44(r1)
-   subir4, r4, MCOUNT_INSN_SIZE
-
-   /* Grab the LR out of the caller stack frame */
-   lwz r3,52(r1)
-
-   bl  prepare_ftrace_return
-   nop
-
-/*
- * prepare_ftrace_return gives us the address we divert to.
- * Change the LR in the callers stack frame to this.
- */
-   stw r3,52(r1)
-
-

[PATCH v4 2/2] powerpc: ftrace_64: split further based on -mprofile-kernel

2017-04-12 Thread Naveen N. Rao

Split ftrace_64.S further retaining the core ftrace 64-bit aspects
in ftrace_64.S and moving ftrace_caller() and ftrace_graph_caller() into
separate files based on -mprofile-kernel. The livepatch routines are all
now contained within the mprofile file.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/trace/Makefile |   5 +
 arch/powerpc/kernel/trace/ftrace_64.S  | 307 +
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 272 ++
 arch/powerpc/kernel/trace/ftrace_64_pg.S   |  69 ++
 4 files changed, 347 insertions(+), 306 deletions(-)
 create mode 100644 arch/powerpc/kernel/trace/ftrace_64_mprofile.S
 create mode 100644 arch/powerpc/kernel/trace/ftrace_64_pg.S

diff --git a/arch/powerpc/kernel/trace/Makefile 
b/arch/powerpc/kernel/trace/Makefile
index 5f5a35254a9b..729dffc5f7bc 100644
--- a/arch/powerpc/kernel/trace/Makefile
+++ b/arch/powerpc/kernel/trace/Makefile
@@ -11,6 +11,11 @@ endif
 
 obj32-$(CONFIG_FUNCTION_TRACER)+= ftrace_32.o
 obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_64.o
+ifdef CONFIG_MPROFILE_KERNEL
+obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_64_mprofile.o
+else
+obj64-$(CONFIG_FUNCTION_TRACER)+= ftrace_64_pg.o
+endif
 obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER)+= ftrace.o
 obj-$(CONFIG_FTRACE_SYSCALLS)  += ftrace.o
diff --git a/arch/powerpc/kernel/trace/ftrace_64.S 
b/arch/powerpc/kernel/trace/ftrace_64.S
index 39dc44daa764..e5ccea19821e 100644
--- a/arch/powerpc/kernel/trace/ftrace_64.S
+++ b/arch/powerpc/kernel/trace/ftrace_64.S
@@ -23,233 +23,7 @@ EXPORT_SYMBOL(_mcount)
mtlrr0
bctr
 
-#ifndef CC_USING_MPROFILE_KERNEL
-_GLOBAL_TOC(ftrace_caller)
-   /* Taken from output of objdump from lib64/glibc */
-   mflrr3
-   ld  r11, 0(r1)
-   stdur1, -112(r1)
-   std r3, 128(r1)
-   ld  r4, 16(r11)
-   subir3, r3, MCOUNT_INSN_SIZE
-.globl ftrace_call
-ftrace_call:
-   bl  ftrace_stub
-   nop
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-.globl ftrace_graph_call
-ftrace_graph_call:
-   b   ftrace_graph_stub
-_GLOBAL(ftrace_graph_stub)
-#endif
-   ld  r0, 128(r1)
-   mtlrr0
-   addir1, r1, 112
-
-#else /* CC_USING_MPROFILE_KERNEL */
-/*
- *
- * ftrace_caller() is the function that replaces _mcount() when ftrace is
- * active.
- *
- * We arrive here after a function A calls function B, and we are the trace
- * function for B. When we enter r1 points to A's stack frame, B has not yet
- * had a chance to allocate one yet.
- *
- * Additionally r2 may point either to the TOC for A, or B, depending on
- * whether B did a TOC setup sequence before calling us.
- *
- * On entry the LR points back to the _mcount() call site, and r0 holds the
- * saved LR as it was on entry to B, ie. the original return address at the
- * call site in A.
- *
- * Our job is to save the register state into a struct pt_regs (on the stack)
- * and then arrange for the ftrace function to be called.
- */
-_GLOBAL(ftrace_caller)
-   /* Save the original return address in A's stack frame */
-   std r0,LRSAVE(r1)
-
-   /* Create our stack frame + pt_regs */
-   stdur1,-SWITCH_FRAME_SIZE(r1)
-
-   /* Save all gprs to pt_regs */
-   SAVE_8GPRS(0,r1)
-   SAVE_8GPRS(8,r1)
-   SAVE_8GPRS(16,r1)
-   SAVE_8GPRS(24,r1)
-
-   /* Load special regs for save below */
-   mfmsr   r8
-   mfctr   r9
-   mfxer   r10
-   mfcrr11
-
-   /* Get the _mcount() call site out of LR */
-   mflrr7
-   /* Save it as pt_regs->nip */
-   std r7, _NIP(r1)
-   /* Save the read LR in pt_regs->link */
-   std r0, _LINK(r1)
-
-   /* Save callee's TOC in the ABI compliant location */
-   std r2, 24(r1)
-   ld  r2,PACATOC(r13) /* get kernel TOC in r2 */
-
-   addis   r3,r2,function_trace_op@toc@ha
-   addir3,r3,function_trace_op@toc@l
-   ld  r5,0(r3)
-
-#ifdef CONFIG_LIVEPATCH
-   mr  r14,r7  /* remember old NIP */
-#endif
-   /* Calculate ip from nip-4 into r3 for call below */
-   subir3, r7, MCOUNT_INSN_SIZE
-
-   /* Put the original return address in r4 as parent_ip */
-   mr  r4, r0
-
-   /* Save special regs */
-   std r8, _MSR(r1)
-   std r9, _CTR(r1)
-   std r10, _XER(r1)
-   std r11, _CCR(r1)
-
-   /* Load _regs in r6 for call below */
-   addir6, r1 ,STACK_FRAME_OVERHEAD
-
-   /* ftrace_call(r3, r4, r5, r6) */
-.globl ftrace_call
-ftrace_call:
-   bl  ftrace_stub
-   nop
-
-   /* Load ctr with the possibly modified NIP */
-   ld  r3, _NIP(r1)
-   mtctr   r3
-#ifdef CONFIG_LIVEPATCH
-   cmpdr14,r3  /* has NIP been altered? */
-#endif
-
-   /* Restore gprs

[PATCH v3 4/5] powerpc: kprobes: add support for KPROBES_ON_FTRACE

2017-04-12 Thread Naveen N. Rao

Allow kprobes to be placed on ftrace _mcount() call sites. This
optimization avoids the use of a trap, by riding on ftrace
infrastructure.

This depends on HAVE_DYNAMIC_FTRACE_WITH_REGS which depends on
MPROFILE_KERNEL, which is only currently enabled on powerpc64le with
newer toolchains.

Based on the x86 code by Masami.

Signed-off-by: Naveen N. Rao 
---
 .../debug/kprobes-on-ftrace/arch-support.txt   |   2 +-
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/include/asm/kprobes.h |  10 ++
 arch/powerpc/kernel/Makefile   |   3 +
 arch/powerpc/kernel/kprobes-ftrace.c   | 104 +
 arch/powerpc/kernel/kprobes.c  |   8 +-
 6 files changed, 126 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kernel/kprobes-ftrace.c

diff --git a/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt 
b/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt
index 40f44d041fb4..930430c6aef6 100644
--- a/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt
+++ b/Documentation/features/debug/kprobes-on-ftrace/arch-support.txt
@@ -27,7 +27,7 @@
 |   nios2: | TODO |
 |openrisc: | TODO |
 |  parisc: | TODO |
-| powerpc: | TODO |
+| powerpc: |  ok  |
 |s390: | TODO |
 |   score: | TODO |
 |  sh: | TODO |
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9ff731f50a29..a55a776a1a43 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -142,6 +142,7 @@ config PPC
select HAVE_IRQ_EXIT_ON_IRQ_STACK
select HAVE_KERNEL_GZIP
select HAVE_KPROBES
+   select HAVE_KPROBES_ON_FTRACE
select HAVE_KRETPROBES
select HAVE_LIVEPATCH   if HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_MEMBLOCK
diff --git a/arch/powerpc/include/asm/kprobes.h 
b/arch/powerpc/include/asm/kprobes.h
index a843884aafaf..a83821f33ea3 100644
--- a/arch/powerpc/include/asm/kprobes.h
+++ b/arch/powerpc/include/asm/kprobes.h
@@ -103,6 +103,16 @@ extern int kprobe_exceptions_notify(struct notifier_block 
*self,
 extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
 extern int kprobe_handler(struct pt_regs *regs);
 extern int kprobe_post_handler(struct pt_regs *regs);
+#ifdef CONFIG_KPROBES_ON_FTRACE
+extern int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
+  struct kprobe_ctlblk *kcb);
+#else
+static inline int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
+ struct kprobe_ctlblk *kcb)
+{
+   return 0;
+}
+#endif
 #else
 static inline int kprobe_handler(struct pt_regs *regs) { return 0; }
 static inline int kprobe_post_handler(struct pt_regs *regs) { return 0; }
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 811f441a125f..3e461637b64d 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_BOOTX_TEXT)  += btext.o
 obj-$(CONFIG_SMP)  += smp.o
 obj-$(CONFIG_KPROBES)  += kprobes.o
 obj-$(CONFIG_OPTPROBES)+= optprobes.o optprobes_head.o
+obj-$(CONFIG_KPROBES_ON_FTRACE)+= kprobes-ftrace.o
 obj-$(CONFIG_UPROBES)  += uprobes.o
 obj-$(CONFIG_PPC_UDBG_16550)   += legacy_serial.o udbg_16550.o
 obj-$(CONFIG_STACKTRACE)   += stacktrace.o
@@ -150,6 +151,8 @@ GCOV_PROFILE_machine_kexec_32.o := n
 UBSAN_SANITIZE_machine_kexec_32.o := n
 GCOV_PROFILE_kprobes.o := n
 UBSAN_SANITIZE_kprobes.o := n
+GCOV_PROFILE_kprobes-ftrace.o := n
+UBSAN_SANITIZE_kprobes-ftrace.o := n
 UBSAN_SANITIZE_vdso.o := n
 
 extra-$(CONFIG_PPC_FPU)+= fpu.o
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
b/arch/powerpc/kernel/kprobes-ftrace.c
new file mode 100644
index ..6c089d9757c9
--- /dev/null
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -0,0 +1,104 @@
+/*
+ * Dynamic Ftrace based Kprobes Optimization
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) Hitachi Ltd., 2012
+ * Copyright 2016 Naveen N. Rao 
+ *   IBM Corporation
+ */
+#include 
+#include 
+#include 
+#include

[PATCH v3 3/5] kprobes: Skip preparing optprobe if the probe is ftrace-based

2017-04-12 Thread Naveen N. Rao

From: Masami Hiramatsu 

Skip preparing optprobe if the probe is ftrace-based, since anyway, it
must not be optimized (or already optimized by ftrace).

Tested-by: Naveen N. Rao 
Signed-off-by: Masami Hiramatsu 
---
Though this patch is generic, it is needed for KPROBES_ON_FTRACE to work
on powerpc.

- Naveen


 kernel/kprobes.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 6a128f3a7ed1..406889889ce5 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -743,13 +743,20 @@ static void kill_optimized_kprobe(struct kprobe *p)
arch_remove_optimized_kprobe(op);
 }
 
+static inline
+void __prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe *p)
+{
+   if (!kprobe_ftrace(p))
+   arch_prepare_optimized_kprobe(op, p);
+}
+
 /* Try to prepare optimized instructions */
 static void prepare_optimized_kprobe(struct kprobe *p)
 {
struct optimized_kprobe *op;
 
op = container_of(p, struct optimized_kprobe, kp);
-   arch_prepare_optimized_kprobe(op, p);
+   __prepare_optimized_kprobe(op, p);
 }
 
 /* Allocate new optimized_kprobe and try to prepare optimized instructions */
@@ -763,7 +770,7 @@ static struct kprobe *alloc_aggr_kprobe(struct kprobe *p)
 
INIT_LIST_HEAD(>list);
op->kp.addr = p->addr;
-   arch_prepare_optimized_kprobe(op, p);
+   __prepare_optimized_kprobe(op, p);
 
return >kp;
 }
-- 
2.12.1

[PATCH v3 2/5] powerpc: ftrace: restore LR from pt_regs

2017-04-12 Thread Naveen N. Rao

Pass the real LR to the ftrace handler. This is needed for
KPROBES_ON_FTRACE for the pre handlers.

Also, with KPROBES_ON_FTRACE, the link register may be updated by the
pre handlers or by a registed kretprobe. Honor updated LR by restoring
it from pt_regs, rather than from the stack save area.

Live patch and function graph continue to work fine with this change.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/entry_64.S | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 8fd8718722a1..744b2f91444a 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -1248,9 +1248,10 @@ _GLOBAL(ftrace_caller)
 
/* Get the _mcount() call site out of LR */
mflrr7
-   /* Save it as pt_regs->nip & pt_regs->link */
+   /* Save it as pt_regs->nip */
std r7, _NIP(r1)
-   std r7, _LINK(r1)
+   /* Save the read LR in pt_regs->link */
+   std r0, _LINK(r1)
 
/* Save callee's TOC in the ABI compliant location */
std r2, 24(r1)
@@ -1297,16 +1298,16 @@ ftrace_call:
REST_8GPRS(16,r1)
REST_8GPRS(24,r1)
 
+   /* Restore possibly modified LR */
+   ld  r0, _LINK(r1)
+   mtlrr0
+
/* Restore callee's TOC */
ld  r2, 24(r1)
 
/* Pop our stack frame */
addi r1, r1, SWITCH_FRAME_SIZE
 
-   /* Restore original LR for return to B */
-   ld  r0, LRSAVE(r1)
-   mtlrr0
-
 #ifdef CONFIG_LIVEPATCH
 /* Based on the cmpd above, if the NIP was altered handle livepatch */
bne-livepatch_handler
-- 
2.12.1

[PATCH v3 5/5] powerpc: kprobes: prefer ftrace when probing function entry

2017-04-12 Thread Naveen N. Rao

KPROBES_ON_FTRACE avoids much of the overhead with regular kprobes as it
eliminates the need for a trap, as well as the need to emulate or
single-step instructions.

Though OPTPROBES provides us with similar performance, we have limited
optprobes trampoline slots. As such, when asked to probe at a function
entry, default to using the ftrace infrastructure.

With:
# cd /sys/kernel/debug/tracing
# echo 'p _do_fork' > kprobe_events

before patch:
# cat ../kprobes/list
c00daf08  k  _do_fork+0x8[DISABLED]
c0044fc0  k  kretprobe_trampoline+0x0[OPTIMIZED]

and after patch:
# cat ../kprobes/list
c00d074c  k  _do_fork+0xc[DISABLED][FTRACE]
c00412b0  k  kretprobe_trampoline+0x0[OPTIMIZED]

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index b78b274e1d6e..23d19678a56f 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -49,8 +49,21 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
unsigned int offset)
 #ifdef PPC64_ELF_ABI_v2
/* PPC64 ABIv2 needs local entry point */
addr = (kprobe_opcode_t *)kallsyms_lookup_name(name);
-   if (addr && !offset)
-   addr = (kprobe_opcode_t *)ppc_function_entry(addr);
+   if (addr && !offset) {
+#ifdef CONFIG_KPROBES_ON_FTRACE
+   unsigned long faddr;
+   /*
+* Per livepatch.h, ftrace location is always within the first
+* 16 bytes of a function on powerpc with -mprofile-kernel.
+*/
+   faddr = ftrace_location_range((unsigned long)addr,
+ (unsigned long)addr + 16);
+   if (faddr)
+   addr = (kprobe_opcode_t *)faddr;
+   else
+#endif
+   addr = (kprobe_opcode_t *)ppc_function_entry(addr);
+   }
 #elif defined(PPC64_ELF_ABI_v1)
/*
 * 64bit powerpc ABIv1 uses function descriptors:
-- 
2.12.1

[PATCH v3 0/5] powerpc: add support for KPROBES_ON_FTRACE

2017-04-12 Thread Naveen N. Rao

v2:
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg114659.html

For v3, this has only been rebased on top of powerpc/next and carries a
minor change to patch 4/5. No other changes.

Also, though patch 3/5 is generic, it needs to be carried in this
series as we crash on powerpc without that patch.


- Naveen


Masami Hiramatsu (1):
  kprobes: Skip preparing optprobe if the probe is ftrace-based

Naveen N. Rao (4):
  powerpc: ftrace: minor cleanup
  powerpc: ftrace: restore LR from pt_regs
  powerpc: kprobes: add support for KPROBES_ON_FTRACE
  powerpc: kprobes: prefer ftrace when probing function entry

 .../debug/kprobes-on-ftrace/arch-support.txt   |   2 +-
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/include/asm/kprobes.h |  10 ++
 arch/powerpc/kernel/Makefile   |   3 +
 arch/powerpc/kernel/entry_64.S |  19 ++--
 arch/powerpc/kernel/kprobes-ftrace.c   | 104 +
 arch/powerpc/kernel/kprobes.c  |  25 -
 kernel/kprobes.c   |  11 ++-
 8 files changed, 159 insertions(+), 16 deletions(-)
 create mode 100644 arch/powerpc/kernel/kprobes-ftrace.c

-- 
2.12.1

[PATCH v3 1/5] powerpc: ftrace: minor cleanup

2017-04-12 Thread Naveen N. Rao

Move the stack setup and teardown code to the ftrace_graph_caller().
This way, we don't incur the cost of setting it up unless function graph
is enabled for this function.

Also, remove the extraneous LR restore code after the function graph
stub. LR has previously been restored and neither livepatch_handler()
nor ftrace_graph_caller() return back here.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/entry_64.S | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 6432d4bf08c8..8fd8718722a1 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -1313,16 +1313,12 @@ ftrace_call:
 #endif
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   stdur1, -112(r1)
 .globl ftrace_graph_call
 ftrace_graph_call:
b   ftrace_graph_stub
 _GLOBAL(ftrace_graph_stub)
-   addir1, r1, 112
 #endif
 
-   ld  r0,LRSAVE(r1)   /* restore callee's lr at _mcount site */
-   mtlrr0
bctr/* jump after _mcount site */
 #endif /* CC_USING_MPROFILE_KERNEL */
 
@@ -1446,6 +1442,7 @@ _GLOBAL(ftrace_stub)
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 #ifndef CC_USING_MPROFILE_KERNEL
 _GLOBAL(ftrace_graph_caller)
+   stdur1, -112(r1)
/* load r4 with local address */
ld  r4, 128(r1)
subir4, r4, MCOUNT_INSN_SIZE
@@ -1471,6 +1468,7 @@ _GLOBAL(ftrace_graph_caller)
 
 #else /* CC_USING_MPROFILE_KERNEL */
 _GLOBAL(ftrace_graph_caller)
+   stdur1, -112(r1)
/* with -mprofile-kernel, parameter regs are still alive at _mcount */
std r10, 104(r1)
std r9, 96(r1)
-- 
2.12.1

[PATCH v2 3/5] powerpc: introduce a new helper to obtain function entry points

2017-04-12 Thread Naveen N. Rao

kprobe_lookup_name() is specific to the kprobe subsystem and may not
always return the function entry point (in a subsequent patch for
KPROBES_ON_FTRACE). For looking up function entry points, introduce a
separate helper and use the same in optprobes.c

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/code-patching.h | 37 
 arch/powerpc/kernel/optprobes.c  |  6 +++---
 2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index 8ab937771068..3e994f404434 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -12,6 +12,8 @@
 
 #include 
 #include 
+#include 
+#include 
 
 /* Flags for create_branch:
  * "b"   == create_branch(addr, target, 0);
@@ -99,6 +101,41 @@ static inline unsigned long ppc_global_function_entry(void 
*func)
 #endif
 }
 
+/*
+ * Wrapper around kallsyms_lookup() to return function entry address:
+ * - For ABIv1, we lookup the dot variant.
+ * - For ABIv2, we return the local entry point.
+ */
+static inline unsigned long ppc_kallsyms_lookup_name(const char *name)
+{
+   unsigned long addr;
+#ifdef PPC64_ELF_ABI_v1
+   /* check for dot variant */
+   char dot_name[1 + KSYM_NAME_LEN];
+   bool dot_appended = false;
+   if (name[0] != '.') {
+   dot_name[0] = '.';
+   dot_name[1] = '\0';
+   strncat(dot_name, name, KSYM_NAME_LEN - 2);
+   dot_appended = true;
+   } else {
+   dot_name[0] = '\0';
+   strncat(dot_name, name, KSYM_NAME_LEN - 1);
+   }
+   addr = kallsyms_lookup_name(dot_name);
+   if (!addr && dot_appended)
+   /* Let's try the original non-dot symbol lookup */
+   addr = kallsyms_lookup_name(name);
+#elif defined(PPC64_ELF_ABI_v2)
+   addr = kallsyms_lookup_name(name);
+   if (addr)
+   addr = ppc_function_entry((void *)addr);
+#else
+   addr = kallsyms_lookup_name(name);
+#endif
+   return addr;
+}
+
 #ifdef CONFIG_PPC64
 /*
  * Some instruction encodings commonly used in dynamic ftracing
diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index ce81a322251c..ec60ed0d4aad 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -243,10 +243,10 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
/*
 * 2. branch to optimized_callback() and emulate_step()
 */
-   op_callback_addr = kprobe_lookup_name("optimized_callback", 0);
-   emulate_step_addr = kprobe_lookup_name("emulate_step", 0);
+   op_callback_addr = (kprobe_opcode_t 
*)ppc_kallsyms_lookup_name("optimized_callback");
+   emulate_step_addr = (kprobe_opcode_t 
*)ppc_kallsyms_lookup_name("emulate_step");
if (!op_callback_addr || !emulate_step_addr) {
-   WARN(1, "kprobe_lookup_name() failed\n");
+   WARN(1, "Unable to lookup 
optimized_callback()/emulate_step()\n");
goto error;
}
 
-- 
2.12.1

[PATCH v2 4/5] powerpc: kprobes: factor out code to emulate instruction into a helper

2017-04-12 Thread Naveen N. Rao

This helper will be used in a subsequent patch to emulate instructions
on re-entering the kprobe handler. No functional change.

Acked-by: Ananth N Mavinakayanahalli 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes.c | 52 ++-
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 0732a0291ace..8b48f7d046bd 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -207,6 +207,35 @@ void __kprobes arch_prepare_kretprobe(struct 
kretprobe_instance *ri,
regs->link = (unsigned long)kretprobe_trampoline;
 }
 
+int __kprobes try_to_emulate(struct kprobe *p, struct pt_regs *regs)
+{
+   int ret;
+   unsigned int insn = *p->ainsn.insn;
+
+   /* regs->nip is also adjusted if emulate_step returns 1 */
+   ret = emulate_step(regs, insn);
+   if (ret > 0) {
+   /*
+* Once this instruction has been boosted
+* successfully, set the boostable flag
+*/
+   if (unlikely(p->ainsn.boostable == 0))
+   p->ainsn.boostable = 1;
+   } else if (ret < 0) {
+   /*
+* We don't allow kprobes on mtmsr(d)/rfi(d), etc.
+* So, we should never get here... but, its still
+* good to catch them, just in case...
+*/
+   printk("Can't step on instruction %x\n", insn);
+   BUG();
+   } else if (ret == 0)
+   /* This instruction can't be boosted */
+   p->ainsn.boostable = -1;
+
+   return ret;
+}
+
 int __kprobes kprobe_handler(struct pt_regs *regs)
 {
struct kprobe *p;
@@ -302,18 +331,9 @@ int __kprobes kprobe_handler(struct pt_regs *regs)
 
 ss_probe:
if (p->ainsn.boostable >= 0) {
-   unsigned int insn = *p->ainsn.insn;
+   ret = try_to_emulate(p, regs);
 
-   /* regs->nip is also adjusted if emulate_step returns 1 */
-   ret = emulate_step(regs, insn);
if (ret > 0) {
-   /*
-* Once this instruction has been boosted
-* successfully, set the boostable flag
-*/
-   if (unlikely(p->ainsn.boostable == 0))
-   p->ainsn.boostable = 1;
-
if (p->post_handler)
p->post_handler(p, regs, 0);
 
@@ -321,17 +341,7 @@ int __kprobes kprobe_handler(struct pt_regs *regs)
reset_current_kprobe();
preempt_enable_no_resched();
return 1;
-   } else if (ret < 0) {
-   /*
-* We don't allow kprobes on mtmsr(d)/rfi(d), etc.
-* So, we should never get here... but, its still
-* good to catch them, just in case...
-*/
-   printk("Can't step on instruction %x\n", insn);
-   BUG();
-   } else if (ret == 0)
-   /* This instruction can't be boosted */
-   p->ainsn.boostable = -1;
+   }
}
prepare_singlestep(p, regs);
kcb->kprobe_status = KPROBE_HIT_SS;
-- 
2.12.1

[PATCH v2 5/5] powerpc: kprobes: emulate instructions on kprobe handler re-entry

2017-04-12 Thread Naveen N. Rao

On kprobe handler re-entry, try to emulate the instruction rather than
single stepping always.

As a related change, remove the duplicate saving of msr as that is
already done in set_current_kprobe()

Acked-by: Ananth N Mavinakayanahalli 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 8b48f7d046bd..005bd4a75902 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -273,10 +273,17 @@ int __kprobes kprobe_handler(struct pt_regs *regs)
 */
save_previous_kprobe(kcb);
set_current_kprobe(p, regs, kcb);
-   kcb->kprobe_saved_msr = regs->msr;
kprobes_inc_nmissed_count(p);
prepare_singlestep(p, regs);
kcb->kprobe_status = KPROBE_REENTER;
+   if (p->ainsn.boostable >= 0) {
+   ret = try_to_emulate(p, regs);
+
+   if (ret > 0) {
+   restore_previous_kprobe(kcb);
+   return 1;
+   }
+   }
return 1;
} else {
if (*addr != BREAKPOINT_INSTRUCTION) {
-- 
2.12.1

[PATCH v2 1/5] kprobes: convert kprobe_lookup_name() to a function

2017-04-12 Thread Naveen N. Rao

The macro is now pretty long and ugly on powerpc. In the light of
further changes needed here, convert it to a __weak variant to be
over-ridden with a nicer looking function.

Suggested-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/kprobes.h | 53 --
 arch/powerpc/kernel/kprobes.c  | 58 ++
 arch/powerpc/kernel/optprobes.c|  4 +--
 include/linux/kprobes.h|  1 +
 kernel/kprobes.c   | 20 ++---
 5 files changed, 69 insertions(+), 67 deletions(-)

diff --git a/arch/powerpc/include/asm/kprobes.h 
b/arch/powerpc/include/asm/kprobes.h
index 0503c98b2117..a843884aafaf 100644
--- a/arch/powerpc/include/asm/kprobes.h
+++ b/arch/powerpc/include/asm/kprobes.h
@@ -61,59 +61,6 @@ extern kprobe_opcode_t optprobe_template_end[];
 #define MAX_OPTINSN_SIZE   (optprobe_template_end - 
optprobe_template_entry)
 #define RELATIVEJUMP_SIZE  sizeof(kprobe_opcode_t) /* 4 bytes */
 
-#ifdef PPC64_ELF_ABI_v2
-/* PPC64 ABIv2 needs local entry point */
-#define kprobe_lookup_name(name, addr) \
-{  \
-   addr = (kprobe_opcode_t *)kallsyms_lookup_name(name);   \
-   if (addr)   \
-   addr = (kprobe_opcode_t *)ppc_function_entry(addr); \
-}
-#elif defined(PPC64_ELF_ABI_v1)
-/*
- * 64bit powerpc ABIv1 uses function descriptors:
- * - Check for the dot variant of the symbol first.
- * - If that fails, try looking up the symbol provided.
- *
- * This ensures we always get to the actual symbol and not the descriptor.
- * Also handle  format.
- */
-#define kprobe_lookup_name(name, addr) \
-{  \
-   char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN]; \
-   const char *modsym; 
\
-   bool dot_appended = false;  \
-   if ((modsym = strchr(name, ':')) != NULL) { \
-   modsym++;   \
-   if (*modsym != '\0' && *modsym != '.') {\
-   /* Convert to  */   \
-   strncpy(dot_name, name, modsym - name); \
-   dot_name[modsym - name] = '.';  \
-   dot_name[modsym - name + 1] = '\0'; \
-   strncat(dot_name, modsym,   \
-   sizeof(dot_name) - (modsym - name) - 2);\
-   dot_appended = true;\
-   } else {\
-   dot_name[0] = '\0'; \
-   strncat(dot_name, name, sizeof(dot_name) - 1);  \
-   }   \
-   } else if (name[0] != '.') {\
-   dot_name[0] = '.';  \
-   dot_name[1] = '\0'; \
-   strncat(dot_name, name, KSYM_NAME_LEN - 2); \
-   dot_appended = true;\
-   } else {\
-   dot_name[0] = '\0'; \
-   strncat(dot_name, name, KSYM_NAME_LEN - 1); \
-   }   \
-   addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name);   \
-   if (!addr && dot_appended) {\
-   /* Let's try the original non-dot symbol lookup */  \
-   addr = (kprobe_opcode_t *)kallsyms_lookup_name(name);   \
-   }   \
-}
-#endif
-
 #define flush_insn_slot(p) do { } while (0)
 #define kretprobe_blacklist_size 0
 
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 331751701fed..a7aa7394954d 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -42,6 +42,64 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
 
 struct kretprobe_blackpoint kretprobe_blacklist[] = {{NULL, NULL}};
 
+kprobe_opcode_t *kprobe_lookup_name(const char *name)
+{
+   kprobe_opcode_t *addr;
+
+#ifdef PPC64_ELF_ABI_v2
+   /* PPC64 ABIv2 needs local entry point */
+   addr = (kprobe_opcode_t *)kallsyms_lookup_name(name);
+   if (addr)
+   addr = (kprobe_opcode_t

[PATCH v2 2/5] powerpc: kprobes: fix handling of function offsets on ABIv2

2017-04-12 Thread Naveen N. Rao

commit 239aeba76409 ("perf powerpc: Fix kprobe and kretprobe handling
with kallsyms on ppc64le") changed how we use the offset field in struct
kprobe on ABIv2. perf now offsets from the GEP (Global entry point) if an
offset is specified and otherwise chooses the LEP (Local entry point).

Fix the same in kernel for kprobe API users. We do this by extending
kprobe_lookup_name() to accept an additional parameter to indicate the
offset specified with the kprobe registration. If offset is 0, we return
the local function entry and return the global entry point otherwise.

With:
# cd /sys/kernel/debug/tracing/
# echo "p _do_fork" >> kprobe_events
# echo "p _do_fork+0x10" >> kprobe_events

before this patch:
# cat ../kprobes/list
c00d0748  k  _do_fork+0x8[DISABLED]
c00d0758  k  _do_fork+0x18[DISABLED]
c00412b0  k  kretprobe_trampoline+0x0[OPTIMIZED]

and after:
# cat ../kprobes/list
c00d04c8  k  _do_fork+0x8[DISABLED]
c00d04d0  k  _do_fork+0x10[DISABLED]
c00412b0  k  kretprobe_trampoline+0x0[OPTIMIZED]

Acked-by: Ananth N Mavinakayanahalli 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes.c   | 4 ++--
 arch/powerpc/kernel/optprobes.c | 4 ++--
 include/linux/kprobes.h | 2 +-
 kernel/kprobes.c| 7 ---
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index a7aa7394954d..0732a0291ace 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -42,14 +42,14 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
 
 struct kretprobe_blackpoint kretprobe_blacklist[] = {{NULL, NULL}};
 
-kprobe_opcode_t *kprobe_lookup_name(const char *name)
+kprobe_opcode_t *kprobe_lookup_name(const char *name, unsigned int offset)
 {
kprobe_opcode_t *addr;
 
 #ifdef PPC64_ELF_ABI_v2
/* PPC64 ABIv2 needs local entry point */
addr = (kprobe_opcode_t *)kallsyms_lookup_name(name);
-   if (addr)
+   if (addr && !offset)
addr = (kprobe_opcode_t *)ppc_function_entry(addr);
 #elif defined(PPC64_ELF_ABI_v1)
/*
diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index aefe076d00e0..ce81a322251c 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -243,8 +243,8 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
/*
 * 2. branch to optimized_callback() and emulate_step()
 */
-   op_callback_addr = kprobe_lookup_name("optimized_callback");
-   emulate_step_addr = kprobe_lookup_name("emulate_step");
+   op_callback_addr = kprobe_lookup_name("optimized_callback", 0);
+   emulate_step_addr = kprobe_lookup_name("emulate_step", 0);
if (!op_callback_addr || !emulate_step_addr) {
WARN(1, "kprobe_lookup_name() failed\n");
goto error;
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index 16f153c84646..1f82a3db00b1 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -379,7 +379,7 @@ static inline struct kprobe_ctlblk *get_kprobe_ctlblk(void)
return this_cpu_ptr(_ctlblk);
 }
 
-kprobe_opcode_t *kprobe_lookup_name(const char *name);
+kprobe_opcode_t *kprobe_lookup_name(const char *name, unsigned int offset);
 int register_kprobe(struct kprobe *p);
 void unregister_kprobe(struct kprobe *p);
 int register_kprobes(struct kprobe **kps, int num);
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index f3421b6b47a3..6a128f3a7ed1 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -72,7 +72,8 @@ static struct {
raw_spinlock_t lock cacheline_aligned_in_smp;
 } kretprobe_table_locks[KPROBE_TABLE_SIZE];
 
-kprobe_opcode_t * __weak kprobe_lookup_name(const char *name)
+kprobe_opcode_t * __weak kprobe_lookup_name(const char *name,
+   unsigned int __unused)
 {
return ((kprobe_opcode_t *)(kallsyms_lookup_name(name)));
 }
@@ -1396,7 +1397,7 @@ static kprobe_opcode_t *kprobe_addr(struct kprobe *p)
goto invalid;
 
if (p->symbol_name) {
-   addr = kprobe_lookup_name(p->symbol_name);
+   addr = kprobe_lookup_name(p->symbol_name, p->offset);
if (!addr)
return ERR_PTR(-ENOENT);
}
@@ -2189,7 +2190,7 @@ static int __init init_kprobes(void)
/* lookup the function address from its name */
for (i = 0; kretprobe_blacklist[i].name != NULL; i++) {
kretprobe_blacklist[i].addr =
-   kprobe_lookup_name(kretprobe_blacklist[i].name);
+   kprobe_lookup_name(kretprobe_blacklist[i].name, 
0);

[PATCH v2 0/5] powerpc: a few kprobe fixes and refactoring

2017-04-12 Thread Naveen N. Rao

v1:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1334843.html

For v2, this series has been re-ordered and rebased on top of
powerpc/next so as to make it easier to resolve conflicts with -tip. No
other changes.

- Naveen


Naveen N. Rao (5):
  kprobes: convert kprobe_lookup_name() to a function
  powerpc: kprobes: fix handling of function offsets on ABIv2
  powerpc: introduce a new helper to obtain function entry points
  powerpc: kprobes: factor out code to emulate instruction into a helper
  powerpc: kprobes: emulate instructions on kprobe handler re-entry

 arch/powerpc/include/asm/code-patching.h |  37 ++
 arch/powerpc/include/asm/kprobes.h   |  53 --
 arch/powerpc/kernel/kprobes.c| 119 +--
 arch/powerpc/kernel/optprobes.c  |   6 +-
 include/linux/kprobes.h  |   1 +
 kernel/kprobes.c |  21 +++---
 6 files changed, 147 insertions(+), 90 deletions(-)

-- 
2.12.1

Re: [PATCH v4 0/5] perf report: Show branch type

2017-04-12 Thread Jiri Olsa

On Wed, Apr 12, 2017 at 06:21:01AM +0800, Jin Yao wrote:

SNIP

> 
> 3. Use 2 bits in perf_branch_entry for a "cross" metrics checking
>for branch cross 4K or 2M area. It's an approximate computing
>for checking if the branch cross 4K page or 2MB page.
> 
> For example:
> 
> perf record -g --branch-filter any,save_type 
> 
> perf report --stdio
> 
>  JCC forward:  27.7%
> JCC backward:   9.8%
>  JMP:   0.0%
>  IND_JMP:   6.5%
> CALL:  26.6%
> IND_CALL:   0.0%
>  RET:  29.3%
> IRET:   0.0%
> CROSS_4K:   0.0%
> CROSS_2M:  14.3%

got mangled perf report --stdio output for:


[root@ibm-x3650m4-02 perf]# ./perf record -j any,save_type kill
kill: not enough arguments
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB perf.data (18 samples) ]

[root@ibm-x3650m4-02 perf]# ./perf report --stdio -f | head -30
# To display the perf.data header info, please use --header/--header-only 
options.
#
#
# Total Lost Samples: 0
#
# Samples: 253  of event 'cycles'
# Event count (approx.): 253
#
# Overhead  Command  Source Shared Object  Source Symbol
Target SymbolBasic Block Cycles
#   ...    
...  
...  ..
#
 8.30%  perf
Um  [kernel.vmlinux]  [k] __intel_pmu_enable_all.constprop.17  [k] 
native_write_msr - 
 7.91%  perf
Um  [kernel.vmlinux]  [k] intel_pmu_lbr_enable_all [k] 
__intel_pmu_enable_all.constprop.17  - 
 7.91%  perf
Um  [kernel.vmlinux]  [k] native_write_msr [k] 
intel_pmu_lbr_enable_all - 
 6.32%  kill libc-2.24.so  [.] _dl_addr 
[.] _dl_addr - 
 5.93%  perf
Um  [kernel.vmlinux]  [k] perf_iterate_ctx [k] 
perf_iterate_ctx - 
 2.77%  kill libc-2.24.so  [.] malloc   
[.] malloc   - 
 1.98%  kill libc-2.24.so  [.] _int_malloc  
[.] _int_malloc  - 
 1.58%  kill [kernel.vmlinux]  [k] __rb_insert_augmented
[k] __rb_insert_augmented- 
 1.58%  perf
Um  [kernel.vmlinux]  [k] perf_event_exec  [k] 
perf_event_exec  - 
 1.19%  kill [kernel.vmlinux]  [k] anon_vma_interval_tree_insert
[k] anon_vma_interval_tree_insert- 
 1.19%  kill [kernel.vmlinux]  [k] free_pgd_range   
[k] free_pgd_range   - 
 1.19%  kill [kernel.vmlinux]  [k] n_tty_write  
[k] n_tty_write  - 
 1.19%  perf
Um  [kernel.vmlinux]  [k] native_sched_clock   [k] 
sched_clock  - 
...
SNIP


jirka

1 2 >

1 - 100 of 123 matches

Mail list logo