date:20170621

Re: [PATCH] powerpc/mm/radix: GR field got removed in ISA 3.0B

2017-06-21 Thread Paul Mackerras

On Wed, Jun 21, 2017 at 10:50:12AM +0530, Aneesh Kumar K.V wrote:
> The bit position is now marked reserved. Hence don't set the bit to 1.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/mmu.h | 1 -
>  arch/powerpc/kvm/book3s_hv.c | 6 +-
>  arch/powerpc/mm/pgtable-radix.c  | 2 +-
>  3 files changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
> b/arch/powerpc/include/asm/book3s/64/mmu.h
> index 77529a3e3811..e28ce2793e7d 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> @@ -55,7 +55,6 @@ extern struct patb_entry *partition_tb;
>  #define RPDS_MASK0x1f/* root page dir. size field */
>  
>  /* Bits in patb1 field */
> -#define PATB_GR  (1UL << 63) /* guest uses radix; must match 
> HR */
>  #define PRTS_MASK0x1f/* process table size field */
>  #define PRTB_MASK0x0000UL
>  
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 42b7a4fd57d9..657729a433f9 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3185,7 +3185,7 @@ static void kvmppc_setup_partition_table(struct kvm 
> *kvm)
>   } else {
>   dw0 = PATB_HR | radix__get_tree_size() |
>   __pa(kvm->arch.pgtable) | RADIX_PGD_INDEX_SIZE;
> - dw1 = PATB_GR | kvm->arch.process_table;
> + dw1 = kvm->arch.process_table;
>   }
>  
>   mmu_partition_table_set_entry(kvm->arch.lpid, dw0, dw1);
> @@ -3840,10 +3840,6 @@ static int kvmhv_configure_mmu(struct kvm *kvm, struct 
> kvm_ppc_mmuv3_cfg *cfg)
>   if (radix != kvm_is_radix(kvm))
>   return -EINVAL;
>  
> - /* GR (guest radix) bit in process_table field must match */
> - if (!!(cfg->process_table & PATB_GR) != radix)
> - return -EINVAL;
> -

It's OK to take out the check, but we should also clear the GR bit,
and preferably any other reserved bits, before we put the value into
the partition table.

Paul.

Re: [PATCH V4 2/2] powerpc/powernv : Add support for OPAL-OCC command/response interface

2017-06-21 Thread Shilpasri G Bhat

Hi Cyril,

On 06/22/2017 06:28 AM, Cyril Bur wrote:
> On Wed, 2017-06-21 at 13:36 +0530, Shilpasri G Bhat wrote:
>> In P9, OCC (On-Chip-Controller) supports shared memory based
>> commad-response interface. Within the shared memory there is an OPAL
>> command buffer and OCC response buffer that can be used to send
>> inband commands to OCC. This patch adds a platform driver to support
>> the command/response interface between OCC and the host.
>>
> 
> Sorry I probably should have pointed out earlier that I don't really
> understand the first patch or exactly what problem you're trying to
> solve. I've left it ignored, feel free to explain what the idea is
> there or hopefully someone who can see what you're trying to do can
> step in.

Thanks for reviewing this patch.

For the first patch however, OCC expects a different request_id in the command
interface every time OPAL is requesting a new command .
'opal_async_get_token_interruptible()' returns a free token from the
'opal_async_complete_map' which does not work for the above OCC requirement as
we may end up getting the same token. Thus the first patch tries to get a new
token excluding a token that was used for the last command.


> 
> As for this patch, just one thing.
> 
> 
>> Signed-off-by: Shilpasri G Bhat 
>> ---
>> - Hold occ->cmd_in_progress in read()
>> - Reset occ->rsp_consumed if copy_to_user() fails
>>
>>  arch/powerpc/include/asm/opal-api.h|  41 +++-
>>  arch/powerpc/include/asm/opal.h|   3 +
>>  arch/powerpc/platforms/powernv/Makefile|   2 +-
>>  arch/powerpc/platforms/powernv/opal-occ.c  | 313 
>> +
>>  arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
>>  arch/powerpc/platforms/powernv/opal.c  |   8 +
>>  6 files changed, 366 insertions(+), 2 deletions(-)
>>  create mode 100644 arch/powerpc/platforms/powernv/opal-occ.c
>>
> 
> [snip]
> 
>> +
>> +static ssize_t opal_occ_read(struct file *file, char __user *buf,
>> + size_t count, loff_t *ppos)
>> +{
>> +struct miscdevice *dev = file->private_data;
>> +struct occ *occ = container_of(dev, struct occ, dev);
>> +int rc;
>> +
>> +if (count < sizeof(*occ->rsp) + occ->rsp->size)
>> +return -EINVAL;
>> +
>> +if (!atomic_cmpxchg(>rsp_consumed, 1, 0))
>> +return -EBUSY;
>> +
>> +if (atomic_cmpxchg(>cmd_in_progress, 0, 1))
>> +return -EBUSY;
>> +
> 
> Personally I would have done these two checks the other way around, it
> doesn't really matter which one you do first but what does matter is
> that you undo the change you did in the first cmpxchg if the second
> cmpxchg causes you do return.
> 
> In this case if cmd_in_progress then you'll have marked the response as
> consumed...

Here, if cmd_in_progress is set by some other thread doing a write() then it
will set the 'rsp_consumed' to valid on successful command completion. If
write() fails then we are doing a good thing here by not setting 'rsp_consumed'
so the user will not be able to read previous command's response.

Thanks and Regards,
Shilpa

> 
>> +rc = copy_to_user((void __user *)buf, occ->rsp,
>> +  sizeof(occ->rsp) + occ->rsp->size);
>> +if (rc) {
>> +atomic_set(>rsp_consumed, 1);
>> +atomic_set(>cmd_in_progress, 0);
>> +pr_err("Failed to copy OCC response data to user\n");
>> +return rc;
>> +}
>> +
>> +atomic_set(>cmd_in_progress, 0);
>> +return sizeof(*occ->rsp) + occ->rsp->size;
>> +}
>> +
> 
> [snip]
>

Re: 1M hugepage size being registered on Linux

2017-06-21 Thread Michael Ellerman

Hi Victor,

Someone refreshed my memory on this, coffee was involved ...

victora  writes:
> Hi Alistair/Jeremy,
>
> I am working on a bug related to 1M hugepage size being registered on 
> Linux (Power 8 Baremetal - Garrison).

On those machines the property in the device tree comes straight from
hostboot, and it includes 1M:

# lsprop ibm,segment-page-sizes 
ibm,segment-page-sizes
 000c    0003  000c
 baseshift slbenclpnum shift
   0010  0007  0018
 penc  shift penc  shift
 0038  0010  0110  0002
 penc  baseshift slbenclpnum
 0010  0001  0018  0008
 shift penc  shift penc
 0014  0130  0001  0014 <--- 1MB = 2^0x14
 baseshift slbenclpnum shift
 0002  0018  0100  0001
 penc  baseshift slbenclpnum
 0018    0022  0120
 shift penc  baseshift slbenc
 0001  0022  0003
 lpnum shift penc


> I was checking dmesg and it seems that 1M page size is coming from 
> firmware to Linux.
>
> [0.00] base_shift=20: shift=20, sllp=0x0130, avpnm=0x, 
> tlbiel=0, penc=2
> [1.528867] HugeTLB registered 1 MB page size, pre-allocated 0 pages

Which is why you see that message.

> Should Linux support this page size? As afar as I know, this was an 
> unsupported page size in the past isn't it? If this should be supported 
> now, is there any specific reason for that?

It's unsupported in Linux because it doesn't match the page table
geometry.

We merged a patch from Aneesh to filter it out in 4.12-rc1:

  a525108cf1cc ("powerpc/mm/hugetlb: Filter out hugepage size not supported by 
page table layout")

I guess we should probably send that patch to stable et. al.

cheers

Re: [PATCH 1/1] futex: remove duplicated code and fix UB

2017-06-21 Thread Darren Hart

On Wed, Jun 21, 2017 at 01:53:18PM +0200, Jiri Slaby wrote:
> There is code duplicated over all architecture's headers for
> futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
> and comparison of the result.
> 
> Remove this duplication and leave up to the arches only the needed
> assembly which is now in arch_futex_atomic_op_inuser.
> 
> This effectively distributes the Will Deacon's arm64 fix for undefined
> behaviour reported by UBSAN to all architectures. The fix was done in
> commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
> FUTEX_OP_OPARG_SHIFT usage).  Look there for an example dump.
> 
> Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
> remove pointless access_ok() checks") as access_ok there returns true.
> We introduce it back to the helper for the sake of simplicity (it gets
> optimized away anyway).
> 

This required a minor manual merge for ARM on the tip of Linus' tree today. The
reduced duplication is a welcome improvement.

Reviewed-by: Darren Hart (VMware) 

-- 
Darren Hart
VMware Open Source Technology Center

Re: [PATCH v3 6/6] powerpc/64s: Blacklist rtas entry/exit from kprobes

2017-06-21 Thread Nicholas Piggin

On Thu, 22 Jun 2017 00:08:42 +0530
"Naveen N. Rao"  wrote:

> We can't take traps with relocation off, so blacklist enter_rtas() and
> rtas_return_loc(). However, instead of blacklisting all of enter_rtas(),
> introduce a new symbol __enter_rtas from where on we can't take a trap
> and blacklist that.
> 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/entry_64.S | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index d376f07153d7..49c35450f399 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -1076,6 +1076,8 @@ _GLOBAL(enter_rtas)
>  rldicr  r9,r9,MSR_SF_LG,(63-MSR_SF_LG)
>   ori r9,r9,MSR_IR|MSR_DR|MSR_FE0|MSR_FE1|MSR_FP|MSR_RI|MSR_LE
>   andcr6,r0,r9
> +
> +__enter_rtas:
>   sync/* disable interrupts so SRR0/1 */
>   mtmsrd  r0  /* don't get trashed */

Along the lines of the system call patch... For consistency, could we
put the __enter_rtas right after mtmsrd? And I wonder if we shoul
come up with a common prefix or postfix naming convention for these
such labels used to control probing?

How do opal calls avoid tracing?

Thanks,
Nick

Re: [PATCH v3 5/6] powerpc/64s: Blacklist functions invoked on a trap

2017-06-21 Thread Nicholas Piggin

On Thu, 22 Jun 2017 00:08:41 +0530
"Naveen N. Rao"  wrote:

> Blacklist all functions involved while handling a trap. We:
> - convert some of the symbols into private symbols,
> - remove the duplicate 'restore' symbol, and
> - blacklist most functions involved while handling a trap.

I'm not sure removing "restore" makes it better. 
fast_exc_return_irq is a relatively specialised case...
I think all these names could be reworked and made a bit
more consistent and descriptive, but for this patch could
you just leave restore in there?

Otherwise it seems okay to me, but I haven't gone through
all the functions involved with trap yet and verified.

Thanks,
Nick

Re: [PATCH v3 4/6] powerpc/64s: Un-blacklist system_call() from kprobes

2017-06-21 Thread Nicholas Piggin

On Thu, 22 Jun 2017 00:08:40 +0530
"Naveen N. Rao"  wrote:

> It is actually safe to probe system_call() in entry_64.S, but only till
> we unset MSR_RI. To allow this, add a new symbol system_call_exit()
> after the mtmsrd and blacklist that. Though the mtmsrd instruction
> itself is now whitelisted, we won't be allowed to probe on it as we
> don't allow probing on rfi and mtmsr instructions (checked for in
> arch_prepare_kprobe()).

Can you add a little comment to say probes aren't allowed, and it's
located after the mtmsr in order to avoid contaminating traces?

Also I wonder if a slightly different name would be more instructive?
I don't normally care, but the system_call_common code isn't trivial
to follow. system_call_exit might give the impression that it is the
entire exit path (which would pair with system_call for entry).

Perhaps system_call_exit_notrace? No that sucks too :(

Thanks,
Nick


> 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/entry_64.S | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index ef8e6615b8ba..feeeadc9aa71 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -204,6 +204,7 @@ system_call:  /* label this so stack 
> traces look sane */
>   mtmsrd  r11,1
>  #endif /* CONFIG_PPC_BOOK3E */
>  
> +system_call_exit:
>   ld  r9,TI_FLAGS(r12)
>   li  r11,-MAX_ERRNO
>   andi.   
> r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> @@ -412,7 +413,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>   b   .   /* prevent speculative execution */
>  #endif
>  _ASM_NOKPROBE_SYMBOL(system_call_common);
> -_ASM_NOKPROBE_SYMBOL(system_call);
> +_ASM_NOKPROBE_SYMBOL(system_call_exit);
>  
>  /* Save non-volatile GPRs, if not already saved. */
>  _GLOBAL(save_nvgprs)

Re: [PATCH v3 3/6] powerpc/64s: Blacklist system_call() and system_call_common() from kprobes

2017-06-21 Thread Nicholas Piggin

On Thu, 22 Jun 2017 00:08:39 +0530
"Naveen N. Rao"  wrote:

> Convert some of the symbols into private symbols and blacklist
> system_call_common() and system_call() from kprobes. We can't take a
> trap at parts of these functions as either MSR_RI is unset or the kernel
> stack pointer is not yet setup.
> 
> Reviewed-by: Masami Hiramatsu 
> Signed-off-by: Naveen N. Rao 

I don't have a problem with this bunch of system call labels
going private. They've never added much for me in profiles.

Reviewed-by: Nicholas Piggin 

Semi-related question, why is system_call: where it is? Should we
move it up to right after the mtmsrd / wrteei instruction?
(obviously for another patch). It's pretty common to get PMU
interrupts coming in right after mtmsr and this makes profiles split
the syscall into two which is annoying.

Thanks,
Nick

Re: [PATCH v3 2/6] powerpc/64s: Convert .L__replay_interrupt_return to a local label

2017-06-21 Thread Nicholas Piggin

On Thu, 22 Jun 2017 00:08:38 +0530
"Naveen N. Rao"  wrote:

> Commit b48bbb82e2b835 ("powerpc/64s: Don't unbalance the return branch
> predictor in __replay_interrupt()") introduced __replay_interrupt_return
> symbol with '.L' prefix in hopes of keeping it private. However, due to
> the use of LOAD_REG_ADDR(), the assembler kept this symbol visible. Fix
> the same by instead using the local label '1'.
> 
> Fixes: Commit b48bbb82e2b835 ("powerpc/64s: Don't unbalance the return branch
> predictor in __replay_interrupt()")
> Suggested-by: Nicholas Piggin 

Thanks, good catch.

Reviewed-by: Nicholas Piggin 

> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/kernel/exceptions-64s.S | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 07b79c2c70f8..2df6d7b3070f 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -1629,7 +1629,7 @@ _GLOBAL(__replay_interrupt)
>* we don't give a damn about, so we don't bother storing them.
>*/
>   mfmsr   r12
> - LOAD_REG_ADDR(r11, .L__replay_interrupt_return)
> + LOAD_REG_ADDR(r11, 1f)
>   mfcrr9
>   ori r12,r12,MSR_EE
>   cmpwi   r3,0x900
> @@ -1647,6 +1647,6 @@ FTR_SECTION_ELSE
>   cmpwi   r3,0xa00
>   beq doorbell_super_common_msgclr
>  ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
> -.L__replay_interrupt_return:
> +1:
>   blr
>

Re: [PATCH v3 1/6] powerpc64/elfv1: Validate function pointer address in the function descriptor

2017-06-21 Thread Nicholas Piggin

On Thu, 22 Jun 2017 00:08:37 +0530
"Naveen N. Rao"  wrote:

> Currently, we assume that the function pointer we receive in
> ppc_function_entry() points to a function descriptor. However, this is
> not always the case. In particular, assembly symbols without the right
> annotation do not have an associated function descriptor. Some of these
> symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL().
> When such addresses are subsequently processed through
> arch_deref_entry_point() in populate_kprobe_blacklist(), we see the
> below errors during bootup:
> [0.663963] Failed to find blacklist at 7d9b02a648029b6c
> [0.663970] Failed to find blacklist at a14d03d0394a0001
> [0.663972] Failed to find blacklist at 7d5302a6f94d0388
> [0.663973] Failed to find blacklist at 48027d11e8610178
> [0.663974] Failed to find blacklist at f8010070f8410080
> [0.663976] Failed to find blacklist at 386100704801f89d
> [0.663977] Failed to find blacklist at 7d5302a6f94d00b0
> 
> Fix this by checking if the address in the function descriptor is
> actually a valid kernel address. In the case of assembly symbols, this
> will almost always fail as this ends up being powerpc instructions. In
> that case, return pointer to the address we received, rather than the
> dereferenced value.
> 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/include/asm/code-patching.h | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/code-patching.h 
> b/arch/powerpc/include/asm/code-patching.h
> index abef812de7f8..ec54050be585 100644
> --- a/arch/powerpc/include/asm/code-patching.h
> +++ b/arch/powerpc/include/asm/code-patching.h
> @@ -83,8 +83,16 @@ static inline unsigned long ppc_function_entry(void *func)
>* On PPC64 ABIv1 the function pointer actually points to the
>* function's descriptor. The first entry in the descriptor is the
>* address of the function text.
> +  *
> +  * However, we may have received a pointer to an assembly symbol
> +  * that may not be a function descriptor. Validate that the entry
> +  * points to a valid kernel address and if not, return the pointer
> +  * we received as is.
>*/
> - return ((func_descr_t *)func)->entry;
> + if (kernel_text_address(((func_descr_t *)func)->entry))
> + return ((func_descr_t *)func)->entry;
> + else
> + return (unsigned long)func;

What if "func" is a text section label (bare asm function)?
Won't func->entry load the random instruction located there
and compare it with a kernel address?

I don't know too much about the v1 ABI, but should we check for
func belonging in the .opd section first and base the check on
that? Alternatively I if "func" is in the kernel text address,
we can recognize it's not in the .opd section... right?

Thanks,
Nick

[PATCH 17/17] cxlflash: Update TMF command processing

2017-06-21 Thread Uma Krishnan

From: "Matthew R. Ochs" 

Currently, the SCSI command presented to the device reset handler is used
to send TMFs to the AFU for a device reset. This behavior is incorrect as
the command presented is an actual command and not a special notification.
As such, it should only be used for reference and not be acted upon.

Additionally, the existing TMF transmission routine does not account for
actual errors from the hardware, only reflecting failure when a timeout
occurs. This can lead to a condition where the device reset handler is
presented with a false 'success'.

Update send_tmf() to dynamically allocate a private command for sending
the TMF command and properly reflect failure when the completed command
indicates an error or was aborted. Detect TMF commands during response
processing and avoid scsi_done() for these types of commands. Lastly,
update comments in the TMF processing paths to describe the new behavior.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 65 ++--
 1 file changed, 44 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 4338982..7a787b6 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -155,9 +155,10 @@ static void process_cmd_err(struct afu_cmd *cmd, struct 
scsi_cmnd *scp)
  * cmd_complete() - command completion handler
  * @cmd:   AFU command that has completed.
  *
- * Prepares and submits command that has either completed or timed out to
- * the SCSI stack. Checks AFU command back into command pool for non-internal
- * (cmd->scp populated) commands.
+ * For SCSI commands this routine prepares and submits commands that have
+ * either completed or timed out to the SCSI stack. For internal commands
+ * (TMF or AFU), this routine simply notifies the originator that the
+ * command has completed.
  */
 static void cmd_complete(struct afu_cmd *cmd)
 {
@@ -167,7 +168,6 @@ static void cmd_complete(struct afu_cmd *cmd)
struct cxlflash_cfg *cfg = afu->parent;
struct device *dev = >dev->dev;
struct hwq *hwq = get_hwq(afu, cmd->hwq_index);
-   bool cmd_is_tmf;
 
spin_lock_irqsave(>hsq_slock, lock_flags);
list_del(>list);
@@ -180,19 +180,14 @@ static void cmd_complete(struct afu_cmd *cmd)
else
scp->result = (DID_OK << 16);
 
-   cmd_is_tmf = cmd->cmd_tmf;
-
dev_dbg_ratelimited(dev, "%s:scp=%p result=%08x ioasc=%08x\n",
__func__, scp, scp->result, cmd->sa.ioasc);
-
scp->scsi_done(scp);
-
-   if (cmd_is_tmf) {
-   spin_lock_irqsave(>tmf_slock, lock_flags);
-   cfg->tmf_active = false;
-   wake_up_all_locked(>tmf_waitq);
-   spin_unlock_irqrestore(>tmf_slock, lock_flags);
-   }
+   } else if (cmd->cmd_tmf) {
+   spin_lock_irqsave(>tmf_slock, lock_flags);
+   cfg->tmf_active = false;
+   wake_up_all_locked(>tmf_waitq);
+   spin_unlock_irqrestore(>tmf_slock, lock_flags);
} else
complete(>cevent);
 }
@@ -206,8 +201,10 @@ static void cmd_complete(struct afu_cmd *cmd)
  */
 static void flush_pending_cmds(struct hwq *hwq)
 {
+   struct cxlflash_cfg *cfg = hwq->afu->parent;
struct afu_cmd *cmd, *tmp;
struct scsi_cmnd *scp;
+   ulong lock_flags;
 
list_for_each_entry_safe(cmd, tmp, >pending_cmds, list) {
/* Bypass command when on a doneq, cmd_complete() will handle */
@@ -222,7 +219,15 @@ static void flush_pending_cmds(struct hwq *hwq)
scp->scsi_done(scp);
} else {
cmd->cmd_aborted = true;
-   complete(>cevent);
+
+   if (cmd->cmd_tmf) {
+   spin_lock_irqsave(>tmf_slock, lock_flags);
+   cfg->tmf_active = false;
+   wake_up_all_locked(>tmf_waitq);
+   spin_unlock_irqrestore(>tmf_slock,
+  lock_flags);
+   } else
+   complete(>cevent);
}
}
 }
@@ -455,24 +460,35 @@ static u32 cmd_to_target_hwq(struct Scsi_Host *host, 
struct scsi_cmnd *scp,
 /**
  * send_tmf() - sends a Task Management Function (TMF)
  * @afu:   AFU to checkout from.
- * @scp:   SCSI command from stack.
+ * @scp:   SCSI command from stack describing target.
  * @tmfcmd:TMF command to send.
  *
  * Return:
- * 0 on success, SCSI_MLQUEUE_HOST_BUSY on failure
+ * 0 on success, SCSI_MLQUEUE_HOST_BUSY or -errno on failure
  */
 static int

[PATCH 16/17] cxlflash: Remove zeroing of private command data

2017-06-21 Thread Uma Krishnan

From: "Matthew R. Ochs" 

The SCSI core now zeroes the per-command private data area prior to
calling into the LLD. Replace the clearing operation that takes place
when the private command data reference is obtained with a routine that
performs common initializations. The zeroing that takes place in the
device reset path remains intact as the private command data associated
with the specified SCSI command is not guaranteed to be cleared.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h | 11 +--
 drivers/scsi/cxlflash/main.c   |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index e95e5a5..6d95e8e 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -178,15 +178,22 @@ static inline struct afu_cmd *sc_to_afuc(struct scsi_cmnd 
*sc)
return PTR_ALIGN(scsi_cmd_priv(sc), __alignof__(struct afu_cmd));
 }
 
-static inline struct afu_cmd *sc_to_afucz(struct scsi_cmnd *sc)
+static inline struct afu_cmd *sc_to_afuci(struct scsi_cmnd *sc)
 {
struct afu_cmd *afuc = sc_to_afuc(sc);
 
-   memset(afuc, 0, sizeof(*afuc));
INIT_LIST_HEAD(>queue);
return afuc;
 }
 
+static inline struct afu_cmd *sc_to_afucz(struct scsi_cmnd *sc)
+{
+   struct afu_cmd *afuc = sc_to_afuc(sc);
+
+   memset(afuc, 0, sizeof(*afuc));
+   return sc_to_afuci(sc);
+}
+
 struct hwq {
/* Stuff requiring alignment go first. */
struct sisl_ioarcb sq[NUM_SQ_ENTRY];/* 16K SQ */
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 0cce442..4338982 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -543,7 +543,7 @@ static int cxlflash_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scp)
struct cxlflash_cfg *cfg = shost_priv(host);
struct afu *afu = cfg->afu;
struct device *dev = >dev->dev;
-   struct afu_cmd *cmd = sc_to_afucz(scp);
+   struct afu_cmd *cmd = sc_to_afuci(scp);
struct scatterlist *sg = scsi_sglist(scp);
int hwq_index = cmd_to_target_hwq(host, scp, afu);
struct hwq *hwq = get_hwq(afu, hwq_index);
-- 
2.1.0

[PATCH 15/17] cxlflash: Support WS16 unmap

2017-06-21 Thread Uma Krishnan

From: "Matthew R. Ochs" 

The cxlflash driver supports performing a write-same16 to scrub virtual
luns when they are released by a user. To date, AFUs for adapters that
are supported by cxlflash do not have the capability to unmap as part of
the WS operation. This can lead to fragmented flash devices which results
in performance degradation.

Future AFUs can optionally support unmap write-same commands and reflects
this support via the context control register. This provides userspace
applications with direct visibility such that they need not depend on a
host API.

Detect unmap support during cxlflash initialization by reading the context
control register associated with the primary hardware queue. Update the
existing write_same16() routine to set the unmap bit in the CDB when unmap
is supported by the host.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h  |  1 +
 drivers/scsi/cxlflash/main.c| 12 
 drivers/scsi/cxlflash/sislite.h |  1 +
 drivers/scsi/cxlflash/vlun.c|  1 +
 4 files changed, 15 insertions(+)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index a91151c..e95e5a5 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -147,6 +147,7 @@ struct cxlflash_cfg {
wait_queue_head_t tmf_waitq;
spinlock_t tmf_slock;
bool tmf_active;
+   bool ws_unmap;  /* Write-same unmap supported */
wait_queue_head_t reset_waitq;
enum cxlflash_state state;
async_cookie_t async_reset_cookie;
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index d3ad52e..0cce442 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -1812,6 +1812,18 @@ static int init_global(struct cxlflash_cfg *cfg)
SISL_CTX_CAP_AFU_CMD | SISL_CTX_CAP_GSCSI_CMD),
>ctrl_map->ctx_cap);
}
+
+   /*
+* Determine write-same unmap support for host by evaluating the unmap
+* sector support bit of the context control register associated with
+* the primary hardware queue. Note that while this status is reflected
+* in a context register, the outcome can be assumed to be host-wide.
+*/
+   hwq = get_hwq(afu, PRIMARY_HWQ);
+   reg = readq_be(>host_map->ctx_ctrl);
+   if (reg & SISL_CTX_CTRL_UNMAP_SECTOR)
+   cfg->ws_unmap = true;
+
/* Initialize heartbeat */
afu->hb = readq_be(>afu_map->global.regs.afu_hb);
 out:
diff --git a/drivers/scsi/cxlflash/sislite.h b/drivers/scsi/cxlflash/sislite.h
index d671fae..09daa86 100644
--- a/drivers/scsi/cxlflash/sislite.h
+++ b/drivers/scsi/cxlflash/sislite.h
@@ -283,6 +283,7 @@ struct sisl_host_map {
__be64 rrq_end; /* write sequence: start followed by end */
__be64 cmd_room;
__be64 ctx_ctrl;/* least significant byte or b56:63 is LISN# */
+#define SISL_CTX_CTRL_UNMAP_SECTOR 0x8000ULL /* b0 */
__be64 mbox_w;  /* restricted use */
__be64 sq_start;/* Submission Queue (R/W): write sequence and */
__be64 sq_end;  /* inclusion semantics are the same as RRQ*/
diff --git a/drivers/scsi/cxlflash/vlun.c b/drivers/scsi/cxlflash/vlun.c
index 0800bcb..bdfb930 100644
--- a/drivers/scsi/cxlflash/vlun.c
+++ b/drivers/scsi/cxlflash/vlun.c
@@ -446,6 +446,7 @@ static int write_same16(struct scsi_device *sdev,
while (left > 0) {
 
scsi_cmd[0] = WRITE_SAME_16;
+   scsi_cmd[1] = cfg->ws_unmap ? 0x8 : 0;
put_unaligned_be64(offset, _cmd[2]);
put_unaligned_be32(ws_limit < left ? ws_limit : left,
   _cmd[10]);
-- 
2.1.0

[PATCH 14/17] cxlflash: Support AFU debug

2017-06-21 Thread Uma Krishnan

From: "Matthew R. Ochs" 

Adopt the SISLite AFU debug capability to allow future CXL Flash
adapters the ability to better debug AFU issues. Update the SISLite
header with the changes necessary to support AFU debug operations
and create a host ioctl interface for user debug software. Also
update the cxlflash documentation to describe this new host ioctl.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 Documentation/powerpc/cxlflash.txt | 14 ++
 drivers/scsi/cxlflash/common.h |  5 ++
 drivers/scsi/cxlflash/main.c   | 96 ++
 drivers/scsi/cxlflash/main.h   |  1 +
 drivers/scsi/cxlflash/sislite.h|  2 +
 include/uapi/scsi/cxlflash_ioctl.h | 37 +--
 6 files changed, 152 insertions(+), 3 deletions(-)

diff --git a/Documentation/powerpc/cxlflash.txt 
b/Documentation/powerpc/cxlflash.txt
index 2d6297b..a64bdaa 100644
--- a/Documentation/powerpc/cxlflash.txt
+++ b/Documentation/powerpc/cxlflash.txt
@@ -413,3 +413,17 @@ HT_CXLFLASH_LUN_PROVISION
 
 With this information, the number of available LUNs and capacity can be
 can be calculated.
+
+HT_CXLFLASH_AFU_DEBUG
+-
+This ioctl is used to debug AFUs by supporting a command pass-through
+interface. It is only valid when used with AFUs that support the AFU
+debug capability.
+
+With exception of buffer management, AFU debug commands are opaque to
+cxlflash and treated as pass-through. For debug commands that do require
+data transfer, the user supplies an adequately sized data buffer and must
+specify the data transfer direction with respect to the host. There is a
+maximum transfer size of 256K imposed. Note that partial read completions
+are not supported - when errors are experienced with a host read data
+transfer, the data buffer is not copied back to the user.
diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 5810724..a91151c 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -262,6 +262,11 @@ static inline bool afu_has_cap(struct afu *afu, u64 cap)
return afu_cap & cap;
 }
 
+static inline bool afu_is_afu_debug(struct afu *afu)
+{
+   return afu_has_cap(afu, SISL_INTVER_CAP_AFU_DEBUG);
+}
+
 static inline bool afu_is_lun_provision(struct afu *afu)
 {
return afu_has_cap(afu, SISL_INTVER_CAP_LUN_PROVISION);
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 1279293..d3ad52e 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -3326,6 +3326,99 @@ static int cxlflash_lun_provision(struct cxlflash_cfg 
*cfg,
 }
 
 /**
+ * cxlflash_afu_debug() - host AFU debug handler
+ * @cfg:   Internal structure associated with the host.
+ * @arg:   Kernel copy of userspace ioctl data structure.
+ *
+ * For debug requests requiring a data buffer, always provide an aligned
+ * (cache line) buffer to the AFU to appease any alignment requirements.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+static int cxlflash_afu_debug(struct cxlflash_cfg *cfg,
+ struct ht_cxlflash_afu_debug *afu_dbg)
+{
+   struct afu *afu = cfg->afu;
+   struct device *dev = >dev->dev;
+   struct sisl_ioarcb rcb;
+   struct sisl_ioasa asa;
+   char *buf = NULL;
+   char *kbuf = NULL;
+   void __user *ubuf = (__force void __user *)afu_dbg->data_ea;
+   u16 req_flags = SISL_REQ_FLAGS_AFU_CMD;
+   u32 ulen = afu_dbg->data_len;
+   bool is_write = afu_dbg->hdr.flags & HT_CXLFLASH_HOST_WRITE;
+   int rc = 0;
+
+   if (!afu_is_afu_debug(afu)) {
+   rc = -ENOTSUPP;
+   goto out;
+   }
+
+   if (ulen) {
+   req_flags |= SISL_REQ_FLAGS_SUP_UNDERRUN;
+
+   if (ulen > HT_CXLFLASH_AFU_DEBUG_MAX_DATA_LEN) {
+   rc = -EINVAL;
+   goto out;
+   }
+
+   if (unlikely(!access_ok(is_write ? VERIFY_READ : VERIFY_WRITE,
+   ubuf, ulen))) {
+   rc = -EFAULT;
+   goto out;
+   }
+
+   buf = kmalloc(ulen + cache_line_size() - 1, GFP_KERNEL);
+   if (unlikely(!buf)) {
+   rc = -ENOMEM;
+   goto out;
+   }
+
+   kbuf = PTR_ALIGN(buf, cache_line_size());
+
+   if (is_write) {
+   req_flags |= SISL_REQ_FLAGS_HOST_WRITE;
+
+   rc = copy_from_user(kbuf, ubuf, ulen);
+   if (unlikely(rc))
+   goto out;
+   }
+   }
+
+   memset(, 0, sizeof(rcb));
+   memset(, 0, sizeof(asa));
+
+   rcb.req_flags = req_flags;
+   rcb.msi = SISL_MSI_RRQ_UPDATED;
+   rcb.timeout

[PATCH 13/17] cxlflash: Support LUN provisioning

2017-06-21 Thread Uma Krishnan

From: "Matthew R. Ochs" 

Adopt the SISLite AFU LUN provisioning capability to allow future CXL
Flash adapters the ability to better manage storage. Update the SISLite
header with the changes necessary to support LUN provision operations
and create a host ioctl interface for user LUN management software. Also
update the cxlflash documentation to describe this new host ioctl.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 Documentation/powerpc/cxlflash.txt |  31 +++
 drivers/scsi/cxlflash/common.h |   5 ++
 drivers/scsi/cxlflash/main.c   | 107 -
 drivers/scsi/cxlflash/main.h   |   5 ++
 drivers/scsi/cxlflash/sislite.h|  22 +++-
 include/uapi/scsi/cxlflash_ioctl.h |  31 +--
 6 files changed, 192 insertions(+), 9 deletions(-)

diff --git a/Documentation/powerpc/cxlflash.txt 
b/Documentation/powerpc/cxlflash.txt
index ee67021..2d6297b 100644
--- a/Documentation/powerpc/cxlflash.txt
+++ b/Documentation/powerpc/cxlflash.txt
@@ -382,3 +382,34 @@ CXL Flash Driver Host IOCTLs
 
 The structure definitions for these IOCTLs are available in:
 uapi/scsi/cxlflash_ioctl.h
+
+HT_CXLFLASH_LUN_PROVISION
+-
+This ioctl is used to create and delete persistent LUNs on cxlflash
+devices that lack an external LUN management interface. It is only
+valid when used with AFUs that support the LUN provision capability.
+
+When sufficient space is available, LUNs can be created by specifying
+the target port to host the LUN and a desired size in 4K blocks. Upon
+success, the LUN ID and WWID of the created LUN will be returned and
+the SCSI bus can be scanned to detect the change in LUN topology. Note
+that partial allocations are not supported. Should a creation fail due
+to a space issue, the target port can be queried for its current LUN
+geometry.
+
+To remove a LUN, the device must first be disassociated from the Linux
+SCSI subsystem. The LUN deletion can then be initiated by specifying a
+target port and LUN ID. Upon success, the LUN geometry associated with
+the port will be updated to reflect new number of provisioned LUNs and
+available capacity.
+
+To query the LUN geometry of a port, the target port is specified and
+upon success, the following information is presented:
+
+- Maximum number of provisioned LUNs allowed for the port
+- Current number of provisioned LUNs for the port
+- Maximum total capacity of provisioned LUNs for the port (4K blocks)
+- Current total capacity of provisioned LUNs for the port (4K blocks)
+
+With this information, the number of available LUNs and capacity can be
+can be calculated.
diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index c96526e..5810724 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -262,6 +262,11 @@ static inline bool afu_has_cap(struct afu *afu, u64 cap)
return afu_cap & cap;
 }
 
+static inline bool afu_is_lun_provision(struct afu *afu)
+{
+   return afu_has_cap(afu, SISL_INTVER_CAP_LUN_PROVISION);
+}
+
 static inline bool afu_is_sq_cmd_mode(struct afu *afu)
 {
return afu_has_cap(afu, SISL_INTVER_CAP_SQ_CMD_MODE);
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index be468ed1..1279293 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -3227,14 +3227,105 @@ static int cxlflash_chr_open(struct inode *inode, 
struct file *file)
 static char *decode_hioctl(int cmd)
 {
switch (cmd) {
-   default:
-   return "UNKNOWN";
+   case HT_CXLFLASH_LUN_PROVISION:
+   return __stringify_1(HT_CXLFLASH_LUN_PROVISION);
}
 
return "UNKNOWN";
 }
 
 /**
+ * cxlflash_lun_provision() - host LUN provisioning handler
+ * @cfg:   Internal structure associated with the host.
+ * @arg:   Kernel copy of userspace ioctl data structure.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+static int cxlflash_lun_provision(struct cxlflash_cfg *cfg,
+ struct ht_cxlflash_lun_provision *lunprov)
+{
+   struct afu *afu = cfg->afu;
+   struct device *dev = >dev->dev;
+   struct sisl_ioarcb rcb;
+   struct sisl_ioasa asa;
+   __be64 __iomem *fc_port_regs;
+   u16 port = lunprov->port;
+   u16 scmd = lunprov->hdr.subcmd;
+   u16 type;
+   u64 reg;
+   u64 size;
+   u64 lun_id;
+   int rc = 0;
+
+   if (!afu_is_lun_provision(afu)) {
+   rc = -ENOTSUPP;
+   goto out;
+   }
+
+   if (port >= cfg->num_fc_ports) {
+   rc = -EINVAL;
+   goto out;
+   }
+
+   switch (scmd) {
+   case HT_CXLFLASH_LUN_PROVISION_SUBCMD_CREATE_LUN:
+   type =

[PATCH 12/17] cxlflash: Refactor AFU capability checking

2017-06-21 Thread Uma Krishnan

From: "Matthew R. Ochs" 

The existing AFU capability checking infrastructure is closely tied to
the command mode capability bits. In order to support new capabilities,
refactor the existing infrastructure to be more generic.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index cbc0eb7..c96526e 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -255,21 +255,21 @@ static inline bool afu_is_irqpoll_enabled(struct afu *afu)
return !!afu->irqpoll_weight;
 }
 
-static inline bool afu_is_cmd_mode(struct afu *afu, u64 cmd_mode)
+static inline bool afu_has_cap(struct afu *afu, u64 cap)
 {
u64 afu_cap = afu->interface_version >> SISL_INTVER_CAP_SHIFT;
 
-   return afu_cap & cmd_mode;
+   return afu_cap & cap;
 }
 
 static inline bool afu_is_sq_cmd_mode(struct afu *afu)
 {
-   return afu_is_cmd_mode(afu, SISL_INTVER_CAP_SQ_CMD_MODE);
+   return afu_has_cap(afu, SISL_INTVER_CAP_SQ_CMD_MODE);
 }
 
 static inline bool afu_is_ioarrin_cmd_mode(struct afu *afu)
 {
-   return afu_is_cmd_mode(afu, SISL_INTVER_CAP_IOARRIN_CMD_MODE);
+   return afu_has_cap(afu, SISL_INTVER_CAP_IOARRIN_CMD_MODE);
 }
 
 static inline u64 lun_to_lunid(u64 lun)
-- 
2.1.0

[PATCH 11/17] cxlflash: Introduce host ioctl support

2017-06-21 Thread Uma Krishnan

From: "Matthew R. Ochs" 

As staging for supporting various host management functions, add a host
ioctl infrastructure to filter ioctl commands and perform operations that
are common for all host ioctls. Also update the cxlflash documentation to
create a new section for documenting host ioctls.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 Documentation/ioctl/ioctl-number.txt |   2 +-
 Documentation/powerpc/cxlflash.txt   |  19 +-
 drivers/scsi/cxlflash/main.c | 121 ++-
 include/uapi/scsi/cxlflash_ioctl.h   |  31 -
 4 files changed, 168 insertions(+), 5 deletions(-)

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 08244be..6b6cc4c 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -324,7 +324,7 @@ Code  Seq#(hex) Include FileComments
 0xB5   00-0F   uapi/linux/rpmsg.h  

 0xC0   00-0F   linux/usb/iowarrior.h
 0xCA   00-0F   uapi/misc/cxl.h
-0xCA   80-8F   uapi/scsi/cxlflash_ioctl.h
+0xCA   80-BF   uapi/scsi/cxlflash_ioctl.h
 0xCB   00-1F   CBM serial IEC bus  in development:


 0xCD   01  linux/reiserfs_fs.h
diff --git a/Documentation/powerpc/cxlflash.txt 
b/Documentation/powerpc/cxlflash.txt
index f9036cb..ee67021 100644
--- a/Documentation/powerpc/cxlflash.txt
+++ b/Documentation/powerpc/cxlflash.txt
@@ -124,8 +124,8 @@ Block library API
 http://github.com/open-power/capiflash
 
 
-CXL Flash Driver IOCTLs
-===
+CXL Flash Driver LUN IOCTLs
+===
 
 Users, such as the block library, that wish to interface with a flash
 device (LUN) via user space access need to use the services provided
@@ -367,3 +367,18 @@ DK_CXLFLASH_MANAGE_LUN
 exclusive user space access (superpipe). In case a LUN is visible
 across multiple ports and adapters, this ioctl is used to uniquely
 identify each LUN by its World Wide Node Name (WWNN).
+
+
+CXL Flash Driver Host IOCTLs
+
+
+Each host adapter instance that is supported by the cxlflash driver
+has a special character device associated with it to enable a set of
+host management function. These character devices are hosted in a
+class dedicated for cxlflash and can be accessed via /dev/cxlflash/*.
+
+Applications can be written to perform various functions using the
+host ioctl APIs below.
+
+The structure definitions for these IOCTLs are available in:
+uapi/scsi/cxlflash_ioctl.h
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 7732dfc..be468ed1 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -2693,7 +2693,14 @@ static ssize_t lun_mode_store(struct device *dev,
 static ssize_t ioctl_version_show(struct device *dev,
  struct device_attribute *attr, char *buf)
 {
-   return scnprintf(buf, PAGE_SIZE, "%u\n", DK_CXLFLASH_VERSION_0);
+   ssize_t bytes = 0;
+
+   bytes = scnprintf(buf, PAGE_SIZE,
+ "disk: %u\n", DK_CXLFLASH_VERSION_0);
+   bytes += scnprintf(buf + bytes, PAGE_SIZE - bytes,
+  "host: %u\n", HT_CXLFLASH_VERSION_0);
+
+   return bytes;
 }
 
 /**
@@ -3211,12 +3218,124 @@ static int cxlflash_chr_open(struct inode *inode, 
struct file *file)
return 0;
 }
 
+/**
+ * decode_hioctl() - translates encoded host ioctl to easily identifiable 
string
+ * @cmd:The host ioctl command to decode.
+ *
+ * Return: A string identifying the decoded host ioctl.
+ */
+static char *decode_hioctl(int cmd)
+{
+   switch (cmd) {
+   default:
+   return "UNKNOWN";
+   }
+
+   return "UNKNOWN";
+}
+
+/**
+ * cxlflash_chr_ioctl() - character device IOCTL handler
+ * @file:  File pointer for this device.
+ * @cmd:   IOCTL command.
+ * @arg:   Userspace ioctl data structure.
+ *
+ * A read/write semaphore is used to implement a 'drain' of currently
+ * running ioctls. The read semaphore is taken at the beginning of each
+ * ioctl thread and released upon concluding execution. Additionally the
+ * semaphore should be released and then reacquired in any ioctl execution
+ * path which will wait for an event to occur that is outside the scope of
+ * the ioctl (i.e. an adapter reset). To drain the ioctls currently running,
+ * a thread simply needs to acquire the write semaphore.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+static long cxlflash_chr_ioctl(struct file *file, unsigned int cmd,
+  unsigned long arg)
+{
+   typedef int (*hioctl) (struct cxlflash_cfg *, void *);
+
+   struct cxlflash_cfg *cfg = file->private_data;
+

[PATCH 10/17] cxlflash: Separate AFU internal command handling from AFU sync specifics

2017-06-21 Thread Uma Krishnan

From: "Matthew R. Ochs" 

To date the only supported internal AFU command is AFU sync. The logic
to send an internal AFU command is embedded in the specific AFU sync
handler and would need to be duplicated for new internal AFU commands.

In order to support new internal AFU commands, separate code that is
common for AFU internal commands into a generic transmission routine
and support passing back command status through an IOASA structure.
The first user of this new routine is the existing AFU sync command.
As a cleanup, use a descriptive name for the AFU sync command instead
of a magic number.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c| 79 ++---
 drivers/scsi/cxlflash/sislite.h |  2 ++
 2 files changed, 53 insertions(+), 28 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 0656dd2..7732dfc 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -2212,28 +2212,22 @@ static void cxlflash_schedule_async_reset(struct 
cxlflash_cfg *cfg)
 }
 
 /**
- * cxlflash_afu_sync() - builds and sends an AFU sync command
+ * send_afu_cmd() - builds and sends an internal AFU command
  * @afu:   AFU associated with the host.
- * @ctx_hndl_u:Identifies context requesting sync.
- * @res_hndl_u:Identifies resource requesting sync.
- * @mode:  Type of sync to issue (lightweight, heavyweight, global).
+ * @rcb:   Pre-populated IOARCB describing command to send.
  *
- * The AFU can only take 1 sync command at a time. This routine enforces this
- * limitation by using a mutex to provide exclusive access to the AFU during
- * the sync. This design point requires calling threads to not be on interrupt
- * context due to the possibility of sleeping during concurrent sync 
operations.
+ * The AFU can only take one internal AFU command at a time. This limitation is
+ * enforced by using a mutex to provide exclusive access to the AFU during the
+ * operation. This design point requires calling threads to not be on interrupt
+ * context due to the possibility of sleeping during concurrent AFU operations.
  *
- * AFU sync operations are only necessary and allowed when the device is
- * operating normally. When not operating normally, sync requests can occur as
- * part of cleaning up resources associated with an adapter prior to removal.
- * In this scenario, these requests are simply ignored (safe due to the AFU
- * going away).
+ * The command status is optionally passed back to the caller when the caller
+ * populates the IOASA field of the IOARCB with a pointer to an IOASA 
structure.
  *
  * Return:
  * 0 on success, -errno on failure
  */
-int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t ctx_hndl_u,
- res_hndl_t res_hndl_u, u8 mode)
+static int send_afu_cmd(struct afu *afu, struct sisl_ioarcb *rcb)
 {
struct cxlflash_cfg *cfg = afu->parent;
struct device *dev = >dev->dev;
@@ -2263,25 +2257,15 @@ int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t 
ctx_hndl_u,
 
 retry:
memset(cmd, 0, sizeof(*cmd));
+   memcpy(>rcb, rcb, sizeof(*rcb));
INIT_LIST_HEAD(>queue);
init_completion(>cevent);
cmd->parent = afu;
cmd->hwq_index = hwq->index;
-
-   dev_dbg(dev, "%s: afu=%p cmd=%p ctx=%d nretry=%d\n",
-   __func__, afu, cmd, ctx_hndl_u, nretry);
-
-   cmd->rcb.req_flags = SISL_REQ_FLAGS_AFU_CMD;
cmd->rcb.ctx_id = hwq->ctx_hndl;
-   cmd->rcb.msi = SISL_MSI_RRQ_UPDATED;
-   cmd->rcb.timeout = MC_AFU_SYNC_TIMEOUT;
-
-   cmd->rcb.cdb[0] = 0xC0; /* AFU Sync */
-   cmd->rcb.cdb[1] = mode;
 
-   /* The cdb is aligned, no unaligned accessors required */
-   *((__be16 *)>rcb.cdb[2]) = cpu_to_be16(ctx_hndl_u);
-   *((__be32 *)>rcb.cdb[4]) = cpu_to_be32(res_hndl_u);
+   dev_dbg(dev, "%s: afu=%p cmd=%p type=%02x nretry=%d\n",
+   __func__, afu, cmd, cmd->rcb.cdb[0], nretry);
 
rc = afu->send_cmd(afu, cmd);
if (unlikely(rc)) {
@@ -2306,6 +2290,8 @@ int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t 
ctx_hndl_u,
break;
}
 
+   if (rcb->ioasa)
+   *rcb->ioasa = cmd->sa;
 out:
atomic_dec(>cmds_active);
mutex_unlock(_active);
@@ -2315,6 +2301,43 @@ int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t 
ctx_hndl_u,
 }
 
 /**
+ * cxlflash_afu_sync() - builds and sends an AFU sync command
+ * @afu:   AFU associated with the host.
+ * @ctx:   Identifies context requesting sync.
+ * @res:   Identifies resource requesting sync.
+ * @mode:  Type of sync to issue (lightweight, heavyweight, global).
+ *
+ * AFU sync operations are only necessary and allowed when the device is
+ * operating normally. When not operating normally, sync requests can occur as
+ * part of

[PATCH 09/17] cxlflash: Create character device to provide host management interface

2017-06-21 Thread Uma Krishnan

The cxlflash driver currently lacks host management interface. Future
devices supported by cxlflash will provide a variety of host-wide
management functions. Examples include LUN provisioning, hardware debug
support, and firmware download.

In order to provide a way to manage the device, a character device will
be created during probe of each adapter. This device will support a set of
ioctls defined in the SISLite specification from which administrators can
manage the adapter.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h |   6 +-
 drivers/scsi/cxlflash/main.c   | 207 -
 drivers/scsi/cxlflash/main.h   |   1 +
 3 files changed, 212 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 11a5b0a..cbc0eb7 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -16,6 +16,7 @@
 #define _CXLFLASH_COMMON_H
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -86,7 +87,8 @@ enum cxlflash_init_state {
INIT_STATE_NONE,
INIT_STATE_PCI,
INIT_STATE_AFU,
-   INIT_STATE_SCSI
+   INIT_STATE_SCSI,
+   INIT_STATE_CDEV
 };
 
 enum cxlflash_state {
@@ -116,6 +118,8 @@ struct cxlflash_cfg {
struct pci_device_id *dev_id;
struct Scsi_Host *host;
int num_fc_ports;
+   struct cdev cdev;
+   struct device *chardev;
 
ulong cxlflash_regs_pci;
 
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index ceb247b..0656dd2 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -34,6 +34,10 @@ MODULE_AUTHOR("Manoj N. Kumar ");
 MODULE_AUTHOR("Matthew R. Ochs ");
 MODULE_LICENSE("GPL");
 
+static struct class *cxlflash_class;
+static u32 cxlflash_major;
+static DECLARE_BITMAP(cxlflash_minor, CXLFLASH_MAX_ADAPTERS);
+
 /**
  * process_cmd_err() - command error handler
  * @cmd:   AFU command that experienced the error.
@@ -863,6 +867,47 @@ static void notify_shutdown(struct cxlflash_cfg *cfg, bool 
wait)
 }
 
 /**
+ * cxlflash_get_minor() - gets the first available minor number
+ *
+ * Return: Unique minor number that can be used to create the character device.
+ */
+static int cxlflash_get_minor(void)
+{
+   int minor;
+   long bit;
+
+   bit = find_first_zero_bit(cxlflash_minor, CXLFLASH_MAX_ADAPTERS);
+   if (bit >= CXLFLASH_MAX_ADAPTERS)
+   return -1;
+
+   minor = bit & MINORMASK;
+   set_bit(minor, cxlflash_minor);
+   return minor;
+}
+
+/**
+ * cxlflash_put_minor() - releases the minor number
+ * @minor: Minor number that is no longer needed.
+ */
+static void cxlflash_put_minor(int minor)
+{
+   clear_bit(minor, cxlflash_minor);
+}
+
+/**
+ * cxlflash_release_chrdev() - release the character device for the host
+ * @cfg:   Internal structure associated with the host.
+ */
+static void cxlflash_release_chrdev(struct cxlflash_cfg *cfg)
+{
+   put_device(cfg->chardev);
+   device_unregister(cfg->chardev);
+   cfg->chardev = NULL;
+   cdev_del(>cdev);
+   cxlflash_put_minor(MINOR(cfg->cdev.dev));
+}
+
+/**
  * cxlflash_remove() - PCI entry point to tear down host
  * @pdev:  PCI device associated with the host.
  *
@@ -897,6 +942,8 @@ static void cxlflash_remove(struct pci_dev *pdev)
cxlflash_stop_term_user_contexts(cfg);
 
switch (cfg->init_state) {
+   case INIT_STATE_CDEV:
+   cxlflash_release_chrdev(cfg);
case INIT_STATE_SCSI:
cxlflash_term_local_luns(cfg);
scsi_remove_host(cfg->host);
@@ -3120,6 +3167,86 @@ static void cxlflash_worker_thread(struct work_struct 
*work)
 }
 
 /**
+ * cxlflash_chr_open() - character device open handler
+ * @inode: Device inode associated with this character device.
+ * @file:  File pointer for this device.
+ *
+ * Only users with admin privileges are allowed to open the character device.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+static int cxlflash_chr_open(struct inode *inode, struct file *file)
+{
+   struct cxlflash_cfg *cfg;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+
+   cfg = container_of(inode->i_cdev, struct cxlflash_cfg, cdev);
+   file->private_data = cfg;
+
+   return 0;
+}
+
+/*
+ * Character device file operations
+ */
+static const struct file_operations cxlflash_chr_fops = {
+   .owner  = THIS_MODULE,
+   .open   = cxlflash_chr_open,
+};
+
+/**
+ * init_chrdev() - initialize the character device for the host
+ * @cfg:   Internal structure associated with the host.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+static int init_chrdev(struct cxlflash_cfg *cfg)
+{
+   struct device *dev = >dev->dev;
+   struct device *char_dev;
+   dev_t devno;
+   int minor;
+   int rc

[PATCH 08/17] cxlflash: Add scsi command abort handler

2017-06-21 Thread Uma Krishnan

To date, CXL flash devices do not support a single command abort operation.
Instead, the SISLite specification provides a context reset operation to
cleanup all pending commands for a given context.

When a context reset is successful, it is guaranteed that the AFU has
aborted all currently pending I/O. This sequence is less invasive than a
device or host reset and can be executed to support scsi command abort
requests. Add eh_abort_handler callback support to process command timeouts
and abort requests.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 61 
 1 file changed, 61 insertions(+)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 0a3de42..ceb247b 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -228,6 +228,10 @@ static void flush_pending_cmds(struct hwq *hwq)
  * @hwq:   Hardware queue owning the context to be reset.
  * @reset_reg: MMIO register to perform reset.
  *
+ * When the reset is successful, the SISLite specification guarantees that
+ * the AFU has aborted all currently pending I/O. Accordingly, these commands
+ * must be flushed.
+ *
  * Return: 0 on success, -errno on failure
  */
 static int context_reset(struct hwq *hwq, __be64 __iomem *reset_reg)
@@ -237,9 +241,12 @@ static int context_reset(struct hwq *hwq, __be64 __iomem 
*reset_reg)
int rc = -ETIMEDOUT;
int nretry = 0;
u64 val = 0x1;
+   ulong lock_flags;
 
dev_dbg(dev, "%s: hwq=%p\n", __func__, hwq);
 
+   spin_lock_irqsave(>hsq_slock, lock_flags);
+
writeq_be(val, reset_reg);
do {
val = readq_be(reset_reg);
@@ -252,6 +259,11 @@ static int context_reset(struct hwq *hwq, __be64 __iomem 
*reset_reg)
udelay(1 << nretry);
} while (nretry++ < MC_ROOM_RETRY_CNT);
 
+   if (!rc)
+   flush_pending_cmds(hwq);
+
+   spin_unlock_irqrestore(>hsq_slock, lock_flags);
+
dev_dbg(dev, "%s: returning rc=%d, val=%016llx nretry=%d\n",
__func__, rc, val, nretry);
return rc;
@@ -2256,6 +2268,54 @@ int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t 
ctx_hndl_u,
 }
 
 /**
+ * cxlflash_eh_abort_handler() - abort a SCSI command
+ * @scp:   SCSI command to abort.
+ *
+ * CXL Flash devices do not support a single command abort. Reset the context
+ * as per SISLite specification. Flush any pending commands in the hardware
+ * queue before the reset.
+ *
+ * Return: SUCCESS/FAILED as defined in scsi/scsi.h
+ */
+static int cxlflash_eh_abort_handler(struct scsi_cmnd *scp)
+{
+   int rc = FAILED;
+   struct Scsi_Host *host = scp->device->host;
+   struct cxlflash_cfg *cfg = shost_priv(host);
+   struct afu_cmd *cmd = sc_to_afuc(scp);
+   struct device *dev = >dev->dev;
+   struct afu *afu = cfg->afu;
+   struct hwq *hwq = get_hwq(afu, cmd->hwq_index);
+
+   dev_dbg(dev, "%s: (scp=%p) %d/%d/%d/%llu "
+   "cdb=(%08x-%08x-%08x-%08x)\n", __func__, scp, host->host_no,
+   scp->device->channel, scp->device->id, scp->device->lun,
+   get_unaligned_be32(&((u32 *)scp->cmnd)[0]),
+   get_unaligned_be32(&((u32 *)scp->cmnd)[1]),
+   get_unaligned_be32(&((u32 *)scp->cmnd)[2]),
+   get_unaligned_be32(&((u32 *)scp->cmnd)[3]));
+
+   /* When the state is not normal, another reset/reload is in progress.
+* Return failed and the mid-layer will invoke host reset handler.
+*/
+   if (cfg->state != STATE_NORMAL) {
+   dev_dbg(dev, "%s: Invalid state for abort, state=%d\n",
+   __func__, cfg->state);
+   goto out;
+   }
+
+   rc = afu->context_reset(hwq);
+   if (unlikely(rc))
+   goto out;
+
+   rc = SUCCESS;
+
+out:
+   dev_dbg(dev, "%s: returning rc=%d\n", __func__, rc);
+   return rc;
+}
+
+/**
  * cxlflash_eh_device_reset_handler() - reset a single LUN
  * @scp:   SCSI command to send.
  *
@@ -2969,6 +3029,7 @@ static struct scsi_host_template driver_template = {
.ioctl = cxlflash_ioctl,
.proc_name = CXLFLASH_NAME,
.queuecommand = cxlflash_queuecommand,
+   .eh_abort_handler = cxlflash_eh_abort_handler,
.eh_device_reset_handler = cxlflash_eh_device_reset_handler,
.eh_host_reset_handler = cxlflash_eh_host_reset_handler,
.change_queue_depth = cxlflash_change_queue_depth,
-- 
2.1.0

[PATCH 07/17] cxlflash: Flush pending commands in cleanup path

2017-06-21 Thread Uma Krishnan

When the AFU is reset in an error path, pending scsi commands can be
silently dropped without completion or a formal abort. This puts the onus
on the cxlflash driver to notify mid-layer and indicating that the command
can be retried.

Once the card has been quiesced, the hardware send queue lock is acquired
to prevent any data movement while the pending commands are processed.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h |  5 +++-
 drivers/scsi/cxlflash/main.c   | 57 +++---
 2 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 3eaa3be..11a5b0a 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -157,7 +157,9 @@ struct afu_cmd {
struct list_head queue;
u32 hwq_index;
 
-   u8 cmd_tmf:1;
+   u8 cmd_tmf:1,
+  cmd_aborted:1;
+
struct list_head list;  /* Pending commands link */
 
/* As per the SISLITE spec the IOARCB EA has to be 16-byte aligned.
@@ -176,6 +178,7 @@ static inline struct afu_cmd *sc_to_afucz(struct scsi_cmnd 
*sc)
struct afu_cmd *afuc = sc_to_afuc(sc);
 
memset(afuc, 0, sizeof(*afuc));
+   INIT_LIST_HEAD(>queue);
return afuc;
 }
 
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 1446fab..0a3de42 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -194,6 +194,36 @@ static void cmd_complete(struct afu_cmd *cmd)
 }
 
 /**
+ * flush_pending_cmds() - flush all pending commands on this hardware queue
+ * @hwq:   Hardware queue to flush.
+ *
+ * The hardware send queue lock associated with this hardware queue must be
+ * held when calling this routine.
+ */
+static void flush_pending_cmds(struct hwq *hwq)
+{
+   struct afu_cmd *cmd, *tmp;
+   struct scsi_cmnd *scp;
+
+   list_for_each_entry_safe(cmd, tmp, >pending_cmds, list) {
+   /* Bypass command when on a doneq, cmd_complete() will handle */
+   if (!list_empty(>queue))
+   continue;
+
+   list_del(>list);
+
+   if (cmd->scp) {
+   scp = cmd->scp;
+   scp->result = (DID_IMM_RETRY << 16);
+   scp->scsi_done(scp);
+   } else {
+   cmd->cmd_aborted = true;
+   complete(>cevent);
+   }
+   }
+}
+
+/**
  * context_reset() - reset context via specified register
  * @hwq:   Hardware queue owning the context to be reset.
  * @reset_reg: MMIO register to perform reset.
@@ -357,6 +387,9 @@ static int wait_resp(struct afu *afu, struct afu_cmd *cmd)
if (!timeout)
rc = -ETIMEDOUT;
 
+   if (cmd->cmd_aborted)
+   rc = -EAGAIN;
+
if (unlikely(cmd->sa.ioasc != 0)) {
dev_err(dev, "%s: cmd %02x failed, ioasc=%08x\n",
__func__, cmd->rcb.cdb[0], cmd->sa.ioasc);
@@ -702,6 +735,7 @@ static void term_mc(struct cxlflash_cfg *cfg, u32 index)
struct afu *afu = cfg->afu;
struct device *dev = >dev->dev;
struct hwq *hwq;
+   ulong lock_flags;
 
if (!afu) {
dev_err(dev, "%s: returning with NULL afu\n", __func__);
@@ -719,6 +753,10 @@ static void term_mc(struct cxlflash_cfg *cfg, u32 index)
if (index != PRIMARY_HWQ)
WARN_ON(cxl_release_context(hwq->ctx));
hwq->ctx = NULL;
+
+   spin_lock_irqsave(>hsq_slock, lock_flags);
+   flush_pending_cmds(hwq);
+   spin_unlock_irqrestore(>hsq_slock, lock_flags);
 }
 
 /**
@@ -2155,7 +2193,7 @@ int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t 
ctx_hndl_u,
 
mutex_lock(_active);
atomic_inc(>cmds_active);
-   buf = kzalloc(sizeof(*cmd) + __alignof__(*cmd) - 1, GFP_KERNEL);
+   buf = kmalloc(sizeof(*cmd) + __alignof__(*cmd) - 1, GFP_KERNEL);
if (unlikely(!buf)) {
dev_err(dev, "%s: no memory for command\n", __func__);
rc = -ENOMEM;
@@ -2165,6 +2203,8 @@ int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t 
ctx_hndl_u,
cmd = (struct afu_cmd *)PTR_ALIGN(buf, __alignof__(*cmd));
 
 retry:
+   memset(cmd, 0, sizeof(*cmd));
+   INIT_LIST_HEAD(>queue);
init_completion(>cevent);
cmd->parent = afu;
cmd->hwq_index = hwq->index;
@@ -2191,11 +2231,20 @@ int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t 
ctx_hndl_u,
}
 
rc = wait_resp(afu, cmd);
-   if (rc == -ETIMEDOUT) {
+   switch (rc) {
+   case -ETIMEDOUT:
rc = afu->context_reset(hwq);
-   if (!rc && ++nretry < 2)
+   if (rc) {
+   cxlflash_schedule_async_reset(cfg);
+   break;
+   }
+   /* fall through to retry */
+   case -EAGAIN:
+

[PATCH 06/17] cxlflash: Track pending scsi commands in each hardware queue

2017-06-21 Thread Uma Krishnan

Currently, there is no book keeping of the pending scsi commands in the
cxlflash driver. This lack of tracking in-flight requests is too
restrictive and requires a heavy-hammer reset each time an adapter error is
encountered. Additionally, it does not allow for commands to be properly
retried.

In order to avoid this problem and to better handle error path command
cleanup, introduce a linked list for each hardware queue that tracks
pending commands.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h | 2 ++
 drivers/scsi/cxlflash/main.c   | 9 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index e9b6108..3eaa3be 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -158,6 +158,7 @@ struct afu_cmd {
u32 hwq_index;
 
u8 cmd_tmf:1;
+   struct list_head list;  /* Pending commands link */
 
/* As per the SISLITE spec the IOARCB EA has to be 16-byte aligned.
 * However for performance reasons the IOARCB/IOASA should be
@@ -193,6 +194,7 @@ struct hwq {
struct sisl_ctrl_map __iomem *ctrl_map; /* MC control map */
ctx_hndl_t ctx_hndl;/* master's context handle */
u32 index;  /* Index of this hwq */
+   struct list_head pending_cmds;  /* Commands pending completion */
 
atomic_t hsq_credits;
spinlock_t hsq_slock;   /* Hardware send queue lock */
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 20c2c5e..1446fab 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -162,8 +162,13 @@ static void cmd_complete(struct afu_cmd *cmd)
struct afu *afu = cmd->parent;
struct cxlflash_cfg *cfg = afu->parent;
struct device *dev = >dev->dev;
+   struct hwq *hwq = get_hwq(afu, cmd->hwq_index);
bool cmd_is_tmf;
 
+   spin_lock_irqsave(>hsq_slock, lock_flags);
+   list_del(>list);
+   spin_unlock_irqrestore(>hsq_slock, lock_flags);
+
if (cmd->scp) {
scp = cmd->scp;
if (unlikely(cmd->sa.ioasc))
@@ -279,6 +284,7 @@ static int send_cmd_ioarrin(struct afu *afu, struct afu_cmd 
*cmd)
hwq->room = room - 1;
}
 
+   list_add(>list, >pending_cmds);
writeq_be((u64)>rcb, >host_map->ioarrin);
 out:
spin_unlock_irqrestore(>hsq_slock, lock_flags);
@@ -319,6 +325,8 @@ static int send_cmd_sq(struct afu *afu, struct afu_cmd *cmd)
hwq->hsq_curr++;
else
hwq->hsq_curr = hwq->hsq_start;
+
+   list_add(>list, >pending_cmds);
writeq_be((u64)hwq->hsq_curr, >host_map->sq_tail);
 
spin_unlock_irqrestore(>hsq_slock, lock_flags);
@@ -1840,6 +1848,7 @@ static int init_mc(struct cxlflash_cfg *cfg, u32 index)
 
hwq->afu = cfg->afu;
hwq->index = index;
+   INIT_LIST_HEAD(>pending_cmds);
 
if (index == PRIMARY_HWQ)
ctx = cxl_get_context(cfg->dev);
-- 
2.1.0

[PATCH 05/17] cxlflash: Handle AFU sync failures

2017-06-21 Thread Uma Krishnan

AFU sync operations are not currently evaluated for failure. This is
acceptable for paths where there is not a dependency on the AFU being
consistent with the host. Examples include link reset events and LUN
cleanup operations. On paths where there is a dependency, such as a LUN
open, a sync failure should be acted upon.

In the event of AFU sync failures, either log or cleanup as appropriate for
operations that are dependent on a successful sync completion.

Update documentation to reflect behavior in the event of an AFU sync
failure.

Signed-off-by: Uma Krishnan 
---
 Documentation/powerpc/cxlflash.txt | 12 ++
 drivers/scsi/cxlflash/superpipe.c  | 34 +--
 drivers/scsi/cxlflash/vlun.c   | 88 +++---
 3 files changed, 107 insertions(+), 27 deletions(-)

diff --git a/Documentation/powerpc/cxlflash.txt 
b/Documentation/powerpc/cxlflash.txt
index 66b4496..f9036cb 100644
--- a/Documentation/powerpc/cxlflash.txt
+++ b/Documentation/powerpc/cxlflash.txt
@@ -257,6 +257,12 @@ DK_CXLFLASH_VLUN_RESIZE
 operating in the virtual mode and used to program a LUN translation
 table that the AFU references when provided with a resource handle.
 
+This ioctl can return -EAGAIN if an AFU sync operation takes too long.
+In addition to returning a failure to user, cxlflash will also schedule
+an asynchronous AFU reset. Should the user choose to retry the operation,
+it is expected to succeed. If this ioctl fails with -EAGAIN, the user
+can either retry the operation or treat it as a failure.
+
 DK_CXLFLASH_RELEASE
 ---
 This ioctl is responsible for releasing a previously obtained
@@ -309,6 +315,12 @@ DK_CXLFLASH_VLUN_CLONE
 clone. This is to avoid a stale entry in the file descriptor table of the
 child process.
 
+This ioctl can return -EAGAIN if an AFU sync operation takes too long.
+In addition to returning a failure to user, cxlflash will also schedule
+an asynchronous AFU reset. Should the user choose to retry the operation,
+it is expected to succeed. If this ioctl fails with -EAGAIN, the user
+can either retry the operation or treat it as a failure.
+
 DK_CXLFLASH_VERIFY
 --
 This ioctl is used to detect various changes such as the capacity of
diff --git a/drivers/scsi/cxlflash/superpipe.c 
b/drivers/scsi/cxlflash/superpipe.c
index fe9f17a..ad0f996 100644
--- a/drivers/scsi/cxlflash/superpipe.c
+++ b/drivers/scsi/cxlflash/superpipe.c
@@ -57,6 +57,19 @@ static void marshal_det_to_rele(struct dk_cxlflash_detach 
*detach,
 }
 
 /**
+ * marshal_udir_to_rele() - translate udirect to release structure
+ * @udirect:   Source structure from which to translate/copy.
+ * @release:   Destination structure for the translate/copy.
+ */
+static void marshal_udir_to_rele(struct dk_cxlflash_udirect *udirect,
+struct dk_cxlflash_release *release)
+{
+   release->hdr = udirect->hdr;
+   release->context_id = udirect->context_id;
+   release->rsrc_handle = udirect->rsrc_handle;
+}
+
+/**
  * cxlflash_free_errpage() - frees resources associated with global error page
  */
 void cxlflash_free_errpage(void)
@@ -622,6 +635,7 @@ int _cxlflash_disk_release(struct scsi_device *sdev,
res_hndl_t rhndl = release->rsrc_handle;
 
int rc = 0;
+   int rcr = 0;
u64 ctxid = DECODE_CTXID(release->context_id),
rctxid = release->context_id;
 
@@ -686,8 +700,12 @@ int _cxlflash_disk_release(struct scsi_device *sdev,
rhte_f1->dw = 0;
dma_wmb(); /* Make RHT entry bottom-half clearing visible */
 
-   if (!ctxi->err_recovery_active)
-   cxlflash_afu_sync(afu, ctxid, rhndl, AFU_HW_SYNC);
+   if (!ctxi->err_recovery_active) {
+   rcr = cxlflash_afu_sync(afu, ctxid, rhndl, AFU_HW_SYNC);
+   if (unlikely(rcr))
+   dev_dbg(dev, "%s: AFU sync failed rc=%d\n",
+   __func__, rcr);
+   }
break;
default:
WARN(1, "Unsupported LUN mode!");
@@ -1929,6 +1947,7 @@ static int cxlflash_disk_direct_open(struct scsi_device 
*sdev, void *arg)
struct afu *afu = cfg->afu;
struct llun_info *lli = sdev->hostdata;
struct glun_info *gli = lli->parent;
+   struct dk_cxlflash_release rel = { { 0 }, 0 };
 
struct dk_cxlflash_udirect *pphys = (struct dk_cxlflash_udirect *)arg;
 
@@ -1970,13 +1989,18 @@ static int cxlflash_disk_direct_open(struct scsi_device 
*sdev, void *arg)
rsrc_handle = (rhte - ctxi->rht_start);
 
rht_format1(rhte, lli->lun_id[sdev->channel], ctxi->rht_perms, port);
-   cxlflash_afu_sync(afu, ctxid, rsrc_handle, AFU_LW_SYNC);
 
last_lba = gli->max_lba;
pphys->hdr.return_flags = 0;
pphys->last_lba = last_lba;

[PATCH 04/17] cxlflash: Schedule asynchronous reset of the host

2017-06-21 Thread Uma Krishnan

A context reset failure indicates the AFU is in a bad state. At present,
when such a situation occurs, no further action is taken. This leaves the
adapter in an unusable state with no recoverable actions.

To avoid this situation, context reset failures will be escalated to a host
reset operation. This will be done asynchronously to allow the acting
thread to return to the user with a failure.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h |   2 +
 drivers/scsi/cxlflash/main.c   | 137 ++---
 2 files changed, 104 insertions(+), 35 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 75decf6..e9b6108 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -15,6 +15,7 @@
 #ifndef _CXLFLASH_COMMON_H
 #define _CXLFLASH_COMMON_H
 
+#include 
 #include 
 #include 
 #include 
@@ -144,6 +145,7 @@ struct cxlflash_cfg {
bool tmf_active;
wait_queue_head_t reset_waitq;
enum cxlflash_state state;
+   async_cookie_t async_reset_cookie;
 };
 
 struct afu_cmd {
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index b8dc379..20c2c5e 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -586,6 +586,20 @@ static void free_mem(struct cxlflash_cfg *cfg)
 }
 
 /**
+ * cxlflash_reset_sync() - synchronizing point for asynchronous resets
+ * @cfg:   Internal structure associated with the host.
+ */
+static void cxlflash_reset_sync(struct cxlflash_cfg *cfg)
+{
+   if (cfg->async_reset_cookie == 0)
+   return;
+
+   /* Wait until all async calls prior to this cookie have completed */
+   async_synchronize_cookie(cfg->async_reset_cookie + 1);
+   cfg->async_reset_cookie = 0;
+}
+
+/**
  * stop_afu() - stops the AFU command timers and unmaps the MMIO space
  * @cfg:   Internal structure associated with the host.
  *
@@ -601,6 +615,8 @@ static void stop_afu(struct cxlflash_cfg *cfg)
int i;
 
cancel_work_sync(>work_q);
+   if (!current_is_async())
+   cxlflash_reset_sync(cfg);
 
if (likely(afu)) {
while (atomic_read(>cmds_active))
@@ -2005,6 +2021,91 @@ static int init_afu(struct cxlflash_cfg *cfg)
 }
 
 /**
+ * afu_reset() - resets the AFU
+ * @cfg:   Internal structure associated with the host.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+static int afu_reset(struct cxlflash_cfg *cfg)
+{
+   struct device *dev = >dev->dev;
+   int rc = 0;
+
+   /* Stop the context before the reset. Since the context is
+* no longer available restart it after the reset is complete
+*/
+   term_afu(cfg);
+
+   rc = init_afu(cfg);
+
+   dev_dbg(dev, "%s: returning rc=%d\n", __func__, rc);
+   return rc;
+}
+
+/**
+ * drain_ioctls() - wait until all currently executing ioctls have completed
+ * @cfg:   Internal structure associated with the host.
+ *
+ * Obtain write access to read/write semaphore that wraps ioctl
+ * handling to 'drain' ioctls currently executing.
+ */
+static void drain_ioctls(struct cxlflash_cfg *cfg)
+{
+   down_write(>ioctl_rwsem);
+   up_write(>ioctl_rwsem);
+}
+
+/**
+ * cxlflash_async_reset_host() - asynchronous host reset handler
+ * @data:  Private data provided while scheduling reset.
+ * @cookie:Cookie that can be used for checkpointing.
+ */
+static void cxlflash_async_reset_host(void *data, async_cookie_t cookie)
+{
+   struct cxlflash_cfg *cfg = data;
+   struct device *dev = >dev->dev;
+   int rc = 0;
+
+   if (cfg->state != STATE_RESET) {
+   dev_dbg(dev, "%s: Not performing a reset, state=%d\n",
+   __func__, cfg->state);
+   goto out;
+   }
+
+   drain_ioctls(cfg);
+   cxlflash_mark_contexts_error(cfg);
+   rc = afu_reset(cfg);
+   if (rc)
+   cfg->state = STATE_FAILTERM;
+   else
+   cfg->state = STATE_NORMAL;
+   wake_up_all(>reset_waitq);
+
+out:
+   scsi_unblock_requests(cfg->host);
+}
+
+/**
+ * cxlflash_schedule_async_reset() - schedule an asynchronous host reset
+ * @cfg:   Internal structure associated with the host.
+ */
+static void cxlflash_schedule_async_reset(struct cxlflash_cfg *cfg)
+{
+   struct device *dev = >dev->dev;
+
+   if (cfg->state != STATE_NORMAL) {
+   dev_dbg(dev, "%s: Not performing reset state=%d\n",
+   __func__, cfg->state);
+   return;
+   }
+
+   cfg->state = STATE_RESET;
+   scsi_block_requests(cfg->host);
+   cfg->async_reset_cookie = async_schedule(cxlflash_async_reset_host,
+cfg);
+}
+
+/**
  * cxlflash_afu_sync() - builds and sends an AFU sync command
  * @afu:   AFU associated with the host.
  * @ctx_hndl_u:Identifies context requesting sync.

[PATCH 03/17] cxlflash: Reset hardware queue context via specified register

2017-06-21 Thread Uma Krishnan

Per the SISLite specification, context_reset() writes 0x1 to the LSB of the
reset register. When the AFU processes this reset request, it is expected
to clear the bit after reset is complete. The current implementation simply
checks that the entire value read back is not 1, instead of masking off the
LSB and evaluating it for a change to 0. Should the AFU manipulate other
bits during the reset (reading back a value of 0xF for example), successful
completion will be prematurely indicated given the existing logic.

Additionally, in the event that the context reset operation fails, there
does not currently exist a way to provide feedback to the initiator of the
reset. This poses a problem for the rare case that a context reset fails as
the caller will proceed on the assumption that all is well.

To remedy these issues, refactor the context reset routine to only mask off
the LSB when evaluating for success and return status to the caller. Also
update the context reset handler parameters to pass a hardware queue
reference instead of a single command to better reflect that the entire
queue associated with the context is impacted by the reset.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h |  2 +-
 drivers/scsi/cxlflash/main.c   | 83 +++---
 2 files changed, 47 insertions(+), 38 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 6fc32cfc..75decf6 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -211,7 +211,7 @@ struct hwq {
 struct afu {
struct hwq hwqs[CXLFLASH_MAX_HWQS];
int (*send_cmd)(struct afu *, struct afu_cmd *);
-   void (*context_reset)(struct afu_cmd *);
+   int (*context_reset)(struct hwq *);
 
/* AFU HW */
struct cxlflash_afu_map __iomem *afu_map;   /* entire MMIO map */
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 815d04b..b8dc379 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -189,55 +189,59 @@ static void cmd_complete(struct afu_cmd *cmd)
 }
 
 /**
- * context_reset() - reset command owner context via specified register
- * @cmd:   AFU command that timed out.
+ * context_reset() - reset context via specified register
+ * @hwq:   Hardware queue owning the context to be reset.
  * @reset_reg: MMIO register to perform reset.
+ *
+ * Return: 0 on success, -errno on failure
  */
-static void context_reset(struct afu_cmd *cmd, __be64 __iomem *reset_reg)
+static int context_reset(struct hwq *hwq, __be64 __iomem *reset_reg)
 {
-   int nretry = 0;
-   u64 rrin = 0x1;
-   struct afu *afu = cmd->parent;
-   struct cxlflash_cfg *cfg = afu->parent;
+   struct cxlflash_cfg *cfg = hwq->afu->parent;
struct device *dev = >dev->dev;
+   int rc = -ETIMEDOUT;
+   int nretry = 0;
+   u64 val = 0x1;
 
-   dev_dbg(dev, "%s: cmd=%p\n", __func__, cmd);
+   dev_dbg(dev, "%s: hwq=%p\n", __func__, hwq);
 
-   writeq_be(rrin, reset_reg);
+   writeq_be(val, reset_reg);
do {
-   rrin = readq_be(reset_reg);
-   if (rrin != 0x1)
+   val = readq_be(reset_reg);
+   if ((val & 0x1) == 0x0) {
+   rc = 0;
break;
+   }
+
/* Double delay each time */
udelay(1 << nretry);
} while (nretry++ < MC_ROOM_RETRY_CNT);
 
-   dev_dbg(dev, "%s: returning rrin=%016llx nretry=%d\n",
-   __func__, rrin, nretry);
+   dev_dbg(dev, "%s: returning rc=%d, val=%016llx nretry=%d\n",
+   __func__, rc, val, nretry);
+   return rc;
 }
 
 /**
- * context_reset_ioarrin() - reset command owner context via IOARRIN register
- * @cmd:   AFU command that timed out.
+ * context_reset_ioarrin() - reset context via IOARRIN register
+ * @hwq:   Hardware queue owning the context to be reset.
+ *
+ * Return: 0 on success, -errno on failure
  */
-static void context_reset_ioarrin(struct afu_cmd *cmd)
+static int context_reset_ioarrin(struct hwq *hwq)
 {
-   struct afu *afu = cmd->parent;
-   struct hwq *hwq = get_hwq(afu, cmd->hwq_index);
-
-   context_reset(cmd, >host_map->ioarrin);
+   return context_reset(hwq, >host_map->ioarrin);
 }
 
 /**
- * context_reset_sq() - reset command owner context w/ SQ Context Reset 
register
- * @cmd:   AFU command that timed out.
+ * context_reset_sq() - reset context via SQ_CONTEXT_RESET register
+ * @hwq:   Hardware queue owning the context to be reset.
+ *
+ * Return: 0 on success, -errno on failure
  */
-static void context_reset_sq(struct afu_cmd *cmd)
+static int context_reset_sq(struct hwq *hwq)
 {
-   struct afu *afu = cmd->parent;
-   struct hwq *hwq = get_hwq(afu, cmd->hwq_index);
-
-   context_reset(cmd, >host_map->sq_ctx_reset);
+   return context_reset(hwq,

[PATCH 02/17] cxlflash: Update cxlflash_afu_sync() to return errno

2017-06-21 Thread Uma Krishnan

The cxlflash_afu_sync() routine returns a negative one to indicate any kind
of failure. This makes it impossible to establish why the error occurred.

Update the return codes to clearly indicate the failure cause to the
caller.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 64ea597ca..815d04b 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -2022,8 +2022,7 @@ static int init_afu(struct cxlflash_cfg *cfg)
  * going away).
  *
  * Return:
- * 0 on success
- * -1 on failure
+ * 0 on success, -errno on failure
  */
 int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t ctx_hndl_u,
  res_hndl_t res_hndl_u, u8 mode)
@@ -2047,7 +2046,7 @@ int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t 
ctx_hndl_u,
buf = kzalloc(sizeof(*cmd) + __alignof__(*cmd) - 1, GFP_KERNEL);
if (unlikely(!buf)) {
dev_err(dev, "%s: no memory for command\n", __func__);
-   rc = -1;
+   rc = -ENOMEM;
goto out;
}
 
@@ -2071,12 +2070,14 @@ int cxlflash_afu_sync(struct afu *afu, ctx_hndl_t 
ctx_hndl_u,
*((__be32 *)>rcb.cdb[4]) = cpu_to_be32(res_hndl_u);
 
rc = afu->send_cmd(afu, cmd);
-   if (unlikely(rc))
+   if (unlikely(rc)) {
+   rc = -ENOBUFS;
goto out;
+   }
 
rc = wait_resp(afu, cmd);
if (unlikely(rc))
-   rc = -1;
+   rc = -EIO;
 out:
atomic_dec(>cmds_active);
mutex_unlock(_active);
-- 
2.1.0

[PATCH 01/17] cxlflash: Combine the send queue locks

2017-06-21 Thread Uma Krishnan

Currently there are separate spin locks for the two supported I/O queueing
models. This makes it difficult to serialize with paths outside the enqueue
path.

As a design simplification and to support serialization with enqueue
operations, move to only a single lock that is used for enqueueing
regardless of the queueing model.

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/common.h | 3 +--
 drivers/scsi/cxlflash/main.c   | 9 +
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/cxlflash/common.h b/drivers/scsi/cxlflash/common.h
index 256af81..6fc32cfc 100644
--- a/drivers/scsi/cxlflash/common.h
+++ b/drivers/scsi/cxlflash/common.h
@@ -193,7 +193,7 @@ struct hwq {
u32 index;  /* Index of this hwq */
 
atomic_t hsq_credits;
-   spinlock_t hsq_slock;
+   spinlock_t hsq_slock;   /* Hardware send queue lock */
struct sisl_ioarcb *hsq_start;
struct sisl_ioarcb *hsq_end;
struct sisl_ioarcb *hsq_curr;
@@ -204,7 +204,6 @@ struct hwq {
bool toggle;
 
s64 room;
-   spinlock_t rrin_slock; /* Lock to rrin queuing and cmd_room updates */
 
struct irq_poll irqpoll;
 } __aligned(cache_line_size());
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index a7d57c3..64ea597ca 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -261,7 +261,7 @@ static int send_cmd_ioarrin(struct afu *afu, struct afu_cmd 
*cmd)
 * To avoid the performance penalty of MMIO, spread the update of
 * 'room' over multiple commands.
 */
-   spin_lock_irqsave(>rrin_slock, lock_flags);
+   spin_lock_irqsave(>hsq_slock, lock_flags);
if (--hwq->room < 0) {
room = readq_be(>host_map->cmd_room);
if (room <= 0) {
@@ -277,7 +277,7 @@ static int send_cmd_ioarrin(struct afu *afu, struct afu_cmd 
*cmd)
 
writeq_be((u64)>rcb, >host_map->ioarrin);
 out:
-   spin_unlock_irqrestore(>rrin_slock, lock_flags);
+   spin_unlock_irqrestore(>hsq_slock, lock_flags);
dev_dbg(dev, "%s: cmd=%p len=%u ea=%016llx rc=%d\n", __func__,
cmd, cmd->rcb.data_len, cmd->rcb.data_ea, rc);
return rc;
@@ -1722,7 +1722,10 @@ static int start_afu(struct cxlflash_cfg *cfg)
hwq->hrrq_end = >rrq_entry[NUM_RRQ_ENTRY - 1];
hwq->hrrq_curr = hwq->hrrq_start;
hwq->toggle = 1;
+
+   /* Initialize spin locks */
spin_lock_init(>hrrq_slock);
+   spin_lock_init(>hsq_slock);
 
/* Initialize SQ */
if (afu_is_sq_cmd_mode(afu)) {
@@ -1731,7 +1734,6 @@ static int start_afu(struct cxlflash_cfg *cfg)
hwq->hsq_end = >sq[NUM_SQ_ENTRY - 1];
hwq->hsq_curr = hwq->hsq_start;
 
-   spin_lock_init(>hsq_slock);
atomic_set(>hsq_credits, NUM_SQ_ENTRY - 1);
}
 
@@ -1984,7 +1986,6 @@ static int init_afu(struct cxlflash_cfg *cfg)
for (i = 0; i < afu->num_hwqs; i++) {
hwq = get_hwq(afu, i);
 
-   spin_lock_init(>rrin_slock);
hwq->room = readq_be(>host_map->cmd_room);
}
 
-- 
2.1.0

[PATCH 00/17] cxlflash: LUN provisioning support and miscellaneous fixes

2017-06-21 Thread Uma Krishnan

This patch series contains miscellaneous fixes and several enhancements
such as LUN provisioning support, WS16 unmap and AFU debug capabilities.

This series is intended for 4.13 and is bisectable.

Matthew R. Ochs (8):
  cxlflash: Separate AFU internal command handling from AFU sync
specifics
  cxlflash: Introduce host ioctl support
  cxlflash: Refactor AFU capability checking
  cxlflash: Support LUN provisioning
  cxlflash: Support AFU debug
  cxlflash: Support WS16 unmap
  cxlflash: Remove zeroing of private command data
  cxlflash: Update TMF command processing

Uma Krishnan (9):
  cxlflash: Combine the send queue locks
  cxlflash: Update cxlflash_afu_sync() to return errno
  cxlflash: Reset hardware queue context via specified register
  cxlflash: Schedule asynchronous reset of the host
  cxlflash: Handle AFU sync failures
  cxlflash: Track pending scsi commands in each hardware queue
  cxlflash: Flush pending commands in cleanup path
  cxlflash: Add scsi command abort handler
  cxlflash: Create character device to provide host management interface

 Documentation/ioctl/ioctl-number.txt |2 +-
 Documentation/powerpc/cxlflash.txt   |   76 ++-
 drivers/scsi/cxlflash/common.h   |   48 +-
 drivers/scsi/cxlflash/main.c | 1008 ++
 drivers/scsi/cxlflash/main.h |7 +
 drivers/scsi/cxlflash/sislite.h  |   27 +-
 drivers/scsi/cxlflash/superpipe.c|   34 +-
 drivers/scsi/cxlflash/vlun.c |   89 ++-
 include/uapi/scsi/cxlflash_ioctl.h   |   85 ++-
 9 files changed, 1217 insertions(+), 159 deletions(-)

-- 
2.1.0

[RFC v3 23/23] procfs: display the protection-key number associated with a vma

2017-06-21 Thread Ram Pai

Display the pkey number associated with the vma in smaps of a task.
The key will be seen as below:

VmFlags: rd wr mr mw me dw ac key=0

Signed-off-by: Ram Pai 
---
 Documentation/filesystems/proc.txt |  3 ++-
 fs/proc/task_mmu.c | 22 +++---
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 4cddbce..a8c74aa 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -423,7 +423,7 @@ SwapPss:   0 kB
 KernelPageSize:4 kB
 MMUPageSize:   4 kB
 Locked:0 kB
-VmFlags: rd ex mr mw me dw
+VmFlags: rd ex mr mw me dw key=
 
 the first of these lines shows the same information as is displayed for the
 mapping in /proc/PID/maps.  The remaining lines show the size of the mapping
@@ -491,6 +491,7 @@ manner. The codes are the following:
 hg  - huge page advise flag
 nh  - no-huge page advise flag
 mg  - mergable advise flag
+key= - the memory protection key number
 
 Note that there is no guarantee that every flag and associated mnemonic will
 be present in all further kernel releases. Things get changed, the flags may
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2ddc298..d2eb096 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1,4 +1,6 @@
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -666,22 +668,20 @@ static void show_smap_vma_flags(struct seq_file *m, 
struct vm_area_struct *vma)
[ilog2(VM_MERGEABLE)]   = "mg",
[ilog2(VM_UFFD_MISSING)]= "um",
[ilog2(VM_UFFD_WP)] = "uw",
-#ifdef CONFIG_ARCH_HAS_PKEYS
-   /* These come out via ProtectionKey: */
-   [ilog2(VM_PKEY_BIT0)]   = "",
-   [ilog2(VM_PKEY_BIT1)]   = "",
-   [ilog2(VM_PKEY_BIT2)]   = "",
-   [ilog2(VM_PKEY_BIT3)]   = "",
-#endif /* CONFIG_ARCH_HAS_PKEYS */
-#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
-   /* Additional bit in ProtectionKey: */
-   [ilog2(VM_PKEY_BIT4)]   = "",
-#endif
};
size_t i;
 
seq_puts(m, "VmFlags: ");
for (i = 0; i < BITS_PER_LONG; i++) {
+#ifdef CONFIG_ARCH_HAS_PKEYS
+   if (i == ilog2(VM_PKEY_BIT0)) {
+   int keyvalue = vma_pkey(vma);
+
+   i += ilog2(arch_max_pkey())-1;
+   seq_printf(m, "key=%d ", keyvalue);
+   continue;
+   }
+#endif /* CONFIG_ARCH_HAS_PKEYS */
if (!mnemonics[i][0])
continue;
if (vma->vm_flags & (1UL << i)) {
-- 
1.8.3.1

[RFC v3 22/23] Documentation: PowerPC specific updates to memory protection keys

2017-06-21 Thread Ram Pai

Add documentation updates that capture PowerPC specific changes.

Signed-off-by: Ram Pai 
---
 Documentation/vm/protection-keys.txt | 65 +---
 1 file changed, 45 insertions(+), 20 deletions(-)

diff --git a/Documentation/vm/protection-keys.txt 
b/Documentation/vm/protection-keys.txt
index b643045..965ad75 100644
--- a/Documentation/vm/protection-keys.txt
+++ b/Documentation/vm/protection-keys.txt
@@ -1,21 +1,46 @@
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
-which will be found on future Intel CPUs.
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature found in
+new generation of intel CPUs and on PowerPC 7 and higher CPUs.
 
 Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
-
-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register.  The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs.  These
-permissions are enforced on data access only and have no effect on
+protections, but without requiring modification of the page tables when an
+application changes protection domains.
+
+
+On Intel:
+
+   It works by dedicating 4 previously ignored bits in each page table
+   entry to a "protection key", giving 16 possible keys.
+
+   There is also a new user-accessible register (PKRU) with two separate
+   bits (Access Disable and Write Disable) for each key.  Being a CPU
+   register, PKRU is inherently thread-local, potentially giving each
+   thread a different set of protections from every other thread.
+
+   There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+   to the new register.  The feature is only available in 64-bit mode,
+   even though there is theoretically space in the PAE PTEs.  These
+   permissions are enforced on data access only and have no effect on
+   instruction fetches.
+
+
+On PowerPC:
+
+   It works by dedicating 5 page table entry bits to a "protection key",
+   giving 32 possible keys.
+
+   There  is  a  user-accessible  register (AMR)  with  two separate bits;
+   Access Disable and  Write  Disable, for  each key.  Being  a  CPU
+   register,  AMR  is inherently  thread-local,  potentially  giving  each
+   thread a different set of protections from every other thread.  NOTE:
+   Disabling read permission does not disable write and vice-versa.
+
+   The feature is available on 64-bit HPTE mode only.
+   'mtspr 0xd, mem' reads the AMR register
+   'mfspr mem, 0xd' writes into the AMR register.
+
+
+
+Permissions are enforced on data access only and have no effect on
 instruction fetches.
 
 === Syscalls ===
@@ -28,9 +53,9 @@ There are 3 system calls which directly interact with pkeys:
  unsigned long prot, int pkey);
 
 Before a pkey can be used, it must first be allocated with
-pkey_alloc().  An application calls the WRPKRU instruction
+pkey_alloc().  An application calls the WRPKRU/AMR instruction
 directly in order to change access permissions to memory covered
-with a key.  In this example WRPKRU is wrapped by a C function
+with a key.  In this example WRPKRU/AMR is wrapped by a C function
 called pkey_set().
 
int real_prot = PROT_READ|PROT_WRITE;
@@ -52,11 +77,11 @@ is no longer in use:
munmap(ptr, PAGE_SIZE);
pkey_free(pkey);
 
-(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
+(Note: pkey_set() is a wrapper for the RDPKRU,WRPKRU or AMR instructions.
  An example implementation can be found in
  tools/testing/selftests/x86/protection_keys.c)
 
-=== Behavior ===
+=== Behavior =
 
 The kernel attempts to make protection keys consistent with the
 behavior of a plain mprotect().  For instance if you do this:
-- 
1.8.3.1

[RFC v3 21/23] Documentation: Move protecton key documentation to arch neutral directory

2017-06-21 Thread Ram Pai

Since PowerPC and Intel both support memory protection keys, moving
the documenation to arch-neutral directory.

Signed-off-by: Ram Pai 
---
 Documentation/vm/protection-keys.txt  | 85 +++
 Documentation/x86/protection-keys.txt | 85 ---
 2 files changed, 85 insertions(+), 85 deletions(-)
 create mode 100644 Documentation/vm/protection-keys.txt
 delete mode 100644 Documentation/x86/protection-keys.txt

diff --git a/Documentation/vm/protection-keys.txt 
b/Documentation/vm/protection-keys.txt
new file mode 100644
index 000..b643045
--- /dev/null
+++ b/Documentation/vm/protection-keys.txt
@@ -0,0 +1,85 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
+which will be found on future Intel CPUs.
+
+Memory Protection Keys provides a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.  It works by
+dedicating 4 previously ignored bits in each page table entry to a
+"protection key", giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two separate
+bits (Access Disable and Write Disable) for each key.  Being a CPU
+register, PKRU is inherently thread-local, potentially giving each
+thread a different set of protections from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and writing
+to the new register.  The feature is only available in 64-bit mode,
+even though there is theoretically space in the PAE PTEs.  These
+permissions are enforced on data access only and have no effect on
+instruction fetches.
+
+=== Syscalls ===
+
+There are 3 system calls which directly interact with pkeys:
+
+   int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
+   int pkey_free(int pkey);
+   int pkey_mprotect(unsigned long start, size_t len,
+ unsigned long prot, int pkey);
+
+Before a pkey can be used, it must first be allocated with
+pkey_alloc().  An application calls the WRPKRU instruction
+directly in order to change access permissions to memory covered
+with a key.  In this example WRPKRU is wrapped by a C function
+called pkey_set().
+
+   int real_prot = PROT_READ|PROT_WRITE;
+   pkey = pkey_alloc(0, PKEY_DENY_WRITE);
+   ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 
0);
+   ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
+   ... application runs here
+
+Now, if the application needs to update the data at 'ptr', it can
+gain access, do the update, then remove its write access:
+
+   pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
+   *ptr = foo; // assign something
+   pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
+
+Now when it frees the memory, it will also free the pkey since it
+is no longer in use:
+
+   munmap(ptr, PAGE_SIZE);
+   pkey_free(pkey);
+
+(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
+ An example implementation can be found in
+ tools/testing/selftests/x86/protection_keys.c)
+
+=== Behavior ===
+
+The kernel attempts to make protection keys consistent with the
+behavior of a plain mprotect().  For instance if you do this:
+
+   mprotect(ptr, size, PROT_NONE);
+   something(ptr);
+
+you can expect the same effects with protection keys when doing this:
+
+   pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ);
+   pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey);
+   something(ptr);
+
+That should be true whether something() is a direct access to 'ptr'
+like:
+
+   *ptr = foo;
+
+or when the kernel does the access on the application's behalf like
+with a read():
+
+   read(fd, ptr, 1);
+
+The kernel will send a SIGSEGV in both cases, but si_code will be set
+to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
+the plain mprotect() permissions are violated.
diff --git a/Documentation/x86/protection-keys.txt 
b/Documentation/x86/protection-keys.txt
deleted file mode 100644
index b643045..000
--- a/Documentation/x86/protection-keys.txt
+++ /dev/null
@@ -1,85 +0,0 @@
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature
-which will be found on future Intel CPUs.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of

[RFC v3 20/23] selftest: PowerPC specific test updates to memory protection keys

2017-06-21 Thread Ram Pai

Abstracted out the arch specific code into the header file, and
added powerpc specific changes.

a) added 4k-backed hpte, memory allocator, powerpc specific.
b) added three test case where the key is associated after the page is
accessed/allocated/mapped.
c) cleaned up the code to make checkpatch.pl happy

Signed-off-by: Ram Pai 
---
 tools/testing/selftests/vm/pkey-helpers.h| 230 +--
 tools/testing/selftests/vm/protection_keys.c | 562 ---
 2 files changed, 513 insertions(+), 279 deletions(-)

diff --git a/tools/testing/selftests/vm/pkey-helpers.h 
b/tools/testing/selftests/vm/pkey-helpers.h
index b202939..69bfa89 100644
--- a/tools/testing/selftests/vm/pkey-helpers.h
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -12,13 +12,72 @@
 #include 
 #include 
 
-#define NR_PKEYS 16
-#define PKRU_BITS_PER_PKEY 2
+/* Define some kernel-like types */
+#define  u8 uint8_t
+#define u16 uint16_t
+#define u32 uint32_t
+#define u64 uint64_t
+
+#ifdef __i386__ /* arch */
+
+#define SYS_mprotect_key 380
+#define SYS_pkey_alloc  381
+#define SYS_pkey_free   382
+#define REG_IP_IDX REG_EIP
+#define si_pkey_offset 0x14
+
+#define NR_PKEYS   16
+#define NR_RESERVED_PKEYS  1
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS0x1
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<21)
+
+#define INIT_PRKU 0x0UL
+
+#elif __powerpc64__ /* arch */
+
+#define SYS_mprotect_key 386
+#define SYS_pkey_alloc  384
+#define SYS_pkey_free   385
+#define si_pkey_offset 0x20
+#define REG_IP_IDX PT_NIP
+#define REG_TRAPNO PT_TRAP
+#define REG_AMR45
+#define gregs gp_regs
+#define fpregs fp_regs
+
+#define NR_PKEYS   32
+#define NR_RESERVED_PKEYS  3
+#define PKRU_BITS_PER_PKEY 2
+#define PKEY_DISABLE_ACCESS0x3  /* disable read and write */
+#define PKEY_DISABLE_WRITE 0x2
+#define HPAGE_SIZE (1UL<<24)
+
+#define INIT_PRKU 0x3UL
+#else /* arch */
+
+   NOT SUPPORTED
+
+#endif /* arch */
+
 
 #ifndef DEBUG_LEVEL
 #define DEBUG_LEVEL 0
 #endif
 #define DPRINT_IN_SIGNAL_BUF_SIZE 4096
+
+
+static inline u32 pkey_to_shift(int pkey)
+{
+#ifdef __i386__ /* arch */
+   return pkey * PKRU_BITS_PER_PKEY;
+#elif __powerpc64__ /* arch */
+   return (NR_PKEYS - pkey - 1) * PKRU_BITS_PER_PKEY;
+#endif /* arch */
+}
+
+
 extern int dprint_in_signal;
 extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
 static inline void sigsafe_printf(const char *format, ...)
@@ -53,53 +112,76 @@ static inline void sigsafe_printf(const char *format, ...)
 #define dprintf3(args...) dprintf_level(3, args)
 #define dprintf4(args...) dprintf_level(4, args)
 
-extern unsigned int shadow_pkru;
-static inline unsigned int __rdpkru(void)
+extern u64 shadow_pkey_reg;
+
+static inline u64 __rdpkey_reg(void)
 {
+#ifdef __i386__ /* arch */
unsigned int eax, edx;
unsigned int ecx = 0;
-   unsigned int pkru;
+   unsigned int pkey_reg;
 
asm volatile(".byte 0x0f,0x01,0xee\n\t"
 : "=a" (eax), "=d" (edx)
 : "c" (ecx));
-   pkru = eax;
-   return pkru;
+#elif __powerpc64__ /* arch */
+   u64 eax;
+   u64 pkey_reg;
+
+   asm volatile("mfspr %0, 0xd" : "=r" ((u64)(eax)));
+#endif /* arch */
+   pkey_reg = (u64)eax;
+   return pkey_reg;
 }
 
-static inline unsigned int _rdpkru(int line)
+static inline u64 _rdpkey_reg(int line)
 {
-   unsigned int pkru = __rdpkru();
+   u64 pkey_reg = __rdpkey_reg();
 
-   dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n",
-   line, pkru, shadow_pkru);
-   assert(pkru == shadow_pkru);
+   dprintf4("rdpkey_reg(line=%d) pkey_reg: %lx shadow: %lx\n",
+   line, pkey_reg, shadow_pkey_reg);
+   assert(pkey_reg == shadow_pkey_reg);
 
-   return pkru;
+   return pkey_reg;
 }
 
-#define rdpkru() _rdpkru(__LINE__)
+#define rdpkey_reg() _rdpkey_reg(__LINE__)
 
-static inline void __wrpkru(unsigned int pkru)
+static inline void __wrpkey_reg(u64 pkey_reg)
 {
-   unsigned int eax = pkru;
+#ifdef __i386__ /* arch */
+   unsigned int eax = pkey_reg;
unsigned int ecx = 0;
unsigned int edx = 0;
 
-   dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+   dprintf4("%s() changing %lx to %lx\n",
+__func__, __rdpkey_reg(), pkey_reg);
asm volatile(".byte 0x0f,0x01,0xef\n\t"
 : : "a" (eax), "c" (ecx), "d" (edx));
-   assert(pkru == __rdpkru());
+   dprintf4("%s() PKRUP after changing %lx to %lx\n",
+   __func__, __rdpkey_reg(), pkey_reg);
+#else /* arch */
+   u64 eax = pkey_reg;
+
+   dprintf4("%s() changing %llx to %llx\n",
+__func__, __rdpkey_reg(), pkey_reg);
+   asm volatile("mtspr 0xd, %0" : : "r" ((unsigned long)(eax)) : "memory");
+

[RFC v3 19/23] selftest: Move protecton key selftest to arch neutral directory

2017-06-21 Thread Ram Pai

Signed-off-by: Ram Pai 
---
 tools/testing/selftests/vm/Makefile   |1 +
 tools/testing/selftests/vm/pkey-helpers.h |  219 
 tools/testing/selftests/vm/protection_keys.c  | 1395 +
 tools/testing/selftests/x86/Makefile  |2 +-
 tools/testing/selftests/x86/pkey-helpers.h|  219 
 tools/testing/selftests/x86/protection_keys.c | 1395 -
 6 files changed, 1616 insertions(+), 1615 deletions(-)
 create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
 create mode 100644 tools/testing/selftests/vm/protection_keys.c
 delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
 delete mode 100644 tools/testing/selftests/x86/protection_keys.c

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index cbb29e4..1d32f78 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -17,6 +17,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += protection_keys
 
 TEST_PROGS := run_vmtests
 
diff --git a/tools/testing/selftests/vm/pkey-helpers.h 
b/tools/testing/selftests/vm/pkey-helpers.h
new file mode 100644
index 000..b202939
--- /dev/null
+++ b/tools/testing/selftests/vm/pkey-helpers.h
@@ -0,0 +1,219 @@
+#ifndef _PKEYS_HELPER_H
+#define _PKEYS_HELPER_H
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NR_PKEYS 16
+#define PKRU_BITS_PER_PKEY 2
+
+#ifndef DEBUG_LEVEL
+#define DEBUG_LEVEL 0
+#endif
+#define DPRINT_IN_SIGNAL_BUF_SIZE 4096
+extern int dprint_in_signal;
+extern char dprint_in_signal_buffer[DPRINT_IN_SIGNAL_BUF_SIZE];
+static inline void sigsafe_printf(const char *format, ...)
+{
+   va_list ap;
+
+   va_start(ap, format);
+   if (!dprint_in_signal) {
+   vprintf(format, ap);
+   } else {
+   int len = vsnprintf(dprint_in_signal_buffer,
+   DPRINT_IN_SIGNAL_BUF_SIZE,
+   format, ap);
+   /*
+* len is amount that would have been printed,
+* but actual write is truncated at BUF_SIZE.
+*/
+   if (len > DPRINT_IN_SIGNAL_BUF_SIZE)
+   len = DPRINT_IN_SIGNAL_BUF_SIZE;
+   write(1, dprint_in_signal_buffer, len);
+   }
+   va_end(ap);
+}
+#define dprintf_level(level, args...) do { \
+   if (level <= DEBUG_LEVEL)   \
+   sigsafe_printf(args);   \
+   fflush(NULL);   \
+} while (0)
+#define dprintf0(args...) dprintf_level(0, args)
+#define dprintf1(args...) dprintf_level(1, args)
+#define dprintf2(args...) dprintf_level(2, args)
+#define dprintf3(args...) dprintf_level(3, args)
+#define dprintf4(args...) dprintf_level(4, args)
+
+extern unsigned int shadow_pkru;
+static inline unsigned int __rdpkru(void)
+{
+   unsigned int eax, edx;
+   unsigned int ecx = 0;
+   unsigned int pkru;
+
+   asm volatile(".byte 0x0f,0x01,0xee\n\t"
+: "=a" (eax), "=d" (edx)
+: "c" (ecx));
+   pkru = eax;
+   return pkru;
+}
+
+static inline unsigned int _rdpkru(int line)
+{
+   unsigned int pkru = __rdpkru();
+
+   dprintf4("rdpkru(line=%d) pkru: %x shadow: %x\n",
+   line, pkru, shadow_pkru);
+   assert(pkru == shadow_pkru);
+
+   return pkru;
+}
+
+#define rdpkru() _rdpkru(__LINE__)
+
+static inline void __wrpkru(unsigned int pkru)
+{
+   unsigned int eax = pkru;
+   unsigned int ecx = 0;
+   unsigned int edx = 0;
+
+   dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+   asm volatile(".byte 0x0f,0x01,0xef\n\t"
+: : "a" (eax), "c" (ecx), "d" (edx));
+   assert(pkru == __rdpkru());
+}
+
+static inline void wrpkru(unsigned int pkru)
+{
+   dprintf4("%s() changing %08x to %08x\n", __func__, __rdpkru(), pkru);
+   /* will do the shadow check for us: */
+   rdpkru();
+   __wrpkru(pkru);
+   shadow_pkru = pkru;
+   dprintf4("%s(%08x) pkru: %08x\n", __func__, pkru, __rdpkru());
+}
+
+/*
+ * These are technically racy. since something could
+ * change PKRU between the read and the write.
+ */
+static inline void __pkey_access_allow(int pkey, int do_allow)
+{
+   unsigned int pkru = rdpkru();
+   int bit = pkey * 2;
+
+   if (do_allow)
+   pkru &= (1<

[RFC v3 18/23] powerpc: Deliver SEGV signal on pkey violation

2017-06-21 Thread Ram Pai

The value of the AMR register at the time of exception
is made available in gp_regs[PT_AMR] of the siginfo.

This field can be used to reprogram the permission bits of
any valid pkey.

Similarly the value of the pkey, whose protection got violated,
is made available at si_pkey field of the siginfo structure.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/paca.h|  1 +
 arch/powerpc/include/uapi/asm/ptrace.h |  3 ++-
 arch/powerpc/kernel/asm-offsets.c  |  5 
 arch/powerpc/kernel/exceptions-64s.S   | 16 +--
 arch/powerpc/kernel/signal_32.c| 14 ++
 arch/powerpc/kernel/signal_64.c| 14 ++
 arch/powerpc/kernel/traps.c| 49 ++
 arch/powerpc/mm/fault.c|  2 ++
 8 files changed, 101 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 1c09f8f..a41afd3 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -92,6 +92,7 @@ struct paca_struct {
struct dtl_entry *dispatch_log_end;
 #endif /* CONFIG_PPC_STD_MMU_64 */
u64 dscr_default;   /* per-CPU default DSCR */
+   u64 paca_amr;   /* value of amr at exception */
 
 #ifdef CONFIG_PPC_STD_MMU_64
/*
diff --git a/arch/powerpc/include/uapi/asm/ptrace.h 
b/arch/powerpc/include/uapi/asm/ptrace.h
index 8036b38..7ec2428 100644
--- a/arch/powerpc/include/uapi/asm/ptrace.h
+++ b/arch/powerpc/include/uapi/asm/ptrace.h
@@ -108,8 +108,9 @@ struct pt_regs {
 #define PT_DAR 41
 #define PT_DSISR 42
 #define PT_RESULT 43
-#define PT_DSCR 44
 #define PT_REGS_COUNT 44
+#define PT_DSCR 44
+#define PT_AMR 45
 
 #define PT_FPR048  /* each FP reg occupies 2 slots in this space */
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 709e234..17f5d8a 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -241,6 +241,11 @@ int main(void)
OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   OFFSET(PACA_AMR, paca_struct, paca_amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
OFFSET(ACCOUNT_STARTTIME, paca_struct, accounting.starttime);
OFFSET(ACCOUNT_STARTTIME_USER, paca_struct, accounting.starttime_user);
OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 3fd0528..a4de1b4 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -493,9 +493,15 @@ EXC_COMMON_BEGIN(data_access_common)
ld  r12,_MSR(r1)
ld  r3,PACA_EXGEN+EX_DAR(r13)
lwz r4,PACA_EXGEN+EX_DSISR(r13)
-   li  r5,0x300
std r3,_DAR(r1)
std r4,_DSISR(r1)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+   beq+1f
+   mfspr   r5,SPRN_AMR
+   std r5,PACA_AMR(r13)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+1: li  r5,0x300
 BEGIN_MMU_FTR_SECTION
b   do_hash_page/* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
@@ -561,9 +567,15 @@ EXC_COMMON_BEGIN(instruction_access_common)
ld  r12,_MSR(r1)
ld  r3,_NIP(r1)
andis.  r4,r12,0x5820
-   li  r5,0x400
std r3,_DAR(r1)
std r4,_DSISR(r1)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   andis.  r0,r4,DSISR_KEYFAULT@h /* save AMR only if its a key fault */
+   beq+1f
+   mfspr   r5,SPRN_AMR
+   std r5,PACA_AMR(r13)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+1: li  r5,0x400
 BEGIN_MMU_FTR_SECTION
b   do_hash_page/* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 97bb138..059766a 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -500,6 +500,11 @@ static int save_user_regs(struct pt_regs *regs, struct 
mcontext __user *frame,
   (unsigned long) >tramp[2]);
}
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (__put_user(get_paca()->paca_amr, >mc_gregs[PT_AMR]))
+   return 1;
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
return 0;
 }
 
@@ -661,6 +666,9 @@ static long restore_user_regs(struct pt_regs *regs,
long err;
unsigned int save_r2 = 0;
unsigned long msr;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #ifdef CONFIG_VSX
int i;
 #endif
@@ -750,6 +758,12 @@ static

[RFC v3 17/23] powerpc: Handle exceptions caused by violation of pkey protection

2017-06-21 Thread Ram Pai

Handle Data and Instruction exceptions caused by memory
protection-key.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/mmu_context.h | 12 +
 arch/powerpc/include/asm/pkeys.h   |  9 
 arch/powerpc/include/asm/reg.h |  2 +-
 arch/powerpc/mm/fault.c| 20 
 arch/powerpc/mm/pkeys.c| 90 ++
 5 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index da7e943..71fffe0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+bool arch_pte_access_permitted(pte_t pte, bool write);
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+   bool write, bool execute, bool foreign);
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+   /* by default, allow everything */
+   return true;
+}
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
bool write, bool execute, bool foreign)
 {
/* by default, allow everything */
return true;
 }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index af3882f..a83722e 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,15 @@
VM_PKEY_BIT3 | \
VM_PKEY_BIT4)
 
+static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
+{
+   return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
+   ((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
+   ((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
+   ((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
+   ((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
+}
+
 #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\
((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index ba110dd..6e2a860 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -286,7 +286,7 @@
 #define   DSISR_SET_RC 0x0004  /* Failed setting of R/C bits */
 #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory */
 #define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | DSISR_PAGEATTR_CONFLT | \
-   DSISR_BADACCESS | DSISR_BIT43)
+   DSISR_BADACCESS | DSISR_KEYFAULT | DSISR_BIT43)
 #define SPRN_TBRL  0x10C   /* Time Base Read Lower Register (user, R/O) */
 #define SPRN_TBRU  0x10D   /* Time Base Read Upper Register (user, R/O) */
 #define SPRN_CIR   0x11B   /* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3a7d580..3d71984 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -261,6 +261,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
}
 #endif
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (error_code & DSISR_KEYFAULT) {
+   code = SEGV_PKUERR;
+   goto bad_area_nosemaphore;
+   }
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
/* We restore the interrupt state now */
if (!arch_irq_disabled_regs(regs))
local_irq_enable();
@@ -441,6 +448,19 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
 #endif /* CONFIG_PPC_STD_MMU */
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+   is_exec, 0)) {
+   code = SEGV_PKUERR;
+   goto bad_area;
+   }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+   /* handle_mm_fault() needs to know if its a instruction access
+* fault.
+*/
+   if (is_exec)
+   flags |= FAULT_FLAG_INSTRUCTION;
/*
 * If for any reason at all we couldn't handle the fault,
 * make sure we exit gracefully rather than endlessly redo
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 11a32b3..439241a 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
 }
 
+static inline bool pkey_allows_read(int pkey)
+{
+   int pkey_shift = (arch_max_pkey()-pkey-1)

[RFC v3 16/23] powerpc: Macro the mask used for checking DSI exception

2017-06-21 Thread Ram Pai

Replace the magic number used to check for DSI exception
with a meaningful value.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/reg.h   | 7 ++-
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7e50e47..ba110dd 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -272,16 +272,21 @@
 #define SPRN_DAR   0x013   /* Data Address Register */
 #define SPRN_DBCR  0x136   /* e300 Data Breakpoint Control Reg */
 #define SPRN_DSISR 0x012   /* Data Storage Interrupt Status Register */
+#define   DSISR_BIT32  0x8000  /* not defined */
 #define   DSISR_NOHPTE 0x4000  /* no translation found */
+#define   DSISR_PAGEATTR_CONFLT0x2000  /* page attribute 
conflict */
+#define   DSISR_BIT35  0x1000  /* not defined */
 #define   DSISR_PROTFAULT  0x0800  /* protection fault */
 #define   DSISR_BADACCESS  0x0400  /* bad access to CI or G */
 #define   DSISR_ISSTORE0x0200  /* access was a store */
 #define   DSISR_DABRMATCH  0x0040  /* hit data breakpoint */
-#define   DSISR_NOSEGMENT  0x0020  /* SLB miss */
 #define   DSISR_KEYFAULT   0x0020  /* Key fault */
+#define   DSISR_BIT43  0x0010  /* not defined */
 #define   DSISR_UNSUPP_MMU 0x0008  /* Unsupported MMU config */
 #define   DSISR_SET_RC 0x0004  /* Failed setting of R/C bits */
 #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory */
+#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | DSISR_PAGEATTR_CONFLT | \
+   DSISR_BADACCESS | DSISR_BIT43)
 #define SPRN_TBRL  0x10C   /* Time Base Read Lower Register (user, R/O) */
 #define SPRN_TBRU  0x10D   /* Time Base Read Upper Register (user, R/O) */
 #define SPRN_CIR   0x11B   /* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index ae418b8..3fd0528 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
.balign IFETCH_ALIGN_BYTES
 do_hash_page:
 #ifdef CONFIG_PPC_STD_MMU_64
-   andis.  r0,r4,0xa410/* weird error? */
+   andis.  r0,r4,DSISR_PAGE_FAULT_MASK@h
bne-handle_page_fault   /* if not, try to insert a HPTE */
andis.  r0,r4,DSISR_DABRMATCH@h
bne-handle_dabr_fault
-- 
1.8.3.1

[RFC v3 15/23] powerpc: Program HPTE key protection bits

2017-06-21 Thread Ram Pai

Map the PTE protection key bits to the HPTE key protection bits,
while creating HPTE  entries.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +
 arch/powerpc/include/asm/pkeys.h  | 7 +++
 arch/powerpc/mm/hash_utils_64.c   | 5 +
 3 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 6981a52..f7a6ed3 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -90,6 +90,8 @@
 #define HPTE_R_PP0 ASM_CONST(0x8000)
 #define HPTE_R_TS  ASM_CONST(0x4000)
 #define HPTE_R_KEY_HI  ASM_CONST(0x3000)
+#define HPTE_R_KEY_BIT0ASM_CONST(0x2000)
+#define HPTE_R_KEY_BIT1ASM_CONST(0x1000)
 #define HPTE_R_RPN_SHIFT   12
 #define HPTE_R_RPN ASM_CONST(0x0000)
 #define HPTE_R_RPN_3_0 ASM_CONST(0x01fff000)
@@ -104,6 +106,9 @@
 #define HPTE_R_C   ASM_CONST(0x0080)
 #define HPTE_R_R   ASM_CONST(0x0100)
 #define HPTE_R_KEY_LO  ASM_CONST(0x0e00)
+#define HPTE_R_KEY_BIT2ASM_CONST(0x0800)
+#define HPTE_R_KEY_BIT3ASM_CONST(0x0400)
+#define HPTE_R_KEY_BIT4ASM_CONST(0x0200)
 
 #define HPTE_V_1TB_SEG ASM_CONST(0x4000)
 #define HPTE_V_VRMA_MASK   ASM_CONST(0x4001ff00)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0f3dca8..af3882f 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -27,6 +27,13 @@
((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
 
+#define pte_to_hpte_pkey_bits(pteflags)\
+   (((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) |\
+   ((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) | \
+   ((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) | \
+   ((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) | \
+   ((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL))
+
 /*
  * Bits are in BE format.
  * NOTE: key 31, 1, 0 are not used.
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index b3bc5d6..34bc94c 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -230,6 +231,10 @@ unsigned long htab_convert_pte_flags(unsigned long 
pteflags)
 */
rflags |= HPTE_R_M;
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   rflags |= pte_to_hpte_pkey_bits(pteflags);
+#endif
+
return rflags;
 }
 
-- 
1.8.3.1

[RFC v3 14/23] powerpc: Implementation for sys_mprotect_pkey() system call

2017-06-21 Thread Ram Pai

This system call, associates the pkey with PTE of all
pages covering the given address range.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 22 ++-
 arch/powerpc/include/asm/mman.h  | 14 -
 arch/powerpc/include/asm/pkeys.h | 21 ++-
 arch/powerpc/include/asm/systbl.h|  1 +
 arch/powerpc/include/asm/unistd.h|  4 +-
 arch/powerpc/include/uapi/asm/unistd.h   |  1 +
 arch/powerpc/mm/pkeys.c  | 93 +++-
 7 files changed, 148 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 87e9a89..bc845cd 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -37,6 +37,7 @@
 #define _RPAGE_RSV20x0800UL
 #define _RPAGE_RSV30x0400UL
 #define _RPAGE_RSV40x0200UL
+#define _RPAGE_RSV50x00040UL
 
 #define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
 #define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
@@ -56,6 +57,20 @@
 /* Max physical address bit as per radix table */
 #define _RPAGE_PA_MAX  57
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+#define H_PAGE_PKEY_BIT0   _RPAGE_RSV1
+#define H_PAGE_PKEY_BIT1   _RPAGE_RSV2
+#define H_PAGE_PKEY_BIT2   _RPAGE_RSV3
+#define H_PAGE_PKEY_BIT3   _RPAGE_RSV4
+#define H_PAGE_PKEY_BIT4   _RPAGE_RSV5
+#else /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+#define H_PAGE_PKEY_BIT0   0
+#define H_PAGE_PKEY_BIT1   0
+#define H_PAGE_PKEY_BIT2   0
+#define H_PAGE_PKEY_BIT3   0
+#define H_PAGE_PKEY_BIT4   0
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 /*
  * Max physical address bit we will use for now.
  *
@@ -122,7 +137,12 @@
 #define PAGE_PROT_BITS  (_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT | \
 H_PAGE_4K_PFN | _PAGE_PRIVILEGED | _PAGE_ACCESSED | \
 _PAGE_READ | _PAGE_WRITE |  _PAGE_DIRTY | _PAGE_EXEC | 
\
-_PAGE_SOFT_DIRTY)
+_PAGE_SOFT_DIRTY | \
+H_PAGE_PKEY_BIT0 | \
+H_PAGE_PKEY_BIT1 | \
+H_PAGE_PKEY_BIT2 | \
+H_PAGE_PKEY_BIT3 | \
+H_PAGE_PKEY_BIT4)
 /*
  * We define 2 sets of base prot bits, one for basic pages (ie,
  * cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f6..624f6a2 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -13,6 +13,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -22,13 +23,24 @@
 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
unsigned long pkey)
 {
-   return (prot & PROT_SAO) ? VM_SAO : 0;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   return (((prot & PROT_SAO) ? VM_SAO : 0) |
+   pkey_to_vmflag_bits(pkey));
+#else
+   return ((prot & PROT_SAO) ? VM_SAO : 0);
+#endif
 }
 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
 
 static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 {
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   return (vm_flags & VM_SAO) ?
+   __pgprot(_PAGE_SAO | vmflag_to_page_pkey_bits(vm_flags)) :
+   __pgprot(0 | vmflag_to_page_pkey_bits(vm_flags));
+#else
return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0);
+#endif
 }
 #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
 
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 7bc8746..0f3dca8 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,19 @@
VM_PKEY_BIT3 | \
VM_PKEY_BIT4)
 
+#define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
+   ((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |\
+   ((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |\
+   ((key & 0x8UL) ? VM_PKEY_BIT3 : 0x0UL) |\
+   ((key & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL))
+
+#define vmflag_to_page_pkey_bits(vm_flags)  \
+   (((vm_flags & VM_PKEY_BIT0) ? H_PAGE_PKEY_BIT4 : 0x0UL)| \
+   ((vm_flags & VM_PKEY_BIT1) ? H_PAGE_PKEY_BIT3 : 0x0UL) | \
+   ((vm_flags & VM_PKEY_BIT2) ? H_PAGE_PKEY_BIT2 : 0x0UL) | \
+   ((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) | \
+   ((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
+
 /*
  * Bits are in BE format.
  * NOTE: key

[RFC v3 13/23] powerpc: store and restore the pkey state across context switches

2017-06-21 Thread Ram Pai

Store and restore the AMR, IAMR and UMOR register state of the task
before scheduling out and after scheduling in, respectively.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/processor.h |  5 +
 arch/powerpc/kernel/process.c| 18 ++
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index a2123f2..1f714df 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -310,6 +310,11 @@ struct thread_struct {
struct thread_vr_state ckvr_state; /* Checkpointed VR state */
unsigned long   ckvrsave; /* Checkpointed VRSAVE */
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   unsigned long   amr;
+   unsigned long   iamr;
+   unsigned long   uamor;
+#endif
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
void*   kvm_shadow_vcpu; /* KVM internal data */
 #endif /* CONFIG_KVM_BOOK3S_32_HANDLER */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index baae104..37d001a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1096,6 +1096,11 @@ static inline void save_sprs(struct thread_struct *t)
t->tar = mfspr(SPRN_TAR);
}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   t->amr = mfspr(SPRN_AMR);
+   t->iamr = mfspr(SPRN_IAMR);
+   t->uamor = mfspr(SPRN_UAMOR);
+#endif
 }
 
 static inline void restore_sprs(struct thread_struct *old_thread,
@@ -1131,6 +1136,14 @@ static inline void restore_sprs(struct thread_struct 
*old_thread,
mtspr(SPRN_TAR, new_thread->tar);
}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   if (old_thread->amr != new_thread->amr)
+   mtspr(SPRN_AMR, new_thread->amr);
+   if (old_thread->iamr != new_thread->iamr)
+   mtspr(SPRN_IAMR, new_thread->iamr);
+   if (old_thread->uamor != new_thread->uamor)
+   mtspr(SPRN_UAMOR, new_thread->uamor);
+#endif
 }
 
 struct task_struct *__switch_to(struct task_struct *prev,
@@ -1686,6 +1699,11 @@ void start_thread(struct pt_regs *regs, unsigned long 
start, unsigned long sp)
current->thread.tm_texasr = 0;
current->thread.tm_tfiar = 0;
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   current->thread.amr   = 0x0ul;
+   current->thread.iamr  = 0x0ul;
+   current->thread.uamor = 0x0ul;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 }
 EXPORT_SYMBOL(start_thread);
 
-- 
1.8.3.1

[RFC v3 12/23] powerpc: Implement sys_pkey_alloc and sys_pkey_free system call

2017-06-21 Thread Ram Pai

Sys_pkey_alloc() allocates and returns available pkey
Sys_pkey_free()  frees up the pkey.

Total 32 keys are supported on powerpc. However pkey 0,1 and 31
are reserved. So effectively we have 29 pkeys.

Each key  can  be  initialized  to disable read, write and execute
permissions. On powerpc a key can be initialize to disable execute.

Signed-off-by: Ram Pai 
---
 arch/powerpc/Kconfig |  15 
 arch/powerpc/include/asm/book3s/64/mmu.h |  10 +++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  62 ++
 arch/powerpc/include/asm/pkeys.h | 124 +++
 arch/powerpc/include/asm/systbl.h|   2 +
 arch/powerpc/include/asm/unistd.h|   4 +-
 arch/powerpc/include/uapi/asm/unistd.h   |   2 +
 arch/powerpc/mm/Makefile |   1 +
 arch/powerpc/mm/mmu_context_book3s64.c   |   5 ++
 arch/powerpc/mm/pkeys.c  |  88 +++
 10 files changed, 310 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/include/asm/pkeys.h
 create mode 100644 arch/powerpc/mm/pkeys.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7c8f99..b6960617 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -871,6 +871,21 @@ config SECCOMP
 
  If unsure, say Y. Only embedded should say N here.
 
+config PPC64_MEMORY_PROTECTION_KEYS
+   prompt "PowerPC Memory Protection Keys"
+   def_bool y
+   # Note: only available in 64-bit mode
+   depends on PPC64 && PPC_64K_PAGES
+   select ARCH_USES_HIGH_VMA_FLAGS
+   select ARCH_HAS_PKEYS
+   ---help---
+ Memory Protection Keys provides a mechanism for enforcing
+ page-based protections, but without requiring modification of the
+ page tables when an application changes protection domains.
+
+ For details, see Documentation/powerpc/protection-keys.txt
+
+ If unsure, say y.
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3..0c0a2a8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -108,6 +108,16 @@ struct patb_entry {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
struct list_head iommu_group_mem_list;
 #endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   /*
+* Each bit represents one protection key.
+* bit set   -> key allocated
+* bit unset -> key available for allocation
+*/
+   u32 pkey_allocation_map;
+   s16 execute_only_pkey; /* key holding execute-only protection */
+#endif
 } mm_context_t;
 
 /*
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 85bc987..87e9a89 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -428,6 +428,68 @@ static inline void huge_ptep_set_wrprotect(struct 
mm_struct *mm,
pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
 }
 
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
+#include 
+static inline u64 read_amr(void)
+{
+   return mfspr(SPRN_AMR);
+}
+static inline void write_amr(u64 value)
+{
+   mtspr(SPRN_AMR, value);
+}
+static inline u64 read_iamr(void)
+{
+   return mfspr(SPRN_IAMR);
+}
+static inline void write_iamr(u64 value)
+{
+   mtspr(SPRN_IAMR, value);
+}
+static inline u64 read_uamor(void)
+{
+   return mfspr(SPRN_UAMOR);
+}
+static inline void write_uamor(u64 value)
+{
+   mtspr(SPRN_UAMOR, value);
+}
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+static inline u64 read_amr(void)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+   return -1;
+}
+static inline void write_amr(u64 value)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_uamor(void)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+   return -1;
+}
+static inline void write_uamor(u64 value)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_iamr(void)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+   return -1;
+}
+static inline void write_iamr(u64 value)
+{
+   WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
   unsigned long addr, pte_t *ptep)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
new file mode 100644
index 000..7bc8746
--- /dev/null
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -0,0 +1,124 @@
+#ifndef _ASM_PPC64_PKEYS_H
+#define _ASM_PPC64_PKEYS_H
+
+
+#define arch_max_pkey()  32

[RFC v3 11/23] x86: key creation with PKEY_DISABLE_EXECUTE is disallowed

2017-06-21 Thread Ram Pai

x86 does not support disabling execute permissions on a pkey.

Signed-off-by: Ram Pai 
---
 arch/x86/kernel/fpu/xstate.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c24ac1e..d582631 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -900,6 +900,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
if (!boot_cpu_has(X86_FEATURE_OSPKE))
return -EINVAL;
 
+   if (init_val & PKEY_DISABLE_EXECUTE)
+   return -EINVAL;
+
/* Set the bits we need in PKRU:  */
if (init_val & PKEY_DISABLE_ACCESS)
new_pkru_bits |= PKRU_AD_BIT;
-- 
1.8.3.1

[RFC v3 10/23] mm: provide the ability to disable execute on a key at creation

2017-06-21 Thread Ram Pai

Currently sys_pkey_create() provides the ability to disable read
and write permission on the key, at  creation. powerpc  has  the
hardware support to disable execute on a pkey as well.This patch
enhances the interface to let disable execute  at  key  creation
time. x86 does  not  allow  this.  Hence the next patch will add
ability  in  x86  to  return  error  is  PKEY_DISABLE_EXECUTE is
specified.

Signed-off-by: Ram Pai 
---
 include/uapi/asm-generic/mman-common.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index 8c27db0..bf4fa07 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -74,7 +74,9 @@
 
 #define PKEY_DISABLE_ACCESS0x1
 #define PKEY_DISABLE_WRITE 0x2
+#define PKEY_DISABLE_EXECUTE   0x4
 #define PKEY_ACCESS_MASK   (PKEY_DISABLE_ACCESS |\
-PKEY_DISABLE_WRITE)
+PKEY_DISABLE_WRITE  |\
+PKEY_DISABLE_EXECUTE)
 
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
-- 
1.8.3.1

[RFC v3 09/23] mm: introduce an additional vma bit for powerpc pkey

2017-06-21 Thread Ram Pai

Currently there are only 4bits in the vma flags to support 16 keys
on x86.  powerpc supports 32 keys, which needs 5bits. This patch
introduces an addition bit in the vma flags.

Signed-off-by: Ram Pai 
---
 fs/proc/task_mmu.c |  6 +-
 include/linux/mm.h | 18 +-
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33..2ddc298 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -666,12 +666,16 @@ static void show_smap_vma_flags(struct seq_file *m, 
struct vm_area_struct *vma)
[ilog2(VM_MERGEABLE)]   = "mg",
[ilog2(VM_UFFD_MISSING)]= "um",
[ilog2(VM_UFFD_WP)] = "uw",
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_ARCH_HAS_PKEYS
/* These come out via ProtectionKey: */
[ilog2(VM_PKEY_BIT0)]   = "",
[ilog2(VM_PKEY_BIT1)]   = "",
[ilog2(VM_PKEY_BIT2)]   = "",
[ilog2(VM_PKEY_BIT3)]   = "",
+#endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+   /* Additional bit in ProtectionKey: */
+   [ilog2(VM_PKEY_BIT4)]   = "",
 #endif
};
size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6..3d35bcc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,21 +208,29 @@ extern int overcommit_kbytes_handler(struct ctl_table *, 
int, void __user *,
 #define VM_HIGH_ARCH_BIT_1 33  /* bit only usable on 64-bit 
architectures */
 #define VM_HIGH_ARCH_BIT_2 34  /* bit only usable on 64-bit 
architectures */
 #define VM_HIGH_ARCH_BIT_3 35  /* bit only usable on 64-bit 
architectures */
+#define VM_HIGH_ARCH_BIT_4 36  /* bit only usable on 64-bit arch */
 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
-#if defined(CONFIG_X86)
-# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
-#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+#ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
-# define VM_PKEY_BIT0  VM_HIGH_ARCH_0  /* A protection key is a 4-bit value */
+# define VM_PKEY_BIT0  VM_HIGH_ARCH_0
 # define VM_PKEY_BIT1  VM_HIGH_ARCH_1
 # define VM_PKEY_BIT2  VM_HIGH_ARCH_2
 # define VM_PKEY_BIT3  VM_HIGH_ARCH_3
-#endif
+#endif /* CONFIG_ARCH_HAS_PKEYS */
+
+#if defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
+# define VM_PKEY_BIT4  VM_HIGH_ARCH_4 /* additional key bit used on ppc64 */
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
+#if defined(CONFIG_X86)
+# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
 #elif defined(CONFIG_PPC)
 # define VM_SAOVM_ARCH_1   /* Strong Access Ordering 
(powerpc) */
 #elif defined(CONFIG_PARISC)
-- 
1.8.3.1

[RFC v3 08/23] powerpc: use helper functions in flush_hash_page()

2017-06-21 Thread Ram Pai

replace redundant code in flush_hash_page() with helper functions
get_hidx_gslot() and set_hidx_slot()

Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/hash_utils_64.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 99f97754c..b3bc5d6 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1611,23 +1611,18 @@ unsigned long get_hidx_gslot(unsigned long vpn, 
unsigned long shift,
 void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
 unsigned long flags)
 {
-   unsigned long hash, index, shift, hidx, slot;
+   unsigned long index, shift, gslot;
int local = flags & HPTE_LOCAL_UPDATE;
 
DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);
pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
-   hash = hpt_hash(vpn, shift, ssize);
-   hidx = __rpte_to_hidx(pte, index);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
-   DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx);
+   gslot = get_hidx_gslot(vpn, shift, ssize, pte, index);
+   DBG_LOW(" sub %ld: gslot=%lx\n", index, gslot);
/*
 * We use same base page size and actual psize, because we don't
 * use these functions for hugepage
 */
-   mmu_hash_ops.hpte_invalidate(slot, vpn, psize, psize,
+   mmu_hash_ops.hpte_invalidate(gslot, vpn, psize, psize,
 ssize, local);
} pte_iterate_hashed_end();
 
-- 
1.8.3.1

[RFC v3 07/23] powerpc: use helper functions in __hash_page_4K() for 4K PTE

2017-06-21 Thread Ram Pai

replace redundant code with helper functions
get_hidx_gslot() and set_hidx_slot()

Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/hash64_4k.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 6fa450c..c673829 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
   pte_t *ptep, unsigned long trap, unsigned long flags,
   int ssize, int subpg_prot)
 {
+   real_pte_t rpte;
unsigned long hpte_group;
unsigned long rflags, pa;
unsigned long old_pte, new_pte;
@@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
 * need to add in 0x1 if it's a read-only user page
 */
rflags = htab_convert_pte_flags(new_pte);
+   rpte = __real_pte(__pte(old_pte), ptep);
 
if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -64,13 +66,10 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
/*
 * There MIGHT be an HPTE for this pte
 */
-   hash = hpt_hash(vpn, shift, ssize);
-   if (old_pte & H_PAGE_F_SECOND)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
+   unsigned long gslot = get_hidx_gslot(vpn, shift,
+   ssize, rpte, 0);
 
-   if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
+   if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, MMU_PAGE_4K,
   MMU_PAGE_4K, ssize, flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
}
@@ -118,8 +117,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
return -1;
}
new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
-   new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
-   (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+   new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
}
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
-- 
1.8.3.1

[RFC v3 06/23] powerpc: use helper functions in __hash_page_4K() for 64K PTE

2017-06-21 Thread Ram Pai

replace redundant code in __hash_page_4K() with helper
functions get_hidx_gslot() and set_hidx_slot()

Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/hash64_64k.c | 24 ++--
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 5cbdaa9..cb48a60 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -103,18 +103,12 @@ int __hash_page_4K(unsigned long ea, unsigned long 
access, unsigned long vsid,
if (__rpte_sub_valid(rpte, subpg_index)) {
int ret;
 
-   hash = hpt_hash(vpn, shift, ssize);
-   hidx = __rpte_to_hidx(rpte, subpg_index);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
-
-   ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
+   gslot = get_hidx_gslot(vpn, shift, ssize, rpte, subpg_index);
+   ret = mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn,
 MMU_PAGE_4K, MMU_PAGE_4K,
 ssize, flags);
/*
-*if we failed because typically the HPTE wasn't really here
+* if we failed because typically the HPTE wasn't really here
 * we try an insertion.
 */
if (ret == -1)
@@ -214,15 +208,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
 * Since we have H_PAGE_BUSY set on ptep, we can be sure
 * nobody is undating hidx.
 */
-   hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
-   rpte.hidx &= ~(0xfUL << (subpg_index << 2));
-   *hidxp = rpte.hidx  | (slot << (subpg_index << 2));
-   new_pte = mark_subptegroup_valid(new_pte, subpg_index);
-   new_pte |=  H_PAGE_HASHPTE;
-   /*
-* check __real_pte for details on matching smp_rmb()
-*/
-   smp_wmb();
+   new_pte |= H_PAGE_HASHPTE;
+   new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
+
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
 }
-- 
1.8.3.1

[RFC v3 05/23] powerpc: capture the PTE format changes in the dump pte report

2017-06-21 Thread Ram Pai

The H_PAGE_F_SECOND,H_PAGE_F_GIX are not in the 64K main-PTE.
capture these changes in the dump pte report.

Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/dump_linuxpagetables.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/dump_linuxpagetables.c 
b/arch/powerpc/mm/dump_linuxpagetables.c
index 44fe483..5627edd 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -213,7 +213,7 @@ struct flag_info {
.val= H_PAGE_4K_PFN,
.set= "4K_pfn",
}, {
-#endif
+#else /* CONFIG_PPC_64K_PAGES */
.mask   = H_PAGE_F_GIX,
.val= H_PAGE_F_GIX,
.set= "f_gix",
@@ -224,6 +224,7 @@ struct flag_info {
.val= H_PAGE_F_SECOND,
.set= "f_second",
}, {
+#endif /* CONFIG_PPC_64K_PAGES */
 #endif
.mask   = _PAGE_SPECIAL,
.val= _PAGE_SPECIAL,
-- 
1.8.3.1

[RFC v3 04/23] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages

2017-06-21 Thread Ram Pai

Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
in the 64K backed HPTE pages. This along with the earlier
patch will entirely free up the four bits from 64K PTE.
The bit numbers are big-endian as defined in the ISA3.0

This patch does the following change to 64K PTE that is
backed by 64K HPTE.

H_PAGE_F_SECOND which occupied bit 4 moves to the second part
of the pte.
H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
second part of the pte.

since bit 7 is now freed up, we move H_PAGE_BUSY from bit 9
to bit 7. Trying to minimize gaps so that contiguous bits
can be allocated if needed in the future.

The second part of the PTE will hold
(H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.

The above PTE changes is applicable to hugetlbpages aswell.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 28 +--
 arch/powerpc/mm/hash64_64k.c  | 17 
 arch/powerpc/mm/hugetlbpage-hash64.c  | 16 ++-
 3 files changed, 23 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 4bac70a..7b5dbf3 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,11 +12,8 @@
  */
 #define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
-#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
-#define H_PAGE_F_GIX_SHIFT 56
 
-#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are busy */
+#define H_PAGE_BUSY_RPAGE_RPN44 /* software: PTE & hash are busy */
 #define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
 
 /*
@@ -26,8 +23,7 @@
 #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
 
 /* PTE flags to conserve for HPTE identification */
-#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
-H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
+#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
 /*
  * we support 16 fragments per PTE page of 64K size.
  */
@@ -55,24 +51,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
unsigned long *hidxp;
 
rpte.pte = pte;
-   rpte.hidx = 0;
-   if (pte_val(pte) & H_PAGE_COMBO) {
-   /*
-* Make sure we order the hidx load against the H_PAGE_COMBO
-* check. The store side ordering is done in __hash_page_4K
-*/
-   smp_rmb();
-   hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
-   rpte.hidx = *hidxp;
-   }
+   /*
+* The store side ordering is done in set_hidx_slot()
+*/
+   smp_rmb();
+   hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+   rpte.hidx = *hidxp;
return rpte;
 }
 
 static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long 
index)
 {
-   if ((pte_val(rpte.pte) & H_PAGE_COMBO))
-   return (rpte.hidx >> (index<<2)) & 0xf;
-   return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
+   return ((rpte.hidx >> (index<<2)) & 0xfUL);
 }
 
 static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index a16cd28..5cbdaa9 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -231,6 +231,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
unsigned long vsid, pte_t *ptep, unsigned long trap,
unsigned long flags, int ssize)
 {
+   real_pte_t rpte;
unsigned long hpte_group;
unsigned long rflags, pa;
unsigned long old_pte, new_pte;
@@ -267,6 +268,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
 
rflags = htab_convert_pte_flags(new_pte);
+   rpte = __real_pte(__pte(old_pte), ptep);
 
if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -274,16 +276,13 @@ int __hash_page_64K(unsigned long ea, unsigned long 
access,
 
vpn  = hpt_vpn(ea, vsid, ssize);
if (unlikely(old_pte & H_PAGE_HASHPTE)) {
+   unsigned long gslot;
+
/*
 * There MIGHT be an HPTE for this pte
 */
-   hash = hpt_hash(vpn, shift, ssize);
-   if (old_pte & H_PAGE_F_SECOND)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
-   if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
+   gslot =

[RFC v3 03/23] powerpc: introduce get_hidx_gslot helper

2017-06-21 Thread Ram Pai

Introduce get_hidx_gslot() which returns the slot number of the HPTE
in the global hash table.

This function will come in handy as we work towards re-arranging the
PTE bits in the later patches.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash.h |  3 +++
 arch/powerpc/mm/hash_utils_64.c   | 14 ++
 2 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index ac049de..e7cf03a 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -162,6 +162,9 @@ static inline bool hpte_soft_invalid(unsigned long slot)
return ((slot & 0xfUL) == 0xfUL);
 }
 
+unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
+   int ssize, real_pte_t rpte, unsigned int subpg_index);
+
 /* This low level function performs the actual PTE insertion
  * Setting the PTE depends on the MMU type and other factors. It's
  * an horrible mess that I'm not going to try to clean up now but
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 1b494d0..99f97754c 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1591,6 +1591,20 @@ static inline void tm_flush_hash_page(int local)
 }
 #endif
 
+unsigned long get_hidx_gslot(unsigned long vpn, unsigned long shift,
+   int ssize, real_pte_t rpte, unsigned int subpg_index)
+{
+   unsigned long hash, slot, hidx;
+
+   hash = hpt_hash(vpn, shift, ssize);
+   hidx = __rpte_to_hidx(rpte, subpg_index);
+   if (hidx & _PTEIDX_SECONDARY)
+   hash = ~hash;
+   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+   slot += hidx & _PTEIDX_GROUP_IX;
+   return slot;
+}
+
 /* WARNING: This is called from hash_low_64.S, if you change this prototype,
  *  do not forget to update the assembly call site !
  */
-- 
1.8.3.1

[RFC v3 02/23] powerpc: introduce set_hidx_slot helper

2017-06-21 Thread Ram Pai

Introduce set_hidx_slot() which sets the (H_PAGE_F_SECOND|H_PAGE_F_GIX)
bits at  the  appropriate  location  in  the  PTE  of  4K  PTE.  In the
case of 64K PTE, it sets the bits in the second part of the PTE. Though
the implementation for the former just needs the slot parameter, it does
take some additional parameters to keep the prototype consistent.

This function will come in handy as we  work  towards  re-arranging the
bits in the later patches.

Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  7 +++
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 16 
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 9c2c8f1..cef644c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -55,6 +55,13 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
 }
 #endif
 
+static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+   unsigned int subpg_index, unsigned long slot)
+{
+   return (slot << H_PAGE_F_GIX_SHIFT) &
+   (H_PAGE_F_SECOND | H_PAGE_F_GIX);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 static inline char *get_hpte_slot_array(pmd_t *pmdp)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 3f49941..4bac70a 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -75,6 +75,22 @@ static inline unsigned long __rpte_to_hidx(real_pte_t rpte, 
unsigned long index)
return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
 }
 
+static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+   unsigned int subpg_index, unsigned long slot)
+{
+   unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+
+   rpte.hidx &= ~(0xfUL << (subpg_index << 2));
+   *hidxp = rpte.hidx  | (slot << (subpg_index << 2));
+   /*
+* Avoid race with __real_pte()
+* hidx must be committed to memory before committing
+* the pte.
+*/
+   smp_wmb();
+   return 0x0UL;
+}
+
 #define __rpte_to_pte(r)   ((r).pte)
 extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
 /*
-- 
1.8.3.1

[RFC v3 01/23] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

2017-06-21 Thread Ram Pai

Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6,
in the 4K backed HPTE pages. These bits continue to be used
for 64K backed HPTE pages in this patch,  but will be freed
up in the next patch. The  bit  numbers  are big-endian  as
defined in the ISA3.0

The patch does the following change to the 64K PTE format

H_PAGE_BUSY moves from bit 3 to bit 9
H_PAGE_F_SECOND which occupied bit 4 moves to the second part
of the pte.
H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
second part of the pte.

the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
is  initialized  to  0xF  indicating  an invalid  slot.  If  a HPTE
gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
released immediately. In  other  words, even  though   0xF   is   a
valid slot we discard  and consider it as an invalid
slot;i.e HPTE(). This  gives  us  an opportunity to not
depend on a bit in the primary PTE in order to determine the
validity of a slot.

When  we  release  aHPTE   in the 0xF   slot we also   release a
legitimate primary   slot  andunmapthat  entry. This  is  to
ensure  that we do get a   legimate   non-0xF  slot the next time we
retry for a slot.

Though treating 0xF slot as invalid reduces the number of available
slots  and  may  have an effect  on the performance, the probabilty
of hitting a 0xF is extermely low.

Compared  to the current scheme, the above described scheme reduces
the number of false hash table updates  significantly  and  has the
added  advantage  of  releasing  four  valuable  PTE bits for other
purpose.

This idea was jointly developed by Paul Mackerras, Aneesh, Michael
Ellermen and myself.

4K PTE format remain unchanged currently.

Signed-off-by: Ram Pai 

Conflicts:
arch/powerpc/include/asm/book3s/64/hash.h
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  7 +++
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 17 ---
 arch/powerpc/include/asm/book3s/64/hash.h | 12 +++--
 arch/powerpc/mm/hash64_64k.c  | 70 +++
 arch/powerpc/mm/hash_utils_64.c   |  4 +-
 5 files changed, 66 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b..9c2c8f1 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,6 +16,13 @@
 #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
 #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
+#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT 56
+
+#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
+
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
 H_PAGE_F_SECOND | H_PAGE_F_GIX)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 9732837..3f49941 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -10,20 +10,21 @@
  * 64k aligned address free up few of the lower bits of RPN for us
  * We steal that here. For more deatils look at pte_pfn/pfn_pte()
  */
-#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
-#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
+#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT 56
+
+#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
+
 /*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
  */
 #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
 
-/*
- * Used to track subpage group valid if H_PAGE_COMBO is set
- * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
- */
-#define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
-
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
 H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 4e957b0..ac049de 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -8,11 +8,8 @@
  *
  */
 #define H_PTE_NONE_MASK_PAGE_HPTEFLAGS

[RFC v3 00/23] powerpc: Memory Protection Keys

2017-06-21 Thread Ram Pai

Memory protection keys enable applications to protect its
address space from inadvertent access or corruption from
itself.

The overall idea:

 A process allocates a   key  and associates it with
 a  address  range  withinits   address   space.
 The process  than  can  dynamically  set read/write 
 permissions on  the   key   without  involving  the 
 kernel. Any  code that  violates   the  permissions
 off the address space; as defined by its associated
 key, will receive a segmentation fault.

This patch series enables the feature on PPC64.
It is enabled on HPTE 64K-page platform.

ISA3.0 section 5.7.13 describes the detailed specifications.


Testing:
This patch series has passed all the protection key
tests available in  the selftests directory.
The tests are updated to work on both x86 and powerpc.

version v3:
(1) split the patches into smaller consumable
patches.
(2) added the ability to disable execute permission
on a key at creation.
(3) rename  calc_pte_to_hpte_pkey_bits() to
pte_to_hpte_pkey_bits() -- suggested by Anshuman
(4) some code optimization and clarity in
do_page_fault()  
(5) A bug fix while invalidating a hpte slot in 
__hash_page_4K() -- noticed by Aneesh


version v2:
(1) documentation and selftest added
(2) fixed a bug in 4k hpte backed 64k pte where page
invalidation was not done correctly, and 
initialization of second-part-of-the-pte was not
done correctly if the pte was not yet Hashed
with a hpte.  Reported by Aneesh.
(3) Fixed ABI breakage caused in siginfo structure.
Reported by Anshuman.

Outstanding known issue:
Calls to sys_swapcontext with a made-up context will end 
up with a crap AMR if done by code who didn't know about
that register. -- Reported by Ben.

version v1: Initial version

Thanks-to: Dave Hansen, Aneesh, Paul Mackerras,
   Michael Ellermen


Ram Pai (23):
  powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
  powerpc: introduce set_hidx_slot helper
  powerpc: introduce get_hidx_gslot helper
  powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
  powerpc: capture the PTE format changes in the dump pte report
  powerpc: use helper functions in __hash_page_4K() for 64K PTE
  powerpc: use helper functions in __hash_page_4K() for 4K PTE
  powerpc: use helper functions in flush_hash_page()
  mm: introduce an additional vma bit for powerpc pkey
  mm: provide the ability to disable execute on a key at creation
  x86: key creation with PKEY_DISABLE_EXECUTE is disallowed
  powerpc: Implement sys_pkey_alloc and sys_pkey_free system call
  powerpc: store and restore the pkey state across context switches
  powerpc: Implementation for sys_mprotect_pkey() system call
  powerpc: Program HPTE key protection bits
  powerpc: Macro the mask used for checking DSI exception
  powerpc: Handle exceptions caused by violation of pkey protection
  powerpc: Deliver SEGV signal on pkey violation
  selftest: Move protecton key selftest to arch neutral directory
  selftest: PowerPC specific test updates to memory protection keys
  Documentation: Move protecton key documentation to arch neutral
directory
  Documentation: PowerPC specific updates to memory protection keys
  procfs: display the protection-key number associated with a vma

 Documentation/filesystems/proc.txt|3 +-
 Documentation/vm/protection-keys.txt  |  110 ++
 Documentation/x86/protection-keys.txt |   85 --
 arch/powerpc/Kconfig  |   15 +
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   14 +
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   53 +-
 arch/powerpc/include/asm/book3s/64/hash.h |   15 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |5 +
 arch/powerpc/include/asm/book3s/64/mmu.h  |   10 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   84 +-
 arch/powerpc/include/asm/mman.h   |   14 +-
 arch/powerpc/include/asm/mmu_context.h|   12 +
 arch/powerpc/include/asm/paca.h   |1 +
 arch/powerpc/include/asm/pkeys.h  |  159 +++
 arch/powerpc/include/asm/processor.h  |5 +
 arch/powerpc/include/asm/reg.h|7 +-
 arch/powerpc/include/asm/systbl.h |3 +
 arch/powerpc/include/asm/unistd.h |6 +-
 arch/powerpc/include/uapi/asm/ptrace.h|3 +-
 arch/powerpc/include/uapi/asm/unistd.h|3 +
 arch/powerpc/kernel/asm-offsets.c |5 +
 arch/powerpc/kernel/exceptions-64s.S  |   18 +-
 arch/powerpc/kernel/process.c |   18 +
 arch/powerpc/kernel/signal_32.c   |   14 +
 arch/powerpc/kernel/signal_64.c   |   14 +
 arch/powerpc/kernel/traps.c   |   49 +
 arch/powerpc/mm/Makefile

Re: [PATCH v2 6/6] ima: Support module-style appended signatures for appraisal

2017-06-21 Thread Mimi Zohar

On Wed, 2017-06-21 at 14:45 -0300, Thiago Jung Bauermann wrote:
> Hello Mimi,
> 
> Thanks for your review, and for queuing the other patches in this series.
> 
> Mimi Zohar  writes:
> > On Wed, 2017-06-07 at 22:49 -0300, Thiago Jung Bauermann wrote:
> >> This patch introduces the modsig keyword to the IMA policy syntax to
> >> specify that a given hook should expect the file to have the IMA signature
> >> appended to it.
> >
> > Thank you, Thiago. Appended signatures seem to be working proper now
> > with multiple keys on the IMA keyring.
> 
> Great news!
> 
> > The length of this patch description is a good indication that this
> > patch needs to be broken up for easier review. A few
> > comments/suggestions inline below.
> 
> Ok, I will try to break it up, and also patch 5 as you suggested.
> 
> >> diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
> >> index 06554c448dce..9190c9058f4f 100644
> >> --- a/security/integrity/digsig.c
> >> +++ b/security/integrity/digsig.c
> >> @@ -48,11 +48,10 @@ static bool init_keyring __initdata;
> >>  #define restrict_link_to_ima restrict_link_by_builtin_trusted
> >>  #endif
> >> 
> >> -int integrity_digsig_verify(const unsigned int id, const char *sig, int 
> >> siglen,
> >> -  const char *digest, int digestlen)
> >> +struct key *integrity_keyring_from_id(const unsigned int id)
> >>  {
> >> -  if (id >= INTEGRITY_KEYRING_MAX || siglen < 2)
> >> -  return -EINVAL;
> >> +  if (id >= INTEGRITY_KEYRING_MAX)
> >> +  return ERR_PTR(-EINVAL);
> >> 
> >
> > When splitting up this patch, the addition of this new function could
> > be a separate patch. The patch description would explain the need for
> > a new function.
> 
> Ok, will do for v3.
> 
> >> @@ -229,10 +234,14 @@ int ima_appraise_measurement(enum ima_hooks func,
> >>goto out;
> >>}
> >> 
> >> -  status = evm_verifyxattr(dentry, XATTR_NAME_IMA, xattr_value, rc, iint);
> >> -  if ((status != INTEGRITY_PASS) && (status != INTEGRITY_UNKNOWN)) {
> >> -  if ((status == INTEGRITY_NOLABEL)
> >> -  || (status == INTEGRITY_NOXATTRS))
> >> +  /* Appended signatures aren't protected by EVM. */
> >> +  status = evm_verifyxattr(dentry, XATTR_NAME_IMA,
> >> +   xattr_value->type == IMA_MODSIG ?
> >> +   NULL : xattr_value, rc, iint);
> >> +  if (status != INTEGRITY_PASS && status != INTEGRITY_UNKNOWN &&
> >> +  !(xattr_value->type == IMA_MODSIG &&
> >> +(status == INTEGRITY_NOLABEL || status == INTEGRITY_NOXATTRS))) {
> >
> > This was messy to begin with, and now it is even more messy. For
> > appended signatures, we're only interested in INTEGRITY_FAIL. Maybe
> > leave the existing "if" clause alone and define a new "if" clause.
> 
> Ok, is this what you had in mind?
> 
> @@ -229,8 +237,14 @@ int ima_appraise_measurement(enum ima_hooks func,
>   goto out;
>   }
> 
> - status = evm_verifyxattr(dentry, XATTR_NAME_IMA, xattr_value, rc, iint);
> - if ((status != INTEGRITY_PASS) && (status != INTEGRITY_UNKNOWN)) {
> + /* Appended signatures aren't protected by EVM. */
> + status = evm_verifyxattr(dentry, XATTR_NAME_IMA,
> +  xattr_value->type == IMA_MODSIG ?
> +  NULL : xattr_value, rc, iint);

Yes, maybe add a comment here indicating only verifying other security
xattrs, if they exist.

> + if (xattr_value->type == IMA_MODSIG && status == INTEGRITY_FAIL) {
> + cause = "invalid-HMAC";
> + goto out;
> + } else if (status != INTEGRITY_PASS && status != INTEGRITY_UNKNOWN) {
>   if ((status == INTEGRITY_NOLABEL)
>   || (status == INTEGRITY_NOXATTRS))
>   cause = "missing-HMAC";

> 
> >> @@ -267,11 +276,18 @@ int ima_appraise_measurement(enum ima_hooks func,
> >>status = INTEGRITY_PASS;
> >>break;
> >>case EVM_IMA_XATTR_DIGSIG:
> >> +  case IMA_MODSIG:
> >>iint->flags |= IMA_DIGSIG;
> >> -  rc = integrity_digsig_verify(INTEGRITY_KEYRING_IMA,
> >> -   (const char *)xattr_value, rc,
> >> -   iint->ima_hash->digest,
> >> -   iint->ima_hash->length);
> >> +
> >> +  if (xattr_value->type == EVM_IMA_XATTR_DIGSIG)
> >> +  rc = integrity_digsig_verify(INTEGRITY_KEYRING_IMA,
> >> +   (const char *)xattr_value,
> >> +   rc, iint->ima_hash->digest,
> >> +   iint->ima_hash->length);
> >> +  else
> >> +  rc = ima_modsig_verify(INTEGRITY_KEYRING_IMA,
> >> + xattr_value);
> >> +
> >
> > Perhaps allowing IMA_MODSIG to flow into EVM_IMA_XATTR_DIGSIG

Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd

2017-06-21 Thread Thiago Jung Bauermann

Michael Ellerman  writes:
> Thiago Jung Bauermann  writes:
>
>> Calling arch_update_cpu_topology from a CPU hotplug state machine callback
>> hits a deadlock because the function tries to get a read lock on
>> cpu_hotplug_lock while the state machine still holds a write lock on it.
>>
>> Since all callers of arch_update_cpu_topology except rtasd already hold
>> cpu_hotplug_lock, this patch changes the function to use
>> stop_machine_cpuslocked and creates a separate function for rtasd which
>> still tries to obtain the lock.
>>
>> Michael Bringmann investigated the bug and provided a detailed analysis
>> of the deadlock on this previous RFC for an alternate solution:
>>
>> https://patchwork.ozlabs.org/patch/771293/
>
> Do we know when this broke? Or has it never worked?

It's been broken since at least v4.4, I think. I don't know about
earlier versions.

> Should it go to stable? (can't in its current form AFAICS)

It's not hard to backport both this patch and commit fe5595c07400
("stop_machine: Provide stop_machine_cpuslocked()") from branch
smp/hotplug in tip.git for stable.

Since rtasd only started calling arch_update_cpu_topology since v4.11,
for earlier versions this patch can be simplified to making that
function call stop_machine_cpuslocked unconditionally instead of
defining a separate function.

>> Signed-off-by: Thiago Jung Bauermann 
>> ---
>>
>> Notes:
>> This patch applies on tip/smp/hotplug, it should probably be carried 
>> there.
>
> stop_machine_cpuslocked() doesn't exist in mainline so I think it has to
> be carried there right?

Yes. I said "probably" because I don't know if you want to wait
until that branch is merged so that you can carry this patch in your
tree.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH V4 2/2] powerpc/powernv : Add support for OPAL-OCC command/response interface

2017-06-21 Thread Cyril Bur

On Wed, 2017-06-21 at 13:36 +0530, Shilpasri G Bhat wrote:
> In P9, OCC (On-Chip-Controller) supports shared memory based
> commad-response interface. Within the shared memory there is an OPAL
> command buffer and OCC response buffer that can be used to send
> inband commands to OCC. This patch adds a platform driver to support
> the command/response interface between OCC and the host.
> 

Sorry I probably should have pointed out earlier that I don't really
understand the first patch or exactly what problem you're trying to
solve. I've left it ignored, feel free to explain what the idea is
there or hopefully someone who can see what you're trying to do can
step in.

As for this patch, just one thing.


> Signed-off-by: Shilpasri G Bhat 
> ---
> - Hold occ->cmd_in_progress in read()
> - Reset occ->rsp_consumed if copy_to_user() fails
> 
>  arch/powerpc/include/asm/opal-api.h|  41 +++-
>  arch/powerpc/include/asm/opal.h|   3 +
>  arch/powerpc/platforms/powernv/Makefile|   2 +-
>  arch/powerpc/platforms/powernv/opal-occ.c  | 313 
> +
>  arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
>  arch/powerpc/platforms/powernv/opal.c  |   8 +
>  6 files changed, 366 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/opal-occ.c
> 

[snip]

> +
> +static ssize_t opal_occ_read(struct file *file, char __user *buf,
> +  size_t count, loff_t *ppos)
> +{
> + struct miscdevice *dev = file->private_data;
> + struct occ *occ = container_of(dev, struct occ, dev);
> + int rc;
> +
> + if (count < sizeof(*occ->rsp) + occ->rsp->size)
> + return -EINVAL;
> +
> + if (!atomic_cmpxchg(>rsp_consumed, 1, 0))
> + return -EBUSY;
> +
> + if (atomic_cmpxchg(>cmd_in_progress, 0, 1))
> + return -EBUSY;
> +

Personally I would have done these two checks the other way around, it
doesn't really matter which one you do first but what does matter is
that you undo the change you did in the first cmpxchg if the second
cmpxchg causes you do return.

In this case if cmd_in_progress then you'll have marked the response as
consumed...

> + rc = copy_to_user((void __user *)buf, occ->rsp,
> +   sizeof(occ->rsp) + occ->rsp->size);
> + if (rc) {
> + atomic_set(>rsp_consumed, 1);
> + atomic_set(>cmd_in_progress, 0);
> + pr_err("Failed to copy OCC response data to user\n");
> + return rc;
> + }
> +
> + atomic_set(>cmd_in_progress, 0);
> + return sizeof(*occ->rsp) + occ->rsp->size;
> +}
> +

[snip]

Re: [next-20170609] Oops while running CPU off-on (cpuset.c/cpuset_can_attach)

2017-06-21 Thread Stephen Rothwell

Hi all,

On Tue, 13 Jun 2017 09:56:41 -0400 Tejun Heo  wrote:
>
> (forwarding to Li w/ full body)
> 
> Li, can you please take a look at this?
> 
> Thanks.
> 
> On Mon, Jun 12, 2017 at 04:53:42PM +0530, Abdul Haleem wrote:
> > Hi,
> > 
> > linux-next kernel crashed while running CPU offline and online.
> > 
> > Machine: Power 8 LPAR
> > Kernel : 4.12.0-rc4-next-20170609
> > gcc : version 5.2.1
> > config: attached
> > testcase: CPU off/on
> > 
> > for i in $(seq 100);do 
> > for j in $(seq 0 15);do 
> > echo 0 >  /sys/devices/system/cpu/cpu$j/online
> > sleep 5
> > echo 1 > /sys/devices/system/cpu/cpu$j/online
> > done
> > done
> > 
> > kernel trace:
> > --
> > Unable to handle kernel paging request for data at address 0x0960
> > Faulting instruction address: 0xc01d6868
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > SMP NR_CPUS=2048
> > NUMA
> > pSeries
> > Modules linked in: dlci mpls_router af_key 8021q garp mrp nfc af_alg
> > caif_socket caif pn_pep phonet fcrypt pcbc rxrpc hidp hid cmtp
> > kernelcapi bnep rfcomm bluetooth ecdh_generic can_bcm can_raw can pptp
> > gre l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppoe
> > pppox irda xfrm_user xfrm_algo nfnetlink scsi_transport_iscsi dn_rtmsg
> > llc2 dccp_ipv6 atm appletalk ipx p8023 p8022 psnap sctp dccp_ipv4 dccp
> > xt_addrtype xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4
> > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter
> > ip_tables x_tables nf_nat nf_conntrack bridge stp llc dm_thin_pool
> > dm_persistent_data dm_bio_prison dm_bufio libcrc32c rtc_generic
> > vmx_crypto pseries_rng autofs4
> > CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: GW
> > 4.12.0-rc4-next-20170609 #2
> > Workqueue: events cpuset_hotplug_workfn
> > task: cca60580 task.stack: cc728000
> > NIP: c01d6868 LR: c01d6858 CTR: c01d6810
> > REGS: cc72b720 TRAP: 0300   Tainted: GW
> > (4.12.0-rc4-next-20170609)
> > MSR: 80009033 
> >   CR: 44722422  XER: 2000  
> > CFAR: c0008710 DAR: 0960 DSISR: 4000 SOFTE: 1 
> > GPR00: c01d6858 cc72b9a0 c1536e00
> >  
> > GPR04: cc72b9c0  cc72bad0
> > c00766367678 
> > GPR08: c00766366d10 cc72b958 c1736e00
> >  
> > GPR12: c01d6810 ce749300 c0123ef8
> > c00775af4180 
> > GPR16:   c0075480e9c0
> > c0075480e9e0 
> > GPR20: c0075480e8c0 0001 
> > cc72ba20 
> > GPR24: cc72baa0 cc72bac0 c1407248
> > cc72ba20 
> > GPR28: c141fc80 cc72bac0 cc6bc790
> >  
> > NIP [c01d6868] cpuset_can_attach+0x58/0x1b0
> > LR [c01d6858] cpuset_can_attach+0x48/0x1b0
> > Call Trace:
> > [cc72b9a0] [c01d6858] cpuset_can_attach+0x48/0x1b0
> > (unreliable)
> > [cc72ba00] [c01cbe80] cgroup_migrate_execute+0xb0/0x450
> > [cc72ba80] [c01d3754] cgroup_transfer_tasks+0x1c4/0x360
> > [cc72bba0] [c01d923c] cpuset_hotplug_workfn+0x86c/0xa20
> > [cc72bca0] [c011aa44] process_one_work+0x1e4/0x580
> > [cc72bd30] [c011ae78] worker_thread+0x98/0x5c0
> > [cc72bdc0] [c0124058] kthread+0x168/0x1b0
> > [cc72be30] [c000b2e8] ret_from_kernel_thread+0x5c/0x74
> > Instruction dump:
> > f821ffa1 7c7d1b78 6000 6000 38810020 7fa3eb78 3f42ffed 4bff4c25 
> > 6000 3b5a0448 3d420020 eb610020  7f43d378 e929
> > f92af200 
> > ---[ end trace dcaaf98fb36d9e64 ]---

Has there been any progress on this?
-- 
Cheers,
Stephen Rothwell

Re: [PATCH v6 1/4] of: remove *phandle properties from expanded device tree

2017-06-21 Thread Frank Rowand

adding Ben and Paul.

Hi Michael,

On 06/20/17 21:57, Michael Ellerman wrote:
> Hi Frank,
> 
> frowand.l...@gmail.com writes:
>> From: Frank Rowand 
>>
>> Remove "phandle", "linux,phandle", and "ibm,phandle" properties from
>> the internal device tree.  The phandle will still be in the struct
>> device_node phandle field and will still be displayed as if it is
>> a property in /proc/device_tree.
>>
>> This is to resolve the issue found by Stephen Boyd [1] when he changed
>> the type of struct property.value from void * to const void *.  As
>> a result of the type change, the overlay code had compile errors
>> where the resolver updates phandle values.
>>
>>   [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html
>>
>> - Add sysfs infrastructure to report np->phandle, as if it was a property.
>> - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties
>>   in the expanded device tree.
>> - Remove phandle properties in of_attach_node(), for nodes dynamically
>>   attached to the live tree.  Add the phandle sysfs entry for these nodes.
>> - When creating an overlay changeset, duplicate the node phandle in
>>   __of_node_dup().
>> - Remove no longer needed checks to exclude "phandle" and "linux,phandle"
>>   properties in several locations.
>> - A side effect of these changes is that the obsolete "linux,phandle" and
>>   "ibm,phandle" properties will no longer appear in /proc/device-tree (they
>>   will appear as "phandle").
> 
> Sorry but I don't think that can work for us.
> 
> Our DLPAR (ie. CPU/memory/device hotplug) stuff on PowerVM uses
> "ibm,phandle", and it's not the same thing as "phandle" /
> "linux,phandle".
> 
> I don't know the code well myself, but the spec (PAPR) says:

This is the LoPAPR, section 2.1.4, R1-2.1.4-3

   https://members.openpowerfoundation.org/document/dl/469


>   Note: If the “ibm,phandle” property exists, there are two “phandle”
>   namespaces which must be kept separate. One is that actually used by
>   the OF client interface, the other is properties in the device tree
>   making reference to device tree nodes. These requirements are written
>   to maintain backward compatibility with older FW versions predating
>   these requirements; if the “ibm,phandle” property is not present, the
>   OS may assume that any device tree properties which refer to this node
>   will have a phandle value matching that returned by client interface
>   services.
> 
> I have systems here that still use "ibm,phandle". I also see at least
> some of the userspace code that looks for "ibm,phandle", and nothing
> else.
> 
> The note above actually implies that the current Linux code is wrong,
> when it uses "ibm,phandle" as the value of np->phandle.

My interpretation of the LoPAPR R1-2.1.4-1 and R1-2.1.4-2 is that the
ibm-phandle property it the node's phandle value that other nodes may
refer to.  Thus this is the value that should be placed in np->phandle,
which is the value that will be used to find a node based on its
phandle value.  Which is the way the drivers/of/fdt.c currently works:

/* We accept flattened tree phandles either in
 * ePAPR-style "phandle" properties, or the
 * legacy "linux,phandle" properties.  If both
 * appear and have different values, things
 * will get weird. Don't do that.
 */
if (!strcmp(pname, "phandle") ||
!strcmp(pname, "linux,phandle")) {
if (!np->phandle)
np->phandle = be32_to_cpup(val);
}

/* And we process the "ibm,phandle" property
 * used in pSeries dynamic device tree
 * stuff
 */
if (!strcmp(pname, "ibm,phandle"))
np->phandle = be32_to_cpup(val);

My interpratation of R1-2.1.4-1 through R1-2.1.4-3 is that the
"ibm,phandle" property is relevant to the contents of the Linux
kernel device tree and that the "phandle returned by a client
interface service" is not relevant to the Linux kernel device
tree.  I would not expect the powerpc code to expose the
device tree code to a "phandle returned by a client
interface service".  Is that correct?

The current code which chooses which value potentially ends up in
np->phandle seems to involve a little bit of cargo cult coding.
This code has been adapted and combined from several locations,
see commits:

   dfbd4c6eff35f1b1065cca046003cc9d7ff27222

then earlier:

   04b954a673dd02f585a2769c4945a43880faa989
   6016a363f6b56b46b24655bcfc0499b715851cf3
   e6a6928c3ea1d0195ed75a091e345696b916c09b
   bbd33931a08362f78266a4016211a35947b91041

I would like for the code that sets the value of np->phandle
to simply say:

   if name is "phandle", "linux,phandle", or "ibm,phandle" then
  np->phandle = the value

Does anyone know if the additional logic in the current code is

Re: [next-20170609] WARNING: CPU: 3 PID: 71167 at lib/idr.c:157 idr_replace

2017-06-21 Thread Chris Wilson

Quoting Tejun Heo (2017-06-13 14:58:49)
> Cc'ing David Airlie.
> 
> This is from drm driver calling in idr_replace() w/ a negative id.
> Probably a silly bug in error handling path?

No, this is the validation of an invalid userspace handle. The drm ABI
for handles is supposed to be a full u32 range with 0 reserved for an
invalid handle (constrained by the idr_alloc ofc). The WARN was
introduced by 0a835c4f090a.
-Chris

Re: [PATCH v6 2/4] of: make __of_attach_node() static

2017-06-21 Thread Stephen Boyd

On 06/20, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> __of_attach_node() is not used outside of drivers/of/dynamic.c.  Make
> it static and remove it from drivers/of/of_private.h.
> 
> Signed-off-by: Frank Rowand 
> ---

Reviewed-by: Stephen Boyd 

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Re: Network TX Stall on 440EP Processor

2017-06-21 Thread Benjamin Herrenschmidt

On Tue, 2017-06-20 at 14:17 -0700, Thomas Besemer wrote:
> I'm working on a project that is derived from the Yosemite
> PPC 440EP board.  It's a legacy project that was running the
> 2.6.24 Kernel, and network traffic was stalling due to transmission
> halting without an understandable error (in this error condition, the various
> status registers of network interface showed no issues), other 
> than TX stalling due to Buffer Descriptor Ring becoming full.

This is my emac driver ? I haven't looked at (or touched) that thing in
eons :-)

Cheers,
Ben.

> In order to see if the problem has been resolved, the Kernel
> has been updated to 4.9.13, compiled with gcc version 5.4.0
> (Buildroot 2017.02.2).  Although the frequency of the
> problem is decreased, it still does show up.
> 
> The test case is the Linux Target running idle, no application
> code.  From a Linux host on a directly connected network, 30
> flood pings are started.  After a period of several minutes to
> perhaps hours, the transmit aspect of the network controller
> ceases to transmit packets (Buffer Descriptor ring becomes full). 
> RX still works.  In the 2.6.24 Kernel, the problem happens
> within seconds, so it has improved with the new Kernel.
> 
> Below is the output from the Kernel when this happens.
> 
> Has anybody seen this problem before?  I can't find any
> errata on it, nor can I find any reports of it.
> 
> The orginal problem is rooted in the Embedded Application
> running, and after a period of time of heavy network
> traffic, the TX side of network stalls.  The flood ping
> test is used simply to force the problem to happen.
> 
> [ 3127.143572] NETDEV WATCHDOG: eth0 (emac): transmit queue 0 timed out
> [ 3127.150172] [ cut here ]
> [ 3127.154778] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 
> dev_watchdog+0x23c/0x244
> [ 3127.162965] Modules linked in:
> [ 3127.166013] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.13 #9
> [ 3127.171707] task: c0e67300 task.stack: c0f0
> [ 3127.176192] NIP: c068e734 LR: c068e734 CTR: c04672f4
> [ 3127.181107] REGS: c0f01c90 TRAP: 0700   Not tainted  (4.9.13)
> [ 3127.186793] MSR: 00029000 [ 3127.190241]   CR: 2812  XER: 
> 
> [ 3127.194210]
> GPR00: c068e734 c0f01d40 c0e67300 0038 d1006301 00df c04683e4 00df
> GPR08: 00df c0eff4b0 c0eff4b0 0004 24122424 00b960f0  c0e8
> GPR16: 000ac8c1 c07b8618 c098bddc c0e69000 000a c0ee c0e73f20 c0f0
> GPR24: c100e4e8 c0ee c0e77d60 c3128000 c068e4f8 c0e8  c3128000
> NIP [c068e734] dev_watchdog+0x23c/0x244
> [ 3127.227680] LR [c068e734] dev_watchdog+0x23c/0x244
> [ 3127.232427] Call Trace:
> [ 3127.234857] [c0f01d40] [c068e734] dev_watchdog+0x23c/0x244 (unreliable)
> [ 3127.241447] [c0f01d60] [c00805e8] call_timer_fn+0x40/0x118
> [ 3127.246889] [c0f01d80] [c00808e8] expire_timers.isra.13+0xbc/0x114
> [ 3127.253032] [c0f01db0] [c0080a94] run_timer_softirq+0x90/0xf0
> [ 3127.258753] [c0f01e00] [c07b31b4] __do_softirq+0x114/0x2b0
> [ 3127.264202] [c0f01e60] [c002a158] irq_exit+0xe8/0xec
> [ 3127.269144] [c0f01e70] [c0008c98] timer_interrupt+0x34/0x4c
> [ 3127.274684] [c0f01e80] [c000ec94] ret_from_except+0x0/0x18
> [ 3127.280151] --- interrupt: 901 at cpm_idle+0x3c/0x70
> [ 3127.280151]     LR = arch_cpu_idle+0x30/0x68
> [ 3127.289300] [c0f01f40] [c0f058e4] cpu_idle_force_poll+0x0/0x4 (unreliable)
> [ 3127.296146] [c0f01f50] [c00073e4] arch_cpu_idle+0x30/0x68
> [ 3127.301509] [c0f01f60] [c005bce8] cpu_startup_entry+0x184/0x1bc
> [ 3127.307392] [c0f01fb0] [c0a76a1c] start_kernel+0x3d4/0x3e8
> [ 3127.312843] [c0f01ff0] [c0b4] _start+0xb4/0xf8
> [ 3127.317599] Instruction dump:
> [ 3127.320557] 811f0284 4b78 3921 7fe3fb78 99281966 4bfd9cd5 7c651b78 
> 3c60c0a1
> [ 3127.328359] 7fc6f378 7fe4fb78 3863357c 48125319 <0fe0> 4bb8 
> 7c0802a6 90010004
> [ 3127.336327] ---[ end trace c31dfe4772ff0e8f ]---
>

Re: [next-20170609] WARNING: CPU: 3 PID: 71167 at lib/idr.c:157 idr_replace

2017-06-21 Thread Dave Airlie


Cc'ing dri-devel.

Dave.

On Tue, 13 Jun 2017, Tejun Heo wrote:

> Cc'ing David Airlie.
> 
> This is from drm driver calling in idr_replace() w/ a negative id.
> Probably a silly bug in error handling path?
> 
> Thanks.
> 
> On Mon, Jun 12, 2017 at 08:10:54PM +0530, Abdul Haleem wrote:
> > Hi,
> > 
> > WARN_ON_ONCE is being called from idr_replace() function in file
> > lib/idr.c at line 157
> > 
> > struct radix_tree_node *node;
> > void __rcu **slot = NULL;
> > void *entry;
> > 
> > if (WARN_ON_ONCE(id < 0))
> > return ERR_PTR(-EINVAL);
> > if (WARN_ON_ONCE(radix_tree_is_internal_node(ptr)))
> > return ERR_PTR(-EINVAL);
> > 
> > entry = __radix_tree_lookup(>idr_rt, id, , );
> > 
> > 
> > Test: Trinity (https://github.com/kernelslacker/trinity)
> > Machine : Power 8 PowerVM LPAR
> > Kernel : 4.12.0-rc4-next-20170606
> > gcc : version 5.2.1
> > config : attached
> > 
> > trace logs:
> > [ cut here ]
> > WARNING: CPU: 3 PID: 71167 at lib/idr.c:157 idr_replace+0x100/0x110
> > Modules linked in: xts(E) ip_set(E) ipmi_powernv(E) ipmi_devintf(E)
> > shpchp(E) ibmpowernv(E) ofpart(E) uio_pdrv_genirq(E) sg(E) ses(E)
> > at24(E) tg3(E) bnx2x(E) ahci(E) loop(E) xt_CHECKSUM(E) ipt_MASQUERADE(E)
> > nf_nat_masquerade_ipv4(E) tun(E) kvm_hv(E) kvm_pr(E) kvm(E)
> > ip6t_rpfilter(E) ipt_REJECT(E) nf_reject_ipv4(E) ip6t_REJECT(E)
> > nf_reject_ipv6(E) xt_conntrack(E) nfnetlink(E) ebtable_nat(E)
> > ebtable_broute(E) bridge(E) stp(E) llc(E) ip6table_nat(E)
> > nf_conntrack_ipv6(E) nf_defrag_ipv6(E) nf_nat_ipv6(E) ip6table_mangle(E)
> > ip6table_security(E) ip6table_raw(E) iptable_nat(E) nf_conntrack_ipv4(E)
> > nf_defrag_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E)
> > iptable_mangle(E) iptable_security(E) iptable_raw(E) ebtable_filter(E)
> > ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E)
> > i2c_dev(E)
> > [29316.280682]  ghash_generic(E) gf128mul(E) vmx_crypto(E) enclosure(E)
> > scsi_transport_sas(E) nvmem_core(E) opal_prd(E) ipmi_msghandler(E)
> > powernv_rng(E) powernv_flash(E) uio(E) rtc_opal(E) mtd(E) i2c_opal(E)
> > nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E)
> > ip_tables(E) ext4(E) jbd2(E) fscrypto(E) mbcache(E) sd_mod(E) mdio(E)
> > libcrc32c(E) ptp(E) ast(E) i2c_algo_bit(E) drm_kms_helper(E)
> > syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) drm(E)
> > aacraid(E) libahci(E) libata(E) i2c_core(E) pps_core(E) dm_mirror(E)
> > dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: xts]
> > CPU: 3 PID: 71167 Comm: trinity-c43 Tainted: GE
> > 4.12.0-rc4-next-20170609-autotest #1
> > task: c03bd0799500 task.stack: c011e81f
> > NIP: c04d20a0 LR: dfc16a98 CTR: c04d1fa0
> > REGS: c011e81f38d0 TRAP: 0700   Tainted: GE
> > (4.12.0-rc4-next-20170609-autotest)
> > MSR: 90029033 
> >   CR: 28002428  XER: 2000  
> > CFAR: c04d1fd4 SOFTE: 1 
> > GPR00: dfc16a98 c011e81f3b50 c106d800 c0334c89de38 
> > GPR04:  d7d7d7d7 d7d7d7d7  
> > GPR08: c011e81f4000  8003 dfc47760 
> > GPR12: c04d1fa0 cfac1f80  10030d70 
> > GPR16: 10030f38  dfc17150 0008 
> > GPR20: dfc4f4e0 7fff7996 0009  
> > GPR24:  c011e81f3c50 0008 dfc61958 
> > GPR28: c0334c89de50 c0334c89de38 d7d7d7d7 d7d7d7d7 
> > NIP [c04d20a0] idr_replace+0x100/0x110
> > LR [dfc16a98] drm_gem_handle_delete+0x58/0x120 [drm]
> > Call Trace:
> > [c011e81f3b50] [c011e81f3bf0] 0xc011e81f3bf0 (unreliable)
> > [c011e81f3ba0] [dfc16a98] drm_gem_handle_delete+0x58/0x120 [drm]
> > [c011e81f3bf0] [dfc17e80] drm_ioctl+0x270/0x4e0 [drm]
> > [c011e81f3d40] [c0344108] do_vfs_ioctl+0xc8/0x8c0
> > [c011e81f3de0] [c03449c4] SyS_ioctl+0xc4/0xe0
> > [c011e81f3e30] [c000af84] system_call+0x38/0xe0
> > Instruction dump:
> > 38210050 7f83e378 e8010010 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 7c0803a6
> > 4e800020 0fe0 3860ffea 4b94 <0fe0> 3860ffea 4b88 6042
> > ---[ end trace 5158244f52496ab9 ]---
> > _exception: 47 callbacks suppressed
> > 
> > 
> > -- 
> > Regard's
> > 
> > Abdul Haleem
> > IBM Linux Technology Centre
> > 
> > 
> 
> > #
> > # Automatically generated file; DO NOT EDIT.
> > # Linux/powerpc 4.11.0-rc7 Kernel Configuration
> > #
> > CONFIG_PPC64=y
> > 
> > #
> > # Processor support
> > #
> > CONFIG_PPC_BOOK3S_64=y
> > # CONFIG_PPC_BOOK3E_64 is not set
> > # CONFIG_POWER7_CPU is not set
> > CONFIG_POWER8_CPU=y
> > CONFIG_PPC_BOOK3S=y
> > CONFIG_PPC_FPU=y
> > CONFIG_ALTIVEC=y
> > CONFIG_VSX=y
> > # CONFIG_PPC_ICSWX is not set
> > CONFIG_PPC_STD_MMU=y
> >

Re: Network TX Stall on 440EP Processor

2017-06-21 Thread Thomas Besemer

Hi Michael -

>
> Thomas Besemer  writes:
> > I'm working on a project that is derived from the Yosemite
> > PPC 440EP board.  It's a legacy project that was running the
> > 2.6.24 Kernel, and network traffic was stalling due to transmission
> > halting without an understandable error (in this error condition, the
> > various
> > status registers of network interface showed no issues), other
> > than TX stalling due to Buffer Descriptor Ring becoming full.
>
> I'm not really familiar with these boards, and I'm not a network guy
> either, so hopefully someone else will have some ideas :)
>
> This is the EMAC driver you're using, which is old but still used so
> shouldn't have completely bit rotted.
>
> I think the "Buffer Descriptor Ring becoming full" indicates the
> hardware has stopped sending packets that the kernel has put in the
> ring?
>
> So did the driver get the ring handling wrong somehow and the device
> thinks the ring is empty but we think it's full?
>

Thanks for the feedback.  I'm continuing to look into it, but I should add
to this discussion that when TX stalls, the Ready bit (bit 0) is set in the
TX Status/Control field of all the Buffer Descriptors.  This is what is
perplexing, as TX is enabled, and all BD's are marked as having
valid data.

I've looked to see if there are PLB errors, but cannot see any, and the
MAL/EMAC registers all seem valid.  It simply appears that it stops
sending data for no reason.

Re: clean up and modularize arch dma_mapping interface V2

2017-06-21 Thread tndave




On 06/16/2017 11:10 AM, Christoph Hellwig wrote:

Hi all,

for a while we have a generic implementation of the dma mapping routines
that call into per-arch or per-device operations.  But right now there
still are various bits in the interfaces where don't clearly operate
on these ops.  This series tries to clean up a lot of those (but not all
yet, but the series is big enough).  It gets rid of the DMA_ERROR_CODE
way of signaling failures of the mapping routines from the
implementations to the generic code (and cleans up various drivers that
were incorrectly using it), and gets rid of the ->set_dma_mask routine
in favor of relying on the ->dma_capable method that can be used in
the same way, but which requires less code duplication.

Chris,

Thanks for doing this.
So archs can still have their own definition for dma_set_mask() if 
HAVE_ARCH_DMA_SET_MASK is y?
(and similarly for dma_set_coherent_mask() when 
CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK is y)

Any plan to change these?

I'm in a process of making some changes to SPARC iommu so it would be 
good to know. Thanks.


-Tushar



I've got a good number of reviews last time, but a few are still missing.
I'd love to not have to re-spam everyone with this patchbomb, so early
ACKs (or complaints) are welcome.

I plan to create a new dma-mapping tree to collect all this work.
Any volunteers for co-maintainers, especially from the iommu gang?

The whole series is also available in git:

 git://git.infradead.org/users/hch/misc.git dma-map

Gitweb:

 http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-map

Changes since V1:
  - remove two lines of code from arm dmabounce
  - a few commit message tweaks
  - lots of ACKs

Re: [RFC v2 02/12] powerpc: Free up four 64K PTE bits in 64K backed hpte pages.

2017-06-21 Thread Ram Pai

On Wed, Jun 21, 2017 at 12:24:34PM +0530, Aneesh Kumar K.V wrote:
> Ram Pai  writes:
> 
> 
> 
> > diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c 
> > b/arch/powerpc/mm/hugetlbpage-hash64.c
> > index a84bb44..239ca86 100644
> > --- a/arch/powerpc/mm/hugetlbpage-hash64.c
> > +++ b/arch/powerpc/mm/hugetlbpage-hash64.c
> > @@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long 
> > access, unsigned long vsid,
> >  pte_t *ptep, unsigned long trap, unsigned long flags,
> >  int ssize, unsigned int shift, unsigned int mmu_psize)
> >  {
> > +   real_pte_t rpte;
> > unsigned long vpn;
> > unsigned long old_pte, new_pte;
> > unsigned long rflags, pa, sz;
> > @@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long 
> > access, unsigned long vsid,
> > } while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
> >
> > rflags = htab_convert_pte_flags(new_pte);
> > +   rpte = __real_pte(__pte(old_pte), ptep);
> >
> > sz = ((1UL) << shift);
> > if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -71,15 +73,10 @@ int __hash_page_huge(unsigned long ea, unsigned long 
> > access, unsigned long vsid,
> > /* Check if pte already has an hpte (case 2) */
> > if (unlikely(old_pte & H_PAGE_HASHPTE)) {
> > /* There MIGHT be an HPTE for this pte */
> > -   unsigned long hash, slot;
> > +   unsigned long gslot;
> >
> > -   hash = hpt_hash(vpn, shift, ssize);
> > -   if (old_pte & H_PAGE_F_SECOND)
> > -   hash = ~hash;
> > -   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > -   slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > -
> > -   if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
> > +   gslot = get_hidx_gslot(vpn, shift, ssize, rpte, 0);
> > +   if (mmu_hash_ops.hpte_updatepp(gslot, rflags, vpn, mmu_psize,
> >mmu_psize, ssize, flags) == -1)
> > old_pte &= ~_PAGE_HPTEFLAGS;
> > }
> > @@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long 
> > access, unsigned long vsid,
> > return -1;
> > }
> >
> > -   new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > -   (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > +   new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> 
> We don't really need rpte here. We are just need to track one entry
> here. May be it becomes simpler if use different helpers for 4k hpte and
> others ?

actually we need rpte here. the hidx for these 64K-hpte backed PTEs are
now stored in the second half of the pte. 
I have abstracted the helpers, so that the caller need not
know the location of the hidx. It comes in really handy.

RP

Re: [PATCH v6 1/4] of: remove *phandle properties from expanded device tree

2017-06-21 Thread Frank Rowand

On 06/20/17 23:18, Frank Rowand wrote:
> Hi Rob,
> 
> Michael has an issue that means this patch series is not OK in the
> current form.  I will work on a v7 to see if I can resolve the
> issue.
> 
> -Frank

< snip >

Hi Rob,

The issue is in patch 1.  Patches 2 - 4 are small independent patches
that are not dependent on patch 1, so I just sent them as individual
patches.  Version 7 of this series will be just patch 1.

-Frank

Re: [PATCH] powerpc: Convert VDSO update function to use new update_vsyscall interface

2017-06-21 Thread John Stultz

On Sat, May 27, 2017 at 1:04 AM, Paul Mackerras  wrote:
> This converts the powerpc VDSO time update function to use the new
> interface introduced in commit 576094b7f0aa ("time: Introduce new
> GENERIC_TIME_VSYSCALL", 2012-09-11).  Where the old interface gave
> us the time as of the last update in seconds and whole nanoseconds,
> with the new interface we get the nanoseconds part effectively in
> a binary fixed-point format with tk->tkr_mono.shift bits to the
> right of the binary point.
>
> With the old interface, the fractional nanoseconds got truncated,
> meaning that the value returned by the VDSO clock_gettime function
> would have about 1ns of jitter in it compared to the value computed
> by the generic timekeeping code in the kernel.
>
> The powerpc VDSO time functions (clock_gettime and gettimeofday)
> already work in units of 2^-32 seconds, or 0.23283 ns, because that
> makes it simple to split the result into seconds and fractional
> seconds, and represent the fractional seconds in either microseconds
> or nanoseconds.  This is good enough accuracy for now, so this patch
> avoids changing how the VDSO works or the interface in the VDSO data
> page.
>
> This patch converts the powerpc update_vsyscall_old to be called
> update_vsyscall and use the new interface.  We convert the fractional
> second to units of 2^-32 seconds without truncating to whole nanoseconds.
> (There is still a conversion to whole nanoseconds for any legacy users
> of the vdso_data/systemcfg stamp_xtime field.)
>
> In addition, this improves the accuracy of the computation of tb_to_xs
> for those systems with high-frequency timebase clocks (>= 268.5 MHz)
> by doing the right shift in two parts, one before the multiplication and
> one after, rather than doing the right shift before the multiplication.
> (We can't do all of the right shift after the multiplication unless we
> use 128-bit arithmetic.)
>
> Signed-off-by: Paul Mackerras 

Apologies again for missing this earlier.

So no objections from me. I can't say I really worked the whole thing
out, but you're handling the xtime_nsec field properly and the rest
looks reasonable and is well documented.

So for what its worth:
Acked-by: John Stultz 

Thanks again for making this update!
-john

[PATCH 2/2] selftests/ftrace: Update multiple kprobes test for powerpc

2017-06-21 Thread Naveen N. Rao

KPROBES_ON_FTRACE is only available on powerpc64le. Update comment to
clarify this.

Also, we should use an offset of 8 to ensure that the probe does not
fall on ftrace location. The current offset of 4 will fall before the
function local entry point and won't fire, while an offset of 12 or 16
will fall on ftrace location. Offset 8 is currently guaranteed to not be
the ftrace location.

Finally, do not filter out symbols with a dot. Powerpc Elfv1 uses dot
prefix for all functions and this prevents us from testing some of those
symbols. Furthermore, with the patch to derive event names properly in
the presence of ':' and '.', such names are accepted by kprobe_events
and constitutes a good test for those symbols.

Signed-off-by: Naveen N. Rao 
---
 tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc 
b/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc
index f4d1ff785d67..d209c071b2c0 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc
@@ -2,16 +2,16 @@
 # description: Register/unregister many kprobe events
 
 # ftrace fentry skip size depends on the machine architecture.
-# Currently HAVE_KPROBES_ON_FTRACE defined on x86 and powerpc
+# Currently HAVE_KPROBES_ON_FTRACE defined on x86 and powerpc64le
 case `uname -m` in
   x86_64|i[3456]86) OFFS=5;;
-  ppc*) OFFS=4;;
+  ppc64le) OFFS=8;;
   *) OFFS=0;;
 esac
 
 echo "Setup up to 256 kprobes"
-grep t /proc/kallsyms | cut -f3 -d" " | grep -v .*\\..* | \
-head -n 256 | while read i; do echo p ${i}+${OFFS} ; done > kprobe_events ||:
+grep t /proc/kallsyms | cut -f3 -d" " | head -n 256 | \
+while read i; do echo p ${i}+${OFFS} ; done > kprobe_events ||:
 
 echo 1 > events/kprobes/enable
 echo 0 > events/kprobes/enable
-- 
2.13.1

[PATCH 1/2] trace/kprobes: Sanitize derived event names

2017-06-21 Thread Naveen N. Rao

When we derive event names, convert some expected symbols (such as ':'
used to specify module:name and '.' present in some symbols) into
underscores so that the event name is not rejected.

Before this patch:
# echo 'p kobject_example:foo_store' > kprobe_events
trace_kprobe: Failed to allocate trace_probe.(-22)
-sh: write error: Invalid argument

After this patch:
# echo 'p kobject_example:foo_store' > kprobe_events
# cat kprobe_events
p:kprobes/p_kobject_example_foo_store_0 kobject_example:foo_store

Signed-off-by: Naveen N. Rao 
---
 kernel/trace/trace_kprobe.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index c129fca6ec99..44fd819aa33d 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -598,6 +598,14 @@ static struct notifier_block trace_kprobe_module_nb = {
.priority = 1   /* Invoked after kprobe module callback */
 };
 
+/* Convert certain expected symbols into '_' when generating event names */
+static inline void sanitize_event_name(char *name)
+{
+   while (*name++ != '\0')
+   if (*name == ':' || *name == '.')
+   *name = '_';
+}
+
 static int create_trace_kprobe(int argc, char **argv)
 {
/*
@@ -740,6 +748,7 @@ static int create_trace_kprobe(int argc, char **argv)
else
snprintf(buf, MAX_EVENT_NAME_LEN, "%c_0x%p",
 is_return ? 'r' : 'p', addr);
+   sanitize_event_name(buf);
event = buf;
}
tk = alloc_trace_kprobe(group, event, addr, symbol, offset, maxactive,
-- 
2.13.1

[PATCH 0/2] A couple of small updates/fixes for kprobes tracer

2017-06-21 Thread Naveen N. Rao

Two simple updates for kprobes tracer:
- the first patch is a convenience and allows to probe module symbols
  as well as any dot symbols (necessary on powerpc64 elfv1) without
  having to provide a name for the probepoint.
- the second patch updates the newly added multiple_kprobes.tc test
  case for powerpc.

Thanks,
Naveen

Naveen N. Rao (2):
  trace/kprobes: Sanitize derived event names
  selftests/ftrace: Update multiple kprobes test for powerpc

 kernel/trace/trace_kprobe.c  | 9 +
 tools/testing/selftests/ftrace/test.d/kprobe/multiple_kprobes.tc | 8 
 2 files changed, 13 insertions(+), 4 deletions(-)

-- 
2.13.1

[PATCH v3 6/6] powerpc/64s: Blacklist rtas entry/exit from kprobes

2017-06-21 Thread Naveen N. Rao

We can't take traps with relocation off, so blacklist enter_rtas() and
rtas_return_loc(). However, instead of blacklisting all of enter_rtas(),
introduce a new symbol __enter_rtas from where on we can't take a trap
and blacklist that.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/entry_64.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index d376f07153d7..49c35450f399 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -1076,6 +1076,8 @@ _GLOBAL(enter_rtas)
 rldicr  r9,r9,MSR_SF_LG,(63-MSR_SF_LG)
ori r9,r9,MSR_IR|MSR_DR|MSR_FE0|MSR_FE1|MSR_FP|MSR_RI|MSR_LE
andcr6,r0,r9
+
+__enter_rtas:
sync/* disable interrupts so SRR0/1 */
mtmsrd  r0  /* don't get trashed */
 
@@ -1112,6 +1114,8 @@ rtas_return_loc:
mtspr   SPRN_SRR1,r4
rfid
b   .   /* prevent speculative execution */
+_ASM_NOKPROBE_SYMBOL(__enter_rtas)
+_ASM_NOKPROBE_SYMBOL(rtas_return_loc)
 
.align  3
 1: .llong  rtas_restore_regs
-- 
2.13.1

[PATCH v3 5/6] powerpc/64s: Blacklist functions invoked on a trap

2017-06-21 Thread Naveen N. Rao

Blacklist all functions involved while handling a trap. We:
- convert some of the symbols into private symbols,
- remove the duplicate 'restore' symbol, and
- blacklist most functions involved while handling a trap.

Reviewed-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/entry_64.S   | 47 +---
 arch/powerpc/kernel/exceptions-64s.S |  2 ++
 arch/powerpc/kernel/traps.c  |  3 +++
 3 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index feeeadc9aa71..d376f07153d7 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -184,7 +184,7 @@ system_call:/* label this so stack 
traces look sane */
 #ifdef CONFIG_PPC_BOOK3S
/* No MSR:RI on BookE */
andi.   r10,r8,MSR_RI
-   beq-unrecov_restore
+   beq-.Lunrecov_restore
 #endif
/*
 * Disable interrupts so current_thread_info()->flags can't change,
@@ -424,6 +424,7 @@ _GLOBAL(save_nvgprs)
clrrdi  r0,r11,1
std r0,_TRAP(r1)
blr
+_ASM_NOKPROBE_SYMBOL(save_nvgprs);
 

 /*
@@ -672,18 +673,18 @@ _GLOBAL(ret_from_except_lite)
 * Use the internal debug mode bit to do this.
 */
andis.  r0,r3,DBCR0_IDM@h
-   beq restore
+   beq fast_exc_return_irq
mfmsr   r0
rlwinm  r0,r0,0,~MSR_DE /* Clear MSR.DE */
mtmsr   r0
mtspr   SPRN_DBCR0,r3
li  r10, -1
mtspr   SPRN_DBSR,r10
-   b   restore
+   b   fast_exc_return_irq
 #else
addir3,r1,STACK_FRAME_OVERHEAD
bl  restore_math
-   b   restore
+   b   fast_exc_return_irq
 #endif
 1: andi.   r0,r4,_TIF_NEED_RESCHED
beq 2f
@@ -696,7 +697,7 @@ _GLOBAL(ret_from_except_lite)
bne 3f  /* only restore TM if nothing else to do */
addir3,r1,STACK_FRAME_OVERHEAD
bl  restore_tm_state
-   b   restore
+   b   fast_exc_return_irq
 3:
 #endif
bl  save_nvgprs
@@ -748,14 +749,14 @@ resume_kernel:
 #ifdef CONFIG_PREEMPT
/* Check if we need to preempt */
andi.   r0,r4,_TIF_NEED_RESCHED
-   beq+restore
+   beq+fast_exc_return_irq
/* Check that preempt_count() == 0 and interrupts are enabled */
lwz r8,TI_PREEMPT(r9)
cmpwi   cr1,r8,0
ld  r0,SOFTE(r1)
cmpdi   r0,0
crandc  eq,cr1*4+eq,eq
-   bne restore
+   bne fast_exc_return_irq
 
/*
 * Here we are preempting the current task. We want to make
@@ -786,7 +787,6 @@ resume_kernel:
 
.globl  fast_exc_return_irq
 fast_exc_return_irq:
-restore:
/*
 * This is the main kernel exit path. First we check if we
 * are about to re-enable interrupts
@@ -794,11 +794,11 @@ restore:
ld  r5,SOFTE(r1)
lbz r6,PACASOFTIRQEN(r13)
cmpwi   cr0,r5,0
-   beq restore_irq_off
+   beq .Lrestore_irq_off
 
/* We are enabling, were we already enabled ? Yes, just return */
cmpwi   cr0,r6,1
-   beq cr0,do_restore
+   beq cr0,.Ldo_restore
 
/*
 * We are about to soft-enable interrupts (we are hard disabled
@@ -807,14 +807,14 @@ restore:
 */
lbz r0,PACAIRQHAPPENED(r13)
cmpwi   cr0,r0,0
-   bne-restore_check_irq_replay
+   bne-.Lrestore_check_irq_replay
 
/*
 * Get here when nothing happened while soft-disabled, just
 * soft-enable and move-on. We will hard-enable as a side
 * effect of rfi
 */
-restore_no_replay:
+.Lrestore_no_replay:
TRACE_ENABLE_INTS
li  r0,1
stb r0,PACASOFTIRQEN(r13);
@@ -822,7 +822,7 @@ restore_no_replay:
/*
 * Final return path. BookE is handled in a different file
 */
-do_restore:
+.Ldo_restore:
 #ifdef CONFIG_PPC_BOOK3E
b   exception_return_book3e
 #else
@@ -856,7 +856,7 @@ fast_exception_return:
REST_8GPRS(5, r1)
 
andi.   r0,r3,MSR_RI
-   beq-unrecov_restore
+   beq-.Lunrecov_restore
 
/* Load PPR from thread struct before we clear MSR:RI */
 BEGIN_FTR_SECTION
@@ -914,7 +914,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 * make sure that in this case, we also clear PACA_IRQ_HARD_DIS
 * or that bit can get out of sync and bad things will happen
 */
-restore_irq_off:
+.Lrestore_irq_off:
ld  r3,_MSR(r1)
lbz r7,PACAIRQHAPPENED(r13)
andi.   r0,r3,MSR_EE
@@ -924,13 +924,13 @@ restore_irq_off:
 1: li  r0,0
stb r0,PACASOFTIRQEN(r13);
TRACE_DISABLE_INTS
-   b   do_restore
+   b   .Ldo_restore
 
/*
 * Something did

[PATCH v3 4/6] powerpc/64s: Un-blacklist system_call() from kprobes

2017-06-21 Thread Naveen N. Rao

It is actually safe to probe system_call() in entry_64.S, but only till
we unset MSR_RI. To allow this, add a new symbol system_call_exit()
after the mtmsrd and blacklist that. Though the mtmsrd instruction
itself is now whitelisted, we won't be allowed to probe on it as we
don't allow probing on rfi and mtmsr instructions (checked for in
arch_prepare_kprobe()).

Suggested-by: Michael Ellerman 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/entry_64.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index ef8e6615b8ba..feeeadc9aa71 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -204,6 +204,7 @@ system_call:/* label this so stack 
traces look sane */
mtmsrd  r11,1
 #endif /* CONFIG_PPC_BOOK3E */
 
+system_call_exit:
ld  r9,TI_FLAGS(r12)
li  r11,-MAX_ERRNO
andi.   
r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
@@ -412,7 +413,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
b   .   /* prevent speculative execution */
 #endif
 _ASM_NOKPROBE_SYMBOL(system_call_common);
-_ASM_NOKPROBE_SYMBOL(system_call);
+_ASM_NOKPROBE_SYMBOL(system_call_exit);
 
 /* Save non-volatile GPRs, if not already saved. */
 _GLOBAL(save_nvgprs)
-- 
2.13.1

[PATCH v3 3/6] powerpc/64s: Blacklist system_call() and system_call_common() from kprobes

2017-06-21 Thread Naveen N. Rao

Convert some of the symbols into private symbols and blacklist
system_call_common() and system_call() from kprobes. We can't take a
trap at parts of these functions as either MSR_RI is unset or the kernel
stack pointer is not yet setup.

Reviewed-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/entry_64.S | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index da9486e2fd89..ef8e6615b8ba 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -52,12 +52,11 @@ exception_marker:
.section".text"
.align 7
 
-   .globl system_call_common
-system_call_common:
+_GLOBAL(system_call_common)
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 BEGIN_FTR_SECTION
extrdi. r10, r12, 1, (63-MSR_TS_T_LG) /* transaction active? */
-   bne tabort_syscall
+   bne .Ltabort_syscall
 END_FTR_SECTION_IFSET(CPU_FTR_TM)
 #endif
andi.   r10,r12,MSR_PR
@@ -152,9 +151,9 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
CURRENT_THREAD_INFO(r11, r1)
ld  r10,TI_FLAGS(r11)
andi.   r11,r10,_TIF_SYSCALL_DOTRACE
-   bne syscall_dotrace /* does not return */
+   bne .Lsyscall_dotrace   /* does not return */
cmpldi  0,r0,NR_syscalls
-   bge-syscall_enosys
+   bge-.Lsyscall_enosys
 
 system_call:   /* label this so stack traces look sane */
 /*
@@ -208,7 +207,7 @@ system_call:/* label this so stack 
traces look sane */
ld  r9,TI_FLAGS(r12)
li  r11,-MAX_ERRNO
andi.   
r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
-   bne-syscall_exit_work
+   bne-.Lsyscall_exit_work
 
/* If MSR_FP and MSR_VEC are set in user msr, then no need to restore */
li  r7,MSR_FP
@@ -217,12 +216,12 @@ system_call:  /* label this so stack 
traces look sane */
 #endif
and r0,r8,r7
cmpdr0,r7
-   bne syscall_restore_math
+   bne .Lsyscall_restore_math
 .Lsyscall_restore_math_cont:
 
cmpld   r3,r11
ld  r5,_CCR(r1)
-   bge-syscall_error
+   bge-.Lsyscall_error
 .Lsyscall_error_cont:
ld  r7,_NIP(r1)
 BEGIN_FTR_SECTION
@@ -248,13 +247,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
RFI
b   .   /* prevent speculative execution */
 
-syscall_error: 
+.Lsyscall_error:
orisr5,r5,0x1000/* Set SO bit in CR */
neg r3,r3
std r5,_CCR(r1)
b   .Lsyscall_error_cont
 
-syscall_restore_math:
+.Lsyscall_restore_math:
/*
 * Some initial tests from restore_math to avoid the heavyweight
 * C code entry and MSR manipulations.
@@ -289,7 +288,7 @@ syscall_restore_math:
b   .Lsyscall_restore_math_cont
 
 /* Traced system call support */
-syscall_dotrace:
+.Lsyscall_dotrace:
bl  save_nvgprs
addir3,r1,STACK_FRAME_OVERHEAD
bl  do_syscall_trace_enter
@@ -322,11 +321,11 @@ syscall_dotrace:
b   .Lsyscall_exit
 
 
-syscall_enosys:
+.Lsyscall_enosys:
li  r3,-ENOSYS
b   .Lsyscall_exit

-syscall_exit_work:
+.Lsyscall_exit_work:
 #ifdef CONFIG_PPC_BOOK3S
li  r10,MSR_RI
mtmsrd  r10,1   /* Restore RI */
@@ -386,7 +385,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
b   ret_from_except
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-tabort_syscall:
+.Ltabort_syscall:
/* Firstly we need to enable TM in the kernel */
mfmsr   r10
li  r9, 1
@@ -412,6 +411,8 @@ tabort_syscall:
rfid
b   .   /* prevent speculative execution */
 #endif
+_ASM_NOKPROBE_SYMBOL(system_call_common);
+_ASM_NOKPROBE_SYMBOL(system_call);
 
 /* Save non-volatile GPRs, if not already saved. */
 _GLOBAL(save_nvgprs)
-- 
2.13.1

[PATCH v3 2/6] powerpc/64s: Convert .L__replay_interrupt_return to a local label

2017-06-21 Thread Naveen N. Rao

Commit b48bbb82e2b835 ("powerpc/64s: Don't unbalance the return branch
predictor in __replay_interrupt()") introduced __replay_interrupt_return
symbol with '.L' prefix in hopes of keeping it private. However, due to
the use of LOAD_REG_ADDR(), the assembler kept this symbol visible. Fix
the same by instead using the local label '1'.

Fixes: Commit b48bbb82e2b835 ("powerpc/64s: Don't unbalance the return branch
predictor in __replay_interrupt()")
Suggested-by: Nicholas Piggin 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 07b79c2c70f8..2df6d7b3070f 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1629,7 +1629,7 @@ _GLOBAL(__replay_interrupt)
 * we don't give a damn about, so we don't bother storing them.
 */
mfmsr   r12
-   LOAD_REG_ADDR(r11, .L__replay_interrupt_return)
+   LOAD_REG_ADDR(r11, 1f)
mfcrr9
ori r12,r12,MSR_EE
cmpwi   r3,0x900
@@ -1647,6 +1647,6 @@ FTR_SECTION_ELSE
cmpwi   r3,0xa00
beq doorbell_super_common_msgclr
 ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
-.L__replay_interrupt_return:
+1:
blr
 
-- 
2.13.1

[PATCH v3 1/6] powerpc64/elfv1: Validate function pointer address in the function descriptor

2017-06-21 Thread Naveen N. Rao

Currently, we assume that the function pointer we receive in
ppc_function_entry() points to a function descriptor. However, this is
not always the case. In particular, assembly symbols without the right
annotation do not have an associated function descriptor. Some of these
symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL().
When such addresses are subsequently processed through
arch_deref_entry_point() in populate_kprobe_blacklist(), we see the
below errors during bootup:
[0.663963] Failed to find blacklist at 7d9b02a648029b6c
[0.663970] Failed to find blacklist at a14d03d0394a0001
[0.663972] Failed to find blacklist at 7d5302a6f94d0388
[0.663973] Failed to find blacklist at 48027d11e8610178
[0.663974] Failed to find blacklist at f8010070f8410080
[0.663976] Failed to find blacklist at 386100704801f89d
[0.663977] Failed to find blacklist at 7d5302a6f94d00b0

Fix this by checking if the address in the function descriptor is
actually a valid kernel address. In the case of assembly symbols, this
will almost always fail as this ends up being powerpc instructions. In
that case, return pointer to the address we received, rather than the
dereferenced value.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/code-patching.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index abef812de7f8..ec54050be585 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -83,8 +83,16 @@ static inline unsigned long ppc_function_entry(void *func)
 * On PPC64 ABIv1 the function pointer actually points to the
 * function's descriptor. The first entry in the descriptor is the
 * address of the function text.
+*
+* However, we may have received a pointer to an assembly symbol
+* that may not be a function descriptor. Validate that the entry
+* points to a valid kernel address and if not, return the pointer
+* we received as is.
 */
-   return ((func_descr_t *)func)->entry;
+   if (kernel_text_address(((func_descr_t *)func)->entry))
+   return ((func_descr_t *)func)->entry;
+   else
+   return (unsigned long)func;
 #else
return (unsigned long)func;
 #endif
-- 
2.13.1

[PATCH v3 0/6] powerpc: build out kprobes blacklist -- series 3

2017-06-21 Thread Naveen N. Rao

This is the third in the series of patches to build out an appropriate
kprobes blacklist for powerpc. Since posting the second series (*),
there have been related changes to the code and I have brought that
series forward to account for those changes. As such, all patches from
the second series are included in this patchset.

This patchset now ensures that the newly added multiple kprobes test in
the ftrace testsuite passes on powerpc64. Tested on both Elfv1 and
Elfv2.

Changes since series 2 v2:
  - Patches 1, 2 and 6 are new.
  - Patch 3 now additionally converts syscall_restore_math() to a local
symbol.
  - Patch 5 additionally blacklists __replay_interrupt.

(*)
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg117562.html


- Naveen


Naveen N. Rao (6):
  powerpc64/elfv1: Validate function pointer address in the function
descriptor
  powerpc/64s: Convert .L__replay_interrupt_return to a local label
  powerpc/64s: Blacklist system_call() and system_call_common() from
kprobes
  powerpc/64s: Un-blacklist system_call() from kprobes
  powerpc/64s: Blacklist functions invoked on a trap
  powerpc/64s: Blacklist rtas entry/exit from kprobes

 arch/powerpc/include/asm/code-patching.h | 10 +++-
 arch/powerpc/kernel/entry_64.S   | 81 ++--
 arch/powerpc/kernel/exceptions-64s.S |  6 ++-
 arch/powerpc/kernel/traps.c  |  3 ++
 4 files changed, 63 insertions(+), 37 deletions(-)

-- 
2.13.1

Re: [GIT PULL 00/25] perf/core improvements and fixes

2017-06-21 Thread Ingo Molnar


* Arnaldo Carvalho de Melo <a...@kernel.org> wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> The following changes since commit 007b811b4041989ec2dc91b9614aa2c41332723e:
> 
>   Merge tag 'perf-core-for-mingo-4.13-20170719' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2017-06-20 10:49:08 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.13-20170621
> 
> for you to fetch changes up to 701516ae3dec801084bc913d21e03fce15c61a0b:
> 
>   perf script: Fix message because field list option is -F not -f (2017-06-21 
> 11:35:53 -0300)
> 
> 
> perf/core improvements ad fixes:
> 
> New features:
> 
> - Add support to measure SMI cost in 'perf stat' (Kan Liang)
> 
> - Add support for unwinding callchains in powerpc with libdw (Paolo Bonzini)
> 
> Fixes:
> 
> - Fix message: cpu list option is -C not -c (Adrian Hunter)
> 
> - Fix 'perf script' message: field list option is -F not -f (Adrian Hunter)
> 
> - Intel PT fixes: (Adrian Hunter)
> 
>   o Fix missing stack clear
>   o Ensure IP is zero when state is INTEL_PT_STATE_NO_IP
>   o Fix last_ip usage
>   o Ensure never to set 'last_ip' when packet 'count' is zero
>   o Clear FUP flag on error
>   o Fix transactions_sample_type
> 
> Infrastructure:
> 
> - Intel PT cleanups/refactorings (Adrian Hunter)
> 
>   o Use FUP always when scanning for an IP
>   o Add missing __fallthrough
>   o Remove redundant initial_skip checks
>   o Allow decoding with branch tracing disabled
>   o Add default config for pass-through branch enable
>   o Add documentation for new config terms
>   o Add decoder support for ptwrite and power event packets
>   o Add reserved byte to CBR packet payload
>   o Add decoder support for CBR events
> 
> - Move  find_process() to the only place that uses it, skimming some
>   more fat from util.[ch] (Arnaldo Carvalho de Melo)
> 
> - Do parameter validation earlier on fetch_kernel_version() (Arnaldo Carvalho 
> de Melo)
> 
> - Remove unused _ALL_SOURCE define (Arnaldo Carvalho de Melo)
> 
> - Add sysfs__write_int function (Kan Liang)
> 
> Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
> 
> 
> Adrian Hunter (19):
>   perf intel-pt: Move decoder error setting into one condition
>   perf intel-pt: Improve sample timestamp
>   perf intel-pt: Fix missing stack clear
>   perf intel-pt: Ensure IP is zero when state is INTEL_PT_STATE_NO_IP
>   perf intel-pt: Fix last_ip usage
>   perf intel-pt: Ensure never to set 'last_ip' when packet 'count' is zero
>   perf intel-pt: Use FUP always when scanning for an IP
>   perf intel-pt: Clear FUP flag on error
>   perf intel-pt: Add missing __fallthrough
>   perf intel-pt: Allow decoding with branch tracing disabled
>   perf intel-pt: Add default config for pass-through branch enable
>   perf intel-pt: Add documentation for new config terms
>   perf intel-pt: Add decoder support for ptwrite and power event packets
>   perf intel-pt: Add reserved byte to CBR packet payload
>   perf intel-pt: Add decoder support for CBR events
>   perf intel-pt: Remove redundant initial_skip checks
>   perf intel-pt: Fix transactions_sample_type
>   perf tools: Fix message because cpu list option is -C not -c
>   perf script: Fix message because field list option is -F not -f
> 
> Arnaldo Carvalho de Melo (3):
>   perf evsel: Adopt find_process()
>   perf tools: Do parameter validation earlier on fetch_kernel_version()
>   perf tools: Remove unused _ALL_SOURCE define
> 
> Kan Liang (2):
>   tools lib api fs: Add sysfs__write_int function
>   perf stat: Add support to measure SMI cost
> 
> Paolo Bonzini (1):
>   perf unwind: Support for powerpc
> 
>  tools/lib/api/fs/fs.c  |  30 +++
>  tools/lib/api/fs/fs.h  |   4 +
>  tools/perf/Documentation/intel-pt.txt  |  36 +++
>  tools/perf/Documentation/perf-stat.txt |  14 +
>  tools/perf/Makefile.config |   2 +-
>  tools/perf/arch/powerpc/util/Build |   2 +
>  tools/perf/arch/powerpc/util/unwind-libdw.c|  73 ++
>  tools/perf/arch/x86/util/intel-pt.c|   5 +
>  tools/perf/builtin-script.c|   2 +-
>  tools/perf/builti

[PATCH 06/25] perf unwind: Support for powerpc

2017-06-21 Thread Arnaldo Carvalho de Melo

From: Paolo Bonzini 

Porting PPC to libdw only needs an architecture-specific hook to move
the register state from perf to libdw.

The ARM and x86 architectures already use libdw, and it is useful to
have as much common code for the unwinder as possible.  Mark Wielaard
has contributed a frame-based unwinder to libdw, so that unwinding works
even for binaries that do not have CFI information.  In addition,
libunwind is always preferred to libdw by the build machinery so this
cannot introduce regressions on machines that have both libunwind and
libdw installed.

Signed-off-by: Paolo Bonzini 
Acked-by: Jiri Olsa 
Acked-by: Milian Wolff 
Acked-by: Ravi Bangoria 
Cc: Naveen N. Rao 
Cc: linuxppc-dev@lists.ozlabs.org
Link: 
http://lkml.kernel.org/r/1496312681-20133-1-git-send-email-pbonz...@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/Makefile.config  |  2 +-
 tools/perf/arch/powerpc/util/Build  |  2 +
 tools/perf/arch/powerpc/util/unwind-libdw.c | 73 +
 3 files changed, 76 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/arch/powerpc/util/unwind-libdw.c

diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
index 1f4fbc9a3292..bdf0e87f9b29 100644
--- a/tools/perf/Makefile.config
+++ b/tools/perf/Makefile.config
@@ -61,7 +61,7 @@ endif
 # Disable it on all other architectures in case libdw unwind
 # support is detected in system. Add supported architectures
 # to the check.
-ifneq ($(SRCARCH),$(filter $(SRCARCH),x86 arm))
+ifneq ($(SRCARCH),$(filter $(SRCARCH),x86 arm powerpc))
   NO_LIBDW_DWARF_UNWIND := 1
 endif
 
diff --git a/tools/perf/arch/powerpc/util/Build 
b/tools/perf/arch/powerpc/util/Build
index 90ad64b231cd..2e6595310420 100644
--- a/tools/perf/arch/powerpc/util/Build
+++ b/tools/perf/arch/powerpc/util/Build
@@ -5,4 +5,6 @@ libperf-y += perf_regs.o
 
 libperf-$(CONFIG_DWARF) += dwarf-regs.o
 libperf-$(CONFIG_DWARF) += skip-callchain-idx.o
+
 libperf-$(CONFIG_LIBUNWIND) += unwind-libunwind.o
+libperf-$(CONFIG_LIBDW_DWARF_UNWIND) += unwind-libdw.o
diff --git a/tools/perf/arch/powerpc/util/unwind-libdw.c 
b/tools/perf/arch/powerpc/util/unwind-libdw.c
new file mode 100644
index ..3a24b3c43273
--- /dev/null
+++ b/tools/perf/arch/powerpc/util/unwind-libdw.c
@@ -0,0 +1,73 @@
+#include 
+#include "../../util/unwind-libdw.h"
+#include "../../util/perf_regs.h"
+#include "../../util/event.h"
+
+/* See backends/ppc_initreg.c and backends/ppc_regs.c in elfutils.  */
+static const int special_regs[3][2] = {
+   { 65, PERF_REG_POWERPC_LINK },
+   { 101, PERF_REG_POWERPC_XER },
+   { 109, PERF_REG_POWERPC_CTR },
+};
+
+bool libdw__arch_set_initial_registers(Dwfl_Thread *thread, void *arg)
+{
+   struct unwind_info *ui = arg;
+   struct regs_dump *user_regs = >sample->user_regs;
+   Dwarf_Word dwarf_regs[32], dwarf_nip;
+   size_t i;
+
+#define REG(r) ({  \
+   Dwarf_Word val = 0; \
+   perf_reg_value(, user_regs, PERF_REG_POWERPC_##r);  \
+   val;\
+})
+
+   dwarf_regs[0]  = REG(R0);
+   dwarf_regs[1]  = REG(R1);
+   dwarf_regs[2]  = REG(R2);
+   dwarf_regs[3]  = REG(R3);
+   dwarf_regs[4]  = REG(R4);
+   dwarf_regs[5]  = REG(R5);
+   dwarf_regs[6]  = REG(R6);
+   dwarf_regs[7]  = REG(R7);
+   dwarf_regs[8]  = REG(R8);
+   dwarf_regs[9]  = REG(R9);
+   dwarf_regs[10] = REG(R10);
+   dwarf_regs[11] = REG(R11);
+   dwarf_regs[12] = REG(R12);
+   dwarf_regs[13] = REG(R13);
+   dwarf_regs[14] = REG(R14);
+   dwarf_regs[15] = REG(R15);
+   dwarf_regs[16] = REG(R16);
+   dwarf_regs[17] = REG(R17);
+   dwarf_regs[18] = REG(R18);
+   dwarf_regs[19] = REG(R19);
+   dwarf_regs[20] = REG(R20);
+   dwarf_regs[21] = REG(R21);
+   dwarf_regs[22] = REG(R22);
+   dwarf_regs[23] = REG(R23);
+   dwarf_regs[24] = REG(R24);
+   dwarf_regs[25] = REG(R25);
+   dwarf_regs[26] = REG(R26);
+   dwarf_regs[27] = REG(R27);
+   dwarf_regs[28] = REG(R28);
+   dwarf_regs[29] = REG(R29);
+   dwarf_regs[30] = REG(R30);
+   dwarf_regs[31] = REG(R31);
+   if (!dwfl_thread_state_registers(thread, 0, 32, dwarf_regs))
+   return false;
+
+   dwarf_nip = REG(NIP);
+   dwfl_thread_state_register_pc(thread, dwarf_nip);
+   for (i = 0; i < ARRAY_SIZE(special_regs); i++) {
+   Dwarf_Word val = 0;
+   perf_reg_value(, user_regs, special_regs[i][1]);
+   if (!dwfl_thread_state_registers(thread,
+special_regs[i][0], 1,
+))
+

[GIT PULL 00/25] perf/core improvements and fixes

2017-06-21 Thread Arnaldo Carvalho de Melo

Hi Ingo,

Please consider pulling,

- Arnaldo

Test results at the end of this message, as usual.

The following changes since commit 007b811b4041989ec2dc91b9614aa2c41332723e:

  Merge tag 'perf-core-for-mingo-4.13-20170719' of 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
(2017-06-20 10:49:08 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
tags/perf-core-for-mingo-4.13-20170621

for you to fetch changes up to 701516ae3dec801084bc913d21e03fce15c61a0b:

  perf script: Fix message because field list option is -F not -f (2017-06-21 
11:35:53 -0300)


perf/core improvements ad fixes:

New features:

- Add support to measure SMI cost in 'perf stat' (Kan Liang)

- Add support for unwinding callchains in powerpc with libdw (Paolo Bonzini)

Fixes:

- Fix message: cpu list option is -C not -c (Adrian Hunter)

- Fix 'perf script' message: field list option is -F not -f (Adrian Hunter)

- Intel PT fixes: (Adrian Hunter)

  o Fix missing stack clear
  o Ensure IP is zero when state is INTEL_PT_STATE_NO_IP
  o Fix last_ip usage
  o Ensure never to set 'last_ip' when packet 'count' is zero
  o Clear FUP flag on error
  o Fix transactions_sample_type

Infrastructure:

- Intel PT cleanups/refactorings (Adrian Hunter)

  o Use FUP always when scanning for an IP
  o Add missing __fallthrough
  o Remove redundant initial_skip checks
  o Allow decoding with branch tracing disabled
  o Add default config for pass-through branch enable
  o Add documentation for new config terms
  o Add decoder support for ptwrite and power event packets
  o Add reserved byte to CBR packet payload
  o Add decoder support for CBR events

- Move  find_process() to the only place that uses it, skimming some
  more fat from util.[ch] (Arnaldo Carvalho de Melo)

- Do parameter validation earlier on fetch_kernel_version() (Arnaldo Carvalho 
de Melo)

- Remove unused _ALL_SOURCE define (Arnaldo Carvalho de Melo)

- Add sysfs__write_int function (Kan Liang)

Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>


Adrian Hunter (19):
  perf intel-pt: Move decoder error setting into one condition
  perf intel-pt: Improve sample timestamp
  perf intel-pt: Fix missing stack clear
  perf intel-pt: Ensure IP is zero when state is INTEL_PT_STATE_NO_IP
  perf intel-pt: Fix last_ip usage
  perf intel-pt: Ensure never to set 'last_ip' when packet 'count' is zero
  perf intel-pt: Use FUP always when scanning for an IP
  perf intel-pt: Clear FUP flag on error
  perf intel-pt: Add missing __fallthrough
  perf intel-pt: Allow decoding with branch tracing disabled
  perf intel-pt: Add default config for pass-through branch enable
  perf intel-pt: Add documentation for new config terms
  perf intel-pt: Add decoder support for ptwrite and power event packets
  perf intel-pt: Add reserved byte to CBR packet payload
  perf intel-pt: Add decoder support for CBR events
  perf intel-pt: Remove redundant initial_skip checks
  perf intel-pt: Fix transactions_sample_type
  perf tools: Fix message because cpu list option is -C not -c
  perf script: Fix message because field list option is -F not -f

Arnaldo Carvalho de Melo (3):
  perf evsel: Adopt find_process()
  perf tools: Do parameter validation earlier on fetch_kernel_version()
  perf tools: Remove unused _ALL_SOURCE define

Kan Liang (2):
  tools lib api fs: Add sysfs__write_int function
  perf stat: Add support to measure SMI cost

Paolo Bonzini (1):
  perf unwind: Support for powerpc

 tools/lib/api/fs/fs.c  |  30 +++
 tools/lib/api/fs/fs.h  |   4 +
 tools/perf/Documentation/intel-pt.txt  |  36 +++
 tools/perf/Documentation/perf-stat.txt |  14 +
 tools/perf/Makefile.config |   2 +-
 tools/perf/arch/powerpc/util/Build |   2 +
 tools/perf/arch/powerpc/util/unwind-libdw.c|  73 ++
 tools/perf/arch/x86/util/intel-pt.c|   5 +
 tools/perf/builtin-script.c|   2 +-
 tools/perf/builtin-stat.c  |  49 
 tools/perf/util/evsel.c|  39 +++
 .../perf/util/intel-pt-decoder/intel-pt-decoder.c  | 290 +++--
 .../perf/util/intel-pt-decoder/intel-pt-decoder.h  |  13 +
 .../util/intel-pt-decoder/intel-pt-pkt-decoder.c   | 110 +++-
 .../util/intel-pt-decoder/intel-pt-pkt-decoder.h   |   7 +
 tools/perf/util/intel-pt.c |  23 +-
 tools/perf/util/session.c  |   2 +-
 tools/perf/util/stat-shadow.c  |  33 +++
 tools/perf/util/stat.c |   2 +
 tools/perf/util/

Re: [PATCH v2 6/6] ima: Support module-style appended signatures for appraisal

2017-06-21 Thread Thiago Jung Bauermann


Hello Mimi,

Thanks for your review, and for queuing the other patches in this series.

Mimi Zohar  writes:
> On Wed, 2017-06-07 at 22:49 -0300, Thiago Jung Bauermann wrote:
>> This patch introduces the modsig keyword to the IMA policy syntax to
>> specify that a given hook should expect the file to have the IMA signature
>> appended to it.
>
> Thank you, Thiago. Appended signatures seem to be working proper now
> with multiple keys on the IMA keyring.

Great news!

> The length of this patch description is a good indication that this
> patch needs to be broken up for easier review. A few
> comments/suggestions inline below.

Ok, I will try to break it up, and also patch 5 as you suggested.

>> diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
>> index 06554c448dce..9190c9058f4f 100644
>> --- a/security/integrity/digsig.c
>> +++ b/security/integrity/digsig.c
>> @@ -48,11 +48,10 @@ static bool init_keyring __initdata;
>>  #define restrict_link_to_ima restrict_link_by_builtin_trusted
>>  #endif
>> 
>> -int integrity_digsig_verify(const unsigned int id, const char *sig, int 
>> siglen,
>> -const char *digest, int digestlen)
>> +struct key *integrity_keyring_from_id(const unsigned int id)
>>  {
>> -if (id >= INTEGRITY_KEYRING_MAX || siglen < 2)
>> -return -EINVAL;
>> +if (id >= INTEGRITY_KEYRING_MAX)
>> +return ERR_PTR(-EINVAL);
>> 
>
> When splitting up this patch, the addition of this new function could
> be a separate patch. The patch description would explain the need for
> a new function.

Ok, will do for v3.

>> @@ -229,10 +234,14 @@ int ima_appraise_measurement(enum ima_hooks func,
>>  goto out;
>>  }
>> 
>> -status = evm_verifyxattr(dentry, XATTR_NAME_IMA, xattr_value, rc, iint);
>> -if ((status != INTEGRITY_PASS) && (status != INTEGRITY_UNKNOWN)) {
>> -if ((status == INTEGRITY_NOLABEL)
>> -|| (status == INTEGRITY_NOXATTRS))
>> +/* Appended signatures aren't protected by EVM. */
>> +status = evm_verifyxattr(dentry, XATTR_NAME_IMA,
>> + xattr_value->type == IMA_MODSIG ?
>> + NULL : xattr_value, rc, iint);
>> +if (status != INTEGRITY_PASS && status != INTEGRITY_UNKNOWN &&
>> +!(xattr_value->type == IMA_MODSIG &&
>> +  (status == INTEGRITY_NOLABEL || status == INTEGRITY_NOXATTRS))) {
>
> This was messy to begin with, and now it is even more messy. For
> appended signatures, we're only interested in INTEGRITY_FAIL. Maybe
> leave the existing "if" clause alone and define a new "if" clause.

Ok, is this what you had in mind?

@@ -229,8 +237,14 @@ int ima_appraise_measurement(enum ima_hooks func,
goto out;
}
 
-   status = evm_verifyxattr(dentry, XATTR_NAME_IMA, xattr_value, rc, iint);
-   if ((status != INTEGRITY_PASS) && (status != INTEGRITY_UNKNOWN)) {
+   /* Appended signatures aren't protected by EVM. */
+   status = evm_verifyxattr(dentry, XATTR_NAME_IMA,
+xattr_value->type == IMA_MODSIG ?
+NULL : xattr_value, rc, iint);
+   if (xattr_value->type == IMA_MODSIG && status == INTEGRITY_FAIL) {
+   cause = "invalid-HMAC";
+   goto out;
+   } else if (status != INTEGRITY_PASS && status != INTEGRITY_UNKNOWN) {
if ((status == INTEGRITY_NOLABEL)
|| (status == INTEGRITY_NOXATTRS))
cause = "missing-HMAC";

>> @@ -267,11 +276,18 @@ int ima_appraise_measurement(enum ima_hooks func,
>>  status = INTEGRITY_PASS;
>>  break;
>>  case EVM_IMA_XATTR_DIGSIG:
>> +case IMA_MODSIG:
>>  iint->flags |= IMA_DIGSIG;
>> -rc = integrity_digsig_verify(INTEGRITY_KEYRING_IMA,
>> - (const char *)xattr_value, rc,
>> - iint->ima_hash->digest,
>> - iint->ima_hash->length);
>> +
>> +if (xattr_value->type == EVM_IMA_XATTR_DIGSIG)
>> +rc = integrity_digsig_verify(INTEGRITY_KEYRING_IMA,
>> + (const char *)xattr_value,
>> + rc, iint->ima_hash->digest,
>> + iint->ima_hash->length);
>> +else
>> +rc = ima_modsig_verify(INTEGRITY_KEYRING_IMA,
>> +   xattr_value);
>> +
>
> Perhaps allowing IMA_MODSIG to flow into EVM_IMA_XATTR_DIGSIG on
> failure, would help restore process_measurements() to the way it was.
> Further explanation below.

It's not possible to simply flow into EVM_IMA_XATTR_DIGSIG on failure
because after calling ima_read_xattr we need to run again all the logic
before the switch

Re: 1M hugepage size being registered on Linux

2017-06-21 Thread Mauricio Faria de Oliveira


On 06/21/2017 07:33 AM, Michael Ellerman wrote:

I am working on a bug related to 1M hugepage size being registered on
Linux (Power 8 Baremetal - Garrison).



Wasn't that caused by a firmware bug?


Ben/Stewart, does that ring a bell, something new, intended or not? :- )

Thanks,


I was checking dmesg and it seems that 1M page size is coming from
firmware to Linux.

[0.00] base_shift=20: shift=20, sllp=0x0130, avpnm=0x,
tlbiel=0, penc=2
[1.528867] HugeTLB registered 1 MB page size, pre-allocated 0 pages

Should Linux support this page size?

Does it work?:)

The user manual says it's a supported size, but I thought it didn't work
(in hardware) for some reason.


--
Mauricio Faria de Oliveira
IBM Linux Technology Center

Re: [PATCH v2] perf: libdw support for powerpc [ping]

2017-06-21 Thread Arnaldo Carvalho de Melo

Em Wed, Jun 21, 2017 at 04:19:11PM +0200, Milian Wolff escreveu:
> On Mittwoch, 21. Juni 2017 14:48:29 CEST Arnaldo Carvalho de Melo wrote:
> > Em Wed, Jun 21, 2017 at 10:16:56AM +0200, Milian Wolff escreveu:
> > > On Mittwoch, 21. Juni 2017 03:07:39 CEST Arnaldo Carvalho de Melo wrote:
> > > > Hi Millian, can I take this as an Acked-by or Tested-by?
> > > 
> > > I have no access to any PowerPC hardware. In principle the code looks
> > > fine, but that's all I can say here.
> > 
> > Ok, that would count as an Acked-by, i.e. from
> > Documentation/process/submitting-patches.rst:
> > 
> > -
> > 
> > Acked-by: is not as formal as Signed-off-by:.  It is a record that the acker
> > has at least reviewed the patch and has indicated acceptance.  Hence patch
> > mergers will sometimes manually convert an acker's "yep, looks good to me"
> > into an Acked-by: (but note that it is usually better to ask for an
> > explicit ack).
> > 
> > -
> > 
> > If you had a ppc machine _and_ had applied and tested the patch, that
> > would allow us to use a Tested-by tag.
> 
> I see, I'm still unfamiliar with this process. But yes, do consider it an 
> `Acked-by` from my side then.

Right, then there is another tag there that is relevant to this
discussion:

Link: 
http://lkml.kernel.org/r/1496312681-20133-1-git-send-email-pbonz...@redhat.com

which will has the Message-ID of the message with this patch, embedded
in a URL that when clicked will bring you to the thread where the patch
was submitted and the acks, tested-by, reviewed-by, etc were provided,
so that we can go back and check the history of the patch.

- Arnaldo

Re: [PATCH V6 1/2] powerpc/hotplug: Ensure enough nodes avail for operations

2017-06-21 Thread Michael Bringmann

On 06/21/2017 04:52 AM, Michael Ellerman wrote:
> Michael Bringmann  writes:
> 
>> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU
>> or memory resources, it may occur that the new resources are to be
>> inserted into nodes that were not used for these resources at bootup.
>> In the kernel, any node that is used must be defined and initialized
>> at boot.  In order to meet both needs, this patch adds a new kernel
>> command line option (numnodes=) for use by the PowerPC architecture-
> 
> Sorry, that's a hack.

It is an intermediate step pending the provision of the firmware properties
under discussion that were mentioned by Nathan Fontenot last week.

> I thought you were going to use firmware properties to find the set of
> possible nodes. Did that not work?

Inference based on the current set of firmware properties for associativity
is insufficient.  That is partly the reason for the properties mentioned by
Nathan last week.  The current firmware properties only cover what is known
at boot time.  They do not cover expansions from DLPAR / hot-add operations
which can add up to everything else on the system.

> cheers

Regards,

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com

Re: [PATCH v2] perf: libdw support for powerpc [ping]

2017-06-21 Thread Milian Wolff

On Mittwoch, 21. Juni 2017 14:48:29 CEST Arnaldo Carvalho de Melo wrote:
> Em Wed, Jun 21, 2017 at 10:16:56AM +0200, Milian Wolff escreveu:
> > On Mittwoch, 21. Juni 2017 03:07:39 CEST Arnaldo Carvalho de Melo wrote:
> > > Hi Millian, can I take this as an Acked-by or Tested-by?
> > 
> > I have no access to any PowerPC hardware. In principle the code looks
> > fine, but that's all I can say here.
> 
> Ok, that would count as an Acked-by, i.e. from
> Documentation/process/submitting-patches.rst:
> 
> -
> 
> Acked-by: is not as formal as Signed-off-by:.  It is a record that the acker
> has at least reviewed the patch and has indicated acceptance.  Hence patch
> mergers will sometimes manually convert an acker's "yep, looks good to me"
> into an Acked-by: (but note that it is usually better to ask for an
> explicit ack).
> 
> -
> 
> If you had a ppc machine _and_ had applied and tested the patch, that
> would allow us to use a Tested-by tag.

I see, I'm still unfamiliar with this process. But yes, do consider it an 
`Acked-by` from my side then.

Cheers

-- 
Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

Re: [PATCH V6 0/2] powerpc/dlpar: Correct display of hot-add/hot-remove CPUs and memory

2017-06-21 Thread Michael Bringmann

One of the patches was doubled and sent twice yesterday.
Will update number regardless in future.

On 06/21/2017 04:54 AM, Michael Ellerman wrote:
> Michael Bringmann  writes:
> 
>> On Power systems with shared configurations of CPUs and memory, there
>> are some issues with association of additional CPUs and memory to nodes
>> when hot-adding resources.  These patches address some of those problems.
>>
>> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU
>> or memory resources, it may occur that the new resources are to be
>> inserted into nodes that were not used for these resources at bootup.
>> In the kernel, any node that is used must be defined and initialized
>> at boot.  In order to meet both needs, this patch adds a new kernel
>> command line option (numnodes=) for use by the PowerPC
>> architecture-specific code that defines the maximum number of nodes
>> that the kernel will ever need in its current hardware environment.
>> The boot code that initializes nodes for PowerPC will read this value
>> and use it to ensure that all of the desired nodes are setup in the
>> 'node_possible_map', and elsewhere.
>>
>> powerpc/numa: Correct the currently broken capability to set the
>> topology for shared CPUs in LPARs.  At boot time for shared CPU
>> lpars, the topology for each shared CPU is set to node zero, however,
>> this is now updated correctly using the Virtual Processor Home Node
>> (VPHN) capabilities information provided by the pHyp. The VPHN handling
>> in Linux is disabled, if PRRN handling is present.
>>
>> Signed-off-by: Michael Bringmann 
>>
>> Michael Bringmann (2):
>>   powerpc/hotplug: Add option to define max nodes allowing dynamic
>>   growth of resources.
>>   powerpc/numa: Update CPU topology when VPHN enabled
>> ---
>> Changes in V6:
>>   -- Reorder some code to better eliminate unused functions in
>>conditional builds.
> 
> What changed between yesterday's V6 and this V6?
> 
> If you're going to resend, please bump the version number, we have tools
> that parse the subject and version, and resending multiple times with
> the same number breaks those.
> 
> cheers
> 
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com

Re: new dma-mapping tree, was Re: clean up and modularize arch dma_mapping interface V2

2017-06-21 Thread Marek Szyprowski


Hi Christoph,

On 2017-06-20 15:16, Christoph Hellwig wrote:

On Tue, Jun 20, 2017 at 11:04:00PM +1000, Stephen Rothwell wrote:

git://git.linaro.org/people/mszyprowski/linux-dma-mapping.git#dma-mapping-next

Contacts: Marek Szyprowski and Kyungmin Park (cc'd)

I have called your tree dma-mapping-hch for now.  The other tree has
not been updated since 4.9-rc1 and I am not sure how general it is.
Marek, Kyungmin, any comments?

I'd be happy to join efforts - co-maintainers and reviers are always
welcome.


I did some dma-mapping unification works in the past and my tree in 
linux-next
was a side effect of that. I think that for now it can be dropped in 
favor of

Christoph's tree. I can also do some review and help in maintainers work if
needed, although I was recently busy with other stuff.

Christoph: Could you add me to your MAINTAINERS patch, so further 
dma-mapping

related patches hopefully will be also CC: to me?

Best regards
--
Marek Szyprowski, PhD
Samsung R Institute Poland

Re: [PATCH v2] perf: libdw support for powerpc [ping]

2017-06-21 Thread Arnaldo Carvalho de Melo

Em Wed, Jun 21, 2017 at 10:16:56AM +0200, Milian Wolff escreveu:
> On Mittwoch, 21. Juni 2017 03:07:39 CEST Arnaldo Carvalho de Melo wrote:
> > Hi Millian, can I take this as an Acked-by or Tested-by?

> I have no access to any PowerPC hardware. In principle the code looks
> fine, but that's all I can say here.

Ok, that would count as an Acked-by, i.e. from
Documentation/process/submitting-patches.rst:

-

Acked-by: is not as formal as Signed-off-by:.  It is a record that the acker
has at least reviewed the patch and has indicated acceptance.  Hence patch
mergers will sometimes manually convert an acker's "yep, looks good to me"
into an Acked-by: (but note that it is usually better to ask for an
explicit ack).

-

If you had a ppc machine _and_ had applied and tested the patch, that
would allow us to use a Tested-by tag.

Ok?

- Arnaldo

[PATCH 1/1] futex: remove duplicated code and fix UB

2017-06-21 Thread Jiri Slaby

There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.

Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.

This effectively distributes the Will Deacon's arm64 fix for undefined
behaviour reported by UBSAN to all architectures. The fix was done in
commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
FUTEX_OP_OPARG_SHIFT usage).  Look there for an example dump.

Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).

Signed-off-by: Jiri Slaby 
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Vineet Gupta 
Acked-by: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Richard Kuo 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Michal Simek 
Cc: Ralf Baechle 
Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Acked-by: Michael Ellerman  (powerpc)
Cc: Martin Schwidefsky 
Acked-by: Heiko Carstens  [s390]
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: "David S. Miller" 
Acked-by: Chris Metcalf  [for tile]
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: Arnd Bergmann 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
---
 arch/alpha/include/asm/futex.h  | 26 ---
 arch/arc/include/asm/futex.h| 40 -
 arch/arm/include/asm/futex.h| 26 +++
 arch/arm64/include/asm/futex.h  | 26 +++
 arch/frv/include/asm/futex.h|  3 ++-
 arch/frv/kernel/futex.c | 27 +++-
 arch/hexagon/include/asm/futex.h| 38 +++-
 arch/ia64/include/asm/futex.h   | 25 +++
 arch/microblaze/include/asm/futex.h | 38 +++-
 arch/mips/include/asm/futex.h   | 25 +++
 arch/openrisc/include/asm/futex.h   | 39 +++--
 arch/parisc/include/asm/futex.h | 26 +++
 arch/powerpc/include/asm/futex.h| 26 ---
 arch/s390/include/asm/futex.h   | 23 -
 arch/sh/include/asm/futex.h | 26 +++
 arch/sparc/include/asm/futex_64.h   | 26 ---
 arch/tile/include/asm/futex.h   | 40 -
 arch/x86/include/asm/futex.h| 40 -
 arch/xtensa/include/asm/futex.h | 27 
 include/asm-generic/futex.h | 50 +++--
 kernel/futex.c  | 36 ++
 21 files changed, 127 insertions(+), 506 deletions(-)

diff --git a/arch/alpha/include/asm/futex.h b/arch/alpha/include/asm/futex.h
index fb01dfb760c2..05a70edd57b6 100644
--- a/arch/alpha/include/asm/futex.h
+++ b/arch/alpha/include/asm/futex.h
@@ -25,18 +25,10 @@
:   "r" (uaddr), "r"(oparg) \
:   "memory")
 
-static inline int futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
+   u32 __user *uaddr)
 {
-   int op = (encoded_op >> 28) & 7;
-   int cmp = (encoded_op >> 24) & 15;
-   int oparg = (encoded_op << 8) >> 20;
-   int cmparg = (encoded_op << 20) >> 20;
int oldval = 0, ret;
-   if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-   oparg = 1 << oparg;
-
-   if

Re: 1M hugepage size being registered on Linux

2017-06-21 Thread Michael Ellerman

victora  writes:

> Hi Alistair/Jeremy,
>
> I am working on a bug related to 1M hugepage size being registered on 
> Linux (Power 8 Baremetal - Garrison).

Wasn't that caused by a firmware bug?

> I was checking dmesg and it seems that 1M page size is coming from 
> firmware to Linux.
>
> [0.00] base_shift=20: shift=20, sllp=0x0130, avpnm=0x, 
> tlbiel=0, penc=2
> [1.528867] HugeTLB registered 1 MB page size, pre-allocated 0 pages
>
> Should Linux support this page size?

Does it work? :)

The user manual says it's a supported size, but I thought it didn't work
(in hardware) for some reason.

cheers

Re: Network TX Stall on 440EP Processor

2017-06-21 Thread Michael Ellerman

Hi Thomas,

Thomas Besemer  writes:
> I'm working on a project that is derived from the Yosemite
> PPC 440EP board.  It's a legacy project that was running the
> 2.6.24 Kernel, and network traffic was stalling due to transmission
> halting without an understandable error (in this error condition, the
> various
> status registers of network interface showed no issues), other
> than TX stalling due to Buffer Descriptor Ring becoming full.

I'm not really familiar with these boards, and I'm not a network guy
either, so hopefully someone else will have some ideas :)

This is the EMAC driver you're using, which is old but still used so
shouldn't have completely bit rotted.

I think the "Buffer Descriptor Ring becoming full" indicates the
hardware has stopped sending packets that the kernel has put in the
ring?

So did the driver get the ring handling wrong somehow and the device
thinks the ring is empty but we think it's full?

cheers

Re: [PATCH] powerpc: Only obtain cpu_hotplug_lock if called by rtasd

2017-06-21 Thread Michael Ellerman

Thiago Jung Bauermann  writes:

> Calling arch_update_cpu_topology from a CPU hotplug state machine callback
> hits a deadlock because the function tries to get a read lock on
> cpu_hotplug_lock while the state machine still holds a write lock on it.
>
> Since all callers of arch_update_cpu_topology except rtasd already hold
> cpu_hotplug_lock, this patch changes the function to use
> stop_machine_cpuslocked and creates a separate function for rtasd which
> still tries to obtain the lock.
>
> Michael Bringmann investigated the bug and provided a detailed analysis
> of the deadlock on this previous RFC for an alternate solution:
>
> https://patchwork.ozlabs.org/patch/771293/

Do we know when this broke? Or has it never worked?

Should it go to stable? (can't in its current form AFAICS)

> Signed-off-by: Thiago Jung Bauermann 
> ---
>
> Notes:
> This patch applies on tip/smp/hotplug, it should probably be carried 
> there.

stop_machine_cpuslocked() doesn't exist in mainline so I think it has to
be carried there right?

cheers

[PATCH V3] cxl: Export library to support IBM XSL

2017-06-21 Thread Christophe Lombard

This patch exports a in-kernel 'library' API which can be called by
other drivers to help interacting with an IBM XSL on a POWER9 system.

The XSL (Translation Service Layer) is a stripped down version of the
PSL (Power Service Layer) used in some cards such as the Mellanox CX5.
Like the PSL, it implements the CAIA architecture, but has a number
of differences, mostly in it's implementation dependent registers.

The XSL also uses a special DMA cxl mode, which uses a slightly
different init sequence for the CAPP and PHB.

Signed-off-by: Christophe Lombard 

---
Changelog[v3]
 - Rebase to latest upstream.
 - cxl_handle_mm_fault() is now exported. Remove kernel context
   parameter
 - Update comments

Changelog[v2]
 - Rebase to latest upstream.
 - Return -EFAULT in case of NULL pointer in cxllib_handle_fault().
 - Reverse parameters when copro_handle_mm_fault() is called.
---
 arch/powerpc/include/asm/opal-api.h |   1 +
 drivers/misc/cxl/Kconfig|   5 +
 drivers/misc/cxl/Makefile   |   2 +-
 drivers/misc/cxl/cxl.h  |   6 +
 drivers/misc/cxl/cxllib.c   | 246 
 drivers/misc/cxl/fault.c|  28 ++--
 drivers/misc/cxl/native.c   |  16 ++-
 drivers/misc/cxl/pci.c  |  41 --
 include/misc/cxllib.h   | 133 +++
 9 files changed, 450 insertions(+), 28 deletions(-)
 create mode 100644 drivers/misc/cxl/cxllib.c
 create mode 100644 include/misc/cxllib.h

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index cb3e624..3e0be78 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -877,6 +877,7 @@ enum {
OPAL_PHB_CAPI_MODE_SNOOP_OFF= 2,
OPAL_PHB_CAPI_MODE_SNOOP_ON = 3,
OPAL_PHB_CAPI_MODE_DMA  = 4,
+   OPAL_PHB_CAPI_MODE_DMA_TVT1 = 5,
 };
 
 /* OPAL I2C request */
diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig
index b75cf83..93397cb 100644
--- a/drivers/misc/cxl/Kconfig
+++ b/drivers/misc/cxl/Kconfig
@@ -11,11 +11,16 @@ config CXL_AFU_DRIVER_OPS
bool
default n
 
+config CXL_LIB
+   bool
+   default n
+
 config CXL
tristate "Support for IBM Coherent Accelerators (CXL)"
depends on PPC_POWERNV && PCI_MSI && EEH
select CXL_BASE
select CXL_AFU_DRIVER_OPS
+   select CXL_LIB
default m
help
  Select this option to enable driver support for IBM Coherent
diff --git a/drivers/misc/cxl/Makefile b/drivers/misc/cxl/Makefile
index c14fd6b..0b5fd74 100644
--- a/drivers/misc/cxl/Makefile
+++ b/drivers/misc/cxl/Makefile
@@ -3,7 +3,7 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
 
 cxl-y  += main.o file.o irq.o fault.o native.o
 cxl-y  += context.o sysfs.o pci.o trace.o
-cxl-y  += vphb.o phb.o api.o
+cxl-y  += vphb.o phb.o api.o cxllib.o
 cxl-$(CONFIG_PPC_PSERIES)  += flash.o guest.o of.o hcalls.o
 cxl-$(CONFIG_DEBUG_FS) += debugfs.o
 obj-$(CONFIG_CXL)  += cxl.o
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index a03f8e7..b1afecc 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -1010,6 +1010,7 @@ static inline void cxl_debugfs_add_afu_regs_psl8(struct 
cxl_afu *afu, struct den
 
 void cxl_handle_fault(struct work_struct *work);
 void cxl_prefault(struct cxl_context *ctx, u64 wed);
+int cxl_handle_mm_fault(struct mm_struct *mm, u64 dsisr, u64 dar);
 
 struct cxl *get_cxl_adapter(int num);
 int cxl_alloc_sst(struct cxl_context *ctx);
@@ -1061,6 +1062,11 @@ int cxl_afu_slbia(struct cxl_afu *afu);
 int cxl_data_cache_flush(struct cxl *adapter);
 int cxl_afu_disable(struct cxl_afu *afu);
 int cxl_psl_purge(struct cxl_afu *afu);
+int cxl_calc_capp_routing(struct pci_dev *dev, u64 *chipid,
+ u32 *phb_index, u64 *capp_unit_id);
+int cxl_slot_is_switched(struct pci_dev *dev);
+int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg);
+u64 cxl_calculate_sr(bool master, bool kernel, bool real_mode, bool p9);
 
 void cxl_native_irq_dump_regs_psl9(struct cxl_context *ctx);
 void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx);
diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
new file mode 100644
index 000..4f4c5ca
--- /dev/null
+++ b/drivers/misc/cxl/cxllib.c
@@ -0,0 +1,246 @@
+/*
+ * Copyright 2017 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "cxl.h"
+
+#define CXL_INVALID_DRA ~0ull
+#define CXL_DUMMY_READ_SIZE 128
+#define CXL_DUMMY_READ_ALIGN8

Re: [PATCH V6 0/2] powerpc/dlpar: Correct display of hot-add/hot-remove CPUs and memory

2017-06-21 Thread Michael Ellerman

Michael Bringmann  writes:

> On Power systems with shared configurations of CPUs and memory, there
> are some issues with association of additional CPUs and memory to nodes
> when hot-adding resources.  These patches address some of those problems.
>
> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU
> or memory resources, it may occur that the new resources are to be
> inserted into nodes that were not used for these resources at bootup.
> In the kernel, any node that is used must be defined and initialized
> at boot.  In order to meet both needs, this patch adds a new kernel
> command line option (numnodes=) for use by the PowerPC
> architecture-specific code that defines the maximum number of nodes
> that the kernel will ever need in its current hardware environment.
> The boot code that initializes nodes for PowerPC will read this value
> and use it to ensure that all of the desired nodes are setup in the
> 'node_possible_map', and elsewhere.
>
> powerpc/numa: Correct the currently broken capability to set the
> topology for shared CPUs in LPARs.  At boot time for shared CPU
> lpars, the topology for each shared CPU is set to node zero, however,
> this is now updated correctly using the Virtual Processor Home Node
> (VPHN) capabilities information provided by the pHyp. The VPHN handling
> in Linux is disabled, if PRRN handling is present.
>
> Signed-off-by: Michael Bringmann 
>
> Michael Bringmann (2):
>   powerpc/hotplug: Add option to define max nodes allowing dynamic
>   growth of resources.
>   powerpc/numa: Update CPU topology when VPHN enabled
> ---
> Changes in V6:
>   -- Reorder some code to better eliminate unused functions in
>conditional builds.

What changed between yesterday's V6 and this V6?

If you're going to resend, please bump the version number, we have tools
that parse the subject and version, and resending multiple times with
the same number breaks those.

cheers

Re: [PATCH V6 1/2] powerpc/hotplug: Ensure enough nodes avail for operations

2017-06-21 Thread Michael Ellerman

Michael Bringmann  writes:

> powerpc/hotplug: On systems like PowerPC which allow 'hot-add' of CPU
> or memory resources, it may occur that the new resources are to be
> inserted into nodes that were not used for these resources at bootup.
> In the kernel, any node that is used must be defined and initialized
> at boot.  In order to meet both needs, this patch adds a new kernel
> command line option (numnodes=) for use by the PowerPC architecture-

Sorry, that's a hack.

I thought you were going to use firmware properties to find the set of
possible nodes. Did that not work?

cheers

Re: [PATCH] powerpc: dts: use #include "..." to include local DT

2017-06-21 Thread Michael Ellerman

Masahiro Yamada  writes:
> 2017-06-14 15:45 GMT+09:00 Michael Ellerman :
>>
>> Acked-by: Michael Ellerman 
>
> I have not seen it in linux-next yet.
>
> Who will pick it up?

In the original patch you said:

  Fix them to remove -I$(srctree)/arch/$(SRCARCH)/boot/dts path from
  dtc_cpp_flags.

So I assumed there was a series somewhere that did that and included
this patch.

But if there isn't then I can just merge it.

cheers

Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

2017-06-21 Thread Ram Pai

On Wed, Jun 21, 2017 at 12:11:32PM +0530, Aneesh Kumar K.V wrote:
> Ram Pai  writes:
> 
> > Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
> > in the 4K backed hpte pages. These bits continue to be used
> > for 64K backed hpte pages in this patch, but will be freed
> > up in the next patch.
> >
> > The patch does the following change to the 64K PTE format
> >
> > H_PAGE_BUSY moves from bit 3 to bit 9
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > of the pte.
> > H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
> > second part of the pte.
> >
> > the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is  initialized  to  0xF  indicating  an invalid  slot.  If  a hpte
> > gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
> > released immediately. In  other  words, even  though   0xF   is   a
> > valid slot we discard  and consider it as an invalid
> > slot;i.e hpte_soft_invalid(). This  gives  us  an opportunity to not
> > depend on a bit in the primary PTE in order to determine the
> > validity of a slot.
> >
> > When  we  release  ahpte   in the 0xF   slot we also   release a
> > legitimate primary   slot  andunmapthat  entry. This  is  to
> > ensure  that we do get a   legimate   non-0xF  slot the next time we
> > retry for a slot.
> >
> > Though treating 0xF slot as invalid reduces the number of available
> > slots  and  may  have an effect  on the performance, the probabilty
> > of hitting a 0xF is extermely low.
> >
> > Compared  to the current scheme, the above described scheme reduces
> > the number of false hash table updates  significantly  and  has the
> > added  advantage  of  releasing  four  valuable  PTE bits for other
> > purpose.
> >
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> >
> > 4K PTE format remain unchanged currently.
> >
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/include/asm/book3s/64/hash-4k.h  | 20 +++
> >  arch/powerpc/include/asm/book3s/64/hash-64k.h | 32 +++
> >  arch/powerpc/include/asm/book3s/64/hash.h | 15 +++--
> >  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  5 ++
> >  arch/powerpc/mm/dump_linuxpagetables.c|  3 +-
> >  arch/powerpc/mm/hash64_4k.c   | 14 ++---
> >  arch/powerpc/mm/hash64_64k.c  | 81 
> > ---
> >  arch/powerpc/mm/hash_utils_64.c   | 30 +++---
> >  8 files changed, 122 insertions(+), 78 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> > b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > index b4b5e6b..5ef1d81 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > @@ -16,6 +16,18 @@
> >  #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> >  #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >
> > +
> > +/*
> > + * Only supported by 4k linux page size
> > + */
> > +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG 
> > */
> > +#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > +#define H_PAGE_F_GIX_SHIFT 56
> > +
> > +#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are 
> > busy */
> > +#define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
> > +
> > +
> >  /* PTE flags to conserve for HPTE identification */
> >  #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> >  H_PAGE_F_SECOND | H_PAGE_F_GIX)
> > @@ -48,6 +60,14 @@ static inline int hash__hugepd_ok(hugepd_t hpd)
> >  }
> >  #endif
> >
> > +static inline unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > +   unsigned int subpg_index, unsigned long slot)
> > +{
> > +   return (slot << H_PAGE_F_GIX_SHIFT) &
> > +   (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > +}
> > +
> > +
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >
> >  static inline char *get_hpte_slot_array(pmd_t *pmdp)
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
> > b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > index 9732837..0eb3c89 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > @@ -10,23 +10,25 @@
> >   * 64k aligned address free up few of the lower bits of RPN for us
> >   * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> >   */
> > -#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
> > -#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
> > +#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
> > +#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
> > +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > +#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 |

Re: [PATCH v2] perf: libdw support for powerpc [ping]

2017-06-21 Thread Milian Wolff

On Mittwoch, 21. Juni 2017 03:07:39 CEST Arnaldo Carvalho de Melo wrote:
> Em Thu, Jun 15, 2017 at 10:46:16AM +0200, Milian Wolff escreveu:
> > On Tuesday, June 13, 2017 5:55:09 PM CEST Ravi Bangoria wrote:
> > Just a quick question: Have you guys applied my recent patch:
> > 
> > commit 5ea0416f51cc93436bbe497c62ab49fd9cb245b6
> > Author: Milian Wolff 
> > Date:   Thu Jun 1 23:00:21 2017 +0200
> > 
> > perf report: Include partial stacks unwound with libdw
> > 
> > So far the whole stack was thrown away when any error occurred before
> > the maximum stack depth was unwound. This is actually a very common
> > scenario though. The stacks that got unwound so far are still
> > interesting. This removes a large chunk of differences when comparing
> > perf script output for libunwind and libdw perf unwinding.
> > 
> > If not, then this could explain the issue you are seeing.
> 
> Hi Millian, can I take this as an Acked-by or Tested-by?

I have no access to any PowerPC hardware. In principle the code looks fine, 
but that's all I can say here.

Cheers

-- 
Milian Wolff | milian.wo...@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

Re: [PATCH] powerpc/64: Initialise thread_info for emergency stacks

2017-06-21 Thread Abdul Haleem

On Tue, 2017-06-20 at 23:58 +1000, Nicholas Piggin wrote:
> Emergency stacks have their thread_info mostly uninitialised, which in
> particular means garbage preempt_count values.
> 
> Emergency stack code runs with interrupts disabled entirely, and is
> used very rarely, so this has been unnoticed so far. It was found by a
> proposed new powerpc watchdog that takes a soft-NMI directly from the
> masked_interrupt handler and using the emergency stack. That crashed at
> BUG_ON(in_nmi()) in nmi_enter(). preempt_count()s were found to be
> garbage.
> 
> Reported-by: Abdul Haleem <abdha...@linux.vnet.ibm.com>
> Signed-off-by: Nicholas Piggin <npig...@gmail.com>
> ---
> 
> FYI, this bug looks to be breaking linux-next on some powerpc
> boxes due to interaction with a proposed new powerpc watchdog
> driver Andrew has in his tree:
> 
> http://marc.info/?l=linuxppc-embedded=149794320519941=2
> 
>  arch/powerpc/include/asm/thread_info.h | 19 +++
>  arch/powerpc/kernel/setup_64.c |  6 +++---
>  2 files changed, 22 insertions(+), 3 deletions(-)

Hi Nicholas,

Thanks for the patch, Verified on next-20170621 and PowerPC bare-metal
boots fine with your patch

Tested-by: Abdul Haleem <abdha...@linux.vnet.ibm.com>

Thanks for all your support.

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre

Re: [RFC v2 07/12] powerpc: Macro the mask used for checking DSI exception

2017-06-21 Thread Ram Pai

On Wed, Jun 21, 2017 at 12:55:42PM +0530, Aneesh Kumar K.V wrote:
> Ram Pai  writes:
> 
> > Replace the magic number used to check for DSI exception
> > with a meaningful value.
> >
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/include/asm/reg.h   | 9 -
> >  arch/powerpc/kernel/exceptions-64s.S | 2 +-
> >  2 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> > index 7e50e47..2dcb8a1 100644
> > --- a/arch/powerpc/include/asm/reg.h
> > +++ b/arch/powerpc/include/asm/reg.h
> > @@ -272,16 +272,23 @@
> >  #define SPRN_DAR   0x013   /* Data Address Register */
> >  #define SPRN_DBCR  0x136   /* e300 Data Breakpoint Control Reg */
> >  #define SPRN_DSISR 0x012   /* Data Storage Interrupt Status Register */
> > +#define   DSISR_BIT32  0x8000  /* not defined */
> >  #define   DSISR_NOHPTE 0x4000  /* no translation found 
> > */
> > +#define   DSISR_PAGEATTR_CONFLT0x2000  /* page attribute 
> > conflict */
> > +#define   DSISR_BIT35  0x1000  /* not defined */
> >  #define   DSISR_PROTFAULT  0x0800  /* protection fault */
> >  #define   DSISR_BADACCESS  0x0400  /* bad access to CI or G */
> >  #define   DSISR_ISSTORE0x0200  /* access was a store */
> >  #define   DSISR_DABRMATCH  0x0040  /* hit data breakpoint */
> > -#define   DSISR_NOSEGMENT  0x0020  /* SLB miss */
> >  #define   DSISR_KEYFAULT   0x0020  /* Key fault */
> > +#define   DSISR_BIT43  0x0010  /* not defined */
> >  #define   DSISR_UNSUPP_MMU 0x0008  /* Unsupported MMU config */
> >  #define   DSISR_SET_RC 0x0004  /* Failed setting of 
> > R/C bits */
> >  #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory 
> > */
> > +#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 | \
> > +   DSISR_PAGEATTR_CONFLT | \
> > +   DSISR_BADACCESS |   \
> > +   DSISR_BIT43)
> >  #define SPRN_TBRL  0x10C   /* Time Base Read Lower Register (user, R/O) */
> >  #define SPRN_TBRU  0x10D   /* Time Base Read Upper Register (user, R/O) */
> >  #define SPRN_CIR   0x11B   /* Chip Information Register (hyper, R/0) */
> > diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> > b/arch/powerpc/kernel/exceptions-64s.S
> > index ae418b8..3fd0528 100644
> > --- a/arch/powerpc/kernel/exceptions-64s.S
> > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > @@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
> > .balign IFETCH_ALIGN_BYTES
> >  do_hash_page:
> >  #ifdef CONFIG_PPC_STD_MMU_64
> > -   andis.  r0,r4,0xa410/* weird error? */
> > +   andis.  r0,r4,DSISR_PAGE_FAULT_MASK@h
> > bne-handle_page_fault   /* if not, try to insert a HPTE */
> > andis.  r0,r4,DSISR_DABRMATCH@h
> > bne-handle_dabr_fault
> 
> 
> Thanks for doing this. I always wondered what that 0xa410 indicates. Now
> tha it is documented, I am wondering are those the only DSISR values
> that we want to check early ? You also added few bit positions that is
> expected to carry value 0 ? But then excluded BIT35. Any reason ?

I did not look deeply into why the exact number 0xa410 was used in the
past.  I built the macro DSISR_PAGE_FAULT_MASK using whatever bits make
up 0xa410.  BIT35 if added to the DSISR_PAGE_FAULT_MASK would make it
0xb410. So I did not consider it.

However the macro for BIT35 is already defined in this patch, if that is what 
you were
looking for.
+#define   DSISR_BIT35  0x1000  /* not defined */

RP

[PATCH V4 2/2] powerpc/powernv : Add support for OPAL-OCC command/response interface

2017-06-21 Thread Shilpasri G Bhat

In P9, OCC (On-Chip-Controller) supports shared memory based
commad-response interface. Within the shared memory there is an OPAL
command buffer and OCC response buffer that can be used to send
inband commands to OCC. This patch adds a platform driver to support
the command/response interface between OCC and the host.

Signed-off-by: Shilpasri G Bhat 
---
- Hold occ->cmd_in_progress in read()
- Reset occ->rsp_consumed if copy_to_user() fails

 arch/powerpc/include/asm/opal-api.h|  41 +++-
 arch/powerpc/include/asm/opal.h|   3 +
 arch/powerpc/platforms/powernv/Makefile|   2 +-
 arch/powerpc/platforms/powernv/opal-occ.c  | 313 +
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 arch/powerpc/platforms/powernv/opal.c  |   8 +
 6 files changed, 366 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/opal-occ.c

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index cb3e624..011d86c 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -42,6 +42,10 @@
 #define OPAL_I2C_STOP_ERR  -24
 #define OPAL_XIVE_PROVISIONING -31
 #define OPAL_XIVE_FREE_ACTIVE  -32
+#define OPAL_OCC_INVALID_STATE -33
+#define OPAL_OCC_BUSY  -34
+#define OPAL_OCC_CMD_TIMEOUT   -35
+#define OPAL_OCC_RSP_MISMATCH  -36
 
 /* API Tokens (in r0) */
 #define OPAL_INVALID_CALL -1
@@ -190,7 +194,8 @@
 #define OPAL_NPU_INIT_CONTEXT  146
 #define OPAL_NPU_DESTROY_CONTEXT   147
 #define OPAL_NPU_MAP_LPAR  148
-#define OPAL_LAST  148
+#define OPAL_OCC_COMMAND   149
+#define OPAL_LAST  149
 
 /* Device tree flags */
 
@@ -829,6 +834,40 @@ struct opal_prd_msg_header {
 
 struct opal_prd_msg;
 
+enum occ_cmd {
+   OCC_CMD_AMESTER_PASS_THRU = 0,
+   OCC_CMD_CLEAR_SENSOR_DATA,
+   OCC_CMD_SET_POWER_CAP,
+   OCC_CMD_SET_POWER_SHIFTING_RATIO,
+   OCC_CMD_SELECT_SENSOR_GROUPS,
+   OCC_CMD_LAST
+};
+
+struct opal_occ_cmd_rsp_msg {
+   __be64 cdata;
+   __be64 rdata;
+   __be16 cdata_size;
+   __be16 rdata_size;
+   u8 cmd;
+   u8 request_id;
+   u8 status;
+};
+
+struct opal_occ_cmd_data {
+   __be16 size;
+   u8 cmd;
+   u8 data[];
+};
+
+struct opal_occ_rsp_data {
+   __be16 size;
+   u8 status;
+   u8 data[];
+};
+
+#define MAX_OPAL_CMD_DATA_LENGTH4090
+#define MAX_OCC_RSP_DATA_LENGTH 8698
+
 #define OCC_RESET   0
 #define OCC_LOAD1
 #define OCC_THROTTLE2
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 03ed493..e55ed79 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -346,6 +346,9 @@ static inline int opal_get_async_rc(struct opal_msg msg)
 
 void opal_wake_poller(void);
 
+int64_t opal_occ_command(int chip_id, struct opal_occ_cmd_rsp_msg *msg,
+bool retry);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_OPAL_H */
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b5d98cb..f5f0902 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -2,7 +2,7 @@ obj-y   += setup.o opal-wrappers.o opal.o 
opal-async.o idle.o
 obj-y  += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
-obj-y  += opal-kmsg.o
+obj-y  += opal-kmsg.o opal-occ.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
diff --git a/arch/powerpc/platforms/powernv/opal-occ.c 
b/arch/powerpc/platforms/powernv/opal-occ.c
new file mode 100644
index 000..b346724
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-occ.c
@@ -0,0 +1,313 @@
+/*
+ * Copyright IBM Corporation 2017
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "opal-occ: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct occ {
+   struct miscdevice dev;
+   struct opal_occ_rsp_data *rsp;
+   atomic_t session;
+   atomic_t cmd_in_progress;
+

1 2 >

1 - 100 of 130 matches

Mail list logo