[PATCH v2 2/2] powerpc/eeh: Release EEH device state synchronously

2020-04-19 Thread Sam Bobroff
EEH device state is currently removed (by eeh_remove_device()) during
the device release handler, which is invoked as the device's reference
count drops to zero. This may take some time, or forever, as other
threads may hold references.

However, the PCI device state is released synchronously by
pci_stop_and_remove_bus_device(). This mismatch causes problems, for
example the device may be re-discovered as a new device before the
release handler has been called, leaving the PCI and EEH state
mismatched.

So instead, call eeh_remove_device() from the bus device removal
handlers, which are called synchronously in the removal path.

Signed-off-by: Sam Bobroff 
---
v2 - Added comment explaining why the add case can't be handled similarly to 
the remove case.

 arch/powerpc/kernel/eeh.c | 31 +++
 arch/powerpc/kernel/pci-hotplug.c |  2 --
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 17cb3e9b5697..64361311bc8e 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1106,6 +1106,37 @@ static int eeh_init(void)
 
 core_initcall_sync(eeh_init);
 
+static int eeh_device_notifier(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct device *dev = data;
+
+   switch (action) {
+   /*
+* Note: It's not possible to perform EEH device addition (i.e.
+* {pseries,pnv}_pcibios_bus_add_device()) here because it depends on
+* the device's resources, which have not yet been set up.
+*/
+   case BUS_NOTIFY_DEL_DEVICE:
+   eeh_remove_device(to_pci_dev(dev));
+   break;
+   default:
+   break;
+   }
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block eeh_device_nb = {
+   .notifier_call = eeh_device_notifier,
+};
+
+static __init int eeh_set_bus_notifier(void)
+{
+   bus_register_notifier(_bus_type, _device_nb);
+   return 0;
+}
+arch_initcall(eeh_set_bus_notifier);
+
 /**
  * eeh_add_device_early - Enable EEH for the indicated device node
  * @pdn: PCI device node for which to set up EEH
diff --git a/arch/powerpc/kernel/pci-hotplug.c 
b/arch/powerpc/kernel/pci-hotplug.c
index d6a67f814983..28e9aa274f64 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -57,8 +57,6 @@ void pcibios_release_device(struct pci_dev *dev)
struct pci_controller *phb = pci_bus_to_host(dev->bus);
struct pci_dn *pdn = pci_get_pdn(dev);
 
-   eeh_remove_device(dev);
-
if (phb->controller_ops.release_device)
phb->controller_ops.release_device(dev);
 
-- 
2.22.0.216.g00a2a96fc9



[PATCH v2 1/2] powerpc/eeh: fix pseries_eeh_configure_bridge()

2020-04-19 Thread Sam Bobroff
If a device is hot unplgged during EEH recovery, it's possible for the
RTAS call to ibm,configure-pe in pseries_eeh_configure() to return
parameter error (-3), however negative return values are not checked
for and this leads to an infinite loop.

Fix this by correctly bailing out on negative values.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/platforms/pseries/eeh_pseries.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 893ba3f562c4..c4ef03bec0de 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -605,7 +605,7 @@ static int pseries_eeh_configure_bridge(struct eeh_pe *pe)
config_addr, BUID_HI(pe->phb->buid),
BUID_LO(pe->phb->buid));
 
-   if (!ret)
+   if (ret <= 0)
return ret;
 
/*
-- 
2.22.0.216.g00a2a96fc9



[PATCH v2 0/2] powerpc/eeh: Release EEH device state synchronously

2020-04-19 Thread Sam Bobroff
Hi everyone,

Here are some fixes and cleanups that have come from other work but that I
think stand on their own.

Only one patch ("Release EEH device state synchronously", suggested by Oliver
O'Halloran) is a significant change: it moves the cleanup of some EEH device
data out of the (possibly asynchronous) device release handler and into the
(synchronously called) bus notifier. This is useful for future work as it makes
it easier to reason about the lifetimes of EEH structures.

Note that I've left a few WARN_ON_ONCEs in the code because I'm paranoid, but I
have not been able to hit them during testing.

Cheers,
Sam.

Notes for v2:

I've dropped both cleanup patches (3/4, 4/4) because that type of cleanup
(replacing a call to eeh_rmv_from_parent_pe() with one to eeh_remove_device())
is incorrect: if called during recovery, it will cause edev->pe to remain set
when it would have been cleared previously. This would lead to stale
information in the edev. I think there should be a way to simplify the code
around EEH_PE_KEEP but I'll look at that separately.

Patch set changelog follows:

Patch set v2: 
Patch 1/2: powerpc/eeh: fix pseries_eeh_configure_bridge()
Patch 2/2: powerpc/eeh: Release EEH device state synchronously
- Added comment explaining why the add case can't be handled similarly to the 
remove case.
Dropped (was 3/4) powerpc/eeh: Remove workaround from eeh_add_device_late()
Dropped (was 4/4) powerpc/eeh: Clean up edev cleanup for VFs

Patch set v1:
Patch 1/4: powerpc/eeh: fix pseries_eeh_configure_bridge()
Patch 2/4: powerpc/eeh: Release EEH device state synchronously
Patch 3/4: powerpc/eeh: Remove workaround from eeh_add_device_late()
Patch 4/4: powerpc/eeh: Clean up edev cleanup for VFs

Sam Bobroff (2):
  powerpc/eeh: fix pseries_eeh_configure_bridge()
  powerpc/eeh: Release EEH device state synchronously

 arch/powerpc/kernel/eeh.c| 31 
 arch/powerpc/kernel/pci-hotplug.c|  2 --
 arch/powerpc/platforms/pseries/eeh_pseries.c |  2 +-
 3 files changed, 32 insertions(+), 3 deletions(-)

-- 
2.22.0.216.g00a2a96fc9



[PATCH v2] powerpc/8xx: Fix STRICT_KERNEL_RWX startup test failure

2020-04-19 Thread Christophe Leroy
WRITE_RO lkdtm test works.

But when selecting CONFIG_DEBUG_RODATA_TEST, the kernel reports
rodata_test: test data was not read only

This is because when rodata test runs, there are still old entries
in TLB.

Flush TLB after setting kernel pages RO or NX.

Fixes: d5f17ee96447 ("powerpc/8xx: don't disable large TLBs with 
CONFIG_STRICT_KERNEL_RWX")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/nohash/8xx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index 3189308dece4..d83a12c5bc7f 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -185,6 +185,7 @@ void mmu_mark_initmem_nx(void)
mmu_mapin_ram_chunk(etext8, einittext8, PAGE_KERNEL);
}
}
+   _tlbil_all();
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
@@ -199,6 +200,8 @@ void mmu_mark_rodata_ro(void)
  ~(LARGE_PAGE_SIZE_8M - 1)));
mmu_patch_addis(__dtlbmiss_romem_top, -__pa(_sinittext));
 
+   _tlbil_all();
+
/* Update page tables for PTDUMP and BDI */
mmu_mapin_ram_chunk(0, sinittext, __pgprot(0));
mmu_mapin_ram_chunk(0, etext, PAGE_KERNEL_ROX);
-- 
2.25.0



Re: [PATCH] powerpc/8xx: Fix STRICT_KERNEL_RWX startup test failure

2020-04-19 Thread Christophe Leroy




Le 20/04/2020 à 07:29, Christophe Leroy a écrit :

WRITE_RO lkdtm test works.

But when selecting CONFIG_DEBUG_RODATA_TEST, the kernel reports
rodata_test: test data was not read only

This is because when rodata test runs, there are still old entries
in TLB.

Flush TLB after setting kernel pages RO or NX.

Fixes: d5f17ee96447 ("powerpc/8xx: don't disable large TLBs with 
CONFIG_STRICT_KERNEL_RWX")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
  arch/powerpc/kvm/Makefile| 2 +-


Oops, this change shouldn't be there. Will send v2.


  arch/powerpc/mm/nohash/8xx.c | 3 +++
  2 files changed, 4 insertions(+), 1 deletion(-)



Christophe


[PATCH] powerpc/8xx: Fix STRICT_KERNEL_RWX startup test failure

2020-04-19 Thread Christophe Leroy
WRITE_RO lkdtm test works.

But when selecting CONFIG_DEBUG_RODATA_TEST, the kernel reports
rodata_test: test data was not read only

This is because when rodata test runs, there are still old entries
in TLB.

Flush TLB after setting kernel pages RO or NX.

Fixes: d5f17ee96447 ("powerpc/8xx: don't disable large TLBs with 
CONFIG_STRICT_KERNEL_RWX")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kvm/Makefile| 2 +-
 arch/powerpc/mm/nohash/8xx.c | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 2bfeaa13befb..906707d15810 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -135,4 +135,4 @@ obj-$(CONFIG_KVM_BOOK3S_32) += kvm.o
 obj-$(CONFIG_KVM_BOOK3S_64_PR) += kvm-pr.o
 obj-$(CONFIG_KVM_BOOK3S_64_HV) += kvm-hv.o
 
-obj-y += $(kvm-book3s_64-builtin-objs-y)
+obj-$(CONFIG_KVM_BOOK3S_64) += $(kvm-book3s_64-builtin-objs-y)
diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index 3189308dece4..d83a12c5bc7f 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -185,6 +185,7 @@ void mmu_mark_initmem_nx(void)
mmu_mapin_ram_chunk(etext8, einittext8, PAGE_KERNEL);
}
}
+   _tlbil_all();
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
@@ -199,6 +200,8 @@ void mmu_mark_rodata_ro(void)
  ~(LARGE_PAGE_SIZE_8M - 1)));
mmu_patch_addis(__dtlbmiss_romem_top, -__pa(_sinittext));
 
+   _tlbil_all();
+
/* Update page tables for PTDUMP and BDI */
mmu_mapin_ram_chunk(0, sinittext, __pgprot(0));
mmu_mapin_ram_chunk(0, etext, PAGE_KERNEL_ROX);
-- 
2.25.0



Re: [PATCH 2/2] powerpc/fadump: consider reserved ranges while reserving memory

2020-04-19 Thread Mahesh J Salgaonkar
On 2020-03-11 01:57:10 Wed, Hari Bathini wrote:
> Commit 0962e8004e97 ("powerpc/prom: Scan reserved-ranges node for
> memory reservations") enabled support to parse reserved-ranges DT
> node and reserve kernel memory falling in these ranges for F/W
> purposes. Memory reserved for FADump should not overlap with these
> ranges as it could corrupt memory meant for F/W or crash'ed kernel
> memory to be exported as vmcore.
> 
> But since commit 579ca1a27675 ("powerpc/fadump: make use of memblock's
> bottom up allocation mode"), memblock_find_in_range() is being used to
> find the appropriate area to reserve memory for FADump, which can't
> account for reserved-ranges as these ranges are reserved only after
> FADump memory reservation.
> 
> With reserved-ranges now being populated during early boot, look out
> for these memory ranges while reserving memory for FADump. Without
> this change, MPIPL on PowerNV systems aborts with hostboot failure,
> when memory reserved for FADump is less than 4096MB.
> 
> Fixes: 579ca1a27675 ("powerpc/fadump: make use of memblock's bottom up 
> allocation mode")
> Cc: sta...@vger.kernel.org # v5.4+
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/kernel/fadump.c |   76 
> --
>  1 file changed, 66 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 7fcf4a8f..ab83be9 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -443,10 +443,70 @@ static int __init fadump_get_boot_mem_regions(void)
>   return ret;
>  }
>  
> +/*
> + * Returns true, if the given range overlaps with reserved memory ranges
> + * starting at idx. Also, updates idx to index of overlapping memory range
> + * with the given memory range.
> + * False, otherwise.
> + */
> +static bool overlaps_reserved_ranges(u64 base, u64 end, int *idx)
> +{
> + bool ret = false;
> + int i;
> +
> + for (i = *idx; i < reserved_mrange_info.mem_range_cnt; i++) {
> + u64 rbase = reserved_mrange_info.mem_ranges[i].base;
> + u64 rend = rbase + reserved_mrange_info.mem_ranges[i].size;
> +
> + if (end <= rbase)
> + break;
> +
> + if ((end > rbase) &&  (base < rend)) {
> + *idx = i;
> + ret = true;
> + break;
> + }
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * Locate a suitable memory area to reserve memory for FADump. While at it,
> + * lookup reserved-ranges & avoid overlap with them, as they are used by F/W.
> + */
> +static u64 __init fadump_locate_reserve_mem(u64 base, u64 size)
> +{
> + struct fadump_memory_range *mrngs;
> + phys_addr_t mstart, mend;
> + int idx = 0;
> + u64 i;
> +
> + mrngs = reserved_mrange_info.mem_ranges;
> + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
> + , , NULL) {
> + pr_debug("%llu) mstart: %llx, mend: %llx, base: %llx\n",
> +  i, mstart, mend, base);
> +
> + if (mstart > base)
> + base = PAGE_ALIGN(mstart);
> +
> + while ((mend > base) && ((mend - base) >= size)) {
> + if (!overlaps_reserved_ranges(base, base + size, ))
> + goto out;
> +
> + base = mrngs[idx].base + mrngs[idx].size;
> + base = PAGE_ALIGN(base);

What happens when all the memory ranges found to be overlaped with
reserved ranges ? Shoudn't this function return NULL ? Looks like in
that case this function returns the last set base address which is
either still overlaped or not big enough in size.

Rest looks good to me.

Thanks,
-Mahesh.

> + }
> + }
> +
> +out:
> + return base;
> +}
> +



Re: [PATCH 1/2] powerpc/fadump: use static allocation for reserved memory ranges

2020-04-19 Thread Mahesh J Salgaonkar
On 2020-03-11 01:57:00 Wed, Hari Bathini wrote:
> At times, memory ranges have to be looked up during early boot, when
> kernel couldn't be initialized for dynamic memory allocation. In fact,
> reserved-ranges look up is needed during FADump memory reservation.
> Without accounting for reserved-ranges in reserving memory for FADump,
> MPIPL boot fails with memory corruption issues. So, extend memory
> ranges handling to support static allocation and populate reserved
> memory ranges during early boot.
> 
> Fixes: dda9dbfeeb7a ("powerpc/fadump: consider reserved ranges while 
> releasing memory")
> Cc: sta...@vger.kernel.org # v5.4+
> Signed-off-by: Hari Bathini 

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

> ---
>  arch/powerpc/include/asm/fadump-internal.h |4 +
>  arch/powerpc/kernel/fadump.c   |   77 
> 
>  2 files changed, 48 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/fadump-internal.h 
> b/arch/powerpc/include/asm/fadump-internal.h
> index c814a2b..8d61c8f 100644
> --- a/arch/powerpc/include/asm/fadump-internal.h
> +++ b/arch/powerpc/include/asm/fadump-internal.h
> @@ -64,12 +64,14 @@ struct fadump_memory_range {
>  };
>  
>  /* fadump memory ranges info */
> +#define RNG_NAME_SZ  16
>  struct fadump_mrange_info {
> - charname[16];
> + charname[RNG_NAME_SZ];
>   struct fadump_memory_range  *mem_ranges;
>   u32 mem_ranges_sz;
>   u32 mem_range_cnt;
>   u32 max_mem_ranges;
> + boolis_static;
>  };
>  
>  /* Platform specific callback functions */
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index ff0114a..7fcf4a8f 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -38,8 +38,17 @@ static void __init fadump_reserve_crash_area(u64 base);
>  
>  #ifndef CONFIG_PRESERVE_FA_DUMP
>  static DEFINE_MUTEX(fadump_mutex);
> -struct fadump_mrange_info crash_mrange_info = { "crash", NULL, 0, 0, 0 };
> -struct fadump_mrange_info reserved_mrange_info = { "reserved", NULL, 0, 0, 0 
> };
> +struct fadump_mrange_info crash_mrange_info = { "crash", NULL, 0, 0, 0, 
> false };
> +
> +#define RESERVED_RNGS_SZ 16384 /* 16K - 128 entries */
> +#define RESERVED_RNGS_CNT(RESERVED_RNGS_SZ / \
> +  sizeof(struct fadump_memory_range))
> +static struct fadump_memory_range rngs[RESERVED_RNGS_CNT];
> +struct fadump_mrange_info reserved_mrange_info = { "reserved", rngs,
> +RESERVED_RNGS_SZ, 0,
> +RESERVED_RNGS_CNT, true };
> +
> +static void __init early_init_dt_scan_reserved_ranges(unsigned long node);
>  
>  #ifdef CONFIG_CMA
>  static struct cma *fadump_cma;
> @@ -108,6 +117,11 @@ static int __init fadump_cma_init(void) { return 1; }
>  int __init early_init_dt_scan_fw_dump(unsigned long node, const char *uname,
> int depth, void *data)
>  {
> + if (depth == 0) {
> + early_init_dt_scan_reserved_ranges(node);
> + return 0;
> + }
> +
>   if (depth != 1)
>   return 0;
>  
> @@ -726,10 +740,14 @@ void fadump_free_cpu_notes_buf(void)
>  
>  static void fadump_free_mem_ranges(struct fadump_mrange_info *mrange_info)
>  {
> + if (mrange_info->is_static) {
> + mrange_info->mem_range_cnt = 0;
> + return;
> + }
> +
>   kfree(mrange_info->mem_ranges);
> - mrange_info->mem_ranges = NULL;
> - mrange_info->mem_ranges_sz = 0;
> - mrange_info->max_mem_ranges = 0;
> + memset((void *)((u64)mrange_info + RNG_NAME_SZ), 0,
> +(sizeof(struct fadump_mrange_info) - RNG_NAME_SZ));
>  }
>  
>  /*
> @@ -786,6 +804,12 @@ static inline int fadump_add_mem_range(struct 
> fadump_mrange_info *mrange_info,
>   if (mrange_info->mem_range_cnt == mrange_info->max_mem_ranges) {
>   int ret;
>  
> + if (mrange_info->is_static) {
> + pr_err("Reached array size limit for %s memory 
> ranges\n",
> +mrange_info->name);
> + return -ENOSPC;
> + }
> +
>   ret = fadump_alloc_mem_ranges(mrange_info);
>   if (ret)
>   return ret;
> @@ -1202,20 +1226,19 @@ static void sort_and_merge_mem_ranges(struct 
> fadump_mrange_info *mrange_info)
>   * Scan reserved-ranges to consider them while reserving/releasing
>   * memory for FADump.
>   */
> -static inline int fadump_scan_reserved_mem_ranges(void)
> +static void __init early_init_dt_scan_reserved_ranges(unsigned long node)
>  {
> - struct device_node *root;
>   const 

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin
Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
> On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
>> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> >> > Note that because lr is clobbered we need at least once normally
>> >> > call-clobbered register that's not syscall clobbered to save lr in.
>> >> > Otherwise stack frame setup is required to spill it.
>> >> 
>> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> >> registers, but we have some delay establishing the stack (depends on a
>> >> load which depends on a mfspr), and entry code tends to be quite store
>> >> heavy whereas on the caller side you have r1 set up (modulo stack 
>> >> updates), and the system call is a long delay during which time the 
>> >> store queue has significant time to drain.
>> >> 
>> >> My feeling is it would be better for kernel to have these scratch 
>> >> registers.
>> > 
>> > If your new kernel syscall mechanism requires the caller to make a
>> > whole stack frame it otherwise doesn't need and spill registers to it,
>> > it becomes a lot less attractive. Some of those 90 cycles saved are
>> > immediately lost on the userspace side, plus you either waste icache
>> > at the call point or require the syscall to go through a
>> > userspace-side helper function that performs the spill and restore.
>> 
>> You would be surprised how few cycles that takes on a high end CPU. Some 
>> might be a couple of %. I am one for counting cycles mind you, I'm not 
>> being flippant about it. If we can come up with something faster I'd be 
>> up for it.
> 
> If the cycle count is trivial then just do it on the kernel side.

The cycle count for user is, because you have r1 ready. Kernel does not 
have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to 
save into.

Which is also wasted work for a userspace.

Now that I think about it, no stack frame is even required! lr is saved 
into the caller's stack when its clobbered with an asm, just as when 
it's used for a function call.

>> > The right way to do this is to have the kernel preserve enough
>> > registers that userspace can avoid having any spills. It doesn't have
>> > to preserve everything, probably just enough to save lr. (BTW are
>> 
>> Again, the problem is the kernel doesn't have its dependencies 
>> immediately ready to spill, and spilling (may be) more costly 
>> immediately after the call because we're doing a lot of stores.
>> 
>> I could try measure this. Unfortunately our pipeline simulator tool 
>> doesn't model system calls properly so it's hard to see what's happening 
>> across the user/kernel horizon, I might check if that can be improved
>> or I can hack it by putting some isync in there or something.
> 
> I think it's unlikely to make any real difference to the total number
> of cycles spent which side it happens on, but putting it on the kernel
> side makes it easier to avoid wasting size/icache at each syscall
> site.
> 
>> > syscall arg registers still preserved? If not, this is a major cost on
>> > the userspace side, since any call point that has to loop-and-retry
>> > (e.g. futex) now needs to make its own place to store the original
>> > values.)
>> 
>> Powerpc system calls never did. We could have scv preserve them, but 
>> you'd still need to restore r3. We could make an ABI which does not
>> clobber r3 but puts the return value in r9, say. I'd like to see what
>> the user side code looks like to take advantage of such a thing though.
> 
> Oh wow, I hadn't realized that, but indeed the code we have now is
> allowing for the kernel to clobber them all. So at least this isn't
> getting any worse I guess. I think it was a very poor choice of
> behavior though and a disadvantage vs what other archs do (some of
> them preserve all registers; others preserve only normally call-saved
> ones plus the syscall arg ones and possibly a few other specials).

Well, we could change it. Does the generated code improve significantly
we take those clobbers away?

Thanks,
Nick


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Rich Felker
On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> >> > Note that because lr is clobbered we need at least once normally
> >> > call-clobbered register that's not syscall clobbered to save lr in.
> >> > Otherwise stack frame setup is required to spill it.
> >> 
> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
> >> registers, but we have some delay establishing the stack (depends on a
> >> load which depends on a mfspr), and entry code tends to be quite store
> >> heavy whereas on the caller side you have r1 set up (modulo stack 
> >> updates), and the system call is a long delay during which time the 
> >> store queue has significant time to drain.
> >> 
> >> My feeling is it would be better for kernel to have these scratch 
> >> registers.
> > 
> > If your new kernel syscall mechanism requires the caller to make a
> > whole stack frame it otherwise doesn't need and spill registers to it,
> > it becomes a lot less attractive. Some of those 90 cycles saved are
> > immediately lost on the userspace side, plus you either waste icache
> > at the call point or require the syscall to go through a
> > userspace-side helper function that performs the spill and restore.
> 
> You would be surprised how few cycles that takes on a high end CPU. Some 
> might be a couple of %. I am one for counting cycles mind you, I'm not 
> being flippant about it. If we can come up with something faster I'd be 
> up for it.

If the cycle count is trivial then just do it on the kernel side.

> > The right way to do this is to have the kernel preserve enough
> > registers that userspace can avoid having any spills. It doesn't have
> > to preserve everything, probably just enough to save lr. (BTW are
> 
> Again, the problem is the kernel doesn't have its dependencies 
> immediately ready to spill, and spilling (may be) more costly 
> immediately after the call because we're doing a lot of stores.
> 
> I could try measure this. Unfortunately our pipeline simulator tool 
> doesn't model system calls properly so it's hard to see what's happening 
> across the user/kernel horizon, I might check if that can be improved
> or I can hack it by putting some isync in there or something.

I think it's unlikely to make any real difference to the total number
of cycles spent which side it happens on, but putting it on the kernel
side makes it easier to avoid wasting size/icache at each syscall
site.

> > syscall arg registers still preserved? If not, this is a major cost on
> > the userspace side, since any call point that has to loop-and-retry
> > (e.g. futex) now needs to make its own place to store the original
> > values.)
> 
> Powerpc system calls never did. We could have scv preserve them, but 
> you'd still need to restore r3. We could make an ABI which does not
> clobber r3 but puts the return value in r9, say. I'd like to see what
> the user side code looks like to take advantage of such a thing though.

Oh wow, I hadn't realized that, but indeed the code we have now is
allowing for the kernel to clobber them all. So at least this isn't
getting any worse I guess. I think it was a very poor choice of
behavior though and a disadvantage vs what other archs do (some of
them preserve all registers; others preserve only normally call-saved
ones plus the syscall arg ones and possibly a few other specials).

Rich


[PATCH v2, RESEND] misc: new driver sram_uapi for user level SRAM access

2020-04-19 Thread Wang Wenhu
A generic User-Kernel interface that allows a misc device created
by it to support file-operations of ioctl and mmap to access SRAM
memory from user level. Different kinds of SRAM alloction and free
APIs could be registered by specific SRAM hardware level driver to
the available list and then be chosen by users to allocate and map
SRAM memory from user level.

It is extremely helpful for the user space applications that require
high performance memory accesses, such as embedded networking devices
that would process data in user space, and PowerPC e500 is a case.

Signed-off-by: Wang Wenhu 
Cc: Greg Kroah-Hartman 
Cc: Arnd Bergmann 
Cc: Christophe Leroy 
Cc: Scott Wood 
Cc: Michael Ellerman 
Cc: Randy Dunlap 
Cc: linuxppc-dev@lists.ozlabs.org
---
Changes since v1: addressed comments from Arnd
 * Changed the ioctl cmd definitions using _IO micros
 * Export interfaces for HW-SRAM drivers to register apis to available list
 * Modified allocation alignment to PAGE_SIZE
 * Use phys_addr_t as type of SRAM resource size and offset
 * Support compat_ioctl
 * Misc device name:sram

Note: From this on, the SRAM_UAPI driver is independent to any hardware
drivers, so I would only commit the patch itself as v2, while the v1 of
it was wrapped together with patches for Freescale L2-Cache-SRAM device.
Then after, I'd create patches for Freescale L2-Cache-SRAM device as
another series.

Link for v1:
 * https://lore.kernel.org/lkml/20200418162157.50428-5-wenhu.w...@vivo.com/
---
 drivers/misc/Kconfig  |  11 ++
 drivers/misc/Makefile |   1 +
 drivers/misc/sram_uapi.c  | 352 ++
 include/linux/sram_uapi.h |  28 +++
 4 files changed, 392 insertions(+)
 create mode 100644 drivers/misc/sram_uapi.c
 create mode 100644 include/linux/sram_uapi.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 99e151475d8f..b19c8b24f18e 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -465,6 +465,17 @@ config PVPANIC
  a paravirtualized device provided by QEMU; it lets a virtual machine
  (guest) communicate panic events to the host.
 
+config SRAM_UAPI
+   bool "Generic SRAM User Level API driver"
+   help
+ This driver allows you to create a misc device which could be used
+ as an interface to allocate SRAM memory from user level.
+
+ It is extremely helpful for some user space applications that require
+ high performance memory accesses.
+
+ If unsure, say N.
+
 source "drivers/misc/c2port/Kconfig"
 source "drivers/misc/eeprom/Kconfig"
 source "drivers/misc/cb710/Kconfig"
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 9abf2923d831..794447ca07ca 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_VMWARE_VMCI) += vmw_vmci/
 obj-$(CONFIG_LATTICE_ECP3_CONFIG)  += lattice-ecp3-config.o
 obj-$(CONFIG_SRAM) += sram.o
 obj-$(CONFIG_SRAM_EXEC)+= sram-exec.o
+obj-$(CONFIG_SRAM_UAPI)+= sram_uapi.o
 obj-y  += mic/
 obj-$(CONFIG_GENWQE)   += genwqe/
 obj-$(CONFIG_ECHO) += echo/
diff --git a/drivers/misc/sram_uapi.c b/drivers/misc/sram_uapi.c
new file mode 100644
index ..66c7b56b635f
--- /dev/null
+++ b/drivers/misc/sram_uapi.c
@@ -0,0 +1,352 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 Vivo Communication Technology Co. Ltd.
+ * Copyright (C) 2020 Wang Wenhu 
+ * All rights reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRIVER_NAME"sram_uapi"
+
+struct res_info {
+   phys_addr_t offset;
+   phys_addr_t size;
+};
+
+struct sram_resource {
+   struct list_headlist;
+   struct res_info info;
+   phys_addr_t phys;
+   void*virt;
+   struct vm_area_struct   *vma;
+   struct sram_uapi*parent;
+};
+
+struct sram_uapi {
+   struct list_headres_list;
+   struct sram_api *sa;
+};
+
+static DEFINE_MUTEX(sram_api_list_lock);
+static LIST_HEAD(sram_api_list);
+
+long sram_api_register(struct sram_api *sa)
+{
+   struct sram_api *cur;
+
+   if (!sa || !sa->name || !sa->sram_alloc || !sa->sram_free)
+   return -EINVAL;
+
+   mutex_lock(_api_list_lock);
+   list_for_each_entry(cur, _api_list, list) {
+   if (cur->type == sa->type) {
+   pr_err("error sram %s type %d exists\n", sa->name,
+  sa->type);
+   mutex_unlock(_api_list_lock);
+   return -EEXIST;
+   }
+   }
+
+   kref_init(>kref);
+   list_add_tail(>list, _api_list);
+   pr_info("sram %s type %d registered\n", sa->name, sa->type);
+
+   mutex_unlock(_api_list_lock);
+
+   return 0;
+};
+EXPORT_SYMBOL(sram_api_register);
+
+long sram_api_unregister(struct 

Re: [PATCH] iommu: spapr_tce: Disable compile testing to fix build on book3s_32 config

2020-04-19 Thread Michael Ellerman
Christophe Leroy  writes:
> On 04/14/2020 02:26 PM, Krzysztof Kozlowski wrote:
>> Although SPAPR_TCE_IOMMU itself can be compile tested on certain PowerPC
>> configurations, its presence makes arch/powerpc/kvm/Makefile to select
>> modules which do not build in such configuration.
>> 
>> The arch/powerpc/kvm/ modules use kvm_arch.spapr_tce_tables which exists
>> only with CONFIG_PPC_BOOK3S_64.  However these modules are selected when
>> COMPILE_TEST and SPAPR_TCE_IOMMU are chosen leading to build failures:
>> 
>>  In file included from 
>> arch/powerpc/include/asm/book3s/64/mmu-hash.h:20:0,
>>   from arch/powerpc/kvm/book3s_64_vio_hv.c:22:
>>  arch/powerpc/include/asm/book3s/64/pgtable.h:17:0: error: "_PAGE_EXEC" 
>> redefined [-Werror]
>>   #define _PAGE_EXEC  0x1 /* execute permission */
>> 
>>  In file included from arch/powerpc/include/asm/book3s/32/pgtable.h:8:0,
>>   from arch/powerpc/include/asm/book3s/pgtable.h:8,
>>   from arch/powerpc/include/asm/pgtable.h:18,
>>   from include/linux/mm.h:95,
>>   from arch/powerpc/include/asm/io.h:29,
>>   from include/linux/io.h:13,
>>   from include/linux/irq.h:20,
>>   from arch/powerpc/include/asm/hardirq.h:6,
>>   from include/linux/hardirq.h:9,
>>   from include/linux/kvm_host.h:7,
>>   from arch/powerpc/kvm/book3s_64_vio_hv.c:12:
>>  arch/powerpc/include/asm/book3s/32/hash.h:29:0: note: this is the 
>> location of the previous definition
>>   #define _PAGE_EXEC 0x200 /* software: exec allowed */
>> 
>> Reported-by: Geert Uytterhoeven 
>> Fixes: e93a1695d7fb ("iommu: Enable compile testing for some of drivers")
>> Signed-off-by: Krzysztof Kozlowski 
>> ---
>>   drivers/iommu/Kconfig | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
>> index 58b4a4dbfc78..3532b1ead19d 100644
>> --- a/drivers/iommu/Kconfig
>> +++ b/drivers/iommu/Kconfig
>> @@ -362,7 +362,7 @@ config IPMMU_VMSA
>>   
>>   config SPAPR_TCE_IOMMU
>>  bool "sPAPR TCE IOMMU Support"
>> -depends on PPC_POWERNV || PPC_PSERIES || (PPC && COMPILE_TEST)
>> +depends on PPC_POWERNV || PPC_PSERIES
>>  select IOMMU_API
>>  help
>>Enables bits of IOMMU API required by VFIO. The iommu_ops
>> 
>
> Should it be fixed the other way round, something like:

That doesn't actually fix this specific issue, the code will build but
then not link:

  ld: arch/powerpc/../../virt/kvm/vfio.o: in function 
`.kvm_spapr_tce_release_vfio_group':
  vfio.c:(.text.kvm_spapr_tce_release_vfio_group+0xb0): undefined reference to 
`.kvm_spapr_tce_release_iommu_group'
  ld: arch/powerpc/../../virt/kvm/vfio.o: in function `.kvm_vfio_set_group':
  vfio.c:(.text.kvm_vfio_set_group+0x7f4): undefined reference to 
`.kvm_spapr_tce_attach_iommu_group'
  ld: arch/powerpc/kvm/powerpc.o: in function `.kvm_arch_vm_ioctl':
  (.text.kvm_arch_vm_ioctl+0x1a4): undefined reference to 
`.kvm_vm_ioctl_create_spapr_tce'
  ld: (.text.kvm_arch_vm_ioctl+0x230): undefined reference to 
`.kvm_vm_ioctl_create_spapr_tce'
  make[1]: *** [/home/michael/linux/Makefile:1106: vmlinux] Error 1

> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 2bfeaa13befb..906707d15810 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -135,4 +135,4 @@ obj-$(CONFIG_KVM_BOOK3S_32) += kvm.o
>   obj-$(CONFIG_KVM_BOOK3S_64_PR) += kvm-pr.o
>   obj-$(CONFIG_KVM_BOOK3S_64_HV) += kvm-hv.o
>
> -obj-y += $(kvm-book3s_64-builtin-objs-y)
> +obj-$(CONFIG_KVM_BOOK3S_64) += $(kvm-book3s_64-builtin-objs-y)

But this is probably still a good thing to do, as it would have made the
error messages clearer in this case I think.

cheers


Re: [PATCH] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-04-19 Thread Philip Li
On Mon, Apr 13, 2020 at 04:50:38PM +0530, Santosh Sivaraj wrote:
> kbuild test robot  writes:
> 
> > Hi Santosh,
> >
> > Thank you for the patch! Yet something to improve:
> >
> > [auto build test ERROR on powerpc/next]
> > [also build test ERROR on v5.7-rc1 next-20200412]
> > [if your patch is applied to the wrong git tree, please drop us a note to 
> > help
> > improve the system. BTW, we also suggest to use '--base' option to specify 
> > the
> > base tree in git format-patch, please see
> > https://stackoverflow.com/a/37406982]
> 
> This patch depends on "powerpc/mce: Add MCE notification chain" [1].
got it, thanks for input, though currently the bot is not able to figure
out this yet for two separated patch sets, here the --base may help.

> 
> [1]: 
> https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/
> 
> Thanks,
> Santosh
> 
> >
> > url:
> > https://github.com/0day-ci/linux/commits/Santosh-Sivaraj/papr-scm-Add-bad-memory-ranges-to-nvdimm-bad-ranges/20200401-171233
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> > next
> > config: powerpc-allyesconfig (attached as .config)
> > compiler: powerpc64-linux-gcc (GCC) 9.3.0
> > reproduce:
> > wget 
> > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> > ~/bin/make.cross
> > chmod +x ~/bin/make.cross
> > # save the attached .config to linux build tree
> > GCC_VERSION=9.3.0 make.cross ARCH=powerpc 
> >
> > If you fix the issue, kindly add following tag as appropriate
> > Reported-by: kbuild test robot 
> >
> > All errors (new ones prefixed by >>):
> >
> >arch/powerpc/platforms/pseries/papr_scm.c: In function 'papr_scm_init':
> >>> arch/powerpc/platforms/pseries/papr_scm.c:584:3: error: implicit 
> >>> declaration of function 'mce_register_notifier'; did you mean 
> >>> 'bus_register_notifier'? [-Werror=implicit-function-declaration]
> >  584 |   mce_register_notifier(_ue_nb);
> >  |   ^
> >  |   bus_register_notifier
> >arch/powerpc/platforms/pseries/papr_scm.c: In function 'papr_scm_exit':
> >>> arch/powerpc/platforms/pseries/papr_scm.c:592:2: error: implicit 
> >>> declaration of function 'mce_unregister_notifier'; did you mean 
> >>> 'bus_unregister_notifier'? [-Werror=implicit-function-declaration]
> >  592 |  mce_unregister_notifier(_ue_nb);
> >  |  ^~~
> >  |  bus_unregister_notifier
> >cc1: some warnings being treated as errors
> >
> > vim +584 arch/powerpc/platforms/pseries/papr_scm.c
> >
> >577  
> >578  static int __init papr_scm_init(void)
> >579  {
> >580  int ret;
> >581  
> >582  ret = platform_driver_register(_scm_driver);
> >583  if (!ret)
> >  > 584  mce_register_notifier(_ue_nb);
> >585  
> >586  return ret;
> >587  }
> >588  module_init(papr_scm_init);
> >589  
> >590  static void __exit papr_scm_exit(void)
> >591  {
> >  > 592  mce_unregister_notifier(_ue_nb);
> >593  platform_driver_unregister(_scm_driver);
> >594  }
> >595  module_exit(papr_scm_exit);
> >596  
> >
> > ---
> > 0-DAY CI Kernel Test Service, Intel Corporation
> > https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
> 


Re: [PATCH v2] powerpc/setup_64: Set cache-line-size based on cache-block-size

2020-04-19 Thread Michael Ellerman
Chris Packham  writes:
> On Thu, 2020-04-16 at 21:43 +1000, Michael Ellerman wrote:
>> Chris Packham  writes:
>> > On Wed, 2020-03-25 at 16:18 +1300, Chris Packham wrote:
>> > > If {i,d}-cache-block-size is set and {i,d}-cache-line-size is
>> > > not,
>> > > use
>> > > the block-size value for both. Per the devicetree spec cache-
>> > > line-
>> > > size
>> > > is only needed if it differs from the block size.
>> > > 
>> > > Signed-off-by: Chris Packham 
>> > > ---
>> > > It looks as though the bsizep = lsizep is not required per the
>> > > spec
>> > > but it's
>> > > probably safer to retain it.
>> > > 
>> > > Changes in v2:
>> > > - Scott pointed out that u-boot should be filling in the cache
>> > > properties
>> > >   (which it does). But it does not specify a cache-line-size
>> > > because
>> > > it
>> > >   provides a cache-block-size and the spec says you don't have to
>> > > if
>> > > they are
>> > >   the same. So the error is in the parsing not in the devicetree
>> > > itself.
>> > > 
>> > 
>> > Ping? This thread went kind of quiet.
>> 
>> I replied in the other thread:
>> 
>>   
>> https://lore.kernel.org/linuxppc-dev/87369xx99u@mpe.ellerman.id.au/
>> 
>> But then the merge window happened which is a busy time.
>> 
>
> Yeah I figured that was the case.
>
>> What I'd really like is a v3 that incorporates the info I wrote in
>> the
>> other thread and a Fixes tag.
>> 
>> If you feel like doing that, that would be great. Otherwise I'll do
>> it
>> tomorrow.
>
> I'll rebase against Linus's tree and have a go a adding some more words
> to the commit message.

Thanks.

cheers


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin
Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
> On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> > Note that because lr is clobbered we need at least once normally
>> > call-clobbered register that's not syscall clobbered to save lr in.
>> > Otherwise stack frame setup is required to spill it.
>> 
>> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> registers, but we have some delay establishing the stack (depends on a
>> load which depends on a mfspr), and entry code tends to be quite store
>> heavy whereas on the caller side you have r1 set up (modulo stack 
>> updates), and the system call is a long delay during which time the 
>> store queue has significant time to drain.
>> 
>> My feeling is it would be better for kernel to have these scratch 
>> registers.
> 
> If your new kernel syscall mechanism requires the caller to make a
> whole stack frame it otherwise doesn't need and spill registers to it,
> it becomes a lot less attractive. Some of those 90 cycles saved are
> immediately lost on the userspace side, plus you either waste icache
> at the call point or require the syscall to go through a
> userspace-side helper function that performs the spill and restore.

You would be surprised how few cycles that takes on a high end CPU. Some 
might be a couple of %. I am one for counting cycles mind you, I'm not 
being flippant about it. If we can come up with something faster I'd be 
up for it.

> 
> The right way to do this is to have the kernel preserve enough
> registers that userspace can avoid having any spills. It doesn't have
> to preserve everything, probably just enough to save lr. (BTW are

Again, the problem is the kernel doesn't have its dependencies 
immediately ready to spill, and spilling (may be) more costly 
immediately after the call because we're doing a lot of stores.

I could try measure this. Unfortunately our pipeline simulator tool 
doesn't model system calls properly so it's hard to see what's happening 
across the user/kernel horizon, I might check if that can be improved
or I can hack it by putting some isync in there or something.

> syscall arg registers still preserved? If not, this is a major cost on
> the userspace side, since any call point that has to loop-and-retry
> (e.g. futex) now needs to make its own place to store the original
> values.)

Powerpc system calls never did. We could have scv preserve them, but 
you'd still need to restore r3. We could make an ABI which does not
clobber r3 but puts the return value in r9, say. I'd like to see what
the user side code looks like to take advantage of such a thing though.

Thanks,
Nick


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin
Excerpts from Rich Felker's message of April 20, 2020 11:29 am:
> On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote:
>> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm:
>> > * Nicholas Piggin via Libc-alpha  [2020-04-16 
>> > 10:16:54 +1000]:
>> >> Well it would have to test HWCAP and patch in or branch to two 
>> >> completely different sequences including register save/restores yes.
>> >> You could have the same asm and matching clobbers to put the sequence
>> >> inline and then you could patch the one sc/scv instruction I suppose.
>> > 
>> > how would that 'patch' work?
>> > 
>> > there are many reasons why you don't
>> > want libc to write its .text
>> 
>> I guess I don't know what I'm talking about when it comes to libraries. 
>> Shame if there is no good way to load-time patch libc. It's orthogonal
>> to the scv selection though -- if you don't patch you have to 
>> conditional or indirect branch however you implement it.
> 
> Patched pages cannot be shared. The whole design of PIC and shared
> libraries is that the code("text")/rodata is immutable and shared and
> that only a minimal amount of data, packed tightly together (the GOT)
> has to exist per-instance.

Yeah the pages which were patched couldn't be shared across exec, which
is a significant downside, unless you could group all patch sites into
their own section and similarly pack it together (which has issues of
being out of line).

> 
> Also, allowing patching of executable pages is generally frowned upon
> these days because W^X is a desirable hardening property.

Right, it would want be write-protected after being patched.

Thanks,
Nick


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Rich Felker
On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> > Note that because lr is clobbered we need at least once normally
> > call-clobbered register that's not syscall clobbered to save lr in.
> > Otherwise stack frame setup is required to spill it.
> 
> The kernel would like to use r9-r12 for itself. We could do with fewer 
> registers, but we have some delay establishing the stack (depends on a
> load which depends on a mfspr), and entry code tends to be quite store
> heavy whereas on the caller side you have r1 set up (modulo stack 
> updates), and the system call is a long delay during which time the 
> store queue has significant time to drain.
> 
> My feeling is it would be better for kernel to have these scratch 
> registers.

If your new kernel syscall mechanism requires the caller to make a
whole stack frame it otherwise doesn't need and spill registers to it,
it becomes a lot less attractive. Some of those 90 cycles saved are
immediately lost on the userspace side, plus you either waste icache
at the call point or require the syscall to go through a
userspace-side helper function that performs the spill and restore.

The right way to do this is to have the kernel preserve enough
registers that userspace can avoid having any spills. It doesn't have
to preserve everything, probably just enough to save lr. (BTW are
syscall arg registers still preserved? If not, this is a major cost on
the userspace side, since any call point that has to loop-and-retry
(e.g. futex) now needs to make its own place to store the original
values.)

Rich


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Rich Felker
On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote:
> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm:
> > * Nicholas Piggin via Libc-alpha  [2020-04-16 
> > 10:16:54 +1000]:
> >> Well it would have to test HWCAP and patch in or branch to two 
> >> completely different sequences including register save/restores yes.
> >> You could have the same asm and matching clobbers to put the sequence
> >> inline and then you could patch the one sc/scv instruction I suppose.
> > 
> > how would that 'patch' work?
> > 
> > there are many reasons why you don't
> > want libc to write its .text
> 
> I guess I don't know what I'm talking about when it comes to libraries. 
> Shame if there is no good way to load-time patch libc. It's orthogonal
> to the scv selection though -- if you don't patch you have to 
> conditional or indirect branch however you implement it.

Patched pages cannot be shared. The whole design of PIC and shared
libraries is that the code("text")/rodata is immutable and shared and
that only a minimal amount of data, packed tightly together (the GOT)
has to exist per-instance.

Also, allowing patching of executable pages is generally frowned upon
these days because W^X is a desirable hardening property.

Rich


Re: [PATCH] powerpc/book3s64/kuap: SPRN_AMR modification need CSI instructions before and after

2020-04-19 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of April 20, 2020 10:17 am:
> Excerpts from Aneesh Kumar K.V's message of April 19, 2020 11:53 pm:
>> As per the ISA, context synchronizing instructions is needed before and after
>> SPRN_AMR update. Use isync before and the CSI after is implied by the rfid
>> that we will use to switch to a new context.
> 
> Not entirely sure if we need this. This will restore AMR to more 
> permissive, so if it executes ahead of a stray load from this
> context, it won't make it fault.
> 
> That said, leaving this end open makes it harder to reason about
> user access protection I guess, so let's add it.

We probably should test whether it needs updating, like the entry 
code does.

Thanks,
Nick


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin
Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>> 
>> 
>> On 16/04/2020 14:59, Rich Felker wrote:
>> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>> >>
>> >>
>> >> On 16/04/2020 12:37, Rich Felker wrote:
>> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>> > My preference would be that it work just like the i386 AT_SYSINFO
>> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> > provides a stub in the vdso that performs either scv or the old
>> > mechanism with the same calling convention. Then if the kernel doesn't
>> > provide it (because the kernel is too old) libc would have to provide
>> > its own stub that uses the legacy method and matches the calling
>> > convention of the one the kernel is expected to provide.
>> 
>>  What about pthread cancellation and the requirement of checking the
>>  cancellable syscall anchors in asynchronous cancellation? My plan is
>>  still to use musl strategy on glibc (BZ#12683) and for i686 it 
>>  requires to always use old int$128 for program that uses cancellation
>>  (static case) or just threads (dynamic mode, which should be more
>>  common on glibc).
>> 
>>  Using the i686 strategy of a vDSO bridge symbol would require to always
>>  fallback to 'sc' to still use the same cancellation strategy (and
>>  thus defeating this optimization in such cases).
>> >>>
>> >>> Yes, I assumed it would be the same, ignoring the new syscall
>> >>> mechanism for cancellable syscalls. While there are some exceptions,
>> >>> cancellable syscalls are generally not hot paths but things that are
>> >>> expected to block and to have significant amounts of work to do in
>> >>> kernelspace, so saving a few tens of cycles is rather pointless.
>> >>>
>> >>> It's possible to do a branch/multiple versions of the syscall asm for
>> >>> cancellation but would require extending the cancellation handler to
>> >>> support checking against multiple independent address ranges or using
>> >>> some alternate markup of them.
>> >>
>> >> The main issue is at least for glibc dynamic linking is way more common
>> >> than static linking and once the program become multithread the fallback
>> >> will be always used.
>> > 
>> > I'm not relying on static linking optimizing out the cancellable
>> > version. I'm talking about how cancellable syscalls are pretty much
>> > all "heavy" operations to begin with where a few tens of cycles are in
>> > the realm of "measurement noise" relative to the dominating time
>> > costs.
>> 
>> Yes I am aware, but at same time I am not sure how it plays on real world.
>> For instance, some workloads might issue kernel query syscalls, such as
>> recv, where buffer copying might not be dominant factor. So I see that if
>> the idea is optimizing syscall mechanism, we should try to leverage it
>> as whole in libc.
> 
> Have you timed a minimal recv? I'm not assuming buffer copying is the
> dominant factor. I'm assuming the overhead of all the kernel layers
> involved is dominant.
> 
>> >> And besides the cancellation performance issue, a new bridge vDSO 
>> >> mechanism
>> >> will still require to setup some extra bridge for the case of the older
>> >> kernel.  In the scheme you suggested:
>> >>
>> >>   __asm__("indirect call" ... with common clobbers);
>> >>
>> >> The indirect call will be either the vDSO bridge or an libc provided that
>> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
>> >> against:
>> >>
>> >>if (hwcap & PPC_FEATURE2_SCV) {
>> >>  __asm__(... with some clobbers);
>> >>} else {
>> >>  __asm__(... with different clobbers);
>> >>}
>> > 
>> > If the indirect call can be made roughly as efficiently as the sc
>> > sequence now (which already have some cost due to handling the nasty
>> > error return convention, making the indirect call likely just as small
>> > or smaller), it's O(1) additional code size (and thus icache usage)
>> > rather than O(n) where n is number of syscall points.
>> > 
>> > Of course it would work just as well (for avoiding O(n) growth) to
>> > have a direct call to out-of-line branch like you suggested.
>> 
>> Yes, but does it really matter to optimize this specific usage case
>> for size? glibc, for instance, tries to leverage the syscall mechanism 
>> by adding some complex pre-processor asm directives.  It optimizes
>> the syscall code size in most cases.  For instance, kill in static case 
>> generates on x86_64:
>> 
>>  <__kill>:
>>0:   b8 3e 00 00 00  mov$0x3e,%eax
>>5:   0f 05   syscall 
>>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
>>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
>>   13:   c3  retq   
>> 
>> While on musl:
>> 
>>  :
>>0:

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin
Excerpts from Adhemerval Zanella's message of April 17, 2020 4:52 am:
> 
> 
> On 16/04/2020 15:31, Rich Felker wrote:
>> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>>>
>>>
>>> On 16/04/2020 14:59, Rich Felker wrote:
 On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>
>
> On 16/04/2020 12:37, Rich Felker wrote:
>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
 My preference would be that it work just like the i386 AT_SYSINFO
 where you just replace "int $128" with "call *%%gs:16" and the kernel
 provides a stub in the vdso that performs either scv or the old
 mechanism with the same calling convention. Then if the kernel doesn't
 provide it (because the kernel is too old) libc would have to provide
 its own stub that uses the legacy method and matches the calling
 convention of the one the kernel is expected to provide.
>>>
>>> What about pthread cancellation and the requirement of checking the
>>> cancellable syscall anchors in asynchronous cancellation? My plan is
>>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
>>> requires to always use old int$128 for program that uses cancellation
>>> (static case) or just threads (dynamic mode, which should be more
>>> common on glibc).
>>>
>>> Using the i686 strategy of a vDSO bridge symbol would require to always
>>> fallback to 'sc' to still use the same cancellation strategy (and
>>> thus defeating this optimization in such cases).
>>
>> Yes, I assumed it would be the same, ignoring the new syscall
>> mechanism for cancellable syscalls. While there are some exceptions,
>> cancellable syscalls are generally not hot paths but things that are
>> expected to block and to have significant amounts of work to do in
>> kernelspace, so saving a few tens of cycles is rather pointless.
>>
>> It's possible to do a branch/multiple versions of the syscall asm for
>> cancellation but would require extending the cancellation handler to
>> support checking against multiple independent address ranges or using
>> some alternate markup of them.
>
> The main issue is at least for glibc dynamic linking is way more common
> than static linking and once the program become multithread the fallback
> will be always used.

 I'm not relying on static linking optimizing out the cancellable
 version. I'm talking about how cancellable syscalls are pretty much
 all "heavy" operations to begin with where a few tens of cycles are in
 the realm of "measurement noise" relative to the dominating time
 costs.
>>>
>>> Yes I am aware, but at same time I am not sure how it plays on real world.
>>> For instance, some workloads might issue kernel query syscalls, such as
>>> recv, where buffer copying might not be dominant factor. So I see that if
>>> the idea is optimizing syscall mechanism, we should try to leverage it
>>> as whole in libc.
>> 
>> Have you timed a minimal recv? I'm not assuming buffer copying is the
>> dominant factor. I'm assuming the overhead of all the kernel layers
>> involved is dominant.
> 
> Not really, but reading the advantages of using 'scv' over 'sc' also does
> not outline the real expect gain.  Taking in consideration this should
> be a micro-optimization (focused on entry syscall patch), I think we should
> use where it possible.

It's around 90 cycles improvement, depending on config options and 
speculative mitigations in place, this may be roughly 5-20% of a gettid
syscall, which itself probably bears little relationship to what a recv
syscall doing real work would do, it's easy to swamp it with other work.

But it's a pretty big win in terms of how much we try to optimise this
path.

Thanks,
Nick


Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin
Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm:
> * Nicholas Piggin via Libc-alpha  [2020-04-16 
> 10:16:54 +1000]:
>> Well it would have to test HWCAP and patch in or branch to two 
>> completely different sequences including register save/restores yes.
>> You could have the same asm and matching clobbers to put the sequence
>> inline and then you could patch the one sc/scv instruction I suppose.
> 
> how would that 'patch' work?
> 
> there are many reasons why you don't
> want libc to write its .text

I guess I don't know what I'm talking about when it comes to libraries. 
Shame if there is no good way to load-time patch libc. It's orthogonal
to the scv selection though -- if you don't patch you have to 
conditional or indirect branch however you implement it.

Thanks,
Nick


Re: [PATCH] powerpc/book3s64/kuap: SPRN_AMR modification need CSI instructions before and after

2020-04-19 Thread Nicholas Piggin
Excerpts from Aneesh Kumar K.V's message of April 19, 2020 11:53 pm:
> As per the ISA, context synchronizing instructions is needed before and after
> SPRN_AMR update. Use isync before and the CSI after is implied by the rfid
> that we will use to switch to a new context.

Not entirely sure if we need this. This will restore AMR to more 
permissive, so if it executes ahead of a stray load from this
context, it won't make it fault.

That said, leaving this end open makes it harder to reason about
user access protection I guess, so let's add it.

Thanks,
Nick

> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/kup-radix.h | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h 
> b/arch/powerpc/include/asm/book3s/64/kup-radix.h
> index 3bcef989a35d..224658efe2fd 100644
> --- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
> @@ -16,6 +16,7 @@
>  #ifdef CONFIG_PPC_KUAP
>   BEGIN_MMU_FTR_SECTION_NESTED(67)
>   ld  \gpr, STACK_REGS_KUAP(r1)
> + isync
>   mtspr   SPRN_AMR, \gpr
>   END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_RADIX_KUAP, 67)
>  #endif
> @@ -62,8 +63,14 @@
>  
>  static inline void kuap_restore_amr(struct pt_regs *regs)
>  {
> - if (mmu_has_feature(MMU_FTR_RADIX_KUAP))
> + if (mmu_has_feature(MMU_FTR_RADIX_KUAP)) {
> + isync();
>   mtspr(SPRN_AMR, regs->kuap);
> + /*
> +  * No following isync/CSI required because we will be
> +  * returning to a different context using rfid
> +  */
> + }
>  }
>  
>  static inline void kuap_check_amr(void)
> -- 
> 2.25.2
> 
> 


Re: [PATCH AUTOSEL 5.5 73/75] ocxl: Add PCI hotplug dependency to Kconfig

2020-04-19 Thread Sasha Levin

On Mon, Apr 20, 2020 at 02:32:19AM +1000, Andrew Donnellan wrote:

On 19/4/20 12:09 am, Sasha Levin wrote:

From: Frederic Barrat 

[ Upstream commit 49ce94b8677c7d7a15c4d7cbbb9ff1cd8387827b ]

The PCI hotplug framework is used to update the devices when a new
image is written to the FPGA.

Reviewed-by: Alastair D'Silva 
Reviewed-by: Andrew Donnellan 
Signed-off-by: Frederic Barrat 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191121134918.7155-12-fbar...@linux.ibm.com
Signed-off-by: Sasha Levin 


This shouldn't be backported to any of the stable trees.


I'll drop it, thanks!

--
Thanks,
Sasha


[Bug 207359] MegaRAID SAS 9361 controller hang/reset

2020-04-19 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=207359

--- Comment #2 from Cameron (c...@neo-zeon.de) ---
Looking at bug 206123 above, it's worth noting that the amd64 box I'm using for
comparison has SATA disks, though this is probably still a PPC specific issue.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 207359] MegaRAID SAS 9361 controller hang/reset

2020-04-19 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=207359

gyakov...@gentoo.org changed:

   What|Removed |Added

 CC||gyakov...@gentoo.org

--- Comment #1 from gyakov...@gentoo.org ---
In my case I see similar problem on same motherboard but with aacraid driver
(microsemi one)

https://bugzilla.kernel.org/show_bug.cgi?id=206123

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 207359] New: MegaRAID SAS 9361 controller hang/reset

2020-04-19 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=207359

Bug ID: 207359
   Summary: MegaRAID SAS 9361 controller hang/reset
   Product: Platform Specific/Hardware
   Version: 2.5
Kernel Version: >=v5.4
  Hardware: PPC-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: PPC-64
  Assignee: platform_ppc...@kernel-bugs.osdl.org
  Reporter: c...@neo-zeon.de
Regression: No

Created attachment 288623
  --> https://bugzilla.kernel.org/attachment.cgi?id=288623=edit
dmesg output for controller hang

On a Talos II 2x 36 core (144 thread) POWER9 box, MegaRAID SAS 9361-16i PCIE
controller can be made to pretty consistently hang with "heavy IO" on kernel
versions greater than 5.3.18.
I am unable to reproduce this on a 16/32 core/thread amd64 box with a MegaRAID
SAS 9361-16i PCIE with the exact same firmware revision.

The box also has a Microsemi SAS HBA which seems unaffected by this.

System details:
Talos II motherboard
2x 36 core (144 thread) POWER9 processors
512GB memory
4k page size
MegaRAID SAS 9361-16i PCIE controller (4 disk RAID10 volume, megaraid_sas
driver)
Microsemi HBA w/4x SSD's

The relevant dmesg messages are attached.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH AUTOSEL 5.5 73/75] ocxl: Add PCI hotplug dependency to Kconfig

2020-04-19 Thread Andrew Donnellan

On 19/4/20 12:09 am, Sasha Levin wrote:

From: Frederic Barrat 

[ Upstream commit 49ce94b8677c7d7a15c4d7cbbb9ff1cd8387827b ]

The PCI hotplug framework is used to update the devices when a new
image is written to the FPGA.

Reviewed-by: Alastair D'Silva 
Reviewed-by: Andrew Donnellan 
Signed-off-by: Frederic Barrat 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191121134918.7155-12-fbar...@linux.ibm.com
Signed-off-by: Sasha Levin 


This shouldn't be backported to any of the stable trees.

--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited



Re: [PATCH 3/4] dma-mapping: add a dma_ops_bypass flag to struct device

2020-04-19 Thread Alexey Kardashevskiy



On 19/04/2020 22:25, Joerg Roedel wrote:
> On Sun, Apr 19, 2020 at 10:00:58AM +0200, Christoph Hellwig wrote:
>> The difference is that NULL ops mean imply the direct mapping is always
>> used, dma_ops_bypass means a direct mapping is used if no bounce buffering
>> using swiotlb is needed, which should also answer your first question.
>> The idea is to consolidate code in the core to use an opportunistic
>> direct mapping instead of the dynamic iommu mapping.  I though the cover
>> letter and commit log explained this well enough, but maybe I need to
>> do a better job.
> 
> Ah right, now I see it, when dma_ops_bypass is set it will only use
> direct mapping when the available memory fits into the device's
> dma_masks, and calls into dma_ops otherwise.
> 
> I wonder how that will interact with an IOMMU driver, which has to make
> sure that the direct mapping is accessible for the device at all.  It
> can either put the device into a passthrough domain for direct mapping
> or into a re-mapped domain, but then all DMA-API calls need to use dma-ops.
> When the dma_mask covers available memory but coherent_mask doesn't,
> the streaming calls will use dma-direct and alloc_coherent() calls into
> dma-ops. There is no way for the IOMMU driver to ensure both works.
> 
> So what are the conditions under which an IOMMU driver would set
> dma_ops_bypass to 1 and get a different result as to when setting
> dev->dma_ops to NULL?


One example is powerpc64/pseries (arch/powerpc/kernel/dma-iommu.c) where
dma_iommu_ops::dma_iommu_dma_supported() (i.e. need ops) decides whether
to set dma_ops_bypass to 1. It tries creating a DMA window with 1:1
mapping to fit maximum possible RAM address, if that works, then ops is
not needed.


-- 
Alexey


[PATCH 10/10] powerpc/book3s64/pkeys: Make initial_allocation_mask static

2020-04-19 Thread Aneesh Kumar K.V
initial_allocation_mask is not used outside this file.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pkeys.h | 1 -
 arch/powerpc/mm/book3s64/pkeys.c | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 652bad7334f3..47c81d41ea9a 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -13,7 +13,6 @@
 
 DECLARE_STATIC_KEY_FALSE(pkey_disabled);
 extern int max_pkey;
-extern u32 initial_allocation_mask; /*  bits set for the initially allocated 
keys */
 extern u32 reserved_allocation_mask; /* bits set for reserved keys */
 
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | \
diff --git a/arch/powerpc/mm/book3s64/pkeys.c b/arch/powerpc/mm/book3s64/pkeys.c
index b969dc44edc7..78aaa9300ac2 100644
--- a/arch/powerpc/mm/book3s64/pkeys.c
+++ b/arch/powerpc/mm/book3s64/pkeys.c
@@ -15,11 +15,11 @@
 DEFINE_STATIC_KEY_FALSE(pkey_disabled);
 DEFINE_STATIC_KEY_FALSE(execute_pkey_disabled);
 int  max_pkey; /* Maximum key value supported */
-u32  initial_allocation_mask;   /* Bits set for the initially allocated keys */
 /*
  *  Keys marked in the reservation list cannot be allocated by  userspace
  */
 u32  reserved_allocation_mask;
+static u32  initial_allocation_mask;   /* Bits set for the initially allocated 
keys */
 static u64 default_amr;
 static u64 default_iamr;
 /* Allow all keys to be modified by default */
-- 
2.25.2



[PATCH 09/10] powerpc/book3s64/pkeys: Convert pkey_total to max_pkey

2020-04-19 Thread Aneesh Kumar K.V
max_pkey now represents max key value that userspace can allocate.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pkeys.h |  7 +--
 arch/powerpc/mm/book3s64/pkeys.c | 14 +++---
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 75d2a2c19c04..652bad7334f3 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -12,7 +12,7 @@
 #include 
 
 DECLARE_STATIC_KEY_FALSE(pkey_disabled);
-extern int pkeys_total; /* total pkeys as per device tree */
+extern int max_pkey;
 extern u32 initial_allocation_mask; /*  bits set for the initially allocated 
keys */
 extern u32 reserved_allocation_mask; /* bits set for reserved keys */
 
@@ -44,7 +44,10 @@ static inline int vma_pkey(struct vm_area_struct *vma)
return (vma->vm_flags & ARCH_VM_PKEY_FLAGS) >> VM_PKEY_SHIFT;
 }
 
-#define arch_max_pkey() pkeys_total
+static inline int arch_max_pkey(void)
+{
+   return max_pkey;
+}
 
 #define pkey_alloc_mask(pkey) (0x1 << pkey)
 
diff --git a/arch/powerpc/mm/book3s64/pkeys.c b/arch/powerpc/mm/book3s64/pkeys.c
index 9b3692129139..b969dc44edc7 100644
--- a/arch/powerpc/mm/book3s64/pkeys.c
+++ b/arch/powerpc/mm/book3s64/pkeys.c
@@ -14,7 +14,7 @@
 
 DEFINE_STATIC_KEY_FALSE(pkey_disabled);
 DEFINE_STATIC_KEY_FALSE(execute_pkey_disabled);
-int  pkeys_total;  /* Total pkeys as per device tree */
+int  max_pkey; /* Maximum key value supported */
 u32  initial_allocation_mask;   /* Bits set for the initially allocated keys */
 /*
  *  Keys marked in the reservation list cannot be allocated by  userspace
@@ -84,7 +84,7 @@ static int scan_pkey_feature(void)
 
 static int pkey_initialize(void)
 {
-   int os_reserved, i;
+   int pkeys_total, i;
 
/*
 * We define PKEY_DISABLE_EXECUTE in addition to the arch-neutral
@@ -122,12 +122,12 @@ static int pkey_initialize(void)
 * The OS can manage only 8 pkeys due to its inability to represent them
 * in the Linux 4K PTE. Mark all other keys reserved.
 */
-   os_reserved = pkeys_total - 8;
+   max_pkey = min(8, pkeys_total);
 #else
-   os_reserved = 0;
+   max_pkey = pkeys_total;
 #endif
 
-   if (unlikely((pkeys_total - os_reserved) <= execute_only_key)) {
+   if (unlikely(max_pkey <= execute_only_key)) {
/*
 * Insufficient number of keys to support
 * execute only key. Mark it unavailable.
@@ -174,10 +174,10 @@ static int pkey_initialize(void)
reserved_allocation_mask |= (0x1 << 1);
 
/*
-* Prevent the usage of OS reserved the keys. Update UAMOR
+* Prevent the usage of OS reserved keys. Update UAMOR
 * for those keys.
 */
-   for (i = (pkeys_total - os_reserved); i < pkeys_total; i++) {
+   for (i = max_pkey; i < pkeys_total; i++) {
reserved_allocation_mask |= (0x1 << i);
default_uamor &= ~(0x3ul << pkeyshift(i));
}
-- 
2.25.2



[PATCH 08/10] powerpc/book3s64/pkeys: Simplify pkey disable branch

2020-04-19 Thread Aneesh Kumar K.V
Make the default value FALSE (pkey enabled) and set to TRUE when we
find the total number of keys supported to be zero.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pkeys.h | 2 +-
 arch/powerpc/mm/book3s64/pkeys.c | 7 +++
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 5dd0a79d1809..75d2a2c19c04 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -11,7 +11,7 @@
 #include 
 #include 
 
-DECLARE_STATIC_KEY_TRUE(pkey_disabled);
+DECLARE_STATIC_KEY_FALSE(pkey_disabled);
 extern int pkeys_total; /* total pkeys as per device tree */
 extern u32 initial_allocation_mask; /*  bits set for the initially allocated 
keys */
 extern u32 reserved_allocation_mask; /* bits set for reserved keys */
diff --git a/arch/powerpc/mm/book3s64/pkeys.c b/arch/powerpc/mm/book3s64/pkeys.c
index 86ea32a9e67c..9b3692129139 100644
--- a/arch/powerpc/mm/book3s64/pkeys.c
+++ b/arch/powerpc/mm/book3s64/pkeys.c
@@ -12,7 +12,7 @@
 #include 
 #include 
 
-DEFINE_STATIC_KEY_TRUE(pkey_disabled);
+DEFINE_STATIC_KEY_FALSE(pkey_disabled);
 DEFINE_STATIC_KEY_FALSE(execute_pkey_disabled);
 int  pkeys_total;  /* Total pkeys as per device tree */
 u32  initial_allocation_mask;   /* Bits set for the initially allocated keys */
@@ -104,9 +104,8 @@ static int pkey_initialize(void)
 
/* scan the device tree for pkey feature */
pkeys_total = scan_pkey_feature();
-   if (pkeys_total)
-   static_branch_disable(_disabled);
-   else {
+   if (!pkeys_total) {
+   /* No support for pkey. Mark it disabled */
static_branch_enable(_disabled);
return 0;
}
-- 
2.25.2



[PATCH 07/10] powerpc/book3s64/pkeys: Convert execute key support to static key

2020-04-19 Thread Aneesh Kumar K.V
Convert the bool to a static key like pkey_disabled.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/pkeys.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/pkeys.c b/arch/powerpc/mm/book3s64/pkeys.c
index fcfa67172c6a..86ea32a9e67c 100644
--- a/arch/powerpc/mm/book3s64/pkeys.c
+++ b/arch/powerpc/mm/book3s64/pkeys.c
@@ -13,13 +13,13 @@
 #include 
 
 DEFINE_STATIC_KEY_TRUE(pkey_disabled);
+DEFINE_STATIC_KEY_FALSE(execute_pkey_disabled);
 int  pkeys_total;  /* Total pkeys as per device tree */
 u32  initial_allocation_mask;   /* Bits set for the initially allocated keys */
 /*
  *  Keys marked in the reservation list cannot be allocated by  userspace
  */
 u32  reserved_allocation_mask;
-static bool pkey_execute_disable_supported;
 static u64 default_amr;
 static u64 default_iamr;
 /* Allow all keys to be modified by default */
@@ -116,9 +116,7 @@ static int pkey_initialize(void)
 * execute_disable support. Instead we use a PVR check.
 */
if (pvr_version_is(PVR_POWER7) || pvr_version_is(PVR_POWER7p))
-   pkey_execute_disable_supported = false;
-   else
-   pkey_execute_disable_supported = true;
+   static_branch_enable(_pkey_disabled);
 
 #ifdef CONFIG_PPC_4K_PAGES
/*
@@ -214,7 +212,7 @@ static inline void write_amr(u64 value)
 
 static inline u64 read_iamr(void)
 {
-   if (!likely(pkey_execute_disable_supported))
+   if (static_branch_unlikely(_pkey_disabled))
return 0x0UL;
 
return mfspr(SPRN_IAMR);
@@ -222,7 +220,7 @@ static inline u64 read_iamr(void)
 
 static inline void write_iamr(u64 value)
 {
-   if (!likely(pkey_execute_disable_supported))
+   if (static_branch_unlikely(_pkey_disabled))
return;
 
mtspr(SPRN_IAMR, value);
@@ -282,7 +280,7 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
return -EINVAL;
 
if (init_val & PKEY_DISABLE_EXECUTE) {
-   if (!pkey_execute_disable_supported)
+   if (static_branch_unlikely(_pkey_disabled))
return -EINVAL;
new_iamr_bits |= IAMR_EX_BIT;
}
-- 
2.25.2



[PATCH 05/10] powerpc/book3s64/pkeys: Simplify the key initialization

2020-04-19 Thread Aneesh Kumar K.V
Add documentation explaining the execute_only_key. The reservation and 
initialization mask
details are also explained in this patch.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/pkeys.c | 187 ++-
 1 file changed, 108 insertions(+), 79 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/pkeys.c b/arch/powerpc/mm/book3s64/pkeys.c
index d60e6bfa3e03..fcfa67172c6a 100644
--- a/arch/powerpc/mm/book3s64/pkeys.c
+++ b/arch/powerpc/mm/book3s64/pkeys.c
@@ -15,48 +15,71 @@
 DEFINE_STATIC_KEY_TRUE(pkey_disabled);
 int  pkeys_total;  /* Total pkeys as per device tree */
 u32  initial_allocation_mask;   /* Bits set for the initially allocated keys */
-u32  reserved_allocation_mask;  /* Bits set for reserved keys */
+/*
+ *  Keys marked in the reservation list cannot be allocated by  userspace
+ */
+u32  reserved_allocation_mask;
 static bool pkey_execute_disable_supported;
-static bool pkeys_devtree_defined; /* property exported by device tree */
-static u64 pkey_amr_mask;  /* Bits in AMR not to be touched */
-static u64 pkey_iamr_mask; /* Bits in AMR not to be touched */
-static u64 pkey_uamor_mask;/* Bits in UMOR not to be touched */
+static u64 default_amr;
+static u64 default_iamr;
+/* Allow all keys to be modified by default */
+static u64 default_uamor = ~0x0UL;
+/*
+ * Key used to implement PROT_EXEC mmap. Denies READ/WRITE
+ * We pick key 2 because 0 is special key and 1 is reserved as per ISA.
+ */
 static int execute_only_key = 2;
 
+
 #define AMR_BITS_PER_PKEY 2
 #define AMR_RD_BIT 0x1UL
 #define AMR_WR_BIT 0x2UL
 #define IAMR_EX_BIT 0x1UL
-#define PKEY_REG_BITS (sizeof(u64)*8)
+#define PKEY_REG_BITS (sizeof(u64) * 8)
 #define pkeyshift(pkey) (PKEY_REG_BITS - ((pkey+1) * AMR_BITS_PER_PKEY))
 
-static void scan_pkey_feature(void)
+static int scan_pkey_feature(void)
 {
u32 vals[2];
+   int pkeys_total = 0;
struct device_node *cpu;
 
+   /*
+* Pkey is not supported with Radix translation.
+*/
+   if (radix_enabled())
+   return 0;
+
cpu = of_find_node_by_type(NULL, "cpu");
if (!cpu)
-   return;
+   return 0;
 
if (of_property_read_u32_array(cpu,
-   "ibm,processor-storage-keys", vals, 2))
-   return;
+  "ibm,processor-storage-keys", vals, 2) 
== 0) {
+   /*
+* Since any pkey can be used for data or execute, we will
+* just treat all keys as equal and track them as one entity.
+*/
+   pkeys_total = vals[0];
+   /*  Should we check for IAMR support FIXME!! */
+   } else {
+   /*
+* Let's assume 32 pkeys on P8 bare metal, if its not defined 
by device
+* tree. We make this exception since skiboot forgot to expose 
this
+* property on power8.
+*/
+   if (!firmware_has_feature(FW_FEATURE_LPAR) &&
+   cpu_has_feature(CPU_FTRS_POWER8))
+   pkeys_total = 32;
+   }
 
/*
-* Since any pkey can be used for data or execute, we will just treat
-* all keys as equal and track them as one entity.
+* Adjust the upper limit, based on the number of bits supported by
+* arch-neutral code.
 */
-   pkeys_total = vals[0];
-   pkeys_devtree_defined = true;
-}
-
-static inline bool pkey_mmu_enabled(void)
-{
-   if (firmware_has_feature(FW_FEATURE_LPAR))
-   return pkeys_total;
-   else
-   return cpu_has_feature(CPU_FTR_PKEY);
+   pkeys_total = min_t(int, pkeys_total,
+   ((ARCH_VM_PKEY_FLAGS >> VM_PKEY_SHIFT) + 1));
+   return pkeys_total;
 }
 
 static int pkey_initialize(void)
@@ -80,31 +103,13 @@ static int pkey_initialize(void)
!= (sizeof(u64) * BITS_PER_BYTE));
 
/* scan the device tree for pkey feature */
-   scan_pkey_feature();
-
-   /*
-* Let's assume 32 pkeys on P8 bare metal, if its not defined by device
-* tree. We make this exception since skiboot forgot to expose this
-* property on power8.
-*/
-   if (!pkeys_devtree_defined && !firmware_has_feature(FW_FEATURE_LPAR) &&
-   cpu_has_feature(CPU_FTRS_POWER8))
-   pkeys_total = 32;
-
-   /*
-* Adjust the upper limit, based on the number of bits supported by
-* arch-neutral code.
-*/
-   pkeys_total = min_t(int, pkeys_total,
-   ((ARCH_VM_PKEY_FLAGS >> VM_PKEY_SHIFT)+1));
-
-   if (!pkey_mmu_enabled() || radix_enabled() || !pkeys_total)
-   static_branch_enable(_disabled);
-   else
+   pkeys_total = scan_pkey_feature();
+   if (pkeys_total)

[PATCH 06/10] powerpc/book3s64/pkeys: kill cpu feature key CPU_FTR_PKEY

2020-04-19 Thread Aneesh Kumar K.V
We don't use CPU_FTR_PKEY anymore. Remove the feature bit and mark it
free.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/cputable.h | 10 +-
 arch/powerpc/kernel/dt_cpu_ftrs.c   |  6 --
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 40a4d3c6fd99..b77f8258ee8c 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -198,7 +198,7 @@ static inline void cpu_feature_keys_init(void) { }
 #define CPU_FTR_STCX_CHECKS_ADDRESSLONG_ASM_CONST(0x8000)
 #define CPU_FTR_POPCNTB
LONG_ASM_CONST(0x0001)
 #define CPU_FTR_POPCNTD
LONG_ASM_CONST(0x0002)
-#define CPU_FTR_PKEY   LONG_ASM_CONST(0x0004)
+/* LONG_ASM_CONST(0x0004) Free */
 #define CPU_FTR_VMX_COPY   LONG_ASM_CONST(0x0008)
 #define CPU_FTR_TM LONG_ASM_CONST(0x0010)
 #define CPU_FTR_CFAR   LONG_ASM_CONST(0x0020)
@@ -437,7 +437,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_DSCR | CPU_FTR_SAO  | CPU_FTR_ASYM_SMT | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | \
-   CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX | CPU_FTR_PKEY)
+   CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX )
 #define CPU_FTRS_POWER8 (CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_206 |\
CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -447,7 +447,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
-   CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_PKEY)
+   CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP )
 #define CPU_FTRS_POWER8E (CPU_FTRS_POWER8 | CPU_FTR_PMAO_BUG)
 #define CPU_FTRS_POWER9 (CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | CPU_FTR_ARCH_206 |\
@@ -458,8 +458,8 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \
-   CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | CPU_FTR_PKEY | \
-   CPU_FTR_P9_TLBIE_STQ_BUG | CPU_FTR_P9_TLBIE_ERAT_BUG | 
CPU_FTR_P9_TIDR)
+   CPU_FTR_TM_COMP | CPU_FTR_ARCH_300 | CPU_FTR_P9_TLBIE_STQ_BUG | \
+   CPU_FTR_P9_TLBIE_ERAT_BUG | CPU_FTR_P9_TIDR)
 #define CPU_FTRS_POWER9_DD2_0 (CPU_FTRS_POWER9 | CPU_FTR_P9_RADIX_PREFETCH_BUG)
 #define CPU_FTRS_POWER9_DD2_1 (CPU_FTRS_POWER9 | \
   CPU_FTR_P9_RADIX_PREFETCH_BUG | \
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 36bc0d5c4f3a..120ea339ffda 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -747,12 +747,6 @@ static __init void cpufeatures_cpu_quirks(void)
}
 
update_tlbie_feature_flag(version);
-   /*
-* PKEY was not in the initial base or feature node
-* specification, but it should become optional in the next
-* cpu feature version sequence.
-*/
-   cur_cpu_spec->cpu_features |= CPU_FTR_PKEY;
 }
 
 static void __init cpufeatures_setup_finished(void)
-- 
2.25.2



[PATCH 04/10] powerpc/book3s64/pkeys: Explain key 1 reservation details

2020-04-19 Thread Aneesh Kumar K.V
This explains the details w.r.t key 1.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/pkeys.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/pkeys.c b/arch/powerpc/mm/book3s64/pkeys.c
index 1199fc2bfaec..d60e6bfa3e03 100644
--- a/arch/powerpc/mm/book3s64/pkeys.c
+++ b/arch/powerpc/mm/book3s64/pkeys.c
@@ -124,7 +124,10 @@ static int pkey_initialize(void)
 #else
os_reserved = 0;
 #endif
-   /* Bits are in LE format. */
+   /*
+* key 1 is recommended not to be used. PowerISA(3.0) page 1015,
+* programming note.
+*/
reserved_allocation_mask = (0x1 << 1) | (0x1 << execute_only_key);
 
/* register mask is in BE format */
-- 
2.25.2



[PATCH 03/10] powerpc/book3s64/pkeys: Move pkey related bits in the linux page table

2020-04-19 Thread Aneesh Kumar K.V
To keep things simple, all the pkey related bits are kept together
in linux page table for 64K config with hash translation. With hash-4k
kernel requires 4 bits to store slots details. This is done by overloading
some of the RPN bits for storing the slot details. Due to this PKEY_BIT0 on
the 4K config is used for storing hash slot details.

64K before

||RSV1| RSV2| RSV3 | RSV4 | RPN44| RPN43   | | RSV5|
|| P4 |  P3 |  P2  |  P1  | Busy | HASHPTE | |  P0 |

after

||RSV1| RSV2| RSV3 | RSV4 | RPN44 | RPN43   | | RSV5 |
|| P4 |  P3 |  P2  |  P1  | P0| HASHPTE | | Busy |

4k before

|| RSV1 | RSV2 | RSV3 | RSV4 | RPN44| RPN43 | RSV5|
|| Busy |  HASHPTE |  P2  |  P1  | F_SEC| F_GIX |  P0 |

after

|| RSV1| RSV2| RSV3 | RSV4 | Free | RPN43 | RSV5 |
|| HASHPTE |  P2 |  P1  |  P0  | F_SEC| F_GIX | BUSY |

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  | 16 
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 12 ++--
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 17 -
 3 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index f889d56bf8cf..082b98808701 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -34,11 +34,11 @@
 #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
 #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
-#define H_PAGE_F_GIX_SHIFT 53
-#define H_PAGE_F_SECOND_RPAGE_RPN44/* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX   (_RPAGE_RPN43 | _RPAGE_RPN42 | _RPAGE_RPN41)
-#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are busy */
-#define H_PAGE_HASHPTE _RPAGE_RSV2 /* software: PTE & hash are busy */
+#define H_PAGE_F_GIX_SHIFT _PAGE_PA_MAX
+#define H_PAGE_F_SECOND_RPAGE_PKEY_BIT0 /* HPTE is in 2ndary 
HPTEG */
+#define H_PAGE_F_GIX   (_RPAGE_RPN43 | _RPAGE_RPN42 | _RPAGE_RPN41)
+#define H_PAGE_BUSY_RPAGE_RSV1
+#define H_PAGE_HASHPTE _RPAGE_PKEY_BIT4
 
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
@@ -59,9 +59,9 @@
 /* memory key bits, only 8 keys supported */
 #define H_PTE_PKEY_BIT40
 #define H_PTE_PKEY_BIT30
-#define H_PTE_PKEY_BIT2_RPAGE_RSV3
-#define H_PTE_PKEY_BIT1_RPAGE_RSV4
-#define H_PTE_PKEY_BIT0_RPAGE_RSV5
+#define H_PTE_PKEY_BIT2_RPAGE_PKEY_BIT3
+#define H_PTE_PKEY_BIT1_RPAGE_PKEY_BIT2
+#define H_PTE_PKEY_BIT0_RPAGE_PKEY_BIT1
 
 
 /*
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 0a15fd14cf72..f20de1149ebe 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -32,15 +32,15 @@
  */
 #define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
-#define H_PAGE_BUSY_RPAGE_RPN44 /* software: PTE & hash are busy */
+#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are busy */
 #define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
 
 /* memory key bits. */
-#define H_PTE_PKEY_BIT4_RPAGE_RSV1
-#define H_PTE_PKEY_BIT3_RPAGE_RSV2
-#define H_PTE_PKEY_BIT2_RPAGE_RSV3
-#define H_PTE_PKEY_BIT1_RPAGE_RSV4
-#define H_PTE_PKEY_BIT0_RPAGE_RSV5
+#define H_PTE_PKEY_BIT4_RPAGE_PKEY_BIT4
+#define H_PTE_PKEY_BIT3_RPAGE_PKEY_BIT3
+#define H_PTE_PKEY_BIT2_RPAGE_PKEY_BIT2
+#define H_PTE_PKEY_BIT1_RPAGE_PKEY_BIT1
+#define H_PTE_PKEY_BIT0_RPAGE_PKEY_BIT0
 
 /*
  * We need to differentiate between explicit huge page and THP huge
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 368b136517e0..e31369707f9f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -32,11 +32,13 @@
 #define _RPAGE_SW1 0x00800
 #define _RPAGE_SW2 0x00400
 #define _RPAGE_SW3 0x00200
-#define _RPAGE_RSV10x1000UL
-#define _RPAGE_RSV20x0800UL
-#define _RPAGE_RSV30x0400UL
-#define _RPAGE_RSV40x0200UL
-#define _RPAGE_RSV50x00040UL
+#define _RPAGE_RSV10x00040UL
+
+#define _RPAGE_PKEY_BIT4   0x1000UL
+#define _RPAGE_PKEY_BIT3   0x0800UL
+#define _RPAGE_PKEY_BIT2   0x0400UL
+#define _RPAGE_PKEY_BIT1   0x0200UL
+#define _RPAGE_PKEY_BIT0   0x0100UL
 
 #define _PAGE_PTE  

[PATCH 02/10] powerpc/book3s64/pkeys: pkeys are supported only on hash on book3s.

2020-04-19 Thread Aneesh Kumar K.V
Move them to hash specific file and add BUG() for radix path.
---
 .../powerpc/include/asm/book3s/64/hash-pkey.h | 32 
 arch/powerpc/include/asm/book3s/64/pkeys.h| 25 +
 arch/powerpc/include/asm/pkeys.h  | 37 ---
 3 files changed, 64 insertions(+), 30 deletions(-)
 create mode 100644 arch/powerpc/include/asm/book3s/64/hash-pkey.h
 create mode 100644 arch/powerpc/include/asm/book3s/64/pkeys.h

diff --git a/arch/powerpc/include/asm/book3s/64/hash-pkey.h 
b/arch/powerpc/include/asm/book3s/64/hash-pkey.h
new file mode 100644
index ..795010897e5d
--- /dev/null
+++ b/arch/powerpc/include/asm/book3s/64/hash-pkey.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_BOOK3S_64_HASH_PKEY_H
+#define _ASM_POWERPC_BOOK3S_64_HASH_PKEY_H
+
+static inline u64 hash__vmflag_to_pte_pkey_bits(u64 vm_flags)
+{
+   return (((vm_flags & VM_PKEY_BIT0) ? H_PTE_PKEY_BIT0 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT1) ? H_PTE_PKEY_BIT1 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT2) ? H_PTE_PKEY_BIT2 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT3) ? H_PTE_PKEY_BIT3 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT4) ? H_PTE_PKEY_BIT4 : 0x0UL));
+}
+
+static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
+{
+   return (((pteflags & H_PTE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL) |
+   ((pteflags & H_PTE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) |
+   ((pteflags & H_PTE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) |
+   ((pteflags & H_PTE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) |
+   ((pteflags & H_PTE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL));
+}
+
+static inline u16 hash__pte_to_pkey_bits(u64 pteflags)
+{
+   return (((pteflags & H_PTE_PKEY_BIT4) ? 0x10 : 0x0UL) |
+   ((pteflags & H_PTE_PKEY_BIT3) ? 0x8 : 0x0UL) |
+   ((pteflags & H_PTE_PKEY_BIT2) ? 0x4 : 0x0UL) |
+   ((pteflags & H_PTE_PKEY_BIT1) ? 0x2 : 0x0UL) |
+   ((pteflags & H_PTE_PKEY_BIT0) ? 0x1 : 0x0UL));
+}
+
+#endif
diff --git a/arch/powerpc/include/asm/book3s/64/pkeys.h 
b/arch/powerpc/include/asm/book3s/64/pkeys.h
new file mode 100644
index ..8174662a9173
--- /dev/null
+++ b/arch/powerpc/include/asm/book3s/64/pkeys.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef _ASM_POWERPC_BOOK3S_64_PKEYS_H
+#define _ASM_POWERPC_BOOK3S_64_PKEYS_H
+
+#include 
+
+static inline u64 vmflag_to_pte_pkey_bits(u64 vm_flags)
+{
+   if (static_branch_likely(_disabled))
+   return 0x0UL;
+
+   if (radix_enabled())
+   BUG();
+   return hash__vmflag_to_pte_pkey_bits(vm_flags);
+}
+
+static inline u16 pte_to_pkey_bits(u64 pteflags)
+{
+   if (radix_enabled())
+   BUG();
+   return hash__pte_to_pkey_bits(pteflags);
+}
+
+#endif /*_ASM_POWERPC_KEYS_H */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index f8f4d0793789..5dd0a79d1809 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -25,23 +25,18 @@ extern u32 reserved_allocation_mask; /* bits set for 
reserved keys */
PKEY_DISABLE_WRITE  | \
PKEY_DISABLE_EXECUTE)
 
+#ifdef CONFIG_PPC_BOOK3S_64
+#include 
+#else
+#error "Not supported"
+#endif
+
+
 static inline u64 pkey_to_vmflag_bits(u16 pkey)
 {
return (((u64)pkey << VM_PKEY_SHIFT) & ARCH_VM_PKEY_FLAGS);
 }
 
-static inline u64 vmflag_to_pte_pkey_bits(u64 vm_flags)
-{
-   if (static_branch_likely(_disabled))
-   return 0x0UL;
-
-   return (((vm_flags & VM_PKEY_BIT0) ? H_PTE_PKEY_BIT0 : 0x0UL) |
-   ((vm_flags & VM_PKEY_BIT1) ? H_PTE_PKEY_BIT1 : 0x0UL) |
-   ((vm_flags & VM_PKEY_BIT2) ? H_PTE_PKEY_BIT2 : 0x0UL) |
-   ((vm_flags & VM_PKEY_BIT3) ? H_PTE_PKEY_BIT3 : 0x0UL) |
-   ((vm_flags & VM_PKEY_BIT4) ? H_PTE_PKEY_BIT4 : 0x0UL));
-}
-
 static inline int vma_pkey(struct vm_area_struct *vma)
 {
if (static_branch_likely(_disabled))
@@ -51,24 +46,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 
 #define arch_max_pkey() pkeys_total
 
-static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
-{
-   return (((pteflags & H_PTE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL) |
-   ((pteflags & H_PTE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) |
-   ((pteflags & H_PTE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) |
-   ((pteflags & H_PTE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) |
-   ((pteflags & H_PTE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL));
-}
-
-static inline u16 pte_to_pkey_bits(u64 pteflags)
-{
-   return (((pteflags & H_PTE_PKEY_BIT4) ? 0x10 : 0x0UL) |
-   ((pteflags & H_PTE_PKEY_BIT3) ? 0x8 : 0x0UL) |
-   ((pteflags & H_PTE_PKEY_BIT2) ? 0x4 : 0x0UL) |
-   ((pteflags & H_PTE_PKEY_BIT1) ? 0x2 : 0x0UL) |
-   

[PATCH 01/10] powerpc/book3s64/pkeys: Fixup bit numbering

2020-04-19 Thread Aneesh Kumar K.V
This number the pkey bit such that it is easy to follow. PKEY_BIT0 is
the lower order bit. This makes further changes easy to follow.

No functional change in this patch other than linux page table for
hash translation now maps pkeys differently.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  9 +++
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  8 +++
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  8 +++
 arch/powerpc/include/asm/pkeys.h  | 24 +--
 4 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 3f9ae3585ab9..f889d56bf8cf 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -57,11 +57,12 @@
 #define H_PMD_FRAG_NR  (PAGE_SIZE >> H_PMD_FRAG_SIZE_SHIFT)
 
 /* memory key bits, only 8 keys supported */
-#define H_PTE_PKEY_BIT00
-#define H_PTE_PKEY_BIT10
+#define H_PTE_PKEY_BIT40
+#define H_PTE_PKEY_BIT30
 #define H_PTE_PKEY_BIT2_RPAGE_RSV3
-#define H_PTE_PKEY_BIT3_RPAGE_RSV4
-#define H_PTE_PKEY_BIT4_RPAGE_RSV5
+#define H_PTE_PKEY_BIT1_RPAGE_RSV4
+#define H_PTE_PKEY_BIT0_RPAGE_RSV5
+
 
 /*
  * On all 4K setups, remap_4k_pfn() equates to remap_pfn_range()
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 0729c034e56f..0a15fd14cf72 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -36,11 +36,11 @@
 #define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated HPTE */
 
 /* memory key bits. */
-#define H_PTE_PKEY_BIT0_RPAGE_RSV1
-#define H_PTE_PKEY_BIT1_RPAGE_RSV2
+#define H_PTE_PKEY_BIT4_RPAGE_RSV1
+#define H_PTE_PKEY_BIT3_RPAGE_RSV2
 #define H_PTE_PKEY_BIT2_RPAGE_RSV3
-#define H_PTE_PKEY_BIT3_RPAGE_RSV4
-#define H_PTE_PKEY_BIT4_RPAGE_RSV5
+#define H_PTE_PKEY_BIT1_RPAGE_RSV4
+#define H_PTE_PKEY_BIT0_RPAGE_RSV5
 
 /*
  * We need to differentiate between explicit huge page and THP huge
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 3fa1b962dc27..58fcc959f9d5 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -86,8 +86,8 @@
 #define HPTE_R_PP0 ASM_CONST(0x8000)
 #define HPTE_R_TS  ASM_CONST(0x4000)
 #define HPTE_R_KEY_HI  ASM_CONST(0x3000)
-#define HPTE_R_KEY_BIT0ASM_CONST(0x2000)
-#define HPTE_R_KEY_BIT1ASM_CONST(0x1000)
+#define HPTE_R_KEY_BIT4ASM_CONST(0x2000)
+#define HPTE_R_KEY_BIT3ASM_CONST(0x1000)
 #define HPTE_R_RPN_SHIFT   12
 #define HPTE_R_RPN ASM_CONST(0x0000)
 #define HPTE_R_RPN_3_0 ASM_CONST(0x01fff000)
@@ -103,8 +103,8 @@
 #define HPTE_R_R   ASM_CONST(0x0100)
 #define HPTE_R_KEY_LO  ASM_CONST(0x0e00)
 #define HPTE_R_KEY_BIT2ASM_CONST(0x0800)
-#define HPTE_R_KEY_BIT3ASM_CONST(0x0400)
-#define HPTE_R_KEY_BIT4ASM_CONST(0x0200)
+#define HPTE_R_KEY_BIT1ASM_CONST(0x0400)
+#define HPTE_R_KEY_BIT0ASM_CONST(0x0200)
 #define HPTE_R_KEY (HPTE_R_KEY_LO | HPTE_R_KEY_HI)
 
 #define HPTE_V_1TB_SEG ASM_CONST(0x4000)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 20ebf153c871..f8f4d0793789 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -35,11 +35,11 @@ static inline u64 vmflag_to_pte_pkey_bits(u64 vm_flags)
if (static_branch_likely(_disabled))
return 0x0UL;
 
-   return (((vm_flags & VM_PKEY_BIT0) ? H_PTE_PKEY_BIT4 : 0x0UL) |
-   ((vm_flags & VM_PKEY_BIT1) ? H_PTE_PKEY_BIT3 : 0x0UL) |
+   return (((vm_flags & VM_PKEY_BIT0) ? H_PTE_PKEY_BIT0 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT1) ? H_PTE_PKEY_BIT1 : 0x0UL) |
((vm_flags & VM_PKEY_BIT2) ? H_PTE_PKEY_BIT2 : 0x0UL) |
-   ((vm_flags & VM_PKEY_BIT3) ? H_PTE_PKEY_BIT1 : 0x0UL) |
-   ((vm_flags & VM_PKEY_BIT4) ? H_PTE_PKEY_BIT0 : 0x0UL));
+   ((vm_flags & VM_PKEY_BIT3) ? H_PTE_PKEY_BIT3 : 0x0UL) |
+   ((vm_flags & VM_PKEY_BIT4) ? H_PTE_PKEY_BIT4 : 0x0UL));
 }
 
 static inline int vma_pkey(struct vm_area_struct *vma)
@@ -53,20 +53,20 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 
 static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
 {
-   return (((pteflags & H_PTE_PKEY_BIT0) ? 

[PATCH] powerpc/book3s64/kuap: SPRN_AMR modification need CSI instructions before and after

2020-04-19 Thread Aneesh Kumar K.V
As per the ISA, context synchronizing instructions is needed before and after
SPRN_AMR update. Use isync before and the CSI after is implied by the rfid
that we will use to switch to a new context.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/kup-radix.h | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h 
b/arch/powerpc/include/asm/book3s/64/kup-radix.h
index 3bcef989a35d..224658efe2fd 100644
--- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
@@ -16,6 +16,7 @@
 #ifdef CONFIG_PPC_KUAP
BEGIN_MMU_FTR_SECTION_NESTED(67)
ld  \gpr, STACK_REGS_KUAP(r1)
+   isync
mtspr   SPRN_AMR, \gpr
END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_RADIX_KUAP, 67)
 #endif
@@ -62,8 +63,14 @@
 
 static inline void kuap_restore_amr(struct pt_regs *regs)
 {
-   if (mmu_has_feature(MMU_FTR_RADIX_KUAP))
+   if (mmu_has_feature(MMU_FTR_RADIX_KUAP)) {
+   isync();
mtspr(SPRN_AMR, regs->kuap);
+   /*
+* No following isync/CSI required because we will be
+* returning to a different context using rfid
+*/
+   }
 }
 
 static inline void kuap_check_amr(void)
-- 
2.25.2



[PATCH v2] misc: new driver sram_uapi for user level SRAM access

2020-04-19 Thread Wang Wenhu
A generic User-Kernel interface that allows a misc device created
by it to support file-operations of ioctl and mmap to access SRAM
memory from user level. Different kinds of SRAM alloction and free
APIs could be registered by specific SRAM hardware level driver to
the available list and then be chosen by users to allocate and map
SRAM memory from user level.

It is extremely helpful for the user space applications that require
high performance memory accesses. One case is that on some embedded
networking devices, data throughput is high and should be processed
form user level, and the SRAM memory buffer for the processes in user
space is of great preferences.

Signed-off-by: Wang Wenhu 
Cc: Greg Kroah-Hartman 
Cc: Arnd Bergmann 
Cc: Christophe Leroy 
Cc: Scott Wood 
Cc: Michael Ellerman 
Cc: Randy Dunlap 
Cc: linuxppc-dev@lists.ozlabs.org
---
Changes since v1:
 * Addressed comments from Arnd

Note: From this on, the SRAM_UAPI driver is dependent from any hardware
drivers, so I would only commit the patch as a series itself, while the
v1 of this patch was wrapped together with patches for Freescale device.
After this I would re-commit patches for Freescale device as another series.

Link for v1:
 * https://lore.kernel.org/lkml/20200418162157.50428-5-wenhu.w...@vivo.com/
---
 drivers/misc/Kconfig |  11 ++
 drivers/misc/Makefile|   1 +
 drivers/misc/sram_uapi.c | 348 +++
 drivers/misc/sram_uapi.h |  37 +
 4 files changed, 397 insertions(+)
 create mode 100644 drivers/misc/sram_uapi.c
 create mode 100644 drivers/misc/sram_uapi.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 99e151475d8f..b19c8b24f18e 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -465,6 +465,17 @@ config PVPANIC
  a paravirtualized device provided by QEMU; it lets a virtual machine
  (guest) communicate panic events to the host.
 
+config SRAM_UAPI
+   bool "Generic SRAM User Level API driver"
+   help
+ This driver allows you to create a misc device which could be used
+ as an interface to allocate SRAM memory from user level.
+
+ It is extremely helpful for some user space applications that require
+ high performance memory accesses.
+
+ If unsure, say N.
+
 source "drivers/misc/c2port/Kconfig"
 source "drivers/misc/eeprom/Kconfig"
 source "drivers/misc/cb710/Kconfig"
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 9abf2923d831..794447ca07ca 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_VMWARE_VMCI) += vmw_vmci/
 obj-$(CONFIG_LATTICE_ECP3_CONFIG)  += lattice-ecp3-config.o
 obj-$(CONFIG_SRAM) += sram.o
 obj-$(CONFIG_SRAM_EXEC)+= sram-exec.o
+obj-$(CONFIG_SRAM_UAPI)+= sram_uapi.o
 obj-y  += mic/
 obj-$(CONFIG_GENWQE)   += genwqe/
 obj-$(CONFIG_ECHO) += echo/
diff --git a/drivers/misc/sram_uapi.c b/drivers/misc/sram_uapi.c
new file mode 100644
index ..858e43e559d0
--- /dev/null
+++ b/drivers/misc/sram_uapi.c
@@ -0,0 +1,348 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 Vivo Communication Technology Co. Ltd.
+ * Copyright (C) 2020 Wang Wenhu 
+ * All rights reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "sram_uapi.h"
+
+#define DRIVER_NAME"sram_uapi"
+
+struct sram_resource {
+   struct list_headlist;
+   struct res_info info;
+   phys_addr_t phys;
+   void*virt;
+   struct vm_area_struct   *vma;
+   struct sram_uapi*parent;
+};
+
+struct sram_uapi {
+   struct list_headres_list;
+   struct sram_api *sa;
+};
+
+static DEFINE_MUTEX(sram_list_lock);
+static LIST_HEAD(sram_list);
+
+long sram_api_register(struct sram_api *sa)
+{
+   struct sram_api *cur;
+
+   if (!sa || !sa->name || !sa->sram_alloc || !sa->sram_free)
+   return -EINVAL;
+
+   mutex_lock(_list_lock);
+   list_for_each_entry(cur, _list, list) {
+   if (cur->type == sa->type) {
+   pr_err("error sram %s type %d exists\n", sa->name,
+  sa->type);
+   mutex_unlock(_list_lock);
+   return -EEXIST;
+   }
+   }
+
+   kref_init(>kref);
+   list_add_tail(>list, _list);
+   pr_info("sram %s type %d registered\n", sa->name, sa->type);
+
+   mutex_unlock(_list_lock);
+
+   return 0;
+};
+EXPORT_SYMBOL(sram_api_register);
+
+long sram_api_unregister(struct sram_api *sa)
+{
+   struct sram_api *cur, *tmp;
+   long ret = -ENODEV;
+
+   if (!sa || !sa->name || !sa->sram_alloc || !sa->sram_free)
+   return -EINVAL;
+
+   mutex_lock(_list_lock);
+   list_for_each_entry_safe(cur, tmp, _list, list) {
+ 

Re: [PATCH 3/4] dma-mapping: add a dma_ops_bypass flag to struct device

2020-04-19 Thread Joerg Roedel
On Sun, Apr 19, 2020 at 10:00:58AM +0200, Christoph Hellwig wrote:
> The difference is that NULL ops mean imply the direct mapping is always
> used, dma_ops_bypass means a direct mapping is used if no bounce buffering
> using swiotlb is needed, which should also answer your first question.
> The idea is to consolidate code in the core to use an opportunistic
> direct mapping instead of the dynamic iommu mapping.  I though the cover
> letter and commit log explained this well enough, but maybe I need to
> do a better job.

Ah right, now I see it, when dma_ops_bypass is set it will only use
direct mapping when the available memory fits into the device's
dma_masks, and calls into dma_ops otherwise.

I wonder how that will interact with an IOMMU driver, which has to make
sure that the direct mapping is accessible for the device at all.  It
can either put the device into a passthrough domain for direct mapping
or into a re-mapped domain, but then all DMA-API calls need to use dma-ops.
When the dma_mask covers available memory but coherent_mask doesn't,
the streaming calls will use dma-direct and alloc_coherent() calls into
dma-ops. There is no way for the IOMMU driver to ensure both works.

So what are the conditions under which an IOMMU driver would set
dma_ops_bypass to 1 and get a different result as to when setting
dev->dma_ops to NULL?

Regards,

Joerg


Re: remove set_fs calls from the exec and coredump code v2

2020-04-19 Thread Eric W. Biederman
Christoph Hellwig  writes:

> On Fri, Apr 17, 2020 at 05:41:52PM -0500, Eric W. Biederman wrote:
>> > this series gets rid of playing with the address limit in the exec and
>> > coredump code.  Most of this was fairly trivial, the biggest changes are
>> > those to the spufs coredump code.
>> >
>> > Changes since v1:
>> >  - properly spell NUL
>> >  - properly handle the compat siginfo case in ELF coredumps
>> 
>> Quick question is exec from a kernel thread within the scope of what you
>> are looking at?
>> 
>> There is a set_fs(USER_DS) in flush_old_exec whose sole purpose appears
>> to be to allow exec from kernel threads.  Where the kernel threads
>> run with set_fs(KERNEL_DS) until they call exec.
>
> This series doesn't really look at that area.  But I don't think exec
> from a kernel thread makes any sense, and cleaning up how to set the
> initial USER_DS vs KERNEL_DS state is something I'll eventually get to,
> it seems like a major mess at the moment.

Fair enough.  I just wanted to make certain that it is on people's radar
that when the kernel exec's init the arguments are read from kernel
memory and the set_fs(USER_DS) in flush_old_exec() that makes that not
work later.

It is subtle and easy to miss.  So I figured I would mention it since
I have been staring at the exec code a lot lately.

Eric


Re: [PATCH 1/2] signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32

2020-04-19 Thread Christophe Leroy




Le 18/04/2020 à 13:55, Eric W. Biederman a écrit :

Christophe Leroy  writes:


Le 17/04/2020 à 23:09, Eric W. Biederman a écrit :


To remove the use of set_fs in the coredump code there needs to be a
way to convert a kernel siginfo to a userspace compat siginfo.

Call that function copy_siginfo_to_compat and factor it out of
copy_siginfo_to_user32.


I find it a pitty to do that.

The existing function could have been easily converted to using
user_access_begin() + user_access_end() and use unsafe_put_user() to copy to
userspace to avoid copying through a temporary structure on the stack.

With your change, it becomes impossible to do that.


I don't follow.  You don't like temporary structures in the coredump
code or temporary structures in copy_siginfo_to_user32?


In copy_siginfo_to_user32()



A temporary structure in copy_siginfo_to_user is pretty much required
so that it can be zeroed to guarantee we don't pass a structure with
holes to userspace.


Why ? We can zeroize the user structure directly, either with 
clear_user() or with some not yet existing unsafe_clear_user() or 
equivalent.




The implementation of copy_siginfo_to_user32 used to use the equivalent
of user_access_begin() and user_access_end() and the code was a mess
that was very difficult to reason about.  I recall their being holes
in the structure that were being copied to userspace.

Meanwhile if you are going to set all of the bytes a cache hot temporary
structure is quite cheap.


But how can we be sure it is cache hot ? As we are using memset() to 
zeroize it, it won't be loaded from memory as it will use dcbz 
instruction, but at some point in time it will get flushed back to 
memory, that's consuming anyway. Unless we invalidate it after the copy, 
but that becomes complex.


Christophe


Re: [PATCH 1/2] signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32

2020-04-19 Thread Christophe Leroy




Le 19/04/2020 à 10:13, Christoph Hellwig a écrit :

On Sat, Apr 18, 2020 at 06:55:56AM -0500, Eric W. Biederman wrote:

Is that really an issue to use that set_fs() in the coredump code ?


Using set_fs() is pretty bad and something that we would like to remove
from the kernel entirely.  The fewer instances of set_fs() we have the
better.

I forget all of the details but set_fs() is both a type violation and an
attack point when people are attacking the kernel.  The existence of
set_fs() requires somethings that should be constants to be variables.
Something about that means that our current code is difficult to protect
from spectre style vulnerabilities.


Yes, set_fs requires variable based address checking in the uaccess
routines for architectures with a shared address space, or even entirely
different code for architectures with separate kernel and user address
spaces.  My plan is to hopefully kill set_fs in its current form a few
merge windows down the road.  We'll probably still need some form of
it to e.g. mark a thread as kernel thread vs also being able to execute
user code, but it will be much ore limited than before, called from very
few places and actually be a no-op for many architectures.



Oh nice. Some time ago I proposed a patch to change set_fs() to a 
flip/flop flag based logic, see 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/dd2876b808ea38eb7b7f760ecd6ce06096c61fb5.1580295551.git.christophe.le...@c-s.fr/


But if we manage to get rid of it completely, that's even better.


Re: [PATCH 8/8] exec: open code copy_string_kernel

2020-04-19 Thread Christophe Leroy




Le 19/04/2020 à 10:06, Christoph Hellwig a écrit :

On Sat, Apr 18, 2020 at 10:15:42AM +0200, Christophe Leroy wrote:



Le 14/04/2020 à 09:01, Christoph Hellwig a écrit :

Currently copy_string_kernel is just a wrapper around copy_strings that
simplifies the calling conventions and uses set_fs to allow passing a
kernel pointer.  But due to the fact the we only need to handle a single
kernel argument pointer, the logic can be sigificantly simplified while
getting rid of the set_fs.



Instead of duplicating almost identical code, can you write a function that
takes whether the source is from user or from kernel, then you just do
things like:

if (from_user)
len = strnlen_user(str, MAX_ARG_STRLEN);
else
len = strnlen(str, MAX_ARG_STRLEN);


if (from_user)
copy_from_user(kaddr+offset, str, bytes_to_copy);
else
memcpy(kaddr+offset, str, bytes_to_copy);


We'll need two different str variables then with and without __user
annotations to keep type safety.  And introduce a branch-y and unreadable
mess in the exec fast path instead of adding a simple and well understood
function for the kernel case that just deals with the much simpler case
of just copying a single arg vector from a kernel address.



About the branch, I was expecting GCC to inline and eliminate the unused 
branch.


Re: [PATCH] KVM: X86: Fix compile error in svm/sev.c

2020-04-19 Thread Xiaoyao Li

On 4/19/2020 3:30 PM, Tianjia Zhang wrote:

The compiler reported the following compilation errors:

arch/x86/kvm/svm/sev.c: In function ‘sev_pin_memory’:
arch/x86/kvm/svm/sev.c:361:3: error: implicit declaration of function
‘release_pages’ [-Werror=implicit-function-declaration]
release_pages(pages, npinned);
^

The reason is that the 'pagemap.h' header file is not included.



FYI.

Boris has sent the Patch:
https://lkml.kernel.org/r/20200411160927.27954-1...@alien8.de

and it's already in kvm master/queue branch





Re: remove set_fs calls from the exec and coredump code v2

2020-04-19 Thread Christoph Hellwig
On Fri, Apr 17, 2020 at 05:41:52PM -0500, Eric W. Biederman wrote:
> > this series gets rid of playing with the address limit in the exec and
> > coredump code.  Most of this was fairly trivial, the biggest changes are
> > those to the spufs coredump code.
> >
> > Changes since v1:
> >  - properly spell NUL
> >  - properly handle the compat siginfo case in ELF coredumps
> 
> Quick question is exec from a kernel thread within the scope of what you
> are looking at?
> 
> There is a set_fs(USER_DS) in flush_old_exec whose sole purpose appears
> to be to allow exec from kernel threads.  Where the kernel threads
> run with set_fs(KERNEL_DS) until they call exec.

This series doesn't really look at that area.  But I don't think exec
from a kernel thread makes any sense, and cleaning up how to set the
initial USER_DS vs KERNEL_DS state is something I'll eventually get to,
it seems like a major mess at the moment.


Re: [PATCH 1/2] signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32

2020-04-19 Thread Christoph Hellwig
On Sat, Apr 18, 2020 at 06:55:56AM -0500, Eric W. Biederman wrote:
> > Is that really an issue to use that set_fs() in the coredump code ?
> 
> Using set_fs() is pretty bad and something that we would like to remove
> from the kernel entirely.  The fewer instances of set_fs() we have the
> better.
> 
> I forget all of the details but set_fs() is both a type violation and an
> attack point when people are attacking the kernel.  The existence of
> set_fs() requires somethings that should be constants to be variables.
> Something about that means that our current code is difficult to protect
> from spectre style vulnerabilities.

Yes, set_fs requires variable based address checking in the uaccess
routines for architectures with a shared address space, or even entirely
different code for architectures with separate kernel and user address
spaces.  My plan is to hopefully kill set_fs in its current form a few
merge windows down the road.  We'll probably still need some form of
it to e.g. mark a thread as kernel thread vs also being able to execute
user code, but it will be much ore limited than before, called from very
few places and actually be a no-op for many architectures.


Re: [PATCH 8/8] exec: open code copy_string_kernel

2020-04-19 Thread Christoph Hellwig
On Sat, Apr 18, 2020 at 10:15:42AM +0200, Christophe Leroy wrote:
>
>
> Le 14/04/2020 à 09:01, Christoph Hellwig a écrit :
>> Currently copy_string_kernel is just a wrapper around copy_strings that
>> simplifies the calling conventions and uses set_fs to allow passing a
>> kernel pointer.  But due to the fact the we only need to handle a single
>> kernel argument pointer, the logic can be sigificantly simplified while
>> getting rid of the set_fs.
>
>
> Instead of duplicating almost identical code, can you write a function that 
> takes whether the source is from user or from kernel, then you just do 
> things like:
>
>   if (from_user)
>   len = strnlen_user(str, MAX_ARG_STRLEN);
>   else
>   len = strnlen(str, MAX_ARG_STRLEN);
>
>
>   if (from_user)
>   copy_from_user(kaddr+offset, str, bytes_to_copy);
>   else
>   memcpy(kaddr+offset, str, bytes_to_copy);

We'll need two different str variables then with and without __user
annotations to keep type safety.  And introduce a branch-y and unreadable
mess in the exec fast path instead of adding a simple and well understood
function for the kernel case that just deals with the much simpler case
of just copying a single arg vector from a kernel address.


Re: [PATCH 1/2] signal: Factor copy_siginfo_to_external32 from copy_siginfo_to_user32

2020-04-19 Thread Christoph Hellwig
On Sat, Apr 18, 2020 at 10:05:19AM +0200, Christophe Leroy wrote:
>
>
> Le 17/04/2020 à 23:09, Eric W. Biederman a écrit :
>>
>> To remove the use of set_fs in the coredump code there needs to be a
>> way to convert a kernel siginfo to a userspace compat siginfo.
>>
>> Call that function copy_siginfo_to_compat and factor it out of
>> copy_siginfo_to_user32.
>
> I find it a pitty to do that.
>
> The existing function could have been easily converted to using 
> user_access_begin() + user_access_end() and use unsafe_put_user() to copy 
> to userspace to avoid copying through a temporary structure on the stack.
>
> With your change, it becomes impossible to do that.

As Eric said we need a struct to clear all padding.  Note that I
though about converting to unsafe_copy_to_user in my variant as we
can pretty easily do that if pre-filling the structure earlier.  But
I didn't want to throw in such unrelated changes for now - I'll volunteer
to do it later, though.


Re: [PATCH 2/8] signal: clean up __copy_siginfo_to_user32

2020-04-19 Thread Christoph Hellwig
On Fri, Apr 17, 2020 at 04:08:23PM -0500, Eric W. Biederman wrote:
> This bothers me because it makes all architectures pay for the sins of
> x32.  Further it starts burying the details of the what is happening in
> architecture specific helpers.  Hiding the fact that there is only
> one niche architecture that does anything weird.
> 
> I am very sensitive to hiding away signal handling details right now
> because way to much of the signal handling code got hidden in
> architecture specific files and was quite buggy because as a result.

I much prefer the unconditional flags over the crazy ifdef mess in
the current coe and your version.  Except for that and the rather
strange and confusing copy_siginfo_to_external32 name it pretty much
looks the same.


Re: [PATCH 3/4] dma-mapping: add a dma_ops_bypass flag to struct device

2020-04-19 Thread Christoph Hellwig
On Sat, Apr 18, 2020 at 02:42:05PM +0200, Joerg Roedel wrote:
> Hi Christoph,
> 
> On Tue, Apr 14, 2020 at 02:25:05PM +0200, Christoph Hellwig wrote:
> > +static inline bool dma_map_direct(struct device *dev,
> > +   const struct dma_map_ops *ops)
> > +{
> > +   if (likely(!ops))
> > +   return true;
> > +   if (!dev->dma_ops_bypass)
> > +   return false;
> > +
> > +   return min_not_zero(*dev->dma_mask, dev->bus_dma_limit) >=
> > +   dma_direct_get_required_mask(dev);
> 
> Why is the dma-mask check done here? The dma-direct code handles memory
> outside of the devices dma-mask with swiotlb, no?
> 
> I also don't quite get what the difference between setting the
> dma_ops_bypass flag non-zero and setting ops to NULL is.

The difference is that NULL ops mean imply the direct mapping is always
used, dma_ops_bypass means a direct mapping is used if no bounce buffering
using swiotlb is needed, which should also answer your first question.
The idea is to consolidate code in the core to use an opportunistic
direct mapping instead of the dynamic iommu mapping.  I though the cover
letter and commit log explained this well enough, but maybe I need to
do a better job.


[PATCH 7/7] KVM: MIPS: clean up redundant kvm_run parameters in assembly

2020-04-19 Thread Tianjia Zhang
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. Earlier than historical reasons, many kvm-related function
parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time.
This patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang 
---
 arch/mips/include/asm/kvm_host.h |  4 ++--
 arch/mips/kvm/entry.c| 15 +--
 arch/mips/kvm/mips.c |  3 ++-
 arch/mips/kvm/trap_emul.c|  2 +-
 arch/mips/kvm/vz.c   |  2 +-
 5 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 971439297cea..db915c55166d 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -310,7 +310,7 @@ struct kvm_mmu_memory_cache {
 #define KVM_MIPS_GUEST_TLB_SIZE64
 struct kvm_vcpu_arch {
void *guest_ebase;
-   int (*vcpu_run)(struct kvm_run *run, struct kvm_vcpu *vcpu);
+   int (*vcpu_run)(struct kvm_vcpu *vcpu);
 
/* Host registers preserved across guest mode execution */
unsigned long host_stack;
@@ -821,7 +821,7 @@ int kvm_mips_emulation_init(struct kvm_mips_callbacks 
**install_callbacks);
 /* Debug: dump vcpu state */
 int kvm_arch_vcpu_dump_regs(struct kvm_vcpu *vcpu);
 
-extern int kvm_mips_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu);
+extern int kvm_mips_handle_exit(struct kvm_vcpu *vcpu);
 
 /* Building of entry/exception code */
 int kvm_mips_entry_setup(void);
diff --git a/arch/mips/kvm/entry.c b/arch/mips/kvm/entry.c
index 16e1c93b484f..e3f29af3b6cd 100644
--- a/arch/mips/kvm/entry.c
+++ b/arch/mips/kvm/entry.c
@@ -204,7 +204,7 @@ static inline void build_set_exc_base(u32 **p, unsigned int 
reg)
  * Assemble the start of the vcpu_run function to run a guest VCPU. The 
function
  * conforms to the following prototype:
  *
- * int vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu);
+ * int vcpu_run(struct kvm_vcpu *vcpu);
  *
  * The exit from the guest and return to the caller is handled by the code
  * generated by kvm_mips_build_ret_to_host().
@@ -217,8 +217,7 @@ void *kvm_mips_build_vcpu_run(void *addr)
unsigned int i;
 
/*
-* A0: run
-* A1: vcpu
+* A0: vcpu
 */
 
/* k0/k1 not being used in host kernel context */
@@ -237,10 +236,10 @@ void *kvm_mips_build_vcpu_run(void *addr)
kvm_mips_build_save_scratch(, V1, K1);
 
/* VCPU scratch register has pointer to vcpu */
-   UASM_i_MTC0(, A1, scratch_vcpu[0], scratch_vcpu[1]);
+   UASM_i_MTC0(, A0, scratch_vcpu[0], scratch_vcpu[1]);
 
/* Offset into vcpu->arch */
-   UASM_i_ADDIU(, K1, A1, offsetof(struct kvm_vcpu, arch));
+   UASM_i_ADDIU(, K1, A0, offsetof(struct kvm_vcpu, arch));
 
/*
 * Save the host stack to VCPU, used for exception processing
@@ -628,10 +627,7 @@ void *kvm_mips_build_exit(void *addr)
/* Now that context has been saved, we can use other registers */
 
/* Restore vcpu */
-   UASM_i_MFC0(, S1, scratch_vcpu[0], scratch_vcpu[1]);
-
-   /* Restore run (vcpu->run) */
-   UASM_i_LW(, S0, offsetof(struct kvm_vcpu, run), S1);
+   UASM_i_MFC0(, S0, scratch_vcpu[0], scratch_vcpu[1]);
 
/*
 * Save Host level EPC, BadVaddr and Cause to VCPU, useful to process
@@ -793,7 +789,6 @@ void *kvm_mips_build_exit(void *addr)
 * with this in the kernel
 */
uasm_i_move(, A0, S0);
-   uasm_i_move(, A1, S1);
UASM_i_LA(, T9, (unsigned long)kvm_mips_handle_exit);
uasm_i_jalr(, RA, T9);
 UASM_i_ADDIU(, SP, SP, -CALLFRAME_SIZ);
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 9710477a9827..32850470c037 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1186,8 +1186,9 @@ static void kvm_mips_set_c0_status(void)
 /*
  * Return value is in the form (errcode<<2 | RESUME_FLAG_HOST | RESUME_FLAG_NV)
  */
-int kvm_mips_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu)
+int kvm_mips_handle_exit(struct kvm_vcpu *vcpu)
 {
+   struct kvm_run *run = vcpu->run;
u32 cause = vcpu->arch.host_cp0_cause;
u32 exccode = (cause >> CAUSEB_EXCCODE) & 0x1f;
u32 __user *opc = (u32 __user *) vcpu->arch.pc;
diff --git a/arch/mips/kvm/trap_emul.c b/arch/mips/kvm/trap_emul.c
index d822f3aee3dc..04c864cc356a 100644
--- a/arch/mips/kvm/trap_emul.c
+++ b/arch/mips/kvm/trap_emul.c
@@ -1238,7 +1238,7 @@ static int kvm_trap_emul_vcpu_run(struct kvm_vcpu *vcpu)
 */
kvm_mips_suspend_mm(cpu);
 
-   r = vcpu->arch.vcpu_run(vcpu->run, vcpu);
+   r = vcpu->arch.vcpu_run(vcpu);
 
/* We may have migrated while handling guest exits */
cpu = smp_processor_id();
diff --git a/arch/mips/kvm/vz.c b/arch/mips/kvm/vz.c
index 94f1d23828e3..c5878fa0636d 100644
--- a/arch/mips/kvm/vz.c
+++ b/arch/mips/kvm/vz.c
@@ -3152,7 +3152,7 @@ 

[PATCH 5/7] KVM: PPC: clean up redundant kvm_run parameters in assembly

2020-04-19 Thread Tianjia Zhang
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. Earlier than historical reasons, many kvm-related function
parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time.
This patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang 
---
 arch/powerpc/include/asm/kvm_ppc.h|  2 +-
 arch/powerpc/kvm/book3s_interrupts.S  | 17 -
 arch/powerpc/kvm/book3s_pr.c  |  9 -
 arch/powerpc/kvm/booke.c  |  9 -
 arch/powerpc/kvm/booke_interrupts.S   |  9 -
 arch/powerpc/kvm/bookehv_interrupts.S | 10 +-
 6 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index ccf66b3a4c1d..0a056c64c317 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -59,7 +59,7 @@ enum xlate_readwrite {
 };
 
 extern int kvmppc_vcpu_run(struct kvm_vcpu *vcpu);
-extern int __kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu);
+extern int __kvmppc_vcpu_run(struct kvm_vcpu *vcpu);
 extern void kvmppc_handler_highmem(void);
 
 extern void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_interrupts.S 
b/arch/powerpc/kvm/book3s_interrupts.S
index f7ad99d972ce..0eff749d8027 100644
--- a/arch/powerpc/kvm/book3s_interrupts.S
+++ b/arch/powerpc/kvm/book3s_interrupts.S
@@ -55,8 +55,7 @@
  /
 
 /* Registers:
- *  r3: kvm_run pointer
- *  r4: vcpu pointer
+ *  r3: vcpu pointer
  */
 _GLOBAL(__kvmppc_vcpu_run)
 
@@ -68,8 +67,8 @@ kvm_start_entry:
/* Save host state to the stack */
PPC_STLU r1, -SWITCH_FRAME_SIZE(r1)
 
-   /* Save r3 (kvm_run) and r4 (vcpu) */
-   SAVE_2GPRS(3, r1)
+   /* Save r3 (vcpu) */
+   SAVE_GPR(3, r1)
 
/* Save non-volatile registers (r14 - r31) */
SAVE_NVGPRS(r1)
@@ -82,11 +81,11 @@ kvm_start_entry:
PPC_STL r0, _LINK(r1)
 
/* Load non-volatile guest state from the vcpu */
-   VCPU_LOAD_NVGPRS(r4)
+   VCPU_LOAD_NVGPRS(r3)
 
 kvm_start_lightweight:
/* Copy registers into shadow vcpu so we can access them in real mode */
-   mr  r3, r4
+   mr  r4, r3
bl  FUNC(kvmppc_copy_to_svcpu)
nop
REST_GPR(4, r1)
@@ -191,10 +190,10 @@ after_sprg3_load:
PPC_STL r31, VCPU_GPR(R31)(r7)
 
/* Pass the exit number as 3rd argument to kvmppc_handle_exit */
-   lwz r5, VCPU_TRAP(r7)
+   lwz r4, VCPU_TRAP(r7)
 
-   /* Restore r3 (kvm_run) and r4 (vcpu) */
-   REST_2GPRS(3, r1)
+   /* Restore r3 (vcpu) */
+   REST_GPR(3, r1)
bl  FUNC(kvmppc_handle_exit_pr)
 
/* If RESUME_GUEST, get back in the loop */
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index ef54f917bdaf..01c8fe5abe0d 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1151,9 +1151,9 @@ static int kvmppc_exit_pr_progint(struct kvm_vcpu *vcpu, 
unsigned int exit_nr)
return r;
 }
 
-int kvmppc_handle_exit_pr(struct kvm_run *run, struct kvm_vcpu *vcpu,
- unsigned int exit_nr)
+int kvmppc_handle_exit_pr(struct kvm_vcpu *vcpu, unsigned int exit_nr)
 {
+   struct kvm_run *run = vcpu->run;
int r = RESUME_HOST;
int s;
 
@@ -1826,7 +1826,6 @@ static void kvmppc_core_vcpu_free_pr(struct kvm_vcpu 
*vcpu)
 
 static int kvmppc_vcpu_run_pr(struct kvm_vcpu *vcpu)
 {
-   struct kvm_run *run = vcpu->run;
int ret;
 #ifdef CONFIG_ALTIVEC
unsigned long uninitialized_var(vrsave);
@@ -1834,7 +1833,7 @@ static int kvmppc_vcpu_run_pr(struct kvm_vcpu *vcpu)
 
/* Check if we can run the vcpu at all */
if (!vcpu->arch.sane) {
-   run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+   vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
ret = -EINVAL;
goto out;
}
@@ -1861,7 +1860,7 @@ static int kvmppc_vcpu_run_pr(struct kvm_vcpu *vcpu)
 
kvmppc_fix_ee_before_entry();
 
-   ret = __kvmppc_vcpu_run(run, vcpu);
+   ret = __kvmppc_vcpu_run(vcpu);
 
kvmppc_clear_debug(vcpu);
 
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 26b3f5900b72..942039aae598 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -732,12 +732,11 @@ int kvmppc_core_check_requests(struct kvm_vcpu *vcpu)
 
 int kvmppc_vcpu_run(struct kvm_vcpu *vcpu)
 {
-   struct kvm_run *run = vcpu->run;
int ret, s;
struct debug_reg debug;
 
if (!vcpu->arch.sane) {
-   run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+   vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
return -EINVAL;
}
 
@@ -779,7 +778,7 @@ int kvmppc_vcpu_run(struct kvm_vcpu *vcpu)

[PATCH 6/7] KVM: MIPS: clean up redundant 'kvm_run' parameters

2020-04-19 Thread Tianjia Zhang
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. Earlier than historical reasons, many kvm-related function
parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time.
This patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang 
---
 arch/mips/include/asm/kvm_host.h |  28 +---
 arch/mips/kvm/emulate.c  |  59 ++--
 arch/mips/kvm/mips.c |  11 ++-
 arch/mips/kvm/trap_emul.c| 114 ++-
 arch/mips/kvm/vz.c   |  26 +++
 5 files changed, 87 insertions(+), 151 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 2c343c346b79..971439297cea 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -812,8 +812,8 @@ struct kvm_mips_callbacks {
   const struct kvm_one_reg *reg, s64 v);
int (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
int (*vcpu_put)(struct kvm_vcpu *vcpu, int cpu);
-   int (*vcpu_run)(struct kvm_run *run, struct kvm_vcpu *vcpu);
-   void (*vcpu_reenter)(struct kvm_run *run, struct kvm_vcpu *vcpu);
+   int (*vcpu_run)(struct kvm_vcpu *vcpu);
+   void (*vcpu_reenter)(struct kvm_vcpu *vcpu);
 };
 extern struct kvm_mips_callbacks *kvm_mips_callbacks;
 int kvm_mips_emulation_init(struct kvm_mips_callbacks **install_callbacks);
@@ -868,7 +868,6 @@ extern int kvm_mips_handle_mapped_seg_tlb_fault(struct 
kvm_vcpu *vcpu,
 
 extern enum emulation_result kvm_mips_handle_tlbmiss(u32 cause,
 u32 *opc,
-struct kvm_run *run,
 struct kvm_vcpu *vcpu,
 bool write_fault);
 
@@ -975,83 +974,67 @@ static inline bool kvm_is_ifetch_fault(struct 
kvm_vcpu_arch *vcpu)
 
 extern enum emulation_result kvm_mips_emulate_inst(u32 cause,
   u32 *opc,
-  struct kvm_run *run,
   struct kvm_vcpu *vcpu);
 
 long kvm_mips_guest_exception_base(struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_emulate_syscall(u32 cause,
  u32 *opc,
- struct kvm_run *run,
  struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_emulate_tlbmiss_ld(u32 cause,
 u32 *opc,
-struct kvm_run *run,
 struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_emulate_tlbinv_ld(u32 cause,
u32 *opc,
-   struct kvm_run *run,
struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_emulate_tlbmiss_st(u32 cause,
 u32 *opc,
-struct kvm_run *run,
 struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_emulate_tlbinv_st(u32 cause,
u32 *opc,
-   struct kvm_run *run,
struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_emulate_tlbmod(u32 cause,
 u32 *opc,
-struct kvm_run *run,
 struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_emulate_fpu_exc(u32 cause,
  u32 *opc,
- struct kvm_run *run,
  struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_handle_ri(u32 cause,
u32 *opc,
-   struct kvm_run *run,
struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_emulate_ri_exc(u32 cause,
 u32 *opc,
-struct kvm_run *run,
 struct kvm_vcpu *vcpu);
 
 extern enum emulation_result kvm_mips_emulate_bp_exc(u32 cause,
  

[PATCH 4/7] KVM: PPC: clean up redundant 'kvm_run' parameters

2020-04-19 Thread Tianjia Zhang
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. Earlier than historical reasons, many kvm-related function
parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time.
This patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang 
---
 arch/powerpc/include/asm/kvm_book3s.h| 16 +++---
 arch/powerpc/include/asm/kvm_ppc.h   | 27 +
 arch/powerpc/kvm/book3s.c|  4 +-
 arch/powerpc/kvm/book3s.h|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c  | 12 ++--
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  4 +-
 arch/powerpc/kvm/book3s_emulate.c| 10 ++--
 arch/powerpc/kvm/book3s_hv.c | 60 ++--
 arch/powerpc/kvm/book3s_hv_nested.c  | 11 ++--
 arch/powerpc/kvm/book3s_paired_singles.c | 72 
 arch/powerpc/kvm/book3s_pr.c | 30 +-
 arch/powerpc/kvm/booke.c | 36 ++--
 arch/powerpc/kvm/booke.h |  8 +--
 arch/powerpc/kvm/booke_emulate.c |  2 +-
 arch/powerpc/kvm/e500_emulate.c  | 15 +++--
 arch/powerpc/kvm/emulate.c   | 10 ++--
 arch/powerpc/kvm/emulate_loadstore.c | 32 +--
 arch/powerpc/kvm/powerpc.c   | 72 
 arch/powerpc/kvm/trace_hv.h  |  6 +-
 19 files changed, 212 insertions(+), 217 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 506e4df2d730..66dbb1f85d59 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -155,12 +155,11 @@ extern void kvmppc_mmu_unmap_page(struct kvm_vcpu *vcpu, 
struct kvmppc_pte *pte)
 extern int kvmppc_mmu_map_segment(struct kvm_vcpu *vcpu, ulong eaddr);
 extern void kvmppc_mmu_flush_segment(struct kvm_vcpu *vcpu, ulong eaddr, ulong 
seg_size);
 extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu);
-extern int kvmppc_book3s_hv_page_fault(struct kvm_run *run,
-   struct kvm_vcpu *vcpu, unsigned long addr,
-   unsigned long status);
+extern int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
+   unsigned long addr, unsigned long status);
 extern long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr,
unsigned long slb_v, unsigned long valid);
-extern int kvmppc_hv_emulate_mmio(struct kvm_run *run, struct kvm_vcpu *vcpu,
+extern int kvmppc_hv_emulate_mmio(struct kvm_vcpu *vcpu,
unsigned long gpa, gva_t ea, int is_store);
 
 extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct hpte_cache 
*pte);
@@ -174,8 +173,7 @@ extern void kvmppc_mmu_hpte_sysexit(void);
 extern int kvmppc_mmu_hv_init(void);
 extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, unsigned long hc);
 
-extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
-   struct kvm_vcpu *vcpu,
+extern int kvmppc_book3s_radix_page_fault(struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
 extern unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
gva_t eaddr, void *to, void *from,
@@ -234,7 +232,7 @@ extern void kvmppc_trigger_fac_interrupt(struct kvm_vcpu 
*vcpu, ulong fac);
 extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat,
   bool upper, u32 val);
 extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr);
-extern int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu 
*vcpu);
+extern int kvmppc_emulate_paired_single(struct kvm_vcpu *vcpu);
 extern kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t gpa,
bool writing, bool *writable);
 extern void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
@@ -300,12 +298,12 @@ void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 
dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
 long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
-int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu,
+int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu,
  u64 time_limit, unsigned long lpcr);
 void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
 void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
   struct hv_guest_state *hr);
-long int kvmhv_nested_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu);
+long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
 
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 94f5a32acaf1..ccf66b3a4c1d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -58,28 

[PATCH 0/7] clean up redundant 'kvm_run' parameters

2020-04-19 Thread Tianjia Zhang
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. Earlier than historical reasons, many kvm-related function
parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time.
This patch does a unified cleanup of these remaining redundant parameters.

This series of patches has completely cleaned the architecture of
arm64, mips, ppc, and s390 (no such redundant code on x86). Due to
the large number of modified codes, a separate patch is made for each
platform. On the ppc platform, there is also a redundant structure
pointer of 'kvm_run' in 'vcpu_arch', which has also been cleaned
separately.

Thanks and best.

Tianjia Zhang (7):
  KVM: s390: clean up redundant 'kvm_run' parameters
  KVM: arm64: clean up redundant 'kvm_run' parameters
  KVM: PPC: Remove redundant kvm_run from vcpu_arch
  KVM: PPC: clean up redundant 'kvm_run' parameters
  KVM: PPC: clean up redundant kvm_run parameters in assembly
  KVM: MIPS: clean up redundant 'kvm_run' parameters
  KVM: MIPS: clean up redundant kvm_run parameters in assembly

 arch/arm64/include/asm/kvm_coproc.h  |  12 +--
 arch/arm64/include/asm/kvm_host.h|  11 +-
 arch/arm64/include/asm/kvm_mmu.h |   2 +-
 arch/arm64/kvm/handle_exit.c |  36 +++
 arch/arm64/kvm/sys_regs.c|  13 ++-
 arch/mips/include/asm/kvm_host.h |  32 +-
 arch/mips/kvm/emulate.c  |  59 ---
 arch/mips/kvm/entry.c|  15 +--
 arch/mips/kvm/mips.c |  14 +--
 arch/mips/kvm/trap_emul.c| 114 +---
 arch/mips/kvm/vz.c   |  26 ++---
 arch/powerpc/include/asm/kvm_book3s.h|  16 ++-
 arch/powerpc/include/asm/kvm_host.h  |   1 -
 arch/powerpc/include/asm/kvm_ppc.h   |  27 +++--
 arch/powerpc/kvm/book3s.c|   4 +-
 arch/powerpc/kvm/book3s.h|   2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  12 +--
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |   4 +-
 arch/powerpc/kvm/book3s_emulate.c|  10 +-
 arch/powerpc/kvm/book3s_hv.c |  64 ++--
 arch/powerpc/kvm/book3s_hv_nested.c  |  12 +--
 arch/powerpc/kvm/book3s_interrupts.S |  17 ++-
 arch/powerpc/kvm/book3s_paired_singles.c |  72 ++---
 arch/powerpc/kvm/book3s_pr.c |  33 +++---
 arch/powerpc/kvm/booke.c |  39 +++
 arch/powerpc/kvm/booke.h |   8 +-
 arch/powerpc/kvm/booke_emulate.c |   2 +-
 arch/powerpc/kvm/booke_interrupts.S  |   9 +-
 arch/powerpc/kvm/bookehv_interrupts.S|  10 +-
 arch/powerpc/kvm/e500_emulate.c  |  15 ++-
 arch/powerpc/kvm/emulate.c   |  10 +-
 arch/powerpc/kvm/emulate_loadstore.c |  32 +++---
 arch/powerpc/kvm/powerpc.c   |  72 ++---
 arch/powerpc/kvm/trace_hv.h  |   6 +-
 arch/s390/kvm/kvm-s390.c | 127 ---
 virt/kvm/arm/arm.c   |   6 +-
 virt/kvm/arm/mmio.c  |  11 +-
 virt/kvm/arm/mmu.c   |   5 +-
 38 files changed, 441 insertions(+), 519 deletions(-)

-- 
2.17.1



[PATCH 2/7] KVM: arm64: clean up redundant 'kvm_run' parameters

2020-04-19 Thread Tianjia Zhang
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. Earlier than historical reasons, many kvm-related function
parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time.
This patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang 
---
 arch/arm64/include/asm/kvm_coproc.h | 12 +-
 arch/arm64/include/asm/kvm_host.h   | 11 -
 arch/arm64/include/asm/kvm_mmu.h|  2 +-
 arch/arm64/kvm/handle_exit.c| 36 ++---
 arch/arm64/kvm/sys_regs.c   | 13 +--
 virt/kvm/arm/arm.c  |  6 ++---
 virt/kvm/arm/mmio.c | 11 +
 virt/kvm/arm/mmu.c  |  5 ++--
 8 files changed, 46 insertions(+), 50 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_coproc.h 
b/arch/arm64/include/asm/kvm_coproc.h
index 0185ee8b8b5e..454373704b8a 100644
--- a/arch/arm64/include/asm/kvm_coproc.h
+++ b/arch/arm64/include/asm/kvm_coproc.h
@@ -27,12 +27,12 @@ struct kvm_sys_reg_target_table {
 void kvm_register_target_sys_reg_table(unsigned int target,
   struct kvm_sys_reg_target_table *table);
 
-int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, struct kvm_run *run);
-int kvm_handle_cp14_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
-int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
-int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
-int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
-int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu);
+int kvm_handle_cp14_32(struct kvm_vcpu *vcpu);
+int kvm_handle_cp14_64(struct kvm_vcpu *vcpu);
+int kvm_handle_cp15_32(struct kvm_vcpu *vcpu);
+int kvm_handle_cp15_64(struct kvm_vcpu *vcpu);
+int kvm_handle_sys_reg(struct kvm_vcpu *vcpu);
 
 #define kvm_coproc_table_init kvm_sys_reg_table_init
 void kvm_sys_reg_table_init(void);
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 32c8a675e5a4..3fab32e4948c 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -481,18 +481,15 @@ u64 __kvm_call_hyp(void *hypfn, ...);
 void force_vm_exit(const cpumask_t *mask);
 void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
 
-int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
-   int exception_index);
-void handle_exit_early(struct kvm_vcpu *vcpu, struct kvm_run *run,
-  int exception_index);
+int handle_exit(struct kvm_vcpu *vcpu, int exception_index);
+void handle_exit_early(struct kvm_vcpu *vcpu, int exception_index);
 
 /* MMIO helpers */
 void kvm_mmio_write_buf(void *buf, unsigned int len, unsigned long data);
 unsigned long kvm_mmio_read_buf(const void *buf, unsigned int len);
 
-int kvm_handle_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run);
-int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
-phys_addr_t fault_ipa);
+int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
+int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
 int kvm_perf_init(void);
 int kvm_perf_teardown(void);
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 30b0e8d6b895..2ec7b9bb25d3 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -159,7 +159,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
  phys_addr_t pa, unsigned long size, bool writable);
 
-int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
 
 void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
 
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index aacfc55de44c..ec3a66642ea5 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -25,7 +25,7 @@
 #define CREATE_TRACE_POINTS
 #include "trace.h"
 
-typedef int (*exit_handle_fn)(struct kvm_vcpu *, struct kvm_run *);
+typedef int (*exit_handle_fn)(struct kvm_vcpu *);
 
 static void kvm_handle_guest_serror(struct kvm_vcpu *vcpu, u32 esr)
 {
@@ -33,7 +33,7 @@ static void kvm_handle_guest_serror(struct kvm_vcpu *vcpu, 
u32 esr)
kvm_inject_vabt(vcpu);
 }
 
-static int handle_hvc(struct kvm_vcpu *vcpu, struct kvm_run *run)
+static int handle_hvc(struct kvm_vcpu *vcpu)
 {
int ret;
 
@@ -50,7 +50,7 @@ static int handle_hvc(struct kvm_vcpu *vcpu, struct kvm_run 
*run)
return ret;
 }
 
-static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run *run)
+static int handle_smc(struct kvm_vcpu *vcpu)
 {
/*
 * "If an SMC instruction executed at Non-secure EL1 is
@@ -69,7 +69,7 @@ static int handle_smc(struct kvm_vcpu *vcpu, struct kvm_run 
*run)
  * Guest access to FP/ASIMD registers are routed 

[PATCHv3 00/50] Add log level to show_stack()

2020-04-19 Thread Dmitry Safonov
Changes to v3:
- Collected more architectual Acks and Reviewed-by
- Fixed compilation on sparc64

Changes to v2:
- Removed excessive pr_cont("\n") (nits by Senozhatsky)
- Leave backtrace debugging messages with pr_debug()
  (noted by Russell King and Will Deacon)
- Correct microblaze_unwind_inner() declaration
  (Thanks to Michal Simek and kbuild test robot)
- Fix copy'n'paste typo in show_stack_loglvl() for sparc
  (kbuild robot)
- Fix backtrace output on xtensa
  (Thanks Max Filippov)
- Add loglevel to show_stack() on s390 (kbuild robot)
- Collected all Reviewed-by and Acked-by (thanks!)

v2: https://lore.kernel.org/linux-riscv/20200316143916.195608-1-d...@arista.com/
v1: https://lore.kernel.org/linux-riscv/20191106030542.868541-1-d...@arista.com/

Add log level argument to show_stack().
Done in three stages:
1. Introducing show_stack_loglvl() for every architecture
2. Migrating old users with an explicit log level
3. Renaming show_stack_loglvl() into show_stack()

Justification:
o It's a design mistake to move a business-logic decision
  into platform realization detail.
o I have currently two patches sets that would benefit from this work:
  Removing console_loglevel jumps in sysrq driver [1]
  Hung task warning before panic [2] - suggested by Tetsuo (but he
  probably didn't realise what it would involve).
o While doing (1), (2) the backtraces were adjusted to headers
  and other messages for each situation - so there won't be a situation
  when the backtrace is printed, but the headers are missing because
  they have lesser log level (or the reverse).
o As the result in (2) plays with console_loglevel for kdb are removed.

The least important for upstream, but maybe still worth to note that
every company I've worked in so far had an off-list patch to print
backtrace with the needed log level (but only for the architecture they
cared about).
If you have other ideas how you will benefit from show_stack() with
a log level - please, reply to this cover letter.

See also discussion on v1:
https://lore.kernel.org/linux-riscv/20191106083538.z5nlpuf64cigx...@pathway.suse.cz/

Cc: Andrew Morton 
Cc: Greg Kroah-Hartman 
Cc: Ingo Molnar 
Cc: Jiri Slaby 
Cc: Petr Mladek 
Cc: Sergey Senozhatsky 
Cc: Steven Rostedt 
Cc: Tetsuo Handa 

Thanks,
Dmitry

[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-d...@arista.com/T/#u
[2]: 
https://lkml.kernel.org/r/41fd7652-df1f-26f6-aba0-b87ebae07...@i-love.sakura.ne.jp

Dmitry Safonov (50):
  kallsyms/printk: Add loglvl to print_ip_sym()
  alpha: Add show_stack_loglvl()
  arc: Add show_stack_loglvl()
  arm/asm: Add loglvl to c_backtrace()
  arm: Add loglvl to unwind_backtrace()
  arm: Add loglvl to dump_backtrace()
  arm: Wire up dump_backtrace_{entry,stm}
  arm: Add show_stack_loglvl()
  arm64: Add loglvl to dump_backtrace()
  arm64: Add show_stack_loglvl()
  c6x: Add show_stack_loglvl()
  csky: Add show_stack_loglvl()
  h8300: Add show_stack_loglvl()
  hexagon: Add show_stack_loglvl()
  ia64: Pass log level as arg into ia64_do_show_stack()
  ia64: Add show_stack_loglvl()
  m68k: Add show_stack_loglvl()
  microblaze: Add loglvl to microblaze_unwind_inner()
  microblaze: Add loglvl to microblaze_unwind()
  microblaze: Add show_stack_loglvl()
  mips: Add show_stack_loglvl()
  nds32: Add show_stack_loglvl()
  nios2: Add show_stack_loglvl()
  openrisc: Add show_stack_loglvl()
  parisc: Add show_stack_loglvl()
  powerpc: Add show_stack_loglvl()
  riscv: Add show_stack_loglvl()
  s390: Add show_stack_loglvl()
  sh: Add loglvl to dump_mem()
  sh: Remove needless printk()
  sh: Add loglvl to printk_address()
  sh: Add loglvl to show_trace()
  sh: Add show_stack_loglvl()
  sparc: Add show_stack_loglvl()
  um/sysrq: Remove needless variable sp
  um: Add show_stack_loglvl()
  unicore32: Remove unused pmode argument in c_backtrace()
  unicore32: Add loglvl to c_backtrace()
  unicore32: Add show_stack_loglvl()
  x86: Add missing const qualifiers for log_lvl
  x86: Add show_stack_loglvl()
  xtensa: Add loglvl to show_trace()
  xtensa: Add show_stack_loglvl()
  sysrq: Use show_stack_loglvl()
  x86/amd_gart: Print stacktrace for a leak with KERN_ERR
  power: Use show_stack_loglvl()
  kdb: Don't play with console_loglevel
  sched: Print stack trace with KERN_INFO
  kernel: Use show_stack_loglvl()
  kernel: Rename show_stack_loglvl() => show_stack()

 arch/alpha/kernel/traps.c| 22 +++
 arch/arc/include/asm/bug.h   |  3 ++-
 arch/arc/kernel/stacktrace.c | 17 +++-
 arch/arc/kernel/troubleshoot.c   |  2 +-
 arch/arm/include/asm/bug.h   |  3 ++-
 arch/arm/include/asm/traps.h |  3 ++-
 arch/arm/include/asm/unwind.h|  3 ++-
 arch/arm/kernel/traps.c  | 39 +++
 arch/arm/kernel/unwind.c |  5 ++--
 arch/arm/lib/backtrace-clang.S   |  9 +--
 arch/arm/lib/backtrace.S | 14 +++---
 arch/arm64/include/asm/stacktrace.h  |  3 ++-
 

[PATCH 3/7] KVM: PPC: Remove redundant kvm_run from vcpu_arch

2020-04-19 Thread Tianjia Zhang
The 'kvm_run' field already exists in the 'vcpu' structure, which
is the same structure as the 'kvm_run' in the 'vcpu_arch' and
should be deleted.

Signed-off-by: Tianjia Zhang 
---
 arch/powerpc/include/asm/kvm_host.h | 1 -
 arch/powerpc/kvm/book3s_hv.c| 6 ++
 arch/powerpc/kvm/book3s_hv_nested.c | 3 +--
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1dc63101ffe1..2745ff8faa01 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -795,7 +795,6 @@ struct kvm_vcpu_arch {
struct mmio_hpte_cache_entry *pgfault_cache;
 
struct task_struct *run_task;
-   struct kvm_run *kvm_run;
 
spinlock_t vpa_update_lock;
struct kvmppc_vpa vpa;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 93493f0cbfe8..413ea2dcb10c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2934,7 +2934,7 @@ static void post_guest_process(struct kvmppc_vcore *vc, 
bool is_master)
 
ret = RESUME_GUEST;
if (vcpu->arch.trap)
-   ret = kvmppc_handle_exit_hv(vcpu->arch.kvm_run, vcpu,
+   ret = kvmppc_handle_exit_hv(vcpu->run, vcpu,
vcpu->arch.run_task);
 
vcpu->arch.ret = ret;
@@ -3920,7 +3920,6 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
spin_lock(>lock);
vcpu->arch.ceded = 0;
vcpu->arch.run_task = current;
-   vcpu->arch.kvm_run = kvm_run;
vcpu->arch.stolen_logged = vcore_stolen_time(vc, mftb());
vcpu->arch.state = KVMPPC_VCPU_RUNNABLE;
vcpu->arch.busy_preempt = TB_NIL;
@@ -3973,7 +3972,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
if (signal_pending(v->arch.run_task)) {
kvmppc_remove_runnable(vc, v);
v->stat.signal_exits++;
-   v->arch.kvm_run->exit_reason = KVM_EXIT_INTR;
+   v->run->exit_reason = KVM_EXIT_INTR;
v->arch.ret = -EINTR;
wake_up(>arch.cpu_run);
}
@@ -4049,7 +4048,6 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
vc = vcpu->arch.vcore;
vcpu->arch.ceded = 0;
vcpu->arch.run_task = current;
-   vcpu->arch.kvm_run = kvm_run;
vcpu->arch.stolen_logged = vcore_stolen_time(vc, mftb());
vcpu->arch.state = KVMPPC_VCPU_RUNNABLE;
vcpu->arch.busy_preempt = TB_NIL;
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index dc97e5be76f6..5a3987f3ebf3 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -290,8 +290,7 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
r = RESUME_HOST;
break;
}
-   r = kvmhv_run_single_vcpu(vcpu->arch.kvm_run, vcpu, hdec_exp,
- lpcr);
+   r = kvmhv_run_single_vcpu(vcpu->run, vcpu, hdec_exp, lpcr);
} while (is_kvmppc_resume_guest(r));
 
/* save L2 state for return */
-- 
2.17.1



[PATCH 1/7] KVM: s390: clean up redundant 'kvm_run' parameters

2020-04-19 Thread Tianjia Zhang
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. Earlier than historical reasons, many kvm-related function
parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time.
This patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang 
---
 arch/s390/kvm/kvm-s390.c | 127 +--
 1 file changed, 67 insertions(+), 60 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 443af3ead739..cf420d013ba3 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4173,24 +4173,25 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
return rc;
 }
 
-static void sync_regs_fmt2(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static void sync_regs_fmt2(struct kvm_vcpu *vcpu)
 {
+   struct kvm_run *run = vcpu->run;
struct runtime_instr_cb *riccb;
struct gs_cb *gscb;
 
-   riccb = (struct runtime_instr_cb *) _run->s.regs.riccb;
-   gscb = (struct gs_cb *) _run->s.regs.gscb;
-   vcpu->arch.sie_block->gpsw.mask = kvm_run->psw_mask;
-   vcpu->arch.sie_block->gpsw.addr = kvm_run->psw_addr;
-   if (kvm_run->kvm_dirty_regs & KVM_SYNC_ARCH0) {
-   vcpu->arch.sie_block->todpr = kvm_run->s.regs.todpr;
-   vcpu->arch.sie_block->pp = kvm_run->s.regs.pp;
-   vcpu->arch.sie_block->gbea = kvm_run->s.regs.gbea;
-   }
-   if (kvm_run->kvm_dirty_regs & KVM_SYNC_PFAULT) {
-   vcpu->arch.pfault_token = kvm_run->s.regs.pft;
-   vcpu->arch.pfault_select = kvm_run->s.regs.pfs;
-   vcpu->arch.pfault_compare = kvm_run->s.regs.pfc;
+   riccb = (struct runtime_instr_cb *) >s.regs.riccb;
+   gscb = (struct gs_cb *) >s.regs.gscb;
+   vcpu->arch.sie_block->gpsw.mask = run->psw_mask;
+   vcpu->arch.sie_block->gpsw.addr = run->psw_addr;
+   if (run->kvm_dirty_regs & KVM_SYNC_ARCH0) {
+   vcpu->arch.sie_block->todpr = run->s.regs.todpr;
+   vcpu->arch.sie_block->pp = run->s.regs.pp;
+   vcpu->arch.sie_block->gbea = run->s.regs.gbea;
+   }
+   if (run->kvm_dirty_regs & KVM_SYNC_PFAULT) {
+   vcpu->arch.pfault_token = run->s.regs.pft;
+   vcpu->arch.pfault_select = run->s.regs.pfs;
+   vcpu->arch.pfault_compare = run->s.regs.pfc;
if (vcpu->arch.pfault_token == KVM_S390_PFAULT_TOKEN_INVALID)
kvm_clear_async_pf_completion_queue(vcpu);
}
@@ -4198,7 +4199,7 @@ static void sync_regs_fmt2(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
 * If userspace sets the riccb (e.g. after migration) to a valid state,
 * we should enable RI here instead of doing the lazy enablement.
 */
-   if ((kvm_run->kvm_dirty_regs & KVM_SYNC_RICCB) &&
+   if ((run->kvm_dirty_regs & KVM_SYNC_RICCB) &&
test_kvm_facility(vcpu->kvm, 64) &&
riccb->v &&
!(vcpu->arch.sie_block->ecb3 & ECB3_RI)) {
@@ -4209,7 +4210,7 @@ static void sync_regs_fmt2(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
 * If userspace sets the gscb (e.g. after migration) to non-zero,
 * we should enable GS here instead of doing the lazy enablement.
 */
-   if ((kvm_run->kvm_dirty_regs & KVM_SYNC_GSCB) &&
+   if ((run->kvm_dirty_regs & KVM_SYNC_GSCB) &&
test_kvm_facility(vcpu->kvm, 133) &&
gscb->gssm &&
!vcpu->arch.gs_enabled) {
@@ -4218,10 +4219,10 @@ static void sync_regs_fmt2(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
vcpu->arch.sie_block->ecd |= ECD_HOSTREGMGMT;
vcpu->arch.gs_enabled = 1;
}
-   if ((kvm_run->kvm_dirty_regs & KVM_SYNC_BPBC) &&
+   if ((run->kvm_dirty_regs & KVM_SYNC_BPBC) &&
test_kvm_facility(vcpu->kvm, 82)) {
vcpu->arch.sie_block->fpf &= ~FPF_BPBC;
-   vcpu->arch.sie_block->fpf |= kvm_run->s.regs.bpbc ? FPF_BPBC : 
0;
+   vcpu->arch.sie_block->fpf |= run->s.regs.bpbc ? FPF_BPBC : 0;
}
if (MACHINE_HAS_GS) {
preempt_disable();
@@ -4232,45 +4233,47 @@ static void sync_regs_fmt2(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
}
if (vcpu->arch.gs_enabled) {
current->thread.gs_cb = (struct gs_cb *)
-   >run->s.regs.gscb;
+   >s.regs.gscb;
restore_gs_cb(current->thread.gs_cb);
}
preempt_enable();
}
-   /* SIE will load etoken directly from SDNX and therefore kvm_run */
+   /* SIE will load etoken directly from SDNX and therefore run */
 }
 
-static void sync_regs(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+static void sync_regs(struct kvm_vcpu *vcpu)
 {
-   if 

Re: [PATCH v3 00/50] Add log level to show_stack()

2020-04-19 Thread Markus Elfring
> Changes to v3:
> - Collected more architectual Acks and Reviewed-by

* I suggest to avoid a typo in this description.

* Please separate the tag “PATCH” from the version descriptor in the subject.
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=50cc09c18985eacbbd666acfd7be2391394733f5#n709

Regards,
Markus


[PATCH] KVM: X86: Fix compile error in svm/sev.c

2020-04-19 Thread Tianjia Zhang
The compiler reported the following compilation errors:

arch/x86/kvm/svm/sev.c: In function ‘sev_pin_memory’:
arch/x86/kvm/svm/sev.c:361:3: error: implicit declaration of function
‘release_pages’ [-Werror=implicit-function-declaration]
   release_pages(pages, npinned);
   ^

The reason is that the 'pagemap.h' header file is not included.

Signed-off-by: Tianjia Zhang 
---
 arch/x86/kvm/svm/sev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 0e3fc311d7da..3ef99e87c1db 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "x86.h"
 #include "svm.h"
-- 
2.17.1



Re: [PATCH v6,4/4] drivers: misc: new driver sram_uapi for user level SRAM access

2020-04-19 Thread 王文虎
>> A generic User-Kernel interface that allows a misc device created
>> by it to support file-operations of ioctl and mmap to access SRAM
>> memory from user level. Different kinds of SRAM alloction and free
>> APIs could be added to the available array and could be configured
>> from user level.
>
>Having a generic user level interface seem reasonable, but it would
>be helpful to list one or more particular use cases.

OK, I will use the FSL_85XX_SRAM as a case to describe it.

>
>> +if SRAM_UAPI
>> +
>> +config FSL_85XX_SRAM_UAPI
>> +   bool "Freescale MPC85xx Cache-SRAM UAPI support"
>> +   depends on FSL_SOC_BOOKE && PPC32
>> +   select FSL_85XX_CACHE_SRAM
>> +   help
>> + This adds the Freescale MPC85xx Cache-SRAM memory allocation and
>> + free interfaces to the available SRAM API array, which finally 
>> could
>> + be used from user level to access the Freescale MPC85xx Cache-SRAM
>> + memory.
>
>Why do you need  a hardware specific Kconfig option here, shouldn't
>this just use the generic kernel abstraction for the sram?
>
Yes, I will add a interface for any hardware drivers to register there specific 
apis
instead of the definition here.

>> +struct sram_api {
>> +   u32 type;
>> +   long (*sram_alloc)(u32 size, phys_addr_t *phys, u32 align);
>> +   void (*sram_free)(void *ptr);
>> +};
>> +
>> +struct sram_uapi {
>> +   struct list_headres_list;
>> +   struct sram_api *sa;
>> +};
>> +
>> +enum SRAM_TYPE {
>> +#ifdef FSL_85XX_CACHE_SRAM
>> +   SRAM_TYPE_FSL_85XX_CACHE_SRAM,
>> +#endif
>> +   SRAM_TYPE_MAX,
>> +};
>> +
>> +/* keep the SRAM_TYPE value the same with array index */
>> +static struct sram_api srams[] = {
>> +#ifdef FSL_85XX_CACHE_SRAM
>> +   {
>> +   .type   = SRAM_TYPE_FSL_85XX_CACHE_SRAM,
>> +   .sram_alloc = mpc85xx_cache_sram_alloc,
>> +   .sram_free  = mpc85xx_cache_sram_free,
>> +   },
>> +#endif
>> +};
>
>If there is a indeed a requirement for hardware specific functions,
>I'd say these should be registered from the hardware specific driver
>rather than the generic driver having to know about every single
>instance.

Yes, as you mentioned upper, and the interfaces should be registered
by hardware drivers. and I'd use a set of generic abstractions of the 
definitions.

>> +static long sram_uapi_ioctl(struct file *filp, unsigned int cmd,
>> +   unsigned long arg)
>> +{
>> +   struct sram_uapi *uapi = filp->private_data;
>> +   struct sram_resource *res;
>> +   struct res_info info;
>> +   long ret = -EINVAL;
>> +   int size;
>> +   u32 type;
>> +
>> +   if (!uapi)
>> +   return ret;
>> +
>> +   switch (cmd) {
>> +   case SRAM_UAPI_IOCTL_SET_SRAM_TYPE:
>> +   size = copy_from_user((void *), (const void __user 
>> *)arg,
>> + sizeof(type));
>
>This could be a simpler get_user().

Addressed.

>
>> +static const struct file_operations sram_uapi_ops = {
>> +   .owner = THIS_MODULE,
>> +   .open = sram_uapi_open,
>> +   .unlocked_ioctl = sram_uapi_ioctl,
>> +   .mmap = sram_uapi_mmap,
>> +   .release = sram_uapi_release,
>> +};
>
>If you have a .unlocked_ioctl callback, there should also be a
>.compat_ioctl one. This can normally point to compat_ptr_ioctl().

Addressed

>> +
>> +static struct miscdevice sram_uapi_miscdev = {
>> +   MISC_DYNAMIC_MINOR,
>> +   "sram-uapi",
>> +   _uapi_ops,
>> +};
>
>The name of the character device should not contain "uapi", that
>is kind of implied here.

Addressed

>> +
>> +#define SRAM_UAPI_IOCTL_SET_SRAM_TYPE  0
>> +#define SRAM_UAPI_IOCTL_ALLOC  1
>> +#define SRAM_UAPI_IOCTL_FREE   2
>> +
>> +struct res_info {
>> +   u32 offset;
>> +   u32 size;
>> +};
>
>This is of course not a proper ioctl interface at all, please see
>Documentation/driver-api/ioctl.rst for how to define the numbers
>in a uapi header file.
>
>The offset/size arguments should probably be 64 bit wide.

OK, I will reference the ioctl.rst and make a improvement and I think
phys_addr_t would be a better choice.

Thanks,
Wenhu




[PATCH] tools/testing/selftests/powerpc/tm: Remove duplicate headers

2020-04-19 Thread jagdsh . linux
From: Jagadeesh Pagadala 

Code cleanup: Remove duplicate headers which are included twice.

Signed-off-by: Jagadeesh Pagadala 
---
 tools/testing/selftests/powerpc/tm/tm-poison.c  | 1 -
 tools/testing/selftests/powerpc/tm/tm-vmx-unavail.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/tools/testing/selftests/powerpc/tm/tm-poison.c 
b/tools/testing/selftests/powerpc/tm/tm-poison.c
index 9775584..f0257c6 100644
--- a/tools/testing/selftests/powerpc/tm/tm-poison.c
+++ b/tools/testing/selftests/powerpc/tm/tm-poison.c
@@ -20,7 +20,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "tm.h"
 
diff --git a/tools/testing/selftests/powerpc/tm/tm-vmx-unavail.c 
b/tools/testing/selftests/powerpc/tm/tm-vmx-unavail.c
index e2a0c07..9ef37a9 100644
--- a/tools/testing/selftests/powerpc/tm/tm-vmx-unavail.c
+++ b/tools/testing/selftests/powerpc/tm/tm-vmx-unavail.c
@@ -17,7 +17,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "tm.h"
 #include "utils.h"
-- 
1.8.3.1



[PATCH] tools/testing/selftests/powerpc/mm: Remove duplicate headers

2020-04-19 Thread jagdsh . linux
From: Jagadeesh Pagadala 

Code cleanup: Remove duplicate headers which are included twice.

Signed-off-by: Jagadeesh Pagadala 
---
 tools/testing/selftests/powerpc/mm/tlbie_test.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/mm/tlbie_test.c 
b/tools/testing/selftests/powerpc/mm/tlbie_test.c
index f85a093..48344a7 100644
--- a/tools/testing/selftests/powerpc/mm/tlbie_test.c
+++ b/tools/testing/selftests/powerpc/mm/tlbie_test.c
@@ -33,7 +33,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
1.8.3.1