Re: [PATCH 00/16] Remove hash page table slot tracking from linux PTE

2017-10-26 Thread Paul Mackerras
On Fri, Oct 27, 2017 at 10:57:13AM +0530, Aneesh Kumar K.V wrote:
> 
> 
> On 10/27/2017 10:04 AM, Paul Mackerras wrote:
> >How do we interpret these numbers?  Are they times, or speed?  Is
> >larger better or worse?
> 
> Sorry for not including the details. They are time in seconds. Test case is
> a modified mmap_bench included in powerpc/selftest.
> 
> >
> >Can you give us the mean and standard deviation for each set of 5
> >please?
> >
> 
> powernv without patch
> median= 51.432255
> stdev = 0.370835
> 
> with patch
> median = 50.739922
> stdev = 0.06419662
> 
> pseries without patch
> median = 116.617884
> stdev = 3.04531023
> 
> with patch no hcall
> median = 119.42494
> stdev = 0.85874552
> 
> with patch and hcall
> median = 117.735808
> stdev = 2.7624151

So on powernv, the patch set *improves* performance by about 1.3%
(almost 2 standard deviations).  Do we know why that is?

On pseries, performance is about 2.4% worse without new hcalls, but
that is less than 1 standard deviation.  With new hcalls, performance
is 0.95% worse, only a third of a standard deviation.  I think we need
to do more measurements to try to get a more accurate picture here.

Were the pseries numbers done on KVM or PowerVM?  Could you do a set
of measurements on the other one too please?  (I assume the numbers
with the new hcall were done on KVM, and can't be done on PowerVM.)

Paul.


Re: [PATCH 00/16] Remove hash page table slot tracking from linux PTE

2017-10-26 Thread Aneesh Kumar K.V



On 10/27/2017 10:04 AM, Paul Mackerras wrote:

On Fri, Oct 27, 2017 at 09:38:17AM +0530, Aneesh Kumar K.V wrote:

Hi,

With hash translation mode we always tracked the hash pte slot details in linux 
page table.
This occupied space in the linux page table and also limitted our ability to 
support
linux features that require additional PTE bits. This series attempt to lift 
this
limitation by not tracking slot number in linux page table. We still track slot 
details
w.r.t Transparent Hugepage entries because an invalidate there requires us to 
go through
all the 256 hash pte slots. So tracking whether hash page table entry is valid 
helps us in
avoiding a lot of hcalls there. With THP entries we don't keep slot details in 
the primary
linux page table entry but in the second half of page table. Hence tracking 
slot details
for THP doesn't take up space in PTE.

Even though we don't track slot, for removing/updating hash page table entry, 
PAPR hcalls expect
hash page table slot details. On pseries we find slot using H_READ hcall using 
H_READ_4 flags.
This implies an additional 2 hcalls in the updatepp and remove paths. The patch 
series also
attempt to limit the impact of this by adding new hcalls that does 
remove/update of hash page table
entry using hash value instead of hash page table slot.

Below is the performance numbers observed when running a workload that does the 
below sequence

for(5000) {
mmap(128M)
touch every page of 2048 page
munmap()
}

The test is run with address randomization off, swap disabled in both host and 
guest.


|+--+---+--+---|
| iterations | platform | without patch | With series and no hcall | With 
series and hcall |
|+--+---+--+---|
|  1 | powernv  |   |50.818343 |
   |
|  2 | powernv  |   |50.744123 |
   |
|  3 | powernv  |   |50.721603 |
   |
|  4 | powernv  |   |50.739922 |
   |
|  5 | powernv  |   |50.638555 |
   |
|  1 | powernv  | 51.388249 |  |
   |
|  2 | powernv  | 51.789701 |  |
   |
|  3 | powernv  | 52.240394 |  |
   |
|  4 | powernv  | 51.432255 |  |
   |
|  5 | powernv  | 51.392947 |  |
   |
|+--+---+--+---|
|  1 | pseries  |   |  |
123.154394 |
|  2 | pseries  |   |  |
122.253956 |
|  3 | pseries  |   |  |
117.666344 |
|  4 | pseries  |   |  |
117.681479 |
|  5 | pseries  |   |  |
117.735808 |
|  1 | pseries  |   |   119.424940 |
   |
|  2 | pseries  |   |   117.663078 |
   |
|  3 | pseries  |   |   118.345584 |
   |
|  4 | pseries  |   |   119.620934 |
   |
|  5 | pseries  |   |   119.463185 |
   |
|  1 | pseries  |122.810867 |  |
   |
|  2 | pseries  |115.760801 |  |
   |
|  3 | pseries  |115.257030 |  |
   |
|  4 | pseries  |116.617884 |  |
   |
|  5 | pseries  |117.247036 |  |
   |
|+--+---+--+---|



How do we interpret these numbers?  Are they times, or speed?  Is
larger better or worse?


Sorry for not including the details. They are time in seconds. Test case 
is a modified mmap_bench included in powerpc/selftest.




Can you give us the mean and standard deviation for each set of 5
please?



powernv without patch
median= 51.432255
stdev = 0.370835

with patch
median = 50.739922
stdev = 0.06419662

pseries without patch
median = 116.617884
stdev = 3.04531023

with patch no hcall
median = 119.42494
stdev = 0.85874552

with patch and hcall
median = 117.735808
stdev = 2.7624151

-aneesh





Re: [PATCH 00/16] Remove hash page table slot tracking from linux PTE

2017-10-26 Thread Paul Mackerras
On Fri, Oct 27, 2017 at 09:38:17AM +0530, Aneesh Kumar K.V wrote:
> Hi,
> 
> With hash translation mode we always tracked the hash pte slot details in 
> linux page table.
> This occupied space in the linux page table and also limitted our ability to 
> support
> linux features that require additional PTE bits. This series attempt to lift 
> this
> limitation by not tracking slot number in linux page table. We still track 
> slot details
> w.r.t Transparent Hugepage entries because an invalidate there requires us to 
> go through
> all the 256 hash pte slots. So tracking whether hash page table entry is 
> valid helps us in
> avoiding a lot of hcalls there. With THP entries we don't keep slot details 
> in the primary
> linux page table entry but in the second half of page table. Hence tracking 
> slot details
> for THP doesn't take up space in PTE.
> 
> Even though we don't track slot, for removing/updating hash page table entry, 
> PAPR hcalls expect
> hash page table slot details. On pseries we find slot using H_READ hcall 
> using H_READ_4 flags.
> This implies an additional 2 hcalls in the updatepp and remove paths. The 
> patch series also
> attempt to limit the impact of this by adding new hcalls that does 
> remove/update of hash page table
> entry using hash value instead of hash page table slot.
> 
> Below is the performance numbers observed when running a workload that does 
> the below sequence
> 
> for(5000) {
> mmap(128M)
> touch every page of 2048 page
> munmap()
> }
> 
> The test is run with address randomization off, swap disabled in both host 
> and guest.
> 
> 
> |+--+---+--+---|
> | iterations | platform | without patch | With series and no hcall | With 
> series and hcall |
> |+--+---+--+---|
> |  1 | powernv  |   |50.818343 |  
>  |
> |  2 | powernv  |   |50.744123 |  
>  |
> |  3 | powernv  |   |50.721603 |  
>  |
> |  4 | powernv  |   |50.739922 |  
>  |
> |  5 | powernv  |   |50.638555 |  
>  |
> |  1 | powernv  | 51.388249 |  |  
>  |
> |  2 | powernv  | 51.789701 |  |  
>  |
> |  3 | powernv  | 52.240394 |  |  
>  |
> |  4 | powernv  | 51.432255 |  |  
>  |
> |  5 | powernv  | 51.392947 |  |  
>  |
> |+--+---+--+---|
> |  1 | pseries  |   |  |  
>   123.154394 |
> |  2 | pseries  |   |  |  
>   122.253956 |
> |  3 | pseries  |   |  |  
>   117.666344 |
> |  4 | pseries  |   |  |  
>   117.681479 |
> |  5 | pseries  |   |  |  
>   117.735808 |
> |  1 | pseries  |   |   119.424940 |  
>  |
> |  2 | pseries  |   |   117.663078 |  
>  |
> |  3 | pseries  |   |   118.345584 |  
>  |
> |  4 | pseries  |   |   119.620934 |  
>  |
> |  5 | pseries  |   |   119.463185 |  
>  |
> |  1 | pseries  |122.810867 |  |  
>  |
> |  2 | pseries  |115.760801 |  |  
>  |
> |  3 | pseries  |115.257030 |  |  
>  |
> |  4 | pseries  |116.617884 |  |  
>  |
> |  5 | pseries  |117.247036 |  |  
>  |
> |+--+---+--+---|
>

How do we interpret these numbers?  Are they times, or speed?  Is
larger better or worse?

Can you give us the mean and standard deviation for each set of 5
please?

Paul.


[PATCH 14/16] powerpc/mm/pseries: Use HASH_REMOVE hcall in guest

2017-10-26 Thread Aneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/pseries/lpar.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index cd5cf5bd53f1..41512aaa8c8e 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -454,6 +454,27 @@ static void pSeries_lpar_hpte_invalidate(unsigned long 
slot, unsigned long vpn,
BUG_ON(lpar_rc != H_SUCCESS);
 }
 
+static void __pseries_lpar_hash_invalidate(unsigned long hash, unsigned long 
vpn,
+  int psize, int apsize,
+  int ssize, int local)
+{
+   unsigned long want_v;
+   unsigned long lpar_rc;
+   unsigned long dummy1, dummy2;
+
+   pr_devel("inval : hash=%lx, vpn=%016lx, psize: %d, local: %d\n",
+hash, vpn, psize, local);
+
+   want_v = hpte_encode_avpn(vpn, psize, ssize);
+   lpar_rc = plpar_pte_hash_remove(H_AVPN, hash, want_v, , );
+   if (lpar_rc == H_NOT_FOUND)
+   return;
+
+   BUG_ON(lpar_rc != H_SUCCESS);
+
+}
+
+
 static void pSeries_lpar_hash_invalidate(unsigned long hash, unsigned long vpn,
 int psize, int apsize,
 int ssize, int local)
@@ -466,6 +487,9 @@ static void pSeries_lpar_hash_invalidate(unsigned long 
hash, unsigned long vpn,
pr_devel("inval : hash=%lx, vpn=%016lx, psize: %d, local: %d\n",
 hash, vpn, psize, local);
 
+   if (firmware_has_feature(FW_FEATURE_HASH_API))
+   return __pseries_lpar_hash_invalidate(hash, vpn, psize,
+ apsize, ssize, local);
want_v = hpte_encode_avpn(vpn, psize, ssize);
slot = __pSeries_lpar_hpte_find(hash, want_v);
if (slot < 0)
-- 
2.13.6



[PATCH 16/16] powerpc/mm/pseries: Use HASH_BULK_REMOVE hcall in guest

2017-10-26 Thread Aneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/pseries/lpar.c | 40 ---
 1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 4ea9224cbeb6..6dffdf654a28 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -684,19 +684,43 @@ static void pSeries_lpar_flush_hash_range(unsigned long 
number, int local)
for (i = 0; i < number; i++) {
vpn = batch->vpn[i];
pte_iterate_hashed_subpages(vpn, psize, index, shift) {
-   long slot;
-
-   slot = pSeries_lpar_hpte_find(vpn, psize, ssize);
-   if (slot < 0)
-   continue;
-   pix = plpar_bluk_remove(param, pix, slot, vpn,
-   psize, ssize, local);
+   if (!firmware_has_feature(FW_FEATURE_HASH_API)) {
+   long slot;
+
+   slot = pSeries_lpar_hpte_find(vpn, psize, 
ssize);
+   if (slot < 0)
+   continue;
+   pix = plpar_bluk_remove(param, pix, slot, vpn,
+ psize, ssize, local);
+   } else {
+   unsigned long hash;
+   hash = hpt_hash(vpn, 
mmu_psize_defs[psize].shift, ssize);
+   /* trim the top bits, we overload them below */
+   hash &= MAX_HTAB_MASK;
+   param[pix] = HBR_REQUEST | HBR_AVPN | hash;
+   param[pix+1] = hpte_encode_avpn(vpn, psize, 
ssize);
+   pix += 2;
+   if (pix == 8) {
+   rc = plpar_hcall9(H_HASH_BULK_REMOVE, 
param,
+ param[0], param[1], 
param[2],
+ param[3], param[4], 
param[5],
+ param[6], param[7]);
+   BUG_ON(rc != H_SUCCESS);
+   pix = 0;
+   }
+   }
} pte_iterate_hashed_end();
}
if (pix) {
+   unsigned long hcall;
+
/* We have a flush pending */
param[pix] = HBR_END;
-   rc = plpar_hcall9(H_BULK_REMOVE, param, param[0], param[1],
+   if (!firmware_has_feature(FW_FEATURE_HASH_API))
+   hcall = H_BULK_REMOVE;
+   else
+   hcall = H_HASH_BULK_REMOVE;
+   rc = plpar_hcall9(hcall, param, param[0], param[1],
  param[2], param[3], param[4], param[5],
  param[6], param[7]);
BUG_ON(rc != H_SUCCESS);
-- 
2.13.6



[PATCH 15/16] powerpc/mm/pseries: Move slot based bulk remove to helper

2017-10-26 Thread Aneesh Kumar K.V
---
 arch/powerpc/platforms/pseries/lpar.c | 51 +--
 1 file changed, 31 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 41512aaa8c8e..4ea9224cbeb6 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -632,6 +632,34 @@ static int pSeries_lpar_hpte_removebolted(unsigned long ea,
return 0;
 }
 
+static int plpar_bluk_remove(unsigned long *param, int index, unsigned long 
slot,
+unsigned long vpn, unsigned long psize,
+unsigned long ssize, int local)
+{
+   unsigned long rc;
+   if (!firmware_has_feature(FW_FEATURE_BULK_REMOVE)) {
+   /*
+* lpar doesn't use the passed actual page size
+*/
+   pSeries_lpar_hpte_invalidate(slot, vpn, psize,
+0, ssize, local);
+   } else {
+   param[index] = HBR_REQUEST | HBR_AVPN | slot;
+   param[index+1] = hpte_encode_avpn(vpn, psize,
+   ssize);
+   index += 2;
+   if (index == 8) {
+   rc = plpar_hcall9(H_BULK_REMOVE, param,
+ param[0], param[1], param[2],
+ param[3], param[4], param[5],
+ param[6], param[7]);
+   BUG_ON(rc != H_SUCCESS);
+   index = 0;
+   }
+   }
+   return index;
+}
+
 /*
  * Take a spinlock around flushes to avoid bouncing the hypervisor tlbie
  * lock.
@@ -661,29 +689,12 @@ static void pSeries_lpar_flush_hash_range(unsigned long 
number, int local)
slot = pSeries_lpar_hpte_find(vpn, psize, ssize);
if (slot < 0)
continue;
-   if (!firmware_has_feature(FW_FEATURE_BULK_REMOVE)) {
-   /*
-* lpar doesn't use the passed actual page size
-*/
-   pSeries_lpar_hpte_invalidate(slot, vpn, psize,
-0, ssize, local);
-   } else {
-   param[pix] = HBR_REQUEST | HBR_AVPN | slot;
-   param[pix+1] = hpte_encode_avpn(vpn, psize,
-   ssize);
-   pix += 2;
-   if (pix == 8) {
-   rc = plpar_hcall9(H_BULK_REMOVE, param,
-   param[0], param[1], param[2],
-   param[3], param[4], param[5],
-   param[6], param[7]);
-   BUG_ON(rc != H_SUCCESS);
-   pix = 0;
-   }
-   }
+   pix = plpar_bluk_remove(param, pix, slot, vpn,
+   psize, ssize, local);
} pte_iterate_hashed_end();
}
if (pix) {
+   /* We have a flush pending */
param[pix] = HBR_END;
rc = plpar_hcall9(H_BULK_REMOVE, param, param[0], param[1],
  param[2], param[3], param[4], param[5],
-- 
2.13.6



[PATCH 09/16] powerpc/mm: Add new firmware feature HASH API

2017-10-26 Thread Aneesh Kumar K.V
We will use this feature to check whether hypervisor implements hash based
remove and protect hcalls

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/firmware.h   | 3 ++-
 arch/powerpc/kvm/powerpc.c| 4 
 arch/powerpc/platforms/pseries/firmware.c | 1 +
 include/uapi/linux/kvm.h  | 1 +
 4 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 8645897472b1..152d704ac3c3 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -51,6 +51,7 @@
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
 #define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
+#define FW_FEATURE_HASH_APIASM_CONST(0x0004)
 
 #ifndef __ASSEMBLY__
 
@@ -67,7 +68,7 @@ enum {
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
-   FW_FEATURE_HPT_RESIZE,
+   FW_FEATURE_HPT_RESIZE | FW_FEATURE_HASH_API,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 3480faaf1ef8..6fb91198dc90 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -637,6 +637,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
/* Disable this on POWER9 until code handles new HPTE format */
r = !!hv_enabled && !cpu_has_feature(CPU_FTR_ARCH_300);
break;
+   case KVM_CAP_SPAPR_HASH_API:
+   /* Only enable for HV kvm */
+   r = is_kvmppc_hv_enabled(kvm);
+   break;
 #endif
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
case KVM_CAP_PPC_FWNMI:
diff --git a/arch/powerpc/platforms/pseries/firmware.c 
b/arch/powerpc/platforms/pseries/firmware.c
index 63cc82ad58ac..32081d4406e8 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -65,6 +65,7 @@ hypertas_fw_features_table[] = {
{FW_FEATURE_SET_MODE,   "hcall-set-mode"},
{FW_FEATURE_BEST_ENERGY,"hcall-best-energy-1*"},
{FW_FEATURE_HPT_RESIZE, "hcall-hpt-resize"},
+   {FW_FEATURE_HASH_API,   "hcall-hash-api"},
 };
 
 /* Build up the firmware features bitmask using the contents of
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 838887587411..780433b1f179 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -930,6 +930,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SMT_POSSIBLE 147
 #define KVM_CAP_HYPERV_SYNIC2 148
 #define KVM_CAP_HYPERV_VP_INDEX 149
+#define KVM_CAP_SPAPR_HASH_API 150
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.13.6



[PATCH 13/16] powerpc/mm/pseries: Use HASH_PROTECT hcall in guest

2017-10-26 Thread Aneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/pseries/lpar.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 52d2e3038c05..cd5cf5bd53f1 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -388,15 +388,20 @@ static long pSeries_lpar_hash_updatepp(unsigned long hash,
unsigned long want_v;
 
want_v = hpte_encode_avpn(vpn, psize, ssize);
-
pr_devel("update: avpnv=%016lx, hash=%016lx, f=%lx, psize: %d ...",
 want_v, hash, flags, psize);
 
+   if (firmware_has_feature(FW_FEATURE_HASH_API)) {
+   lpar_rc = plpar_pte_hash_protect(flags, hash, want_v);
+   goto err_out;
+   }
slot = __pSeries_lpar_hpte_find(hash, want_v);
if (slot < 0)
return -1;
 
lpar_rc = plpar_pte_protect(flags, slot, want_v);
+
+err_out:
if (lpar_rc == H_NOT_FOUND) {
pr_devel("not found !\n");
return -1;
-- 
2.13.6



[PATCH 12/16] powerpc/kvm/hash: Implement HASH_BULK_REMOVE hcall

2017-10-26 Thread Aneesh Kumar K.V
This is equivalent to H_BULK_REMOVE hcall, but then takes hash value as the arg
instead of hashpte slot number. We will use this later to speed up bulk remove
operation in guest. Instead of finding slot number using H_READ4 hcall, we can
use hash value directly using this hcall.

only support H_AVPN operation

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  2 +
 arch/powerpc/include/asm/hvcall.h |  3 +-
 arch/powerpc/kvm/book3s_hv.c  |  1 +
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 95 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  1 +
 5 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 8b1d924a2f85..c24157fa200c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -68,6 +68,8 @@
  */
 
 #define HPTES_PER_GROUP 8
+/* ISA defines max HTAB SIZE bits 46 */
+#define MAX_HTAB_MASK ((1UL << 46) - 1)
 
 #define HPTE_V_SSIZE_SHIFT 62
 #define HPTE_V_AVPN_SHIFT  7
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 725d4fadec82..c4feb950dd9f 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -293,7 +293,8 @@
 #define H_INT_RESET 0x3D0
 #define H_HASH_REMOVE  0x3D4
 #define H_HASH_PROTECT 0x3D8
-#define MAX_HCALL_OPCODE   H_HASH_PROTECT
+#define H_HASH_BULK_REMOVE 0x3DC
+#define MAX_HCALL_OPCODE   H_HASH_BULK_REMOVE
 
 /* H_VIOCTL functions */
 #define H_GET_VIOA_DUMP_SIZE   0x01
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 822e91ba1dbe..9c6db0cb8a1c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4173,6 +4173,7 @@ static unsigned int default_hcall_list[] = {
 #endif
H_HASH_PROTECT,
H_HASH_REMOVE,
+   H_HASH_BULK_REMOVE,
0
 };
 
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index d6782fab2584..24668e499a01 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -752,6 +752,101 @@ long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
return ret;
 }
 
+long kvmppc_h_hash_bulk_remove(struct kvm_vcpu *vcpu)
+{
+   struct kvm *kvm = vcpu->kvm;
+   unsigned long *args = >arch.gpr[4];
+   __be64 *hp, *hptes[4];
+   unsigned long tlbrb[4];
+   long int i, j, k, n, pte_index[4];
+   unsigned long flags, req, hash, rcbits;
+   int global;
+   long int ret = H_SUCCESS;
+   struct revmap_entry *rev, *revs[4];
+   u64 hp0, hp1;
+
+   if (kvm_is_radix(kvm))
+   return H_FUNCTION;
+
+   global = global_invalidates(kvm);
+   for (i = 0; i < 4 && ret == H_SUCCESS; ) {
+   n = 0;
+   for (; i < 4; ++i) {
+   j = i * 2;
+   hash = args[j];
+   flags = hash >> 56;
+   hash &= ((1ul << 56) - 1);
+   req = flags >> 6;
+   flags &= 3;
+   if (req == 3) { /* no more requests */
+   i = 4;
+   break;
+   }
+   /* only support avpn flag */
+   if (req != 1 || flags != 2) {
+   /* parameter error */
+   args[j] = ((0xa0 | flags) << 56) + hash;
+   ret = H_PARAMETER;
+   break;
+   }
+   /*
+* We wait here to take lock for all hash values
+* FIXME!! will that deadlock ?
+*/
+   hp = kvmppc_find_hpte_slot(kvm, hash,
+  args[j + 1], _index[n]);
+   if (!hp) {
+   args[j] = ((0x90 | flags) << 56) + hash;
+   continue;
+   }
+   hp0 = be64_to_cpu(hp[0]);
+   hp1 = be64_to_cpu(hp[1]);
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   hp0 = hpte_new_to_old_v(hp0, hp1);
+   hp1 = hpte_new_to_old_r(hp1);
+   }
+   args[j] = ((0x80 | flags) << 56) + hash;
+   rev = 
real_vmalloc_addr(>arch.hpt.rev[pte_index[n]]);
+   note_hpte_modification(kvm, rev);
+
+   if (!(hp0 & HPTE_V_VALID)) {
+   /* insert R and C bits from PTE */
+   rcbits = rev->guest_rpte & 

[PATCH 11/16] powerpc/kvm/hash: Implement HASH_PROTECT hcall

2017-10-26 Thread Aneesh Kumar K.V
This is equivalent to H_PROTECT hcall, but then takes hash value as the arg
instead of hashpte slot number. We will use this later to speed up invalidate
operation in guest. Instead of finding slot number using H_READ4 hcall, we can
use hash value directly using this hcall.

H_AVPN flag value is needed. Otherwise will return error.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/hvcall.h |  3 +-
 arch/powerpc/include/asm/plpar_wrappers.h |  7 +++
 arch/powerpc/kvm/book3s_hv.c  |  1 +
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 74 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  1 +
 5 files changed, 63 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 92980217a076..725d4fadec82 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -292,7 +292,8 @@
 #define H_INT_SYNC  0x3CC
 #define H_INT_RESET 0x3D0
 #define H_HASH_REMOVE  0x3D4
-#define MAX_HCALL_OPCODE   H_HASH_REMOVE
+#define H_HASH_PROTECT 0x3D8
+#define MAX_HCALL_OPCODE   H_HASH_PROTECT
 
 /* H_VIOCTL functions */
 #define H_GET_VIOA_DUMP_SIZE   0x01
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 8160fea9b5bc..27e30ca6105d 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -226,6 +226,13 @@ static inline long plpar_pte_protect(unsigned long flags, 
unsigned long ptex,
return plpar_hcall_norets(H_PROTECT, flags, ptex, avpn);
 }
 
+static inline long plpar_pte_hash_protect(unsigned long flags,
+ unsigned long hash,
+ unsigned long avpn)
+{
+   return plpar_hcall_norets(H_HASH_PROTECT, flags, hash, avpn);
+}
+
 static inline long plpar_resize_hpt_prepare(unsigned long flags,
unsigned long shift)
 {
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 56e7f52ed324..822e91ba1dbe 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4171,6 +4171,7 @@ static unsigned int default_hcall_list[] = {
H_XIRR,
H_XIRR_X,
 #endif
+   H_HASH_PROTECT,
H_HASH_REMOVE,
0
 };
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 7ebeb1be8380..d6782fab2584 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -752,33 +752,14 @@ long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
return ret;
 }
 
-long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
- unsigned long pte_index, unsigned long avpn,
- unsigned long va)
+long __kvmppc_do_hash_protect(struct kvm *kvm, __be64 *hpte,
+ unsigned long flags, unsigned long pte_index)
 {
-   struct kvm *kvm = vcpu->kvm;
-   __be64 *hpte;
+   u64 pte_v, pte_r;
struct revmap_entry *rev;
unsigned long v, r, rb, mask, bits;
-   u64 pte_v, pte_r;
-
-   if (kvm_is_radix(kvm))
-   return H_FUNCTION;
-   if (pte_index >= kvmppc_hpt_npte(>arch.hpt))
-   return H_PARAMETER;
 
-   hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
-   while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
-   cpu_relax();
v = pte_v = be64_to_cpu(hpte[0]);
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
-   v = hpte_new_to_old_v(v, be64_to_cpu(hpte[1]));
-   if ((v & (HPTE_V_ABSENT | HPTE_V_VALID)) == 0 ||
-   ((flags & H_AVPN) && (v & ~0x7fUL) != avpn)) {
-   __unlock_hpte(hpte, pte_v);
-   return H_NOT_FOUND;
-   }
-
pte_r = be64_to_cpu(hpte[1]);
bits = (flags << 55) & HPTE_R_PP0;
bits |= (flags << 48) & HPTE_R_KEY_HI;
@@ -823,6 +804,55 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long 
flags,
return H_SUCCESS;
 }
 
+long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
+ unsigned long pte_index, unsigned long avpn,
+ unsigned long va)
+{
+   __be64 *hpte;
+   u64 v, pte_v;
+   struct kvm *kvm = vcpu->kvm;
+
+   if (kvm_is_radix(kvm))
+   return H_FUNCTION;
+   if (pte_index >= kvmppc_hpt_npte(>arch.hpt))
+   return H_PARAMETER;
+
+   hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
+   while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
+   cpu_relax();
+   v = pte_v = be64_to_cpu(hpte[0]);
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   v = hpte_new_to_old_v(v, be64_to_cpu(hpte[1]));
+   if ((v & (HPTE_V_ABSENT | HPTE_V_VALID)) == 0 ||
+   ((flags & H_AVPN) && (v & ~0x7fUL) != avpn)) {
+   

[PATCH 10/16] powerpc/kvm/hash: Implement HASH_REMOVE hcall

2017-10-26 Thread Aneesh Kumar K.V
This is equivalent to H_REMOVE hcall, but then takes hash value as the arg
instead of hashpte slot number. We will use this later to speed up invalidate
operation in guest. Instead of finding slot number using H_READ4 hcall, we can
use hash value directly using this hcall.

Only support flag value for the operation is H_AVPN.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/hvcall.h |   3 +-
 arch/powerpc/include/asm/plpar_wrappers.h |  16 
 arch/powerpc/kvm/book3s_hv.c  |   1 +
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   | 134 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |   2 +
 5 files changed, 138 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 3d34dc0869f6..92980217a076 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -291,7 +291,8 @@
 #define H_INT_ESB   0x3C8
 #define H_INT_SYNC  0x3CC
 #define H_INT_RESET 0x3D0
-#define MAX_HCALL_OPCODE   H_INT_RESET
+#define H_HASH_REMOVE  0x3D4
+#define MAX_HCALL_OPCODE   H_HASH_REMOVE
 
 /* H_VIOCTL functions */
 #define H_GET_VIOA_DUMP_SIZE   0x01
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index c7b164836bc3..8160fea9b5bc 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -124,6 +124,22 @@ static inline long plpar_pte_remove(unsigned long flags, 
unsigned long ptex,
return rc;
 }
 
+static inline long plpar_pte_hash_remove(unsigned long flags, unsigned long 
hash,
+   unsigned long avpn, unsigned long 
*old_pteh_ret,
+   unsigned long *old_ptel_ret)
+{
+   long rc;
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   rc = plpar_hcall(H_HASH_REMOVE, retbuf, flags, hash, avpn);
+
+   *old_pteh_ret = retbuf[0];
+   *old_ptel_ret = retbuf[1];
+
+   return rc;
+}
+
+
 /* plpar_pte_remove_raw can be called in real mode. It calls plpar_hcall_raw */
 static inline long plpar_pte_remove_raw(unsigned long flags, unsigned long 
ptex,
unsigned long avpn, unsigned long *old_pteh_ret,
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 73bf1ebfa78f..56e7f52ed324 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4171,6 +4171,7 @@ static unsigned int default_hcall_list[] = {
H_XIRR,
H_XIRR_X,
 #endif
+   H_HASH_REMOVE,
0
 };
 
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index d80240ba6de4..7ebeb1be8380 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -465,34 +465,21 @@ static void do_tlbies(struct kvm *kvm, unsigned long 
*rbvalues,
}
 }
 
-long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
-   unsigned long pte_index, unsigned long avpn,
-   unsigned long *hpret)
+static long __kvmppc_do_hash_remove(struct kvm *kvm, __be64 *hpte,
+   unsigned long pte_index,
+   unsigned long *hpret)
 {
-   __be64 *hpte;
+
unsigned long v, r, rb;
struct revmap_entry *rev;
u64 pte, orig_pte, pte_r;
 
-   if (kvm_is_radix(kvm))
-   return H_FUNCTION;
-   if (pte_index >= kvmppc_hpt_npte(>arch.hpt))
-   return H_PARAMETER;
-   hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
-   while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
-   cpu_relax();
pte = orig_pte = be64_to_cpu(hpte[0]);
pte_r = be64_to_cpu(hpte[1]);
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
pte = hpte_new_to_old_v(pte, pte_r);
pte_r = hpte_new_to_old_r(pte_r);
}
-   if ((pte & (HPTE_V_ABSENT | HPTE_V_VALID)) == 0 ||
-   ((flags & H_AVPN) && (pte & ~0x7fUL) != avpn) ||
-   ((flags & H_ANDCOND) && (pte & avpn) != 0)) {
-   __unlock_hpte(hpte, orig_pte);
-   return H_NOT_FOUND;
-   }
 
rev = real_vmalloc_addr(>arch.hpt.rev[pte_index]);
v = pte & ~HPTE_V_HVLOCK;
@@ -525,6 +512,35 @@ long kvmppc_do_h_remove(struct kvm *kvm, unsigned long 
flags,
hpret[1] = r;
return H_SUCCESS;
 }
+
+long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
+   unsigned long pte_index, unsigned long avpn,
+   unsigned long *hpret)
+{
+   __be64 *hpte;
+   u64 pte, orig_pte, pte_r;
+
+   if (kvm_is_radix(kvm))
+   return H_FUNCTION;
+   if (pte_index >= kvmppc_hpt_npte(>arch.hpt))
+   return H_PARAMETER;
+   hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
+   

[PATCH 08/16] powerpc/mm/hash: Don't track hash pte slot number in linux page table.

2017-10-26 Thread Aneesh Kumar K.V
Now that we have updated all MMU hash operations to work with hash value instead
of slot, remove slot tracking completely. We also remove real_pte because
without slot tracking 4k, 64k and 64k subpages all have similar pte format.

One of the side effect of this is, we now don't track whether we have taken
a fault on 4k subpages on a 64k page config. That means a invalidate will try
to invalidate all the 4k subpages.

To minimize the impact from above THP still track the slot details. With THP we
have 4096 subpages and we want to avoid calling invalidate on all. For THP we
don't track slot details as part of linux page table, but are tracked in the
deposited page table

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h   | 16 +++-
 arch/powerpc/include/asm/book3s/64/hash-64k.h  | 44 +-
 arch/powerpc/include/asm/book3s/64/hash.h  |  5 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h   | 26 --
 arch/powerpc/include/asm/book3s/64/tlbflush-hash.h |  3 +-
 arch/powerpc/include/asm/pgtable-be-types.h| 10 ---
 arch/powerpc/include/asm/pgtable-types.h   |  9 ---
 arch/powerpc/mm/dump_linuxpagetables.c | 10 ---
 arch/powerpc/mm/hash64_4k.c|  2 -
 arch/powerpc/mm/hash64_64k.c   | 93 +-
 arch/powerpc/mm/hash_native_64.c   | 12 +--
 arch/powerpc/mm/hash_utils_64.c| 22 +
 arch/powerpc/mm/hugetlbpage-hash64.c   |  4 -
 arch/powerpc/mm/tlb_hash64.c   |  9 +--
 arch/powerpc/platforms/pseries/lpar.c  |  4 +-
 15 files changed, 49 insertions(+), 220 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 0c4e470571ca..d65dcb5826ff 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -17,8 +17,7 @@
 #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
 /* PTE flags to conserve for HPTE identification */
-#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
-H_PAGE_F_SECOND | H_PAGE_F_GIX)
+#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
 /*
  * Not supported by 4k linux page size
  */
@@ -27,6 +26,19 @@
 #define H_PAGE_COMBO   0x0
 #define H_PTE_FRAG_NR  0
 #define H_PTE_FRAG_SIZE_SHIFT  0
+
+#define pte_iterate_hashed_subpages(vpn, psize, index, shift)  \
+   do {\
+   index = 0;  \
+   shift = mmu_psize_defs[psize].shift;\
+
+#define pte_iterate_hashed_end() } while(0)
+/*
+ * We expect this to be called only for user addresses or kernel virtual
+ * addresses other than the linear mapping.
+ */
+#define pte_pagesize_index(mm, addr, pte)  MMU_PAGE_4K
+
 /*
  * On all 4K setups, remap_4k_pfn() equates to remap_pfn_range()
  */
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 9732837aaae8..ab36323b8a3e 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -25,8 +25,7 @@
 #define H_PAGE_COMBO_VALID (H_PAGE_F_GIX | H_PAGE_F_SECOND)
 
 /* PTE flags to conserve for HPTE identification */
-#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
-H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
+#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
 /*
  * we support 16 fragments per PTE page of 64K size.
  */
@@ -40,55 +39,16 @@
 
 #ifndef __ASSEMBLY__
 #include 
-
-/*
- * With 64K pages on hash table, we have a special PTE format that
- * uses a second "half" of the page table to encode sub-page information
- * in order to deal with 64K made of 4K HW pages. Thus we override the
- * generic accessors and iterators here
- */
-#define __real_pte __real_pte
-static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
-{
-   real_pte_t rpte;
-   unsigned long *hidxp;
-
-   rpte.pte = pte;
-   rpte.hidx = 0;
-   if (pte_val(pte) & H_PAGE_COMBO) {
-   /*
-* Make sure we order the hidx load against the H_PAGE_COMBO
-* check. The store side ordering is done in __hash_page_4K
-*/
-   smp_rmb();
-   hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
-   rpte.hidx = *hidxp;
-   }
-   return rpte;
-}
-
-static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long 
index)
-{
-   if ((pte_val(rpte.pte) & H_PAGE_COMBO))
-   return (rpte.hidx >> (index<<2)) & 0xf;
-   return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
-}
-
-#define __rpte_to_pte(r)   ((r).pte)
-extern bool __rpte_sub_valid(real_pte_t rpte, unsigned long index);
 /*
  * Trick: 

[PATCH 07/16] powerpc/mm: Add hash updatepp callback

2017-10-26 Thread Aneesh Kumar K.V
Add hash based updatepp callback and use that during hash pte fault.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  6 +
 arch/powerpc/mm/hash64_4k.c   |  7 +
 arch/powerpc/mm/hash64_64k.c  | 19 +++---
 arch/powerpc/mm/hash_native_64.c  | 37 +++
 arch/powerpc/mm/hugetlbpage-hash64.c  |  9 ++-
 arch/powerpc/platforms/ps3/htab.c | 29 +
 arch/powerpc/platforms/pseries/lpar.c | 31 ++
 7 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 79f141e721ee..8b1d924a2f85 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -145,6 +145,12 @@ struct mmu_hash_ops {
 unsigned long vpn,
 int bpsize, int apsize,
 int ssize, unsigned long flags);
+   long(*hash_updatepp)(unsigned long hash,
+unsigned long newpp,
+unsigned long vpn,
+int bpsize, int apsize,
+int ssize, unsigned long flags);
+
void(*hpte_updateboltedpp)(unsigned long newpp,
   unsigned long ea,
   int psize, int ssize);
diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 975793de0914..afb79100f0ce 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -65,12 +65,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
 * There MIGHT be an HPTE for this pte
 */
hash = hpt_hash(vpn, shift, ssize);
-   if (old_pte & H_PAGE_F_SECOND)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
-   if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
+   if (mmu_hash_ops.hash_updatepp(hash, rflags, vpn, MMU_PAGE_4K,
   MMU_PAGE_4K, ssize, flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
}
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index f1eb538721fc..096fdfaf6f1c 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -53,7 +53,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
unsigned long *hidxp;
unsigned long hpte_group;
unsigned int subpg_index;
-   unsigned long rflags, pa, hidx;
+   unsigned long rflags, pa;
unsigned long old_pte, new_pte, subpg_pte;
unsigned long vpn, hash, slot;
unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
@@ -127,17 +127,11 @@ int __hash_page_4K(unsigned long ea, unsigned long 
access, unsigned long vsid,
int ret;
 
hash = hpt_hash(vpn, shift, ssize);
-   hidx = __rpte_to_hidx(rpte, subpg_index);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
-
-   ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
+   ret = mmu_hash_ops.hash_updatepp(hash, rflags, vpn,
 MMU_PAGE_4K, MMU_PAGE_4K,
 ssize, flags);
/*
-*if we failed because typically the HPTE wasn't really here
+* if we failed because typically the HPTE wasn't really here
 * we try an insertion.
 */
if (ret == -1)
@@ -268,12 +262,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
 * There MIGHT be an HPTE for this pte
 */
hash = hpt_hash(vpn, shift, ssize);
-   if (old_pte & H_PAGE_F_SECOND)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
-   if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
+   if (mmu_hash_ops.hash_updatepp(hash, rflags, vpn, MMU_PAGE_64K,
   MMU_PAGE_64K, ssize,
   flags) == -1)
old_pte &= ~_PAGE_HPTEFLAGS;
diff --git 

[PATCH 06/16] powerpc/mm: Switch flush_hash_range to not use slot

2017-10-26 Thread Aneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hash_native_64.c  | 28 
 arch/powerpc/platforms/pseries/lpar.c | 17 -
 2 files changed, 16 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index f473a78baab7..8e2e6b92aa27 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -707,10 +707,8 @@ static void native_hpte_clear(void)
 static void native_flush_hash_range(unsigned long number, int local)
 {
unsigned long vpn;
-   unsigned long hash, index, hidx, shift, slot;
+   unsigned long hash, index, shift;
struct hash_pte *hptep;
-   unsigned long hpte_v;
-   unsigned long want_v;
unsigned long flags;
real_pte_t pte;
struct ppc64_tlb_batch *batch = this_cpu_ptr(_tlb_batch);
@@ -730,23 +728,13 @@ static void native_flush_hash_range(unsigned long number, 
int local)
 
pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
hash = hpt_hash(vpn, shift, ssize);
-   hidx = __rpte_to_hidx(pte, index);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
-   hptep = htab_address + slot;
-   want_v = hpte_encode_avpn(vpn, psize, ssize);
-   native_lock_hpte(hptep);
-   hpte_v = be64_to_cpu(hptep->v);
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
-   hpte_v = hpte_new_to_old_v(hpte_v,
-   be64_to_cpu(hptep->r));
-   if (!HPTE_V_COMPARE(hpte_v, want_v) ||
-   !(hpte_v & HPTE_V_VALID))
-   native_unlock_hpte(hptep);
-   else
-   hptep->v = 0;
+   hptep = native_hpte_find(hash, vpn, psize, ssize);
+   if (!hptep)
+   continue;
+   /*
+* Invalidate the hpte. NOTE: this also unlocks it
+*/
+   hptep->v = 0;
} pte_iterate_hashed_end();
}
 
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index e366252e0e93..d32469e40bbc 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -580,14 +580,14 @@ static int pSeries_lpar_hpte_removebolted(unsigned long 
ea,
 static void pSeries_lpar_flush_hash_range(unsigned long number, int local)
 {
unsigned long vpn;
-   unsigned long i, pix, rc;
+   unsigned long i, rc;
unsigned long flags = 0;
struct ppc64_tlb_batch *batch = this_cpu_ptr(_tlb_batch);
int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
unsigned long param[PLPAR_HCALL9_BUFSIZE];
-   unsigned long hash, index, shift, hidx, slot;
+   unsigned long index, shift;
real_pte_t pte;
-   int psize, ssize;
+   int psize, ssize, pix;
 
if (lock_tlbie)
spin_lock_irqsave(_lpar_tlbie_lock, flags);
@@ -599,12 +599,11 @@ static void pSeries_lpar_flush_hash_range(unsigned long 
number, int local)
vpn = batch->vpn[i];
pte = batch->pte[i];
pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
-   hash = hpt_hash(vpn, shift, ssize);
-   hidx = __rpte_to_hidx(pte, index);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
+   long slot;
+
+   slot = pSeries_lpar_hpte_find(vpn, psize, ssize);
+   if (slot < 0)
+   continue;
if (!firmware_has_feature(FW_FEATURE_BULK_REMOVE)) {
/*
 * lpar doesn't use the passed actual page size
-- 
2.13.6



[PATCH 05/16] powerpc/mm: use hash_invalidate for __kernel_map_pages()

2017-10-26 Thread Aneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hash_utils_64.c | 32 +---
 1 file changed, 5 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index b197fe57547e..8635b241e2d5 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -119,11 +119,6 @@ EXPORT_SYMBOL_GPL(mmu_slb_size);
 #ifdef CONFIG_PPC_64K_PAGES
 int mmu_ci_restrictions;
 #endif
-#ifdef CONFIG_DEBUG_PAGEALLOC
-static u8 *linear_map_hash_slots;
-static unsigned long linear_map_hash_count;
-static DEFINE_SPINLOCK(linear_map_hash_lock);
-#endif /* CONFIG_DEBUG_PAGEALLOC */
 struct mmu_hash_ops mmu_hash_ops;
 EXPORT_SYMBOL(mmu_hash_ops);
 
@@ -1744,7 +1739,7 @@ long hpte_insert_repeating(unsigned long hash, unsigned 
long vpn,
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
+static void kernel_map_linear_page(unsigned long vaddr)
 {
unsigned long hash;
unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
@@ -1761,12 +1756,7 @@ static void kernel_map_linear_page(unsigned long vaddr, 
unsigned long lmi)
ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
HPTE_V_BOLTED,
mmu_linear_psize, mmu_kernel_ssize);
-
BUG_ON (ret < 0);
-   spin_lock(_map_hash_lock);
-   BUG_ON(linear_map_hash_slots[lmi] & 0x80);
-   linear_map_hash_slots[lmi] = ret | 0x80;
-   spin_unlock(_map_hash_lock);
 }
 
 static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi)
@@ -1776,35 +1766,23 @@ static void kernel_unmap_linear_page(unsigned long 
vaddr, unsigned long lmi)
unsigned long vpn = hpt_vpn(vaddr, vsid, mmu_kernel_ssize);
 
hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
-   spin_lock(_map_hash_lock);
-   BUG_ON(!(linear_map_hash_slots[lmi] & 0x80));
-   hidx = linear_map_hash_slots[lmi] & 0x7f;
-   linear_map_hash_slots[lmi] = 0;
-   spin_unlock(_map_hash_lock);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
-   mmu_hash_ops.hpte_invalidate(slot, vpn, mmu_linear_psize,
+   mmu_hash_ops.hash_invalidate(hash, vpn, mmu_linear_psize,
 mmu_linear_psize,
 mmu_kernel_ssize, 0);
 }
 
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
-   unsigned long flags, vaddr, lmi;
+   unsigned long flags, vaddr;
int i;
 
local_irq_save(flags);
for (i = 0; i < numpages; i++, page++) {
vaddr = (unsigned long)page_address(page);
-   lmi = __pa(vaddr) >> PAGE_SHIFT;
-   if (lmi >= linear_map_hash_count)
-   continue;
if (enable)
-   kernel_map_linear_page(vaddr, lmi);
+   kernel_map_linear_page(vaddr);
else
-   kernel_unmap_linear_page(vaddr, lmi);
+   kernel_unmap_linear_page(vaddr);
}
local_irq_restore(flags);
 }
-- 
2.13.6



[PATCH 04/16] powerpc/mm: Add hash invalidate callback

2017-10-26 Thread Aneesh Kumar K.V
Add hash based invalidate callback and use that in flush_hash_page.
Note: In a later patch, we will drop the slot tracking completely. At that point
we will also loose the __rpte_sub_valid() check in
pte_iterate_hashed_subpages(). That means we call the invalidate for all
subpages irrespective of whether we took a hash fault on that or not.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  4 ++
 arch/powerpc/mm/hash_native_64.c  | 27 
 arch/powerpc/mm/hash_utils_64.c   | 11 ++---
 arch/powerpc/platforms/ps3/htab.c | 59 +++
 arch/powerpc/platforms/pseries/lpar.c | 26 
 5 files changed, 119 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 508275bb05d5..79f141e721ee 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -136,6 +136,10 @@ struct mmu_hash_ops {
   unsigned long vpn,
   int bpsize, int apsize,
   int ssize, int local);
+   void(*hash_invalidate)(unsigned long hash,
+  unsigned long vpn,
+  int bpsize, int apsize,
+  int ssize, int local);
long(*hpte_updatepp)(unsigned long slot,
 unsigned long newpp,
 unsigned long vpn,
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 496b1680ba24..f473a78baab7 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -497,6 +497,32 @@ static void native_hpte_invalidate(unsigned long slot, 
unsigned long vpn,
local_irq_restore(flags);
 }
 
+static void native_hash_invalidate(unsigned long hash, unsigned long vpn,
+  int bpsize, int apsize, int ssize, int local)
+{
+   unsigned long flags;
+   struct hash_pte *hptep;
+
+   DBG_LOW("invalidate(vpn=%016lx, hash: %lx)\n", vpn, hash);
+   local_irq_save(flags);
+   hptep = native_hpte_find(hash, vpn, bpsize, ssize);
+   if (hptep) {
+   /*
+* Invalidate the hpte. NOTE: this also unlocks it
+*/
+   hptep->v = 0;
+   }
+   /*
+* We need to invalidate the TLB always because hpte_remove doesn't do
+* a tlb invalidate. If a hash bucket gets full, we "evict" a more/less
+* random entry from it. When we do that we don't invalidate the TLB
+* (hpte_remove) because we assume the old translation is still
+* technically "valid".
+*/
+   tlbie(vpn, bpsize, apsize, ssize, local);
+   local_irq_restore(flags);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static void native_hugepage_invalidate(unsigned long vsid,
   unsigned long addr,
@@ -776,6 +802,7 @@ static int native_register_proc_table(unsigned long base, 
unsigned long page_siz
 void __init hpte_init_native(void)
 {
mmu_hash_ops.hpte_invalidate= native_hpte_invalidate;
+   mmu_hash_ops.hash_invalidate= native_hash_invalidate;
mmu_hash_ops.hpte_updatepp  = native_hpte_updatepp;
mmu_hash_ops.hpte_updateboltedpp = native_hpte_updateboltedpp;
mmu_hash_ops.hpte_removebolted = native_hpte_removebolted;
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 4d4662a77c14..b197fe57547e 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1598,23 +1598,18 @@ static inline void tm_flush_hash_page(int local)
 void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
 unsigned long flags)
 {
-   unsigned long hash, index, shift, hidx, slot;
+   unsigned long hash, index, shift;
int local = flags & HPTE_LOCAL_UPDATE;
 
DBG_LOW("flush_hash_page(vpn=%016lx)\n", vpn);
pte_iterate_hashed_subpages(pte, psize, vpn, index, shift) {
hash = hpt_hash(vpn, shift, ssize);
-   hidx = __rpte_to_hidx(pte, index);
-   if (hidx & _PTEIDX_SECONDARY)
-   hash = ~hash;
-   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
-   DBG_LOW(" sub %ld: hash=%lx, hidx=%lx\n", index, slot, hidx);
+   DBG_LOW(" sub %ld: hash=%lx\n", index, hash);
/*
 * We use same base page size and actual psize, because we don't
 * use these functions for hugepage
 */
-   

[PATCH 02/16] powerpc/mm: Update native_hpte_find to return hash pte

2017-10-26 Thread Aneesh Kumar K.V
The helper now also does a secondary hash search so that we can use this in 
other
functions.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hash_native_64.c | 73 
 1 file changed, 44 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 3848af167df9..496b1680ba24 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -351,32 +351,49 @@ static long native_hpte_updatepp(unsigned long slot, 
unsigned long newpp,
return ret;
 }
 
-static long native_hpte_find(unsigned long vpn, int psize, int ssize)
+/* returns a locked hash pte */
+struct hash_pte *native_hpte_find(unsigned long hash, unsigned long vpn,
+ unsigned long bpsize, unsigned long ssize)
 {
+   int i;
+   unsigned long hpte_v;
struct hash_pte *hptep;
-   unsigned long hash;
-   unsigned long i;
-   long slot;
-   unsigned long want_v, hpte_v;
-
-   hash = hpt_hash(vpn, mmu_psize_defs[psize].shift, ssize);
-   want_v = hpte_encode_avpn(vpn, psize, ssize);
+   unsigned long want_v, slot;
+   bool secondary_search = false;
 
-   /* Bolted mappings are only ever in the primary group */
+   want_v = hpte_encode_avpn(vpn, bpsize, ssize);
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   for (i = 0; i < HPTES_PER_GROUP; i++) {
-   hptep = htab_address + slot;
+
+   /*
+* search for hpte in the primary group
+*/
+search_again:
+   hptep = htab_address + slot;
+   for (i = 0; i < HPTES_PER_GROUP; i++, hptep++) {
+   /* check locklessly first */
hpte_v = be64_to_cpu(hptep->v);
if (cpu_has_feature(CPU_FTR_ARCH_300))
hpte_v = hpte_new_to_old_v(hpte_v, 
be64_to_cpu(hptep->r));
+   if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
+   continue;
 
-   if (HPTE_V_COMPARE(hpte_v, want_v) && (hpte_v & HPTE_V_VALID))
-   /* HPTE matches */
-   return slot;
-   ++slot;
+   native_lock_hpte(hptep);
+   hpte_v = be64_to_cpu(hptep->v);
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   hpte_v = hpte_new_to_old_v(hpte_v, 
be64_to_cpu(hptep->r));
+   if (unlikely(!HPTE_V_COMPARE(hpte_v, want_v) ||
+!(hpte_v & HPTE_V_VALID)))
+   native_unlock_hpte(hptep);
+   else
+   return hptep;
}
-
-   return -1;
+   if (!secondary_search) {
+   /* Search for hpte in the secondary group */
+   slot = (~hash & htab_hash_mask) * HPTES_PER_GROUP;
+   secondary_search = true;
+   goto search_again;
+   }
+   return NULL;
 }
 
 /*
@@ -389,23 +406,22 @@ static long native_hpte_find(unsigned long vpn, int 
psize, int ssize)
 static void native_hpte_updateboltedpp(unsigned long newpp, unsigned long ea,
   int psize, int ssize)
 {
-   unsigned long vpn;
-   unsigned long vsid;
-   long slot;
+   unsigned long hash;
+   unsigned long vpn, vsid;
struct hash_pte *hptep;
 
vsid = get_kernel_vsid(ea, ssize);
vpn = hpt_vpn(ea, vsid, ssize);
-
-   slot = native_hpte_find(vpn, psize, ssize);
-   if (slot == -1)
+   hash = hpt_hash(vpn, mmu_psize_defs[psize].shift, ssize);
+   hptep = native_hpte_find(hash, vpn, psize, ssize);
+   if (!hptep)
panic("could not find page to bolt\n");
-   hptep = htab_address + slot;
 
/* Update the HPTE */
hptep->r = cpu_to_be64((be64_to_cpu(hptep->r) &
~(HPTE_R_PPP | HPTE_R_N)) |
   (newpp & (HPTE_R_PPP | HPTE_R_N)));
+   native_unlock_hpte(hptep);
/*
 * Ensure it is out of the tlb too. Bolted entries base and
 * actual page size will be same.
@@ -422,18 +438,17 @@ static int native_hpte_removebolted(unsigned long ea, int 
psize, int ssize)
 {
unsigned long vpn;
unsigned long vsid;
-   long slot;
+   unsigned long hash;
struct hash_pte *hptep;
 
vsid = get_kernel_vsid(ea, ssize);
vpn = hpt_vpn(ea, vsid, ssize);
+   hash = hpt_hash(vpn, mmu_psize_defs[psize].shift, ssize);
 
-   slot = native_hpte_find(vpn, psize, ssize);
-   if (slot == -1)
+   hptep = native_hpte_find(hash, vpn, psize, ssize);
+   if (!hptep)
return -ENOENT;
 
-   hptep = htab_address + slot;
-
VM_WARN_ON(!(be64_to_cpu(hptep->v) & HPTE_V_BOLTED));
 
/* Invalidate the hpte */
-- 
2.13.6



[PATCH 03/16] powerpc/pseries: Update hpte find helper to take hash value

2017-10-26 Thread Aneesh Kumar K.V
The helper now also does secondary hash search so that we can use this in other
functions.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/pseries/lpar.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 495ba4e7336d..edab68d9f9f3 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -328,15 +328,21 @@ static long pSeries_lpar_hpte_updatepp(unsigned long slot,
return 0;
 }
 
-static long __pSeries_lpar_hpte_find(unsigned long want_v, unsigned long 
hpte_group)
+static long __pSeries_lpar_hpte_find(unsigned long hash, unsigned long want_v)
 {
long lpar_rc;
unsigned long i, j;
+   unsigned long hpte_group;
+   bool secondary_search = false;
struct {
unsigned long pteh;
unsigned long ptel;
} ptes[4];
 
+   /* first check primary */
+   hpte_group = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+
+search_again:
for (i = 0; i < HPTES_PER_GROUP; i += 4, hpte_group += 4) {
 
lpar_rc = plpar_pte_read_4(0, hpte_group, (void *)ptes);
@@ -346,31 +352,31 @@ static long __pSeries_lpar_hpte_find(unsigned long 
want_v, unsigned long hpte_gr
for (j = 0; j < 4; j++) {
if (HPTE_V_COMPARE(ptes[j].pteh, want_v) &&
(ptes[j].pteh & HPTE_V_VALID))
-   return i + j;
+   return hpte_group + j;
}
}
-
+   if (!secondary_search) {
+   hpte_group = (~hash & htab_hash_mask) * HPTES_PER_GROUP;
+   secondary_search = true;
+   goto search_again;
+   }
return -1;
 }
 
 static long pSeries_lpar_hpte_find(unsigned long vpn, int psize, int ssize)
 {
long slot;
-   unsigned long hash;
-   unsigned long want_v;
-   unsigned long hpte_group;
+   unsigned long hash, want_v;
 
hash = hpt_hash(vpn, mmu_psize_defs[psize].shift, ssize);
want_v = hpte_encode_avpn(vpn, psize, ssize);
-
-   /* Bolted entries are always in the primary group */
-   hpte_group = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot = __pSeries_lpar_hpte_find(want_v, hpte_group);
+   slot = __pSeries_lpar_hpte_find(hash, want_v);
if (slot < 0)
return -1;
-   return hpte_group + slot;
+   return slot;
 }
 
+
 static void pSeries_lpar_hpte_updateboltedpp(unsigned long newpp,
 unsigned long ea,
 int psize, int ssize)
-- 
2.13.6



[PATCH 01/16] powerpc/mm/hash: Remove the superfluous bitwise operation when find hpte group

2017-10-26 Thread Aneesh Kumar K.V
When computing the starting slot number for a hash page table group we used
to do this
hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;

Multiplying with 8 (HPTES_PER_GROUP) imply the last three bits are 0. Hence we
really don't need to clear then separately.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/dump_hashpagetable.c |  2 +-
 arch/powerpc/mm/hash64_4k.c  |  8 
 arch/powerpc/mm/hash64_64k.c | 16 
 arch/powerpc/mm/hash_utils_64.c  | 10 --
 arch/powerpc/mm/hugepage-hash64.c|  9 -
 5 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/mm/dump_hashpagetable.c 
b/arch/powerpc/mm/dump_hashpagetable.c
index 5c4c93dcff19..2384d40bfcf4 100644
--- a/arch/powerpc/mm/dump_hashpagetable.c
+++ b/arch/powerpc/mm/dump_hashpagetable.c
@@ -260,7 +260,7 @@ static int pseries_find(unsigned long ea, int psize, bool 
primary, u64 *v, u64 *
/* to check in the secondary hash table, we invert the hash */
if (!primary)
hash = ~hash;
-   hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
+   hpte_group = (hash & htab_hash_mask) * HPTES_PER_GROUP;
/* see if we can find an entry in the hpte with this hash */
for (i = 0; i < HPTES_PER_GROUP; i += 4, hpte_group += 4) {
lpar_rc = plpar_pte_read_4(0, hpte_group, (void *)ptes);
diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 6fa450c12d6d..975793de0914 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -81,7 +81,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
hash = hpt_hash(vpn, shift, ssize);
 
 repeat:
-   hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & 
~0x7UL;
+   hpte_group = (hash & htab_hash_mask) * HPTES_PER_GROUP;
 
/* Insert into the hash table, primary slot */
slot = mmu_hash_ops.hpte_insert(hpte_group, vpn, pa, rflags, 0,
@@ -90,7 +90,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
 * Primary is full, try the secondary
 */
if (unlikely(slot == -1)) {
-   hpte_group = ((~hash & htab_hash_mask) * 
HPTES_PER_GROUP) & ~0x7UL;
+   hpte_group = (~hash & htab_hash_mask) * HPTES_PER_GROUP;
slot = mmu_hash_ops.hpte_insert(hpte_group, vpn, pa,
rflags,
HPTE_V_SECONDARY,
@@ -98,8 +98,8 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
MMU_PAGE_4K, ssize);
if (slot == -1) {
if (mftb() & 0x1)
-   hpte_group = ((hash & htab_hash_mask) *
- HPTES_PER_GROUP) & ~0x7UL;
+   hpte_group = (hash & htab_hash_mask) *
+   HPTES_PER_GROUP;
mmu_hash_ops.hpte_remove(hpte_group);
/*
 * FIXME!! Should be try the group from which 
we removed ?
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 1a68cb19b0e3..f1eb538721fc 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -163,7 +163,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, 
unsigned long vsid,
}
hash = hpt_hash(vpn, shift, ssize);
 repeat:
-   hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
+   hpte_group = (hash & htab_hash_mask) * HPTES_PER_GROUP;
 
/* Insert into the hash table, primary slot */
slot = mmu_hash_ops.hpte_insert(hpte_group, vpn, pa, rflags, 0,
@@ -172,15 +172,15 @@ int __hash_page_4K(unsigned long ea, unsigned long 
access, unsigned long vsid,
 * Primary is full, try the secondary
 */
if (unlikely(slot == -1)) {
-   hpte_group = ((~hash & htab_hash_mask) * HPTES_PER_GROUP) & 
~0x7UL;
+   hpte_group = (~hash & htab_hash_mask) * HPTES_PER_GROUP;
slot = mmu_hash_ops.hpte_insert(hpte_group, vpn, pa,
rflags, HPTE_V_SECONDARY,
MMU_PAGE_4K, MMU_PAGE_4K,
ssize);
if (slot == -1) {
if (mftb() & 0x1)
-   hpte_group = ((hash & htab_hash_mask) *
- HPTES_PER_GROUP) & ~0x7UL;
+   hpte_group = (hash & 

[PATCH 00/16] Remove hash page table slot tracking from linux PTE

2017-10-26 Thread Aneesh Kumar K.V
Hi,

With hash translation mode we always tracked the hash pte slot details in linux 
page table.
This occupied space in the linux page table and also limitted our ability to 
support
linux features that require additional PTE bits. This series attempt to lift 
this
limitation by not tracking slot number in linux page table. We still track slot 
details
w.r.t Transparent Hugepage entries because an invalidate there requires us to 
go through
all the 256 hash pte slots. So tracking whether hash page table entry is valid 
helps us in
avoiding a lot of hcalls there. With THP entries we don't keep slot details in 
the primary
linux page table entry but in the second half of page table. Hence tracking 
slot details
for THP doesn't take up space in PTE.

Even though we don't track slot, for removing/updating hash page table entry, 
PAPR hcalls expect
hash page table slot details. On pseries we find slot using H_READ hcall using 
H_READ_4 flags.
This implies an additional 2 hcalls in the updatepp and remove paths. The patch 
series also
attempt to limit the impact of this by adding new hcalls that does 
remove/update of hash page table
entry using hash value instead of hash page table slot.

Below is the performance numbers observed when running a workload that does the 
below sequence

for(5000) {
mmap(128M)
touch every page of 2048 page
munmap()
}

The test is run with address randomization off, swap disabled in both host and 
guest.


|+--+---+--+---|
| iterations | platform | without patch | With series and no hcall | With 
series and hcall |
|+--+---+--+---|
|  1 | powernv  |   |50.818343 |
   |
|  2 | powernv  |   |50.744123 |
   |
|  3 | powernv  |   |50.721603 |
   |
|  4 | powernv  |   |50.739922 |
   |
|  5 | powernv  |   |50.638555 |
   |
|  1 | powernv  | 51.388249 |  |
   |
|  2 | powernv  | 51.789701 |  |
   |
|  3 | powernv  | 52.240394 |  |
   |
|  4 | powernv  | 51.432255 |  |
   |
|  5 | powernv  | 51.392947 |  |
   |
|+--+---+--+---|
|  1 | pseries  |   |  |
123.154394 |
|  2 | pseries  |   |  |
122.253956 |
|  3 | pseries  |   |  |
117.666344 |
|  4 | pseries  |   |  |
117.681479 |
|  5 | pseries  |   |  |
117.735808 |
|  1 | pseries  |   |   119.424940 |
   |
|  2 | pseries  |   |   117.663078 |
   |
|  3 | pseries  |   |   118.345584 |
   |
|  4 | pseries  |   |   119.620934 |
   |
|  5 | pseries  |   |   119.463185 |
   |
|  1 | pseries  |122.810867 |  |
   |
|  2 | pseries  |115.760801 |  |
   |
|  3 | pseries  |115.257030 |  |
   |
|  4 | pseries  |116.617884 |  |
   |
|  5 | pseries  |117.247036 |  |
   |
|+--+---+--+---|

-aneesh

Aneesh Kumar K.V (16):
  powerpc/mm/hash: Remove the superfluous bitwise operation when find
hpte group
  powerpc/mm: Update native_hpte_find to return hash pte
  powerpc/pseries: Update hpte find helper to take hash value
  powerpc/mm: Add hash invalidate callback
  powerpc/mm: use hash_invalidate for __kernel_map_pages()
  powerpc/mm: Switch flush_hash_range to not use slot
  powerpc/mm: Add hash updatepp callback
  powerpc/mm/hash: Don't track hash pte slot number in linux page table.
  powerpc/mm: Add new firmware feature HASH API
  powerpc/kvm/hash: Implement HASH_REMOVE hcall
  powerpc/kvm/hash: Implement HASH_PROTECT hcall
  powerpc/kvm/hash: Implement HASH_BULK_REMOVE hcall
  powerpc/mm/pseries: 

[PATCH kernel] vfio/spapr: Add trace points for map/unmap

2017-10-26 Thread Alexey Kardashevskiy
This adds trace_map/trace_unmap tracepoints to spapr driver. Type1 already
uses these via the IOMMU API (iommu_map/__iommu_unmap).

Signed-off-by: Alexey Kardashevskiy 
---

Example:
 qemu-system-ppc-8655  [096]   724.662740: unmap:IOMMU: 
iova=0x3000 size=4096 unmapped_size=4096
 qemu-system-ppc-8656  [104]   724.970912: map:  IOMMU: 
iova=0x0800 paddr=0x7ffef7ff size=65536
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 63112c36ab2d..4531486c77c6 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -502,17 +503,19 @@ static int tce_iommu_clear(struct tce_container 
*container,
struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
 {
-   unsigned long oldhpa;
+   unsigned long oldhpa, unmapped, firstentry = entry, totalpages = pages;
long ret;
enum dma_data_direction direction;
 
-   for ( ; pages; --pages, ++entry) {
+   for (unmapped = 0; pages; --pages, ++entry) {
direction = DMA_NONE;
oldhpa = 0;
ret = iommu_tce_xchg(tbl, entry, , );
if (ret)
continue;
 
+   ++unmapped;
+
if (direction == DMA_NONE)
continue;
 
@@ -523,6 +526,9 @@ static int tce_iommu_clear(struct tce_container *container,
 
tce_iommu_unuse_page(container, oldhpa);
}
+   trace_unmap(firstentry << tbl->it_page_shift,
+   totalpages << tbl->it_page_shift,
+   unmapped << tbl->it_page_shift);
 
return 0;
 }
@@ -965,6 +971,8 @@ static long tce_iommu_ioctl(void *iommu_data,
direction);
 
iommu_flush_tce(tbl);
+   if (!ret)
+   trace_map(param.iova, param.vaddr, param.size);
 
return ret;
}
-- 
2.11.0



[PATCH 4/4] m68k/mac: Add mutual exclusion for IOP interrupt polling

2017-10-26 Thread Finn Thain
The IOP interrupt handler iop_ism_irq() is used by the adb-iop
driver to poll for ADB request completion. Unfortunately, it is not
re-entrant. Fix the race condition by adding an iop_ism_irq_poll()
function with suitable mutual exclusion.

Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/m68k/include/asm/mac_iop.h | 1 +
 arch/m68k/mac/iop.c | 9 +
 drivers/macintosh/adb-iop.c | 4 +---
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/m68k/include/asm/mac_iop.h b/arch/m68k/include/asm/mac_iop.h
index 42566fd052bc..d2a08e004e2c 100644
--- a/arch/m68k/include/asm/mac_iop.h
+++ b/arch/m68k/include/asm/mac_iop.h
@@ -158,6 +158,7 @@ extern void iop_complete_message(struct iop_msg *);
 extern void iop_upload_code(uint, __u8 *, uint, __u16);
 extern void iop_download_code(uint, __u8 *, uint, __u16);
 extern __u8 *iop_compare_code(uint, __u8 *, uint, __u16);
+extern void iop_ism_irq_poll(uint);
 
 extern void iop_register_interrupts(void);
 
diff --git a/arch/m68k/mac/iop.c b/arch/m68k/mac/iop.c
index a2ea52db7d18..9bfa17015768 100644
--- a/arch/m68k/mac/iop.c
+++ b/arch/m68k/mac/iop.c
@@ -598,3 +598,12 @@ irqreturn_t iop_ism_irq(int irq, void *dev_id)
}
return IRQ_HANDLED;
 }
+
+void iop_ism_irq_poll(uint iop_num)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   iop_ism_irq(0, (void *)iop_num);
+   local_irq_restore(flags);
+}
diff --git a/drivers/macintosh/adb-iop.c b/drivers/macintosh/adb-iop.c
index f5f4da3d0b67..4b0ad3995497 100644
--- a/drivers/macintosh/adb-iop.c
+++ b/drivers/macintosh/adb-iop.c
@@ -29,8 +29,6 @@
 
 /*#define DEBUG_ADB_IOP*/
 
-extern void iop_ism_irq(int, void *);
-
 static struct adb_request *current_req;
 static struct adb_request *last_req;
 #if 0
@@ -265,7 +263,7 @@ int adb_iop_autopoll(int devs)
 void adb_iop_poll(void)
 {
if (adb_iop_state == idle) adb_iop_start();
-   iop_ism_irq(0, (void *) ADB_IOP);
+   iop_ism_irq_poll(ADB_IOP);
 }
 
 int adb_iop_reset_bus(void)
-- 
2.13.6



Re: [PATCH v5 12/18] MODSIGN: Export module signature definitions

2017-10-26 Thread Mimi Zohar
On Thu, 2017-10-26 at 20:47 -0200, Thiago Jung Bauermann wrote:
> Mimi Zohar  writes:
> 
> > On Tue, 2017-10-17 at 22:53 -0200, Thiago Jung Bauermann wrote:
> >> IMA will use the module_signature format for append signatures, so export
> >> the relevant definitions and factor out the code which verifies that the
> >> appended signature trailer is valid.
> >> 
> >> Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it
> >> and be able to use validate_module_signature without having to depend on
> >> CONFIG_MODULE_SIG.
> >> 
> >> Signed-off-by: Thiago Jung Bauermann 
> >
> > Reviewed-by: Mimi Zohar 
> >
> > One minor comment below...
> 
> Thanks!
> 
> >> diff --git a/kernel/module_signing.c b/kernel/module_signing.c
> >> index 937c844bee4a..204c60d4cc9f 100644
> >> --- a/kernel/module_signing.c
> >> +++ b/kernel/module_signing.c
> >> @@ -11,36 +11,38 @@
> >> 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  #include 
> >>  #include 
> >>  #include "module-internal.h"
> >> 
> >> -enum pkey_id_type {
> >> -  PKEY_ID_PGP,/* OpenPGP generated key ID */
> >> -  PKEY_ID_X509,   /* X.509 arbitrary subjectKeyIdentifier */
> >> -  PKEY_ID_PKCS7,  /* Signature in PKCS#7 message */
> >> -};
> >> -
> >> -/*
> >> - * Module signature information block.
> >> - *
> >> - * The constituents of the signature section are, in order:
> >> +/**
> >> + * validate_module_sig - validate that the given signature is sane
> >>   *
> >> - *- Signer's name
> >> - *- Key identifier
> >> - *- Signature data
> >> - *- Information block
> >> + * @ms:   Signature to validate.
> >> + * @file_len: Size of the file to which @ms is appended.
> >>   */
> >> -struct module_signature {
> >> -  u8  algo;   /* Public-key crypto algorithm [0] */
> >> -  u8  hash;   /* Digest algorithm [0] */
> >> -  u8  id_type;/* Key identifier type [PKEY_ID_PKCS7] */
> >> -  u8  signer_len; /* Length of signer's name [0] */
> >> -  u8  key_id_len; /* Length of key identifier [0] */
> >> -  u8  __pad[3];
> >> -  __be32  sig_len;/* Length of signature data */
> >> -};
> >> +int validate_module_sig(const struct module_signature *ms, size_t 
> >> file_len)
> >> +{
> >> +  if (be32_to_cpu(ms->sig_len) >= file_len - sizeof(*ms))
> >> +  return -EBADMSG;
> >> +  else if (ms->id_type != PKEY_ID_PKCS7) {
> >> +  pr_err("Module is not signed with expected PKCS#7 message\n");
> >> +  return -ENOPKG;
> >> +  } else if (ms->algo != 0 ||
> >> + ms->hash != 0 ||
> >> + ms->signer_len != 0 ||
> >> + ms->key_id_len != 0 ||
> >> + ms->__pad[0] != 0 ||
> >> + ms->__pad[1] != 0 ||
> >> + ms->__pad[2] != 0) {
> >> +  pr_err("PKCS#7 signature info has unexpected non-zero 
> >> params\n");
> >> +  return -EBADMSG;
> >> +  }
> >> +
> >
> > When moving code from one place to another, it's easier to review when
> > there aren't code changes as well. In this case, the original code
> > doesn't have "else clauses".
> 
> Indeed. I changed the code back to using separate if clauses, making
> only the changes that are required for the refactoring.
> 
> > Here some of the if/then/else clauses
> > have braces others don't. There shouldn't be a mixture.
> 
> Does this still apply when the if clauses are separate as in the
> original code? Should the first if still have braces?

No, the original code was fine. 



Re: [PATCH v5 12/18] MODSIGN: Export module signature definitions

2017-10-26 Thread Thiago Jung Bauermann

Mimi Zohar  writes:

> On Tue, 2017-10-17 at 22:53 -0200, Thiago Jung Bauermann wrote:
>> IMA will use the module_signature format for append signatures, so export
>> the relevant definitions and factor out the code which verifies that the
>> appended signature trailer is valid.
>> 
>> Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it
>> and be able to use validate_module_signature without having to depend on
>> CONFIG_MODULE_SIG.
>> 
>> Signed-off-by: Thiago Jung Bauermann 
>
> Reviewed-by: Mimi Zohar 
>
> One minor comment below...

Thanks!

>> diff --git a/kernel/module_signing.c b/kernel/module_signing.c
>> index 937c844bee4a..204c60d4cc9f 100644
>> --- a/kernel/module_signing.c
>> +++ b/kernel/module_signing.c
>> @@ -11,36 +11,38 @@
>> 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>>  #include "module-internal.h"
>> 
>> -enum pkey_id_type {
>> -PKEY_ID_PGP,/* OpenPGP generated key ID */
>> -PKEY_ID_X509,   /* X.509 arbitrary subjectKeyIdentifier */
>> -PKEY_ID_PKCS7,  /* Signature in PKCS#7 message */
>> -};
>> -
>> -/*
>> - * Module signature information block.
>> - *
>> - * The constituents of the signature section are, in order:
>> +/**
>> + * validate_module_sig - validate that the given signature is sane
>>   *
>> - *  - Signer's name
>> - *  - Key identifier
>> - *  - Signature data
>> - *  - Information block
>> + * @ms: Signature to validate.
>> + * @file_len:   Size of the file to which @ms is appended.
>>   */
>> -struct module_signature {
>> -u8  algo;   /* Public-key crypto algorithm [0] */
>> -u8  hash;   /* Digest algorithm [0] */
>> -u8  id_type;/* Key identifier type [PKEY_ID_PKCS7] */
>> -u8  signer_len; /* Length of signer's name [0] */
>> -u8  key_id_len; /* Length of key identifier [0] */
>> -u8  __pad[3];
>> -__be32  sig_len;/* Length of signature data */
>> -};
>> +int validate_module_sig(const struct module_signature *ms, size_t file_len)
>> +{
>> +if (be32_to_cpu(ms->sig_len) >= file_len - sizeof(*ms))
>> +return -EBADMSG;
>> +else if (ms->id_type != PKEY_ID_PKCS7) {
>> +pr_err("Module is not signed with expected PKCS#7 message\n");
>> +return -ENOPKG;
>> +} else if (ms->algo != 0 ||
>> +   ms->hash != 0 ||
>> +   ms->signer_len != 0 ||
>> +   ms->key_id_len != 0 ||
>> +   ms->__pad[0] != 0 ||
>> +   ms->__pad[1] != 0 ||
>> +   ms->__pad[2] != 0) {
>> +pr_err("PKCS#7 signature info has unexpected non-zero 
>> params\n");
>> +return -EBADMSG;
>> +}
>> +
>
> When moving code from one place to another, it's easier to review when
> there aren't code changes as well. In this case, the original code
> doesn't have "else clauses".

Indeed. I changed the code back to using separate if clauses, making
only the changes that are required for the refactoring.

> Here some of the if/then/else clauses
> have braces others don't. There shouldn't be a mixture.

Does this still apply when the if clauses are separate as in the
original code? Should the first if still have braces?

-- 
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [PATCH v5 18/18] ima: Write modsig to the measurement list

2017-10-26 Thread Thiago Jung Bauermann

Hello Mimi,

Thanks for your review.

Mimi Zohar  writes:

> On Tue, 2017-10-17 at 22:53 -0200, Thiago Jung Bauermann wrote:
>
>> diff --git a/security/integrity/ima/ima_main.c 
>> b/security/integrity/ima/ima_main.c
>> index 6a2d960fbd92..0d3390de7432 100644
>> --- a/security/integrity/ima/ima_main.c
>> +++ b/security/integrity/ima/ima_main.c
>> @@ -246,7 +246,35 @@ static int process_measurement(struct file *file, char 
>> *buf, loff_t size,
>>  rc = ima_appraise_measurement(func, iint, file, buf, size,
>>pathname, _value,
>>_len, opened);
>> -if (action & IMA_MEASURE)
>> +
>> +/*
>> + * MODSIG has one corner case we need to deal with here:
>> + *
>> + * Suppose the policy has one measure rule for one hook and an appraise
>> + * rule for a different hook. Suppose also that the template requires
>> + * the signature to be stored in the measurement list.
>> + *
>> + * If a file containing a MODSIG is measured by the first hook before
>> + * being appraised by the second one, the corresponding entry in the
>> + * measurement list will not contain the MODSIG because we only fetch it
>> + * for IMA_APPRAISAL. We don't fetch it earlier because if the file has
>> + * both a DIGSIG and a MODSIG it's not possible to know which one will
>> + * be valid without actually doing the appraisal.
>> + *
>> + * Therefore, after appraisal of a MODSIG signature we need to store the
>> + * measurement again if the current template requires storing the
>> + * signature.
>
> Yes, all true, but this long comment doesn't belong here in the middle
> of process_measurement().
>
>> + * With the opposite ordering (the appraise rule triggering before the
>> + * measurement rule) there is the same problem but it's not possible to
>> + * do anything about it because at the time we are appraising the
>> + * signature it's impossible to know whether a measurement will ever
>> + * need to be stored for this file.
>> + */
>
> With the template format "ima-sig", the verified file signature needs
> to be included in the measurement list. Based on this file signature,
> the attestation server can validate the signature.
>
> In this case, where the appraisal comes first followed by the
> measurement, the appraised file signature is included in the
> measurement list. I don't see the problem here.

I think I forgot that during appraisal the modsig is copied into the
iint cache and that it will be used when the measure rule is trigerred.
I'll drop that last paragraph.

>
>> +if ((action & IMA_MEASURE) || ((iint->flags & IMA_MEASURE) &&
>> +   xattr_value &&
>> +   xattr_value->type == IMA_MODSIG &&
>> +   ima_current_template_has_sig()))
>
> Like the clean up you did elsewhere, this new set of tests should be
> made into a function. The comment could placed along with the new
> function.

Ok. I didn't create a function because these tests are only done here,
but I agree that it will make the code clearer, and be a better place
for the big comment as well. Will do in the next version.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [PATCH v5 00/18] Appended signatures support for IMA appraisal

2017-10-26 Thread Mimi Zohar
On Tue, 2017-10-17 at 22:53 -0200, Thiago Jung Bauermann wrote:
> Hello,
> 
> The main highlight in this version is that it fixes a bug where the modsig
> wasn't being included in the measurement list if the appraised file was
> already measured by another rule. The fix is in the last patch.
> 
> Another change is that the last patch in the v4 series ("ima: Support
> module-style appended signatures for appraisal") has been broken up into
> smaller patches. I may have overdone it...
> 
> Finally, I have added some patches removing superfluous parentheses from
> expressions. IMO these patches make it easier (and more pleasant) to read
> the code, and thus easier to understand it. Since I'm not sure how welcome
> the changes are, I split them in 3 "levels" in increasing potential for
> conflict with patches from other people (they can be squashed together when
> applied):
> 
> 1. patch 2 contains the bare minimum, changing only lines that are also
>touched by other patches in the series;
> 
> 2. patch 3 cleans up all the files that are touched by this patch series;
> 
> 3. patch 4 cleans up all other EVM and IMA files that weren't already fixed
>by the previous patches.
> 
> If unwanted, patches 3 and 4 can be simply skipped without affecting the
> rest of the patches. I have already rebased them from v4.13-rc2 to
> v4.14-rc3 and now to linux-integrity/next with very few easy to resolve
> conflicts, so I think they are worth keeping.
> 
> These patches apply on top of today's linux-integrity/next.

This cover letter and the patch descriptions are well written,
explaining what and why you're making this change.  The problem is
that I don't agree that fewer parentheses makes the code more
readable.  When you repost the patches (for other reasons), please
don't include these changes.

thanks,

Mimi



[PATCH V5 2/2] pseries/initnodes: Ensure nodes initialized for hotplug

2017-10-26 Thread Michael Bringmann
pseries/nodes: On pseries systems which allow 'hot-add' of CPU,
it may occur that the new resources are to be inserted into nodes
that were not used for memory resources at bootup.  Many different
configurations of PowerPC resources may need to be supported depending
upon the environment.  This patch fixes some problems encountered at
runtime with configurations that support memory-less nodes, or that
hot-add resources during system execution after boot.

Signed-off-by: Michael Bringmann 
---
Changes in V5:
  -- Reorganize code used to associatively map CPUs to nodes and
 ensure initialization of nodes at runtime.
---
 arch/powerpc/mm/numa.c |   45 +
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 18fc40d..4d615a4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -551,7 +551,7 @@ static int numa_setup_cpu(unsigned long lcpu)
nid = of_node_to_nid_single(cpu);
 
 out_present:
-   if (nid < 0 || !node_online(nid))
+   if (nid < 0 || !node_possible(nid))
nid = first_online_node;
 
map_cpu_to_node(lcpu, nid);
@@ -867,7 +867,7 @@ void __init dump_numa_cpu_topology(void)
 }
 
 /* Initialize NODE_DATA for a node on the local memory */
-static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
+static void setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
u64 spanned_pages = end_pfn - start_pfn;
const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
@@ -1310,6 +1310,40 @@ static long vphn_get_associativity(unsigned long cpu,
return rc;
 }
 
+static inline int find_cpu_nid(int cpu)
+{
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   int new_nid;
+
+   /* Use associativity from first thread for all siblings */
+   vphn_get_associativity(cpu, associativity);
+   new_nid = associativity_to_nid(associativity);
+   if (new_nid < 0 || !node_possible(new_nid))
+   new_nid = first_online_node;
+
+   /*
+* Need to ensure that NODE_DATA is allocated/initialized for
+* a node from available memory (see memblock_alloc_try_nid).
+* Code executed after boot (like local_memory_node) may not
+* know enough to recover fully for memoryless nodes. 
+*/
+   if (NODE_DATA(new_nid) == NULL) {
+   setup_node_data(new_nid, 0, 0);
+   try_online_node(new_nid);
+
+   /*
+* Node structures successfully initialized, but
+* we still need the memoryliess node to be offline.
+*/
+   node_set_offline(new_nid);
+   }
+
+   if (NODE_DATA(new_nid)->node_spanned_pages == 0)
+   return first_online_node;
+
+   return new_nid;
+}
+
 /*
  * Update the CPU maps and sysfs entries for a single CPU when its NUMA
  * characteristics change. This function doesn't perform any locking and is
@@ -1377,7 +1411,6 @@ int numa_update_cpu_topology(bool cpus_locked)
 {
unsigned int cpu, sibling, changed = 0;
struct topology_update_data *updates, *ud;
-   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
cpumask_t updated_cpus;
struct device *dev;
int weight, new_nid, i = 0;
@@ -1415,11 +1448,7 @@ int numa_update_cpu_topology(bool cpus_locked)
continue;
}
 
-   /* Use associativity from first thread for all siblings */
-   vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_nid(associativity);
-   if (new_nid < 0 || !node_online(new_nid))
-   new_nid = first_online_node;
+   new_nid = find_cpu_nid(cpu);
 
if (new_nid == numa_cpu_lookup_table[cpu]) {
cpumask_andnot(_associativity_changes_mask,



[PATCH V5 1/2] pseries/nodes: Ensure enough nodes avail for operations

2017-10-26 Thread Michael Bringmann
pseries/nodes: On pseries systems which allow 'hot-add' of CPU or
memory resources, it may occur that the new resources are to be
inserted into nodes that were not used for these resources at bootup.
In the kernel, any node that is used must be defined and initialized.
This patch ensures that sufficient nodes are defined to support
configuration requirements after boot, as well as at boot.

This patch extracts the value of the lowest domain level (number
of allocable resources) from the device tree property
"ibm,max-associativity-domains" to use as the maximum number of nodes
to setup as possibly available in the system.  This new setting will
override the instruction,

nodes_and(node_possible_map, node_possible_map, node_online_map);

presently seen in the function arch/powerpc/mm/numa.c:initmem_init().

If the property is not present at boot, no operation will be performed
to define or enable additional nodes.

Signed-off-by: Michael Bringmann 
---
Changes in V5:
  -- Remove unnecessary error check
---
 arch/powerpc/mm/numa.c |   38 +++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index eb604b3..18fc40d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -892,6 +892,35 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
NODE_DATA(nid)->node_spanned_pages = spanned_pages;
 }
 
+static void __init find_possible_nodes(void)
+{
+   struct device_node *rtas;
+   u32 numnodes, i;
+
+   if (min_common_depth <= 0)
+   return;
+
+   rtas = of_find_node_by_path("/rtas");
+   if (!rtas)
+   return;
+
+   if (of_property_read_u32_index(rtas,
+   "ibm,max-associativity-domains",
+   min_common_depth, ))
+   goto out;
+
+   pr_info("numa: Nodes = %d (mcd = %d)\n", numnodes,
+   min_common_depth);
+
+   for (i = 0; i < numnodes; i++) {
+   if (!node_possible(i))
+   node_set(i, node_possible_map);
+   }
+
+out:
+   of_node_put(rtas);
+}
+
 void __init initmem_init(void)
 {
int nid, cpu;
@@ -905,12 +934,15 @@ void __init initmem_init(void)
memblock_dump_all();
 
/*
-* Reduce the possible NUMA nodes to the online NUMA nodes,
-* since we do not support node hotplug. This ensures that  we
-* lower the maximum NUMA node ID to what is actually present.
+* Modify the set of possible NUMA nodes to reflect information
+* available about the set of online nodes, and the set of nodes
+* that we expect to make use of for this platform's affinity
+* calculations.
 */
nodes_and(node_possible_map, node_possible_map, node_online_map);
 
+   find_possible_nodes();
+
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn;
 



[PATCH V5 0/2] pseries/nodes: Fix issues with memoryless nodes

2017-10-26 Thread Michael Bringmann
pseries/nodes: Ensure enough nodes avail for operations

pseries/initnodes: Ensure nodes initialized for hotplug

Signed-off-by: Michael Bringmann 

Michael Bringmann (2):
  pseries/nodes: Ensure enough nodes avail for operations
  pseries/initnodes: Ensure nodes initialized for hotplug
---
Changes in V5:
  -- Remove unnecessary error check
  -- Reorganize code used to associatively map CPUs to nodes and
 ensure initialization of nodes at runtime.



Re: [PATCH v5 13/18] PKCS#7: Introduce pkcs7_get_message_sig and verify_pkcs7_message_sig

2017-10-26 Thread Mimi Zohar
On Tue, 2017-10-17 at 22:53 -0200, Thiago Jung Bauermann wrote:
> IMA will need to access the digest used in the signature so that it can
> verify files containing module-style appended signatures. For this purpose,
> add function pkcs7_get_message_sig.
> 
> It will also need to verify an already parsed PKCS#7 message. For this
> purpose, add function verify_pkcs7_message_signature which takes a struct
> pkcs7_message for verification instead of the raw bytes that
> verify_pkcs7_signature takes.
> 
> Signed-off-by: Thiago Jung Bauermann 

Reviewed-by: Mimi Zohar 

> ---
>  certs/system_keyring.c| 60 
> +--
>  crypto/asymmetric_keys/pkcs7_parser.c | 12 +++
>  include/crypto/pkcs7.h|  2 ++
>  include/linux/verification.h  | 10 ++
>  4 files changed, 68 insertions(+), 16 deletions(-)
> 
> diff --git a/certs/system_keyring.c b/certs/system_keyring.c
> index 6251d1b27f0c..6a8684959780 100644
> --- a/certs/system_keyring.c
> +++ b/certs/system_keyring.c
> @@ -190,33 +190,26 @@ late_initcall(load_system_certificate_list);
>  #ifdef CONFIG_SYSTEM_DATA_VERIFICATION
> 
>  /**
> - * verify_pkcs7_signature - Verify a PKCS#7-based signature on system data.
> + * verify_pkcs7_message_sig - Verify a PKCS#7-based signature on system data.
>   * @data: The data to be verified (NULL if expecting internal data).
>   * @len: Size of @data.
> - * @raw_pkcs7: The PKCS#7 message that is the signature.
> - * @pkcs7_len: The size of @raw_pkcs7.
> + * @pkcs7: The PKCS#7 message that is the signature.
>   * @trusted_keys: Trusted keys to use (NULL for builtin trusted keys only,
>   *   (void *)1UL for all trusted keys).
>   * @usage: The use to which the key is being put.
>   * @view_content: Callback to gain access to content.
>   * @ctx: Context for callback.
>   */
> -int verify_pkcs7_signature(const void *data, size_t len,
> -const void *raw_pkcs7, size_t pkcs7_len,
> -struct key *trusted_keys,
> -enum key_being_used_for usage,
> -int (*view_content)(void *ctx,
> -const void *data, size_t len,
> -size_t asn1hdrlen),
> -void *ctx)
> +int verify_pkcs7_message_sig(const void *data, size_t len,
> +  struct pkcs7_message *pkcs7,
> +  struct key *trusted_keys,
> +  enum key_being_used_for usage,
> +  int (*view_content)(void *ctx, const void *data,
> +  size_t len, size_t asn1hdrlen),
> +  void *ctx)
>  {
> - struct pkcs7_message *pkcs7;
>   int ret;
> 
> - pkcs7 = pkcs7_parse_message(raw_pkcs7, pkcs7_len);
> - if (IS_ERR(pkcs7))
> - return PTR_ERR(pkcs7);
> -
>   /* The data should be detached - so we need to supply it. */
>   if (data && pkcs7_supply_detached_data(pkcs7, data, len) < 0) {
>   pr_err("PKCS#7 signature with non-detached data\n");
> @@ -258,6 +251,41 @@ int verify_pkcs7_signature(const void *data, size_t len,
>   }
> 
>  error:
> + pr_devel("<==%s() = %d\n", __func__, ret);
> + return ret;
> +}
> +
> +/**
> + * verify_pkcs7_signature - Verify a PKCS#7-based signature on system data.
> + * @data: The data to be verified (NULL if expecting internal data).
> + * @len: Size of @data.
> + * @raw_pkcs7: The PKCS#7 message that is the signature.
> + * @pkcs7_len: The size of @raw_pkcs7.
> + * @trusted_keys: Trusted keys to use (NULL for builtin trusted keys only,
> + *   (void *)1UL for all trusted keys).
> + * @usage: The use to which the key is being put.
> + * @view_content: Callback to gain access to content.
> + * @ctx: Context for callback.
> + */
> +int verify_pkcs7_signature(const void *data, size_t len,
> +const void *raw_pkcs7, size_t pkcs7_len,
> +struct key *trusted_keys,
> +enum key_being_used_for usage,
> +int (*view_content)(void *ctx,
> +const void *data, size_t len,
> +size_t asn1hdrlen),
> +void *ctx)
> +{
> + struct pkcs7_message *pkcs7;
> + int ret;
> +
> + pkcs7 = pkcs7_parse_message(raw_pkcs7, pkcs7_len);
> + if (IS_ERR(pkcs7))
> + return PTR_ERR(pkcs7);
> +
> + ret = verify_pkcs7_message_sig(data, len, pkcs7, trusted_keys, usage,
> +view_content, ctx);
> +
>   pkcs7_free_message(pkcs7);
>   pr_devel("<==%s() = %d\n", __func__, ret);
>   return ret;
> diff --git a/crypto/asymmetric_keys/pkcs7_parser.c 

Re: [PATCH v5 12/18] MODSIGN: Export module signature definitions

2017-10-26 Thread Mimi Zohar
On Tue, 2017-10-17 at 22:53 -0200, Thiago Jung Bauermann wrote:
> IMA will use the module_signature format for append signatures, so export
> the relevant definitions and factor out the code which verifies that the
> appended signature trailer is valid.
> 
> Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it
> and be able to use validate_module_signature without having to depend on
> CONFIG_MODULE_SIG.
> 
> Signed-off-by: Thiago Jung Bauermann 

Reviewed-by: Mimi Zohar 

One minor comment below...

> ---
>  include/linux/module.h   |  3 --
>  include/linux/module_signature.h | 47 +
>  init/Kconfig |  6 +++-
>  kernel/Makefile  |  2 +-
>  kernel/module.c  |  1 +
>  kernel/module_signing.c  | 74 
> +---
>  6 files changed, 85 insertions(+), 48 deletions(-)
> 
> diff --git a/include/linux/module.h b/include/linux/module.h
> index fe5aa3736707..108874af9838 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -23,9 +23,6 @@
>  #include 
>  #include 
> 
> -/* In stripped ARM and x86-64 modules, ~ is surprisingly rare. */
> -#define MODULE_SIG_STRING "~Module signature appended~\n"
> -
>  /* Not Yet Implemented */
>  #define MODULE_SUPPORTED_DEVICE(name)
> 
> diff --git a/include/linux/module_signature.h 
> b/include/linux/module_signature.h
> new file mode 100644
> index ..e80728e5b86c
> --- /dev/null
> +++ b/include/linux/module_signature.h
> @@ -0,0 +1,47 @@
> +/* Module signature handling.
> + *
> + * Copyright (C) 2012 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowe...@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _LINUX_MODULE_SIGNATURE_H
> +#define _LINUX_MODULE_SIGNATURE_H
> +
> +/* In stripped ARM and x86-64 modules, ~ is surprisingly rare. */
> +#define MODULE_SIG_STRING "~Module signature appended~\n"
> +
> +enum pkey_id_type {
> + PKEY_ID_PGP,/* OpenPGP generated key ID */
> + PKEY_ID_X509,   /* X.509 arbitrary subjectKeyIdentifier */
> + PKEY_ID_PKCS7,  /* Signature in PKCS#7 message */
> +};
> +
> +/*
> + * Module signature information block.
> + *
> + * The constituents of the signature section are, in order:
> + *
> + *   - Signer's name
> + *   - Key identifier
> + *   - Signature data
> + *   - Information block
> + */
> +struct module_signature {
> + u8  algo;   /* Public-key crypto algorithm [0] */
> + u8  hash;   /* Digest algorithm [0] */
> + u8  id_type;/* Key identifier type [PKEY_ID_PKCS7] */
> + u8  signer_len; /* Length of signer's name [0] */
> + u8  key_id_len; /* Length of key identifier [0] */
> + u8  __pad[3];
> + __be32  sig_len;/* Length of signature data */
> +};
> +
> +int validate_module_sig(const struct module_signature *ms, size_t file_len);
> +int mod_verify_sig(const void *mod, unsigned long *_modlen);
> +
> +#endif /* _LINUX_MODULE_SIGNATURE_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 78cb2461012e..230e9f3aedaf 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1748,7 +1748,7 @@ config MODULE_SRCVERSION_ALL
>  config MODULE_SIG
>   bool "Module signature verification"
>   depends on MODULES
> - select SYSTEM_DATA_VERIFICATION
> + select MODULE_SIG_FORMAT
>   help
> Check modules for valid signatures upon load: the signature
> is simply appended to the module. For more information see
> @@ -1763,6 +1763,10 @@ config MODULE_SIG
> debuginfo strip done by some packagers (such as rpmbuild) and
> inclusion into an initramfs that wants the module size reduced.
> 
> +config MODULE_SIG_FORMAT
> + def_bool n
> + select SYSTEM_DATA_VERIFICATION
> +
>  config MODULE_SIG_FORCE
>   bool "Require modules to be validly signed"
>   depends on MODULE_SIG
> diff --git a/kernel/Makefile b/kernel/Makefile
> index ed470aac53da..20fd6d76232e 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -57,7 +57,7 @@ obj-y += up.o
>  endif
>  obj-$(CONFIG_UID16) += uid16.o
>  obj-$(CONFIG_MODULES) += module.o
> -obj-$(CONFIG_MODULE_SIG) += module_signing.o
> +obj-$(CONFIG_MODULE_SIG_FORMAT) += module_signing.o
>  obj-$(CONFIG_KALLSYMS) += kallsyms.o
>  obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
>  obj-$(CONFIG_CRASH_CORE) += crash_core.o
> diff --git a/kernel/module.c b/kernel/module.c
> index de66ec825992..a259e9df570d 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  

Re: [PATCH v5 18/18] ima: Write modsig to the measurement list

2017-10-26 Thread Mimi Zohar
On Tue, 2017-10-17 at 22:53 -0200, Thiago Jung Bauermann wrote:

> diff --git a/security/integrity/ima/ima_main.c 
> b/security/integrity/ima/ima_main.c
> index 6a2d960fbd92..0d3390de7432 100644
> --- a/security/integrity/ima/ima_main.c
> +++ b/security/integrity/ima/ima_main.c
> @@ -246,7 +246,35 @@ static int process_measurement(struct file *file, char 
> *buf, loff_t size,
>   rc = ima_appraise_measurement(func, iint, file, buf, size,
> pathname, _value,
> _len, opened);
> - if (action & IMA_MEASURE)
> +
> + /*
> +  * MODSIG has one corner case we need to deal with here:
> +  *
> +  * Suppose the policy has one measure rule for one hook and an appraise
> +  * rule for a different hook. Suppose also that the template requires
> +  * the signature to be stored in the measurement list.
> +  *
> +  * If a file containing a MODSIG is measured by the first hook before
> +  * being appraised by the second one, the corresponding entry in the
> +  * measurement list will not contain the MODSIG because we only fetch it
> +  * for IMA_APPRAISAL. We don't fetch it earlier because if the file has
> +  * both a DIGSIG and a MODSIG it's not possible to know which one will
> +  * be valid without actually doing the appraisal.
> +  *
> +  * Therefore, after appraisal of a MODSIG signature we need to store the
> +  * measurement again if the current template requires storing the
> +  * signature.

Yes, all true, but this long comment doesn't belong here in the middle
of process_measurement(). 

> +  * With the opposite ordering (the appraise rule triggering before the
> +  * measurement rule) there is the same problem but it's not possible to
> +  * do anything about it because at the time we are appraising the
> +  * signature it's impossible to know whether a measurement will ever
> +  * need to be stored for this file.
> +  */

With the template format "ima-sig", the verified file signature needs
to be included in the measurement list.  Based on this file signature,
the attestation server can validate the signature.

In this case, where the appraisal comes first followed by the
measurement, the appraised file signature is included in the
measurement list.  I don't see the problem here.

> + if ((action & IMA_MEASURE) || ((iint->flags & IMA_MEASURE) &&
> +xattr_value &&
> +xattr_value->type == IMA_MODSIG &&
> +ima_current_template_has_sig()))

Like the clean up you did elsewhere, this new set of tests should be
made into a function.  The comment could placed along with the new
function.

Mimi

>   ima_store_measurement(iint, file, pathname,
> xattr_value, xattr_len, pcr);
>   if (action & IMA_AUDIT)
> 



Re: [PATCH] powerpc/tm: fix live state of vs0/32 in tm_reclaim

2017-10-26 Thread Breno Leitao
Hi Cyril, Mikey,

On Thu, Oct 26, 2017 at 03:57:33PM +1100, Cyril Bur wrote:
> On Wed, 2017-07-05 at 11:02 +1000, Michael Neuling wrote:
> > On Tue, 2017-07-04 at 16:45 -0400, Gustavo Romero wrote:
> > > Currently tm_reclaim() can return with a corrupted vs0 (fp0) or vs32 (v0)
> > > due to the fact vs0 is used to save FPSCR and vs32 is used to save VSCR.
> > 
> 
> Hi Mikey,
> 
> This completely fell off my radar, we do need something merged!
> 
> For what its worth I like the original patch.
> 
> > tm_reclaim() should have no state live in the registers once it returns.  It
> > should all be saved in the thread struct. The above is not an issue in my 
> > book.
> > 
> 
> Yeah, this is something I agree with, however, if that is the case then
> why have tm_recheckpoint() do partial reloads?

I think it is an optimization which unfortunately causes a lot of
bugs in the code, and made the whole code hard to read and debug. I would vote
to change this design to something simpler.

My suggestion is removing the partial reclaim and partial
recheckpoint at all.

In this case, all the checkpointed registers would go to ckvec/ckfp/ckpt
area in the thread. Independently if registers are bogus or not.

Later, __tm_recheckpoint() would recheckpoint all values from the
ckvec/ckfp registers . Of course we can change the values if
we need to, as, to enable a feature, as FP, that might be disabled (thus
the registers were bogus). This will continue to happen in between the
reclaim and recheckpoint. Something in these lines:

giveup_all(container_of(thr, struct task_struct, thread));

tm_reclaim(thr, MSR_FP | MSR_VEC, cause);

if ((thr->ckpt_regs.msr & MSR_FP) == 0)
memcpy(>ckfp_state, >fp_state,
   sizeof(struct thread_fp_state));
if ((thr->ckpt_regs.msr & MSR_VEC) == 0)
memcpy(>ckvr_state, >vr_state,
   sizeof(struct thread_vr_state));

tm_recheckpoint(>thread, MSR_FP | MSR_VEC) ;

> A partial reload only makes sense if we can be sure that reclaim will
> have left the state at least (partially) correct - not with (as is the
> case today) one corrupted fp or Altivec reg.

Even in this case, something can happen before the techeckpoint and
mess up with the 'hot' registers that will be placed in the checkpoint area. I
am wondering if an interrupt could happen in the mean time that calls some
functions that keep the registers dirty, as memcpy(?). (Is it possible?)

> > Having a quick look at the code, I think there's and issue but we need 
> > something
> > more like this (completely untested).
> > 
> > When we recheckpoint inside an fp unavail, we need to recheckpoint vec if 
> > it was
> > enabled.  Currently we only ever recheckpoint the FP which seems like a 
> > bug. 
> > Visa versa for the other way around.
> > 
> 
> In your example, we don't need to reload VEC if we can trust that
> reclaim left the checkpointed regs on the CPU correctly - this patch
> achieves this.

Right, it is better to have this patch than nothing, but, why not
recheckpointing everything? That would be much easier to debug the
problems, since we will only need to inspect the ckfp/vec area before
recheckpoint. Nowadays we need to inspect the ckarea and the 'hot' live
registers, which is tricky to do. :-)

> Of course I'm more than happy to reduce complexity and not have this
> optimisation at all but then we should remove the entire parameter to
> tm_recheckpoint(). Any in between feels dangerous.

In fact, Gustavo created a patch that do so already, but we didn't
submit because we also felt dangerous to propose such a big change. Our
idea is to just reclaim/recheckpoint with FP|VEC on, and once we are
confident that it is not broken in any form, we remove the second
parameter.

In order to do it, we have the following suggestion:

1) Fix the two major issues we have right now[1]. In order to do so, we would
start to enable VEC|FP in all recheckpoint. This would be the initial step into
simplifying the whole code:

This is the proposed patch:
 https://github.com/leitao/linux/commit/80a73de53061237948bca76d97fbba18a0e2aba0

2) Force VEC|FP on treclaim

3) Simplify the whole trecheckpoint() and tm_reclaim() code, removing the MSR
argument, since FP|VEC would always be on.

Thank you,
Breno

[1] 
https://github.com/leitao/linux/commit/747c6e7ce00c971c347ca0b9c5069a4ebd715ef6

> > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> > index d4e545d27e..d1184264e2 100644
> > --- a/arch/powerpc/kernel/traps.c
> > +++ b/arch/powerpc/kernel/traps.c
> > @@ -1589,7 +1589,7 @@ void fp_unavailable_tm(struct pt_regs *regs)
> >  * If VMX is in use, the VRs now hold checkpointed values,
> >  * so we don't want to load the VRs from the thread_struct.
> >  */
> > -   tm_recheckpoint(>thread, MSR_FP);
> > +   tm_recheckpoint(>thread, regs->msr);
> >  
> > /* If VMX is in use, get the 

[RFC PATCH] powerpc/powernv: OPAL do not process events from hard interrupt context

2017-10-26 Thread Nicholas Piggin
Using irq_work for processing OPAL events can cause latency spikes and
is not really required. OPAL events are not performance critical, and
we already have kopald to poll and run events, so have kopald run them
all. Rather than scheduling them as irq_work, just run them directly
from kopald. Enable and disable interrupts between processing each
event.

Event handlers themselves should continue to use threaded handlers,
workqueues, etc. as appropriate to avoid high interrupts-off
latencies.

Signed-off-by: Nicholas Piggin 
---
Is there any reason for processing OPAL events right after the OPAL
interrupt? Or for using irq_work the poller? I can't see why. This
patch should reduce the maximum irqs-off time.

 arch/powerpc/platforms/powernv/opal-irqchip.c | 85 ---
 arch/powerpc/platforms/powernv/opal.c | 23 
 arch/powerpc/platforms/powernv/powernv.h  |  3 +-
 3 files changed, 51 insertions(+), 60 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c 
b/arch/powerpc/platforms/powernv/opal-irqchip.c
index 9d1b8c0aaf93..646bfac8e3f5 100644
--- a/arch/powerpc/platforms/powernv/opal-irqchip.c
+++ b/arch/powerpc/platforms/powernv/opal-irqchip.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -38,37 +37,47 @@ struct opal_event_irqchip {
unsigned long mask;
 };
 static struct opal_event_irqchip opal_event_irqchip;
-
+static u64 last_outstanding_events;
 static unsigned int opal_irq_count;
 static unsigned int *opal_irqs;
 
-static void opal_handle_irq_work(struct irq_work *work);
-static u64 last_outstanding_events;
-static struct irq_work opal_event_irq_work = {
-   .func = opal_handle_irq_work,
-};
-
-void opal_handle_events(uint64_t events)
+void opal_handle_events(void)
 {
-   int virq, hwirq = 0;
-   u64 mask = opal_event_irqchip.mask;
+   __be64 events = 0;
+   u64 e;
+
+   e = last_outstanding_events & opal_event_irqchip.mask;
+again:
+   while (e) {
+   int virq, hwirq;
+
+   hwirq = fls64(e) - 1;
+   e &= ~BIT_ULL(hwirq);
+
+   local_irq_disable();
+   virq = irq_find_mapping(opal_event_irqchip.domain, hwirq);
+   if (virq) {
+   irq_enter();
+   generic_handle_irq(virq);
+   irq_exit();
+   }
+   local_irq_enable();
 
-   if (!in_irq() && (events & mask)) {
-   last_outstanding_events = events;
-   irq_work_queue(_event_irq_work);
-   return;
+   cond_resched();
}
+   last_outstanding_events = 0;
+   if (opal_poll_events() != OPAL_SUCCESS)
+   return;
+   e = be64_to_cpu(events) & opal_event_irqchip.mask;
+   if (e)
+   goto again;
+}
 
-   while (events & mask) {
-   hwirq = fls64(events) - 1;
-   if (BIT_ULL(hwirq) & mask) {
-   virq = irq_find_mapping(opal_event_irqchip.domain,
-   hwirq);
-   if (virq)
-   generic_handle_irq(virq);
-   }
-   events &= ~BIT_ULL(hwirq);
-   }
+bool opal_recheck_events(void)
+{
+   if (last_outstanding_events & opal_event_irqchip.mask)
+   return true;
+   return false;
 }
 
 static void opal_event_mask(struct irq_data *d)
@@ -78,24 +87,9 @@ static void opal_event_mask(struct irq_data *d)
 
 static void opal_event_unmask(struct irq_data *d)
 {
-   __be64 events;
-
set_bit(d->hwirq, _event_irqchip.mask);
-
-   opal_poll_events();
-   last_outstanding_events = be64_to_cpu(events);
-
-   /*
-* We can't just handle the events now with opal_handle_events().
-* If we did we would deadlock when opal_event_unmask() is called from
-* handle_level_irq() with the irq descriptor lock held, because
-* calling opal_handle_events() would call generic_handle_irq() and
-* then handle_level_irq() which would try to take the descriptor lock
-* again. Instead queue the events for later.
-*/
if (last_outstanding_events & opal_event_irqchip.mask)
-   /* Need to retrigger the interrupt */
-   irq_work_queue(_event_irq_work);
+   opal_wake_poller();
 }
 
 static int opal_event_set_type(struct irq_data *d, unsigned int flow_type)
@@ -136,16 +130,13 @@ static irqreturn_t opal_interrupt(int irq, void *data)
__be64 events;
 
opal_handle_interrupt(virq_to_hw(irq), );
-   opal_handle_events(be64_to_cpu(events));
+   last_outstanding_events = be64_to_cpu(events);
+   if (last_outstanding_events & opal_event_irqchip.mask)
+   opal_wake_poller();
 
return IRQ_HANDLED;
 }
 
-static void opal_handle_irq_work(struct irq_work *work)
-{
-   

Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count

2017-10-26 Thread Andrea Arcangeli
Hello Laurent,

Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea...@intel.com> shows
significant slowdown even for brk/malloc ops both single and
multi threaded.

The single threaded case I think is the most important because it has
zero chance of getting back any benefit later during page faults.

Could you check if:

1. it's possible change vm_write_begin to be a noop if mm->mm_count is
   <= 1? Hint: clone() will run single threaded so there's no way it can run
   in the middle of a being/end critical section (clone could set an
   MMF flag to possibly keep the sequence counter activated if a child
   thread exits and mm_count drops to 1 while the other cpu is in the
   middle of a critical section in the other thread).

2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU
   freeing happened only once a MMF flag is set? That will at least
   reduce the risk of temporary memory waste until the next RCU grace
   period. The read of the MMF will scale fine. Of course to allow
   point 1 and 2 then the page fault should also take the mmap_sem
   until the MMF flag is set.

Could you also investigate a much bigger change: I wonder if it's
possible to drop the sequence number entirely from the vma and stop
using sequence numbers entirely (which is likely the source of the
single threaded regression in point 1 that may explain the report in
the above message-id), and just call the vma rbtree lookup once again
and check that everything is still the same in the vma and the PT lock
obtained is still a match to finish the anon page fault and fill the
pte?

Then of course we also need to add a method to the read-write
semaphore so it tells us if there's already one user holding the read
mmap_sem and we're the second one.  If we're the second one (or more
than second) only then we should skip taking the down_read mmap_sem.
Even a multithreaded app won't ever skip taking the mmap_sem until
there's sign of runtime contention, and it won't have to run the way
more expensive sequence number-less revalidation during page faults,
unless we get an immediate scalability payoff because we already know
the mmap_sem is already contended and there are multiple nested
threads in the page fault handler of the same mm.

Perhaps we'd need something more advanced than a
down_read_trylock_if_not_hold() (which has to guaranteed not to write
to any cacheline) and we'll have to count the per-thread exponential
backoff of mmap_sem frequency, but starting with
down_read_trylock_if_not_hold() would be good I think.

This is not how the current patch works, the current patch uses a
sequence number because it pretends to go lockless always and in turn
has to slow down all vma updates fast paths or the revalidation
slowsdown performance for page fault too much (as it always
revalidates).

I think it would be much better to go speculative only when there's
"detected" runtime contention on the mmap_sem with
down_read_trylock_if_not_hold() and that will make the revalidation
cost not an issue to worry about because normally we won't have to
revalidate the vma at all during page fault. In turn by making the
revalidation more expensive by starting a vma rbtree lookup from
scratch, we can drop the sequence number entirely and that should
simplify the patch tremendously because all vm_write_begin/end would
disappear from the patch and in turn the mmap/brk slowdown measured by
the message-id above, should disappear as well.
   
Thanks,
Andrea


Re: [v5,22/22] powerpc/mm: Add speculative page fault

2017-10-26 Thread kemi
Some regression is found by LKP-tools(linux kernel performance) on this patch 
series
tested on Intel 2s/4s Skylake platform. 
The regression result is sorted by the metric will-it-scale.per_process_ops.

Branch:Laurent-Dufour/Speculative-page-faults/20171011-213456(V4 patch series)
Commit id:
 base:9a4b4dd1d8700dd5771f11dd2c048e4363efb493
 head:56a4a8962fb32555a42eefdc9a19eeedd3e8c2e6
Benchmark suite:will-it-scale
Download link:https://github.com/antonblanchard/will-it-scale/tree/master/tests
Metrics:
 will-it-scale.per_process_ops=processes/nr_cpu
 will-it-scale.per_thread_ops=threads/nr_cpu

tbox:lkp-skl-4sp1(nr_cpu=192,memory=768G)
kconfig:CONFIG_TRANSPARENT_HUGEPAGE is not set
testcasebasechange  headmetric  
 
brk12251803 -18.1%  1843535 
will-it-scale.per_process_ops
341101  -17.5%  281284  
will-it-scale.per_thread_ops
malloc1 48833   -9.2%   44343   
will-it-scale.per_process_ops
31555   +2.9%   32473   
will-it-scale.per_thread_ops
page_fault3 913019  -8.5%   835203  
will-it-scale.per_process_ops
233978  -18.1%  191593  
will-it-scale.per_thread_ops
mmap2   95892   -6.6%   89536   
will-it-scale.per_process_ops
90180   -13.7%  77803   
will-it-scale.per_thread_ops
mmap1   109586  -4.7%   104414  
will-it-scale.per_process_ops
104477  -12.4%  91484   
will-it-scale.per_thread_ops
sched_yield 4964649 -2.1%   4859927 
will-it-scale.per_process_ops
4946759 -1.7%   4864924 
will-it-scale.per_thread_ops
write1  1345159 -1.3%   1327719 
will-it-scale.per_process_ops
1228754 -2.2%   1201915 
will-it-scale.per_thread_ops
page_fault2 202519  -1.0%   200545  
will-it-scale.per_process_ops
96573   -10.4%  86526   
will-it-scale.per_thread_ops
page_fault1 225608  -0.9%   223585  
will-it-scale.per_process_ops
105945  +14.4%  121199  
will-it-scale.per_thread_ops

tbox:lkp-skl-4sp1(nr_cpu=192,memory=768G)
kconfig:CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
testcasebasechange  headmetric  
 
context_switch1 333780  -23.0%  256927  
will-it-scale.per_process_ops
brk12263539 -18.8%  1837462 
will-it-scale.per_process_ops
325854  -15.7%  274752  
will-it-scale.per_thread_ops
malloc1 48746   -13.5%  42148   
will-it-scale.per_process_ops
mmap1   106860  -12.4%  93634   
will-it-scale.per_process_ops
98082   -18.9%  79506   
will-it-scale.per_thread_ops
mmap2   92468   -11.3%  82059   
will-it-scale.per_process_ops
80468   -8.9%   73343   
will-it-scale.per_thread_ops
page_fault3 900709  -9.1%   818851  
will-it-scale.per_process_ops
229837  -18.3%  187769  
will-it-scale.per_thread_ops
write1  1327409 -1.7%   1305048 
will-it-scale.per_process_ops
1215658 -1.6%   1196479 
will-it-scale.per_thread_ops
writeseek3  300639  -1.6%   295882  
will-it-scale.per_process_ops
231118  -2.2%   225929  
will-it-scale.per_thread_ops
signal1 122011  -1.5%   120155  
will-it-scale.per_process_ops
futex1  5123778 -1.2%   5062087 
will-it-scale.per_process_ops
page_fault2 202321  -1.0%   200289  
will-it-scale.per_process_ops
93073   -9.8%   83927   
will-it-scale.per_thread_ops

tbox:lkp-skl-2sp2(nr_cpu=112,memory=64G)
kconfig:CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
testcasebasechange  headmetric  
 
brk12177903 -20.0%  1742054 
will-it-scale.per_process_ops
434558  -15.3%  367896  
will-it-scale.per_thread_ops
malloc1 64871   -10.3%  58174   
will-it-scale.per_process_ops
page_fault3 882435  -9.0%   802892  
will-it-scale.per_process_ops
299176  -15.7%  252170   

Re: [PATCH v2 3/3] powerpc/64s/radix: Fix process table entry cache invalidation

2017-10-26 Thread Aneesh Kumar K.V
Nicholas Piggin  writes:

> According to the architecture, the process table entry cache must be
> flushed with tlbie RIC=2.
>
> Currently the process table entry is set to invalid right before the
> PID is returned to the allocator, with no invalidation. This works on
> existing implementations that are known to not cache the process table
> entry for any except the current PIDR.
>
> It is architecturally correct and cleaner to invalidate with RIC=2
> after clearing the process table entry and before the PID is returned
> to the allocator. This can be done in arch_exit_mmap that runs before
> the final flush, and to ensure the final flush (fullmm) is always a
> RIC=2 variant.
>

Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/mmu_context.h |  4 
>  arch/powerpc/mm/mmu_context_book3s64.c | 25 -
>  arch/powerpc/mm/tlb-radix.c|  6 +-
>  3 files changed, 29 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mmu_context.h 
> b/arch/powerpc/include/asm/mmu_context.h
> index a0d7145d6cd2..20eae6f76247 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -164,9 +164,13 @@ static inline void arch_dup_mmap(struct mm_struct *oldmm,
>  {
>  }
>
> +#ifndef CONFIG_PPC_BOOK3S_64
>  static inline void arch_exit_mmap(struct mm_struct *mm)
>  {
>  }
> +#else
> +extern void arch_exit_mmap(struct mm_struct *mm);
> +#endif
>
>  static inline void arch_unmap(struct mm_struct *mm,
> struct vm_area_struct *vma,
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
> b/arch/powerpc/mm/mmu_context_book3s64.c
> index 05e15386d4cb..6d724dab27c2 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -216,19 +216,34 @@ void destroy_context(struct mm_struct *mm)
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>   WARN_ON_ONCE(!list_empty(>context.iommu_group_mem_list));
>  #endif
> + if (radix_enabled())
> + WARN_ON(process_tb[mm->context.id].prtb0 != 0);
> + else
> + subpage_prot_free(mm);
> + destroy_pagetable_page(mm);
> + __destroy_context(mm->context.id);
> + mm->context.id = MMU_NO_CONTEXT;
> +}
> +
> +void arch_exit_mmap(struct mm_struct *mm)
> +{
>   if (radix_enabled()) {
>   /*
>* Radix doesn't have a valid bit in the process table
>* entries. However we know that at least P9 implementation
>* will avoid caching an entry with an invalid RTS field,
>* and 0 is invalid. So this will do.
> +  *
> +  * This runs before the "fullmm" tlb flush in exit_mmap,
> +  * which does a RIC=2 tlbie to clear the process table
> +  * entry. See the "fullmm" comments in tlb-radix.c.
> +  *
> +  * No barrier required here after the store because
> +  * this process will do the invalidate, which starts with
> +  * ptesync.
>*/
>   process_tb[mm->context.id].prtb0 = 0;
> - } else
> - subpage_prot_free(mm);
> - destroy_pagetable_page(mm);
> - __destroy_context(mm->context.id);
> - mm->context.id = MMU_NO_CONTEXT;
> + }
>  }
>
>  #ifdef CONFIG_PPC_RADIX_MMU
> diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
> index 18170dc264aa..0c8464653aa3 100644
> --- a/arch/powerpc/mm/tlb-radix.c
> +++ b/arch/powerpc/mm/tlb-radix.c
> @@ -297,10 +297,14 @@ void radix__tlb_flush(struct mmu_gather *tlb)
>   psize = radix_get_mmu_psize(page_size);
>   /*
>* if page size is not something we understand, do a full mm flush
> +  *
> +  * A "fullmm" flush must always do a flush_all_mm (RIC=2) flush
> +  * that flushes the process table entry cache upon process teardown.
> +  * See the comment for radix in arch_exit_mmap().
>*/
>   if (psize != -1 && !tlb->fullmm && !tlb->need_flush_all)
>   radix__flush_tlb_range_psize(mm, tlb->start, tlb->end, psize);
> - else if (tlb->need_flush_all) {
> + else if (tlb->fullmm || tlb->need_flush_all) {
>   tlb->need_flush_all = 0;
>   radix__flush_all_mm(mm);
>   } else
> -- 
> 2.13.3



Re: [PATCH v2 2/3] powerpc/64s/radix: tlbie improve preempt handling

2017-10-26 Thread Aneesh Kumar K.V
Nicholas Piggin  writes:

> Preempt should be consistently disabled for mm_is_thread_local tests,
> so bring the rest of these under preempt_disable().
>
> Preempt does not need to be disabled for the mm->context.id tests,
> which allows simplification and removal of gotos.
>

Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/mm/tlb-radix.c | 48 
> +
>  1 file changed, 22 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
> index 892544ebc2d2..18170dc264aa 100644
> --- a/arch/powerpc/mm/tlb-radix.c
> +++ b/arch/powerpc/mm/tlb-radix.c
> @@ -186,16 +186,15 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
>  {
>   unsigned long pid;
>
> - preempt_disable();
>   pid = mm->context.id;
>   if (unlikely(pid == MMU_NO_CONTEXT))
> - goto no_context;
> + return;
>
> + preempt_disable();
>   if (!mm_is_thread_local(mm))
>   _tlbie_pid(pid, RIC_FLUSH_TLB);
>   else
>   _tlbiel_pid(pid, RIC_FLUSH_TLB);
> -no_context:
>   preempt_enable();
>  }
>  EXPORT_SYMBOL(radix__flush_tlb_mm);
> @@ -204,16 +203,15 @@ void radix__flush_all_mm(struct mm_struct *mm)
>  {
>   unsigned long pid;
>
> - preempt_disable();
>   pid = mm->context.id;
>   if (unlikely(pid == MMU_NO_CONTEXT))
> - goto no_context;
> + return;
>
> + preempt_disable();
>   if (!mm_is_thread_local(mm))
>   _tlbie_pid(pid, RIC_FLUSH_ALL);
>   else
>   _tlbiel_pid(pid, RIC_FLUSH_ALL);
> -no_context:
>   preempt_enable();
>  }
>  EXPORT_SYMBOL(radix__flush_all_mm);
> @@ -230,15 +228,14 @@ void radix__flush_tlb_page_psize(struct mm_struct *mm, 
> unsigned long vmaddr,
>   unsigned long pid;
>   unsigned long ap = mmu_get_ap(psize);
>
> - preempt_disable();
>   pid = mm->context.id;
>   if (unlikely(pid == MMU_NO_CONTEXT))
> - goto bail;
> + return;
> + preempt_disable();
>   if (!mm_is_thread_local(mm))
>   _tlbie_va(vmaddr, pid, ap, RIC_FLUSH_TLB);
>   else
>   _tlbiel_va(vmaddr, pid, ap, RIC_FLUSH_TLB);
> -bail:
>   preempt_enable();
>  }
>
> @@ -322,54 +319,53 @@ void radix__flush_tlb_range_psize(struct mm_struct *mm, 
> unsigned long start,
>  {
>   unsigned long pid;
>   unsigned long addr;
> - int local = mm_is_thread_local(mm);
> + bool local;
>   unsigned long ap = mmu_get_ap(psize);
>   unsigned long page_size = 1UL << mmu_psize_defs[psize].shift;
>
> -
> - preempt_disable();
>   pid = mm->context.id;
>   if (unlikely(pid == MMU_NO_CONTEXT))
> - goto err_out;
> + return;
>
> + preempt_disable();
> + local = mm_is_thread_local(mm);
>   if (end == TLB_FLUSH_ALL ||
>   (end - start) > tlb_single_page_flush_ceiling * page_size) {
>   if (local)
>   _tlbiel_pid(pid, RIC_FLUSH_TLB);
>   else
>   _tlbie_pid(pid, RIC_FLUSH_TLB);
> - goto err_out;
> - }
> - for (addr = start; addr < end; addr += page_size) {
> + } else {
> + for (addr = start; addr < end; addr += page_size) {
>
> - if (local)
> - _tlbiel_va(addr, pid, ap, RIC_FLUSH_TLB);
> - else
> - _tlbie_va(addr, pid, ap, RIC_FLUSH_TLB);
> + if (local)
> + _tlbiel_va(addr, pid, ap, RIC_FLUSH_TLB);
> + else
> + _tlbie_va(addr, pid, ap, RIC_FLUSH_TLB);
> + }
>   }
> -err_out:
>   preempt_enable();
>  }
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
>  {
> - int local = mm_is_thread_local(mm);
>   unsigned long ap = mmu_get_ap(mmu_virtual_psize);
>   unsigned long pid, end;
> + bool local;
>
> - preempt_disable();
>   pid = mm->context.id;
>   if (unlikely(pid == MMU_NO_CONTEXT))
> - goto no_context;
> + return;
>
>   /* 4k page size, just blow the world */
>   if (PAGE_SIZE == 0x1000) {
>   radix__flush_all_mm(mm);
> - preempt_enable();
>   return;
>   }
>
> + preempt_disable();
> + local = mm_is_thread_local(mm);
>   /* Otherwise first do the PWC */
>   if (local)
>   _tlbiel_pid(pid, RIC_FLUSH_PWC);
> @@ -384,7 +380,7 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, 
> unsigned long addr)
>   else
>   _tlbie_va(addr, pid, ap, RIC_FLUSH_TLB);
>   }
> -no_context:
> +
>   preempt_enable();
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -- 
> 2.13.3



Re: [PATCH v2 1/3] powerpc/64s/radix: fix preempt imbalance in TLB flush

2017-10-26 Thread Aneesh Kumar K.V
Nicholas Piggin  writes:


Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/mm/tlb-radix.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
> index 3a07d7a5e2fe..892544ebc2d2 100644
> --- a/arch/powerpc/mm/tlb-radix.c
> +++ b/arch/powerpc/mm/tlb-radix.c
> @@ -358,7 +358,7 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, 
> unsigned long addr)
>   unsigned long ap = mmu_get_ap(mmu_virtual_psize);
>   unsigned long pid, end;
>
> -
> + preempt_disable();
>   pid = mm->context.id;
>   if (unlikely(pid == MMU_NO_CONTEXT))
>   goto no_context;
> @@ -366,6 +366,7 @@ void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, 
> unsigned long addr)
>   /* 4k page size, just blow the world */
>   if (PAGE_SIZE == 0x1000) {
>   radix__flush_all_mm(mm);
> + preempt_enable();
>   return;
>   }
>
> -- 
> 2.13.3



Re: [PATCH 3/7] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

2017-10-26 Thread Ram Pai
On Wed, Oct 25, 2017 at 11:18:49AM +0200, Michael Ellerman wrote:
> Ram Pai  writes:
> > On Mon, Oct 23, 2017 at 02:17:39PM +0530, Aneesh Kumar K.V wrote:
> >> Michael Ellerman  writes:
> >> > Ram Pai  writes:
> ...
> >> > Should be:
> >> >  /*
> >> >   * Initialize all hidx entries to invalid value, the first time
> >> >  * the PTE is about to allocate a 4K HPTE.
> >> >   */
> >> >
> >> >> +   if (!(old_pte & H_PAGE_COMBO))
> >> >> +   rpte.hidx = ~0x0UL;
> >> >> +
> >> >
> >> > Paul had the idea that if we biased the slot number by 1, we could make
> >> > the "invalid" value be == 0.
> >> >
> >> > That would avoid needing to that above, and also mean the value is
> >> > correctly invalid from the get-go, which would be good IMO.
> >> >
> >> > I think now that you've added the slot accessors it would be pretty easy
> >> > to do.
> >> 
> >> That would be imply, we loose one slot in primary group, which means we
> >> will do extra work in some case because our primary now has only 7
> >> slots. And in case of pseries, the hypervisor will always return the
> >> least available slot, which implie we will do extra hcalls in case of an
> >> hpte insert to an empty group?
> >
> > No. that is not the idea.  The idea is that slot 'F' in the seconday
> > will continue to be a invalid slot, but will be represented as
> > offset-by-one in the PTE.  In other words, 0 will be repesented as 1,
> > 1 as 2   and  n as (n+1)%32
> 
> Right.
> 
> > The idea seems feasible.  It has the advantage -- where 0 in the PTE
> > means invalid slot. But it can be confusing to the casual code-
> > reader. Will need to put in a big-huge comment to explain that.
> 
> This code is already confusing to *any* reader, so I don't think it's a
> worry. :)

I just got it coded and working.  But I see no advantage implementing
the shifted-value. The hidx in the secondary-part of the pte, still
needs to be initialzed to all-zeros. Because it could contain the value of
the hidx corresponding the 64k-backed hpte, which needs to be cleared.

I will send the patch anyway.  But we should not apply it
for I see no apparent gain.

RP

> 
> cheers

-- 
Ram Pai