Re: [RFC PATCH 1/4] libnvdimm/namespace: Make namespace size validation arch dependent

2019-10-30 Thread Aneesh Kumar K.V

On 10/29/19 11:00 AM, Dan Williams wrote:

On Mon, Oct 28, 2019 at 9:35 PM Aneesh Kumar K.V
 wrote:


On 10/29/19 4:38 AM, Dan Williams wrote:

On Mon, Oct 28, 2019 at 2:48 AM Aneesh Kumar K.V
 wrote:


The page size used to map the namespace is arch dependent. For example
architectures like ppc64 use 16MB page size for direct-mapping. If the namespace
size is not aligned to the mapping page size, we can observe kernel crash
during namespace init and destroy.

This is due to kernel doing partial map/unmap of the resource range

BUG: Unable to handle kernel data access at 0xc00100040600
Faulting instruction address: 0xc0090790
NIP [c0090790] arch_add_memory+0xc0/0x130
LR [c0090744] arch_add_memory+0x74/0x130
Call Trace:
   arch_add_memory+0x74/0x130 (unreliable)
   memremap_pages+0x74c/0xa30
   devm_memremap_pages+0x3c/0xa0
   pmem_attach_disk+0x188/0x770
   nvdimm_bus_probe+0xd8/0x470
   really_probe+0x148/0x570
   driver_probe_device+0x19c/0x1d0
   device_driver_attach+0xcc/0x100
   bind_store+0x134/0x1c0
   drv_attr_store+0x44/0x60
   sysfs_kf_write+0x74/0xc0
   kernfs_fop_write+0x1b4/0x290
   __vfs_write+0x3c/0x70
   vfs_write+0xd0/0x260
   ksys_write+0xdc/0x130
   system_call+0x5c/0x68

Signed-off-by: Aneesh Kumar K.V 
---
   arch/arm64/mm/flush.c | 11 +++
   arch/powerpc/lib/pmem.c   | 21 +++--
   arch/x86/mm/pageattr.c| 12 
   include/linux/libnvdimm.h |  1 +
   4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index ac485163a4a7..90c54c600023 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -91,4 +91,15 @@ void arch_invalidate_pmem(void *addr, size_t size)
  __inval_dcache_area(addr, size);
   }
   EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
+
+unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned 
long size)
+{
+   u32 remainder;
+
+   div_u64_rem(size, PAGE_SIZE * ndr_mappings, );
+   if (remainder)
+   return PAGE_SIZE * ndr_mappings;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(arch_validate_namespace_size);
   #endif
diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
index 377712e85605..2e661a08dae5 100644
--- a/arch/powerpc/lib/pmem.c
+++ b/arch/powerpc/lib/pmem.c
@@ -17,14 +17,31 @@ void arch_wb_cache_pmem(void *addr, size_t size)
  unsigned long start = (unsigned long) addr;
  flush_dcache_range(start, start + size);
   }
-EXPORT_SYMBOL(arch_wb_cache_pmem);
+EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);

   void arch_invalidate_pmem(void *addr, size_t size)
   {
  unsigned long start = (unsigned long) addr;
  flush_dcache_range(start, start + size);
   }
-EXPORT_SYMBOL(arch_invalidate_pmem);
+EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
+
+unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned 
long size)
+{
+   u32 remainder;
+   unsigned long linear_map_size;
+
+   if (radix_enabled())
+   linear_map_size = PAGE_SIZE;
+   else
+   linear_map_size = (1UL << 
mmu_psize_defs[mmu_linear_psize].shift);


This seems more a "supported_alignments" problem, and less a namespace
size or PAGE_SIZE problem, because if the starting address is
misaligned this size validation can still succeed when it shouldn't.




Isn't supported_alignments an indication of how user want the namespace
to be mapped to applications?  Ie, with the above restrictions we can
still do both 64K and 16M mapping of the namespace to userspace.


True, for the pfn device and the device-dax mapping size, but I'm
suggesting adding another instance of alignment control at the raw
namespace level. That would need to be disconnected from the
device-dax page mapping granularity.



Can you explain what you mean by raw namespace level ? We don't have 
multiple values against which we need to check the alignment of 
namespace start and namespace size.


If you can outline how and where you would like to enforce that check I 
can start working on it.


-aneesh



[PATCH v3] cpufreq: powernv: fix stack bloat and hard limit on num cpus

2019-10-30 Thread John Hubbard
The following build warning occurred on powerpc 64-bit builds:

drivers/cpufreq/powernv-cpufreq.c: In function 'init_chip_info':
drivers/cpufreq/powernv-cpufreq.c:1070:1: warning: the frame size of
1040 bytes is larger than 1024 bytes [-Wframe-larger-than=]

This is with a cross-compiler based on gcc 8.1.0, which I got from:
  https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/

The warning is due to putting 1024 bytes on the stack:

unsigned int chip[256];

...and it's also undesirable to have a hard limit on the number of
CPUs here.

Fix both problems by dynamically allocating based on num_possible_cpus,
as recommended by Michael Ellerman.

Fixes: 053819e0bf840 ("cpufreq: powernv: Handle throttling due to Pmax capping 
at chip level")
Cc: Michael Ellerman 
Cc: Shilpasri G Bhat 
Cc: Preeti U Murthy 
Cc: Viresh Kumar 
Cc: Rafael J. Wysocki 
Cc: linux...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: John Hubbard 
---

Changes since v2: applied fixes from Michael Ellerman's review:

* Changed from CONFIG_NR_CPUS to num_possible_cpus()

* Fixed up commit description: added a note about exactly which
  compiler generates the warning. And softened up wording about
  the limitation on number of CPUs.

Changes since v1: includes Viresh's review commit fixes.

thanks,
John Hubbard
NVIDIA


 drivers/cpufreq/powernv-cpufreq.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 6061850e59c9..56f4bc0d209e 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -1041,9 +1041,14 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
 
 static int init_chip_info(void)
 {
-   unsigned int chip[256];
+   unsigned int *chip;
unsigned int cpu, i;
unsigned int prev_chip_id = UINT_MAX;
+   int ret = 0;
+
+   chip = kcalloc(num_possible_cpus(), sizeof(*chip), GFP_KERNEL);
+   if (!chip)
+   return -ENOMEM;
 
for_each_possible_cpu(cpu) {
unsigned int id = cpu_to_chip_id(cpu);
@@ -1055,8 +1060,10 @@ static int init_chip_info(void)
}
 
chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL);
-   if (!chips)
-   return -ENOMEM;
+   if (!chips) {
+   ret = -ENOMEM;
+   goto free_and_return;
+   }
 
for (i = 0; i < nr_chips; i++) {
chips[i].id = chip[i];
@@ -1066,7 +1073,9 @@ static int init_chip_info(void)
per_cpu(chip_info, cpu) =  [i];
}
 
-   return 0;
+free_and_return:
+   kfree(chip);
+   return ret;
 }
 
 static inline void clean_chip_info(void)
-- 
2.23.0



Re: [PATCH v2] cpufreq: powernv: fix stack bloat and NR_CPUS limitation

2019-10-30 Thread John Hubbard

On 10/30/19 7:39 PM, Michael Ellerman wrote:

Hi John,

Sorry I didn't reply to this sooner, too many patches :/

John Hubbard  writes:

The following build warning occurred on powerpc 64-bit builds:

drivers/cpufreq/powernv-cpufreq.c: In function 'init_chip_info':
drivers/cpufreq/powernv-cpufreq.c:1070:1: warning: the frame size of 1040 bytes 
is larger than 1024 bytes [-Wframe-larger-than=]


Oddly I don't see that warning in my builds, eg with GCC9:

   https://travis-ci.org/linuxppc/linux/jobs/604870722


This is with a cross-compiler based on gcc 8.1.0, which I got from:
  https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/

I'll put that in the v3 commit description.




This is due to putting 1024 bytes on the stack:

 unsigned int chip[256];

...and while looking at this, it also has a bug: it fails with a stack
overrun, if CONFIG_NR_CPUS > 256.


It _probably_ doesn't, because it only increments the index when the
chip_id of the CPU changes, ie. it doesn't create a chip for every CPU.
But I agree it's flaky the way it's written.


I'll soften up the wording accordingly.




Fix both problems by dynamically allocating based on CONFIG_NR_CPUS.


Shouldn't it use num_possible_cpus() ?

Given the for loop is over possible CPUs that seems like the upper
bound. In practice it should be lower because some CPUs will share a
chip.



OK, I see, that's more consistent with the code, I'll change to that.


thanks,
--
John Hubbard
NVIDIA





Fixes: 053819e0bf840 ("cpufreq: powernv: Handle throttling due to Pmax capping at 
chip level")
Cc: Shilpasri G Bhat 
Cc: Preeti U Murthy 
Cc: Viresh Kumar 
Cc: Rafael J. Wysocki 
Cc: linux...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: John Hubbard 
---

Changes since v1: includes Viresh's review commit fixes.

  drivers/cpufreq/powernv-cpufreq.c | 17 +
  1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 6061850e59c9..5b2e968cb5ea 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -1041,9 +1041,14 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
  
  static int init_chip_info(void)

  {
-   unsigned int chip[256];
+   unsigned int *chip;
unsigned int cpu, i;
unsigned int prev_chip_id = UINT_MAX;
+   int ret = 0;
+
+   chip = kcalloc(CONFIG_NR_CPUS, sizeof(*chip), GFP_KERNEL);
+   if (!chip)
+   return -ENOMEM;
  
  	for_each_possible_cpu(cpu) {

unsigned int id = cpu_to_chip_id(cpu);
@@ -1055,8 +1060,10 @@ static int init_chip_info(void)
}
  
  	chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL);

-   if (!chips)
-   return -ENOMEM;
+   if (!chips) {
+   ret = -ENOMEM;
+   goto free_and_return;
+   }
  
  	for (i = 0; i < nr_chips; i++) {

chips[i].id = chip[i];
@@ -1066,7 +1073,9 @@ static int init_chip_info(void)
per_cpu(chip_info, cpu) =  [i];
}
  
-	return 0;

+free_and_return:
+   kfree(chip);
+   return ret;
  }
  
  static inline void clean_chip_info(void)

--
2.23.0


[RFC PATCH v10 9/9] powerpc/ima: indicate kernel modules appended signatures are enforced

2019-10-30 Thread Mimi Zohar
The arch specific kernel module policy rule requires kernel modules to
be signed, either as an IMA signature, stored as an xattr, or as an
appended signature.  As a result, kernel modules appended signatures
could be enforced without "sig_enforce" being set or reflected in
/sys/module/module/parameters/sig_enforce.  This patch sets
"sig_enforce".

Signed-off-by: Mimi Zohar 
Cc: Jessica Yu 
---
 arch/powerpc/kernel/ima_arch.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
index b9de0fb45bb9..e34116255ced 100644
--- a/arch/powerpc/kernel/ima_arch.c
+++ b/arch/powerpc/kernel/ima_arch.c
@@ -62,13 +62,17 @@ static const char *const secure_and_trusted_rules[] = {
  */
 const char *const *arch_get_ima_policy(void)
 {
-   if (is_ppc_secureboot_enabled())
+   if (is_ppc_secureboot_enabled()) {
+   if (IS_ENABLED(CONFIG_MODULE_SIG))
+   set_module_sig_enforced();
+
if (is_ppc_trustedboot_enabled())
return secure_and_trusted_rules;
else
return secure_rules;
-   else if (is_ppc_trustedboot_enabled())
+   } else if (is_ppc_trustedboot_enabled()) {
return trusted_rules;
+   }
 
return NULL;
 }
-- 
2.7.5



[PATCH v10 8/9] powerpc/ima: update ima arch policy to check for blacklist

2019-10-30 Thread Mimi Zohar
From: Nayna Jain 

This patch updates the arch-specific policies for PowerNV system to make
sure that the binary hash is not blacklisted.

Signed-off-by: Nayna Jain 
Cc: Jessica Yu 
Signed-off-by: Mimi Zohar 
---
 arch/powerpc/kernel/ima_arch.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
index 0ef5956c9753..b9de0fb45bb9 100644
--- a/arch/powerpc/kernel/ima_arch.c
+++ b/arch/powerpc/kernel/ima_arch.c
@@ -23,9 +23,9 @@ bool arch_ima_get_secureboot(void)
  * is not enabled.
  */
 static const char *const secure_rules[] = {
-   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
+   "appraise func=KEXEC_KERNEL_CHECK appraise_flag=check_blacklist 
appraise_type=imasig|modsig",
 #ifndef CONFIG_MODULE_SIG_FORCE
-   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
+   "appraise func=MODULE_CHECK appraise_flag=check_blacklist 
appraise_type=imasig|modsig",
 #endif
NULL
 };
@@ -49,9 +49,9 @@ static const char *const trusted_rules[] = {
 static const char *const secure_and_trusted_rules[] = {
"measure func=KEXEC_KERNEL_CHECK template=ima-modsig",
"measure func=MODULE_CHECK template=ima-modsig",
-   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
+   "appraise func=KEXEC_KERNEL_CHECK appraise_flag=check_blacklist 
appraise_type=imasig|modsig",
 #ifndef CONFIG_MODULE_SIG_FORCE
-   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
+   "appraise func=MODULE_CHECK appraise_flag=check_blacklist 
appraise_type=imasig|modsig",
 #endif
NULL
 };
-- 
2.7.5



[PATCH v10 7/9] ima: check against blacklisted hashes for files with modsig

2019-10-30 Thread Mimi Zohar
From: Nayna Jain 

Asymmetric private keys are used to sign multiple files.  The kernel
currently supports checking against blacklisted keys.  However, if the
public key is blacklisted, any file signed by the blacklisted key will
automatically fail signature verification.  Blacklisting the public
key is not fine enough granularity, as we might want to only blacklist
a particular file.

This patch adds support for checking against the blacklisted hash of
the file, without the appended signature, based on the IMA policy.  It
defines a new policy option "appraise_flag=check_blacklist".

In addition to the blacklisted binary hashes stored in the firmware "dbx"
variable, the Linux kernel may be configured to load blacklisted binary
hashes onto the .blacklist keyring as well.  The following example shows
how to blacklist a specific kernel module hash.

$ sha256sum kernel/kheaders.ko
77fa889b35a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3
kernel/kheaders.ko

$ grep BLACKLIST .config
CONFIG_SYSTEM_BLACKLIST_KEYRING=y
CONFIG_SYSTEM_BLACKLIST_HASH_LIST="blacklist-hash-list"

$ cat certs/blacklist-hash-list
"bin:77fa889b35a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3"

Update the IMA custom measurement and appraisal policy rules
(/etc/ima-policy):

measure func=MODULE_CHECK template=ima-modsig
appraise func=MODULE_CHECK appraise_flag=check_blacklist
appraise_type=imasig|modsig

After building, installing, and rebooting the kernel:

 545660333 ---lswrv  0 0   \_ blacklist:
bin:77fa889b35a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3

measure func=MODULE_CHECK template=ima-modsig
appraise func=MODULE_CHECK appraise_flag=check_blacklist
appraise_type=imasig|modsig

modprobe: ERROR: could not insert 'kheaders': Permission denied

10 0c9834db5a0182c1fb0cdc5d3adcf11a11fd83dd ima-sig
sha256:3bc6ed4f0b4d6e31bc1dbc9ef844605abc7afdc6d81a57d77a1ec9407997c40
2 /usr/lib/modules/5.4.0-rc3+/kernel/kernel/kheaders.ko

10 82aad2bcc3fa8ed94762356b5c14838f3bcfa6a0 ima-modsig
sha256:3bc6ed4f0b4d6e31bc1dbc9ef844605abc7afdc6d81a57d77a1ec9407997c40
2 /usr/lib/modules/5.4.0rc3+/kernel/kernel/kheaders.ko  sha256:77fa889b3
5a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3
3082029a06092a864886f70d010702a082028b30820287020101310d300b0609608648
016503040201300b06092a864886f70d01070131820264

10 25b72217cc1152b44b134ce2cd68f12dfb71acb3 ima-buf
sha256:8b58427fedcf8f4b20bc8dc007f2e232bf7285d7b93a66476321f9c2a3aa132
b blacklisted-hash
77fa889b35a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3

Signed-off-by: Nayna Jain 
Cc: Jessica Yu 
Cc: David Howells 
[zo...@linux.ibm.com: updated patch description]
Signed-off-by: Mimi Zohar 
---
 Documentation/ABI/testing/ima_policy  |  4 
 security/integrity/ima/ima.h  |  8 
 security/integrity/ima/ima_appraise.c | 33 +
 security/integrity/ima/ima_main.c | 12 
 security/integrity/ima/ima_policy.c   | 12 ++--
 security/integrity/integrity.h|  1 +
 6 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/Documentation/ABI/testing/ima_policy 
b/Documentation/ABI/testing/ima_policy
index 29ebe9afdac4..29aaedf33246 100644
--- a/Documentation/ABI/testing/ima_policy
+++ b/Documentation/ABI/testing/ima_policy
@@ -25,6 +25,7 @@ Description:
lsm:[[subj_user=] [subj_role=] [subj_type=]
 [obj_user=] [obj_role=] [obj_type=]]
option: [[appraise_type=]] [template=] [permit_directio]
+   [appraise_flag=]
base:   func:= 
[BPRM_CHECK][MMAP_CHECK][CREDS_CHECK][FILE_CHECK][MODULE_CHECK]
[FIRMWARE_CHECK]
[KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK]
@@ -38,6 +39,9 @@ Description:
fowner:= decimal value
lsm:are LSM specific
option: appraise_type:= [imasig] [imasig|modsig]
+   appraise_flag:= [check_blacklist]
+   Currently, blacklist check is only for files signed 
with appended
+   signature.
template:= name of a defined IMA template type
(eg, ima-ng). Only valid when action is "measure".
pcr:= decimal value
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index a65772ffa427..df4ca482fb53 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -256,6 +256,8 @@ int ima_policy_show(struct seq_file *m, void *v);
 #define IMA_APPRAISE_KEXEC 0x40
 
 #ifdef CONFIG_IMA_APPRAISE
+int ima_check_blacklist(struct integrity_iint_cache *iint,
+   const struct modsig *modsig, int pcr);
 int ima_appraise_measurement(enum ima_hooks func,
 struct integrity_iint_cache *iint,
 struct file *file, const unsigned 

[PATCH v10 6/9] certs: add wrapper function to check blacklisted binary hash

2019-10-30 Thread Mimi Zohar
From: Nayna Jain 

The -EKEYREJECTED error returned by existing is_hash_blacklisted() is
misleading when called for checking against blacklisted hash of a
binary.

This patch adds a wrapper function is_binary_blacklisted() to return
-EPERM error if binary is blacklisted.

Signed-off-by: Nayna Jain 
Cc: David Howells 
Reviewed-by: Mimi Zohar 
---
 certs/blacklist.c | 9 +
 include/keys/system_keyring.h | 6 ++
 2 files changed, 15 insertions(+)

diff --git a/certs/blacklist.c b/certs/blacklist.c
index ec00bf337eb6..6514f9ebc943 100644
--- a/certs/blacklist.c
+++ b/certs/blacklist.c
@@ -135,6 +135,15 @@ int is_hash_blacklisted(const u8 *hash, size_t hash_len, 
const char *type)
 }
 EXPORT_SYMBOL_GPL(is_hash_blacklisted);
 
+int is_binary_blacklisted(const u8 *hash, size_t hash_len)
+{
+   if (is_hash_blacklisted(hash, hash_len, "bin") == -EKEYREJECTED)
+   return -EPERM;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(is_binary_blacklisted);
+
 /*
  * Initialise the blacklist
  */
diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h
index c1a96fdf598b..fb8b07daa9d1 100644
--- a/include/keys/system_keyring.h
+++ b/include/keys/system_keyring.h
@@ -35,12 +35,18 @@ extern int restrict_link_by_builtin_and_secondary_trusted(
 extern int mark_hash_blacklisted(const char *hash);
 extern int is_hash_blacklisted(const u8 *hash, size_t hash_len,
   const char *type);
+extern int is_binary_blacklisted(const u8 *hash, size_t hash_len);
 #else
 static inline int is_hash_blacklisted(const u8 *hash, size_t hash_len,
  const char *type)
 {
return 0;
 }
+
+static inline int is_binary_blacklisted(const u8 *hash, size_t hash_len)
+{
+   return 0;
+}
 #endif
 
 #ifdef CONFIG_IMA_BLACKLIST_KEYRING
-- 
2.7.5



[PATCH v10 5/9] ima: make process_buffer_measurement() generic

2019-10-30 Thread Mimi Zohar
From: Nayna Jain 

process_buffer_measurement() is limited to measuring the kexec boot
command line. This patch makes process_buffer_measurement() more
generic, allowing it to measure other types of buffer data (e.g.
blacklisted binary hashes or key hashes).

process_buffer_measurement() may be called directly from an IMA
hook or as an auxiliary measurement record. In both cases the buffer
measurement is based on policy. This patch modifies the function to
conditionally retrieve the policy defined PCR and template for the IMA
hook case.

Signed-off-by: Nayna Jain 
[zo...@linux.ibm.com: added comment in process_buffer_measurement()]
Signed-off-by: Mimi Zohar 
---
 security/integrity/ima/ima.h  |  3 ++
 security/integrity/ima/ima_main.c | 58 +++
 2 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 3689081aaf38..a65772ffa427 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -217,6 +217,9 @@ void ima_store_measurement(struct integrity_iint_cache 
*iint, struct file *file,
   struct evm_ima_xattr_data *xattr_value,
   int xattr_len, const struct modsig *modsig, int pcr,
   struct ima_template_desc *template_desc);
+void process_buffer_measurement(const void *buf, int size,
+   const char *eventname, enum ima_hooks func,
+   int pcr);
 void ima_audit_measurement(struct integrity_iint_cache *iint,
   const unsigned char *filename);
 int ima_alloc_init_template(struct ima_event_data *event_data,
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 60027c643ecd..a26e3ad4e886 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -626,14 +626,14 @@ int ima_load_data(enum kernel_load_data_id id)
  * @buf: pointer to the buffer that needs to be added to the log.
  * @size: size of buffer(in bytes).
  * @eventname: event name to be used for the buffer entry.
- * @cred: a pointer to a credentials structure for user validation.
- * @secid: the secid of the task to be validated.
+ * @func: IMA hook
+ * @pcr: pcr to extend the measurement
  *
  * Based on policy, the buffer is measured into the ima log.
  */
-static void process_buffer_measurement(const void *buf, int size,
-  const char *eventname,
-  const struct cred *cred, u32 secid)
+void process_buffer_measurement(const void *buf, int size,
+   const char *eventname, enum ima_hooks func,
+   int pcr)
 {
int ret = 0;
struct ima_template_entry *entry = NULL;
@@ -642,19 +642,45 @@ static void process_buffer_measurement(const void *buf, 
int size,
.filename = eventname,
.buf = buf,
.buf_len = size};
-   struct ima_template_desc *template_desc = NULL;
+   struct ima_template_desc *template = NULL;
struct {
struct ima_digest_data hdr;
char digest[IMA_MAX_DIGEST_SIZE];
} hash = {};
int violation = 0;
-   int pcr = CONFIG_IMA_MEASURE_PCR_IDX;
int action = 0;
+   u32 secid;
 
-   action = ima_get_action(NULL, cred, secid, 0, KEXEC_CMDLINE, ,
-   _desc);
-   if (!(action & IMA_MEASURE))
-   return;
+   /*
+* Both LSM hooks and auxilary based buffer measurements are
+* based on policy.  To avoid code duplication, differentiate
+* between the LSM hooks and auxilary buffer measurements,
+* retrieving the policy rule information only for the LSM hook
+* buffer measurements.
+*/
+   if (func) {
+   security_task_getsecid(current, );
+   action = ima_get_action(NULL, current_cred(), secid, 0, func,
+   , );
+   if (!(action & IMA_MEASURE))
+   return;
+   }
+
+   if (!pcr)
+   pcr = CONFIG_IMA_MEASURE_PCR_IDX;
+
+   if (!template) {
+   template = lookup_template_desc("ima-buf");
+   ret = template_desc_init_fields(template->fmt,
+   &(template->fields),
+   &(template->num_fields));
+   if (ret < 0) {
+   pr_err("template %s init failed, result: %d\n",
+  (strlen(template->name) ?
+   template->name : template->fmt), ret);
+   return;
+   }
+   }
 
iint.ima_hash = 
iint.ima_hash->algo = ima_hash_algo;

[PATCH v10 4/9] powerpc/ima: define trusted boot policy

2019-10-30 Thread Mimi Zohar
From: Nayna Jain 

This patch defines an arch-specific trusted boot only policy and a
combined secure and trusted boot policy.

Signed-off-by: Nayna Jain 
Signed-off-by: Mimi Zohar 
---
 arch/powerpc/kernel/ima_arch.c | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
index d88913dc0da7..0ef5956c9753 100644
--- a/arch/powerpc/kernel/ima_arch.c
+++ b/arch/powerpc/kernel/ima_arch.c
@@ -31,13 +31,44 @@ static const char *const secure_rules[] = {
 };
 
 /*
+ * The "trusted_rules" are enabled only on "trustedboot" enabled systems.
+ * These rules add the kexec kernel image and kernel modules file hashes to
+ * the IMA measurement list.
+ */
+static const char *const trusted_rules[] = {
+   "measure func=KEXEC_KERNEL_CHECK",
+   "measure func=MODULE_CHECK",
+   NULL
+};
+
+/*
+ * The "secure_and_trusted_rules" contains rules for both the secure boot and
+ * trusted boot. The "template=ima-modsig" option includes the appended
+ * signature, when available, in the IMA measurement list.
+ */
+static const char *const secure_and_trusted_rules[] = {
+   "measure func=KEXEC_KERNEL_CHECK template=ima-modsig",
+   "measure func=MODULE_CHECK template=ima-modsig",
+   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
+#ifndef CONFIG_MODULE_SIG_FORCE
+   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
+#endif
+   NULL
+};
+
+/*
  * Returns the relevant IMA arch-specific policies based on the system secure
  * boot state.
  */
 const char *const *arch_get_ima_policy(void)
 {
if (is_ppc_secureboot_enabled())
-   return secure_rules;
+   if (is_ppc_trustedboot_enabled())
+   return secure_and_trusted_rules;
+   else
+   return secure_rules;
+   else if (is_ppc_trustedboot_enabled())
+   return trusted_rules;
 
return NULL;
 }
-- 
2.7.5



[PATCH v10 1/9] powerpc: detect the secure boot mode of the system

2019-10-30 Thread Mimi Zohar
From: Nayna Jain 

This patch defines a function to detect the secure boot state of a
PowerNV system.

The PPC_SECURE_BOOT config represents the base enablement of secure boot
for powerpc.

Signed-off-by: Nayna Jain 
---
 arch/powerpc/Kconfig   | 10 ++
 arch/powerpc/include/asm/secure_boot.h | 23 +++
 arch/powerpc/kernel/Makefile   |  2 ++
 arch/powerpc/kernel/secure_boot.c  | 32 
 4 files changed, 67 insertions(+)
 create mode 100644 arch/powerpc/include/asm/secure_boot.h
 create mode 100644 arch/powerpc/kernel/secure_boot.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3e56c9c2f16e..56ea0019b616 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -934,6 +934,16 @@ config PPC_MEM_KEYS
 
  If unsure, say y.
 
+config PPC_SECURE_BOOT
+   prompt "Enable secure boot support"
+   bool
+   depends on PPC_POWERNV
+   help
+ Systems with firmware secure boot enabled need to define security
+ policies to extend secure boot to the OS. This config allows a user
+ to enable OS secure boot on systems that have firmware support for
+ it. If in doubt say N.
+
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/secure_boot.h 
b/arch/powerpc/include/asm/secure_boot.h
new file mode 100644
index ..07d0fe0ca81f
--- /dev/null
+++ b/arch/powerpc/include/asm/secure_boot.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Secure boot definitions
+ *
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain
+ */
+#ifndef _ASM_POWER_SECURE_BOOT_H
+#define _ASM_POWER_SECURE_BOOT_H
+
+#ifdef CONFIG_PPC_SECURE_BOOT
+
+bool is_ppc_secureboot_enabled(void);
+
+#else
+
+static inline bool is_ppc_secureboot_enabled(void)
+{
+   return false;
+}
+
+#endif
+#endif
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index a7ca8fe62368..e2a54fa240ac 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -161,6 +161,8 @@ ifneq ($(CONFIG_PPC_POWERNV)$(CONFIG_PPC_SVM),)
 obj-y  += ucall.o
 endif
 
+obj-$(CONFIG_PPC_SECURE_BOOT)  += secure_boot.o
+
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
 KCOV_INSTRUMENT_prom_init.o := n
diff --git a/arch/powerpc/kernel/secure_boot.c 
b/arch/powerpc/kernel/secure_boot.c
new file mode 100644
index ..63dc82c50862
--- /dev/null
+++ b/arch/powerpc/kernel/secure_boot.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain
+ */
+#include 
+#include 
+#include 
+
+bool is_ppc_secureboot_enabled(void)
+{
+   struct device_node *node;
+   bool enabled = false;
+
+   node = of_find_compatible_node(NULL, NULL, "ibm,secvar-v1");
+   if (!of_device_is_available(node)) {
+   pr_err("Cannot find secure variable node in device tree; 
failing to secure state\n");
+   goto out;
+   }
+
+   /*
+* secureboot is enabled if os-secure-enforcing property exists,
+* else disabled.
+*/
+   enabled = of_property_read_bool(node, "os-secure-enforcing");
+
+out:
+   of_node_put(node);
+
+   pr_info("Secure boot mode %s\n", enabled ? "enabled" : "disabled");
+   return enabled;
+}
-- 
2.7.5



[PATCH v10 3/9] powerpc: detect the trusted boot state of the system

2019-10-30 Thread Mimi Zohar
From: Nayna Jain 

While secure boot permits only properly verified signed kernels to be
booted, trusted boot calculates the file hash of the kernel image and
stores the measurement prior to boot, that can be subsequently compared
against good known values via attestation services.

This patch reads the trusted boot state of a PowerNV system. The state
is used to conditionally enable additional measurement rules in the IMA
arch-specific policies.

Signed-off-by: Nayna Jain 
---
 arch/powerpc/include/asm/secure_boot.h |  6 ++
 arch/powerpc/kernel/secure_boot.c  | 26 ++
 2 files changed, 32 insertions(+)

diff --git a/arch/powerpc/include/asm/secure_boot.h 
b/arch/powerpc/include/asm/secure_boot.h
index 07d0fe0ca81f..a2ff556916c6 100644
--- a/arch/powerpc/include/asm/secure_boot.h
+++ b/arch/powerpc/include/asm/secure_boot.h
@@ -11,6 +11,7 @@
 #ifdef CONFIG_PPC_SECURE_BOOT
 
 bool is_ppc_secureboot_enabled(void);
+bool is_ppc_trustedboot_enabled(void);
 
 #else
 
@@ -19,5 +20,10 @@ static inline bool is_ppc_secureboot_enabled(void)
return false;
 }
 
+static inline bool is_ppc_trustedboot_enabled(void)
+{
+   return false;
+}
+
 #endif
 #endif
diff --git a/arch/powerpc/kernel/secure_boot.c 
b/arch/powerpc/kernel/secure_boot.c
index 63dc82c50862..a6a5f17ede03 100644
--- a/arch/powerpc/kernel/secure_boot.c
+++ b/arch/powerpc/kernel/secure_boot.c
@@ -7,6 +7,17 @@
 #include 
 #include 
 
+static struct device_node *get_ppc_fw_sb_node(void)
+{
+   static const struct of_device_id ids[] = {
+   { .compatible = "ibm,secureboot-v1", },
+   { .compatible = "ibm,secureboot-v2", },
+   {},
+   };
+
+   return of_find_matching_node(NULL, ids);
+}
+
 bool is_ppc_secureboot_enabled(void)
 {
struct device_node *node;
@@ -30,3 +41,18 @@ bool is_ppc_secureboot_enabled(void)
pr_info("Secure boot mode %s\n", enabled ? "enabled" : "disabled");
return enabled;
 }
+
+bool is_ppc_trustedboot_enabled(void)
+{
+   struct device_node *node;
+   bool enabled = false;
+
+   node = get_ppc_fw_sb_node();
+   enabled = of_property_read_bool(node, "trusted-enabled");
+
+   of_node_put(node);
+
+   pr_info("Trusted boot mode %s\n", enabled ? "enabled" : "disabled");
+
+   return enabled;
+}
-- 
2.7.5



[PATCH v10 2/9] powerpc/ima: add support to initialize ima policy rules

2019-10-30 Thread Mimi Zohar
From: Nayna Jain 

PowerNV systems use a Linux-based bootloader, which rely on the IMA
subsystem to enforce different secure boot modes.  Since the verification
policy may differ based on the secure boot mode of the system, the
policies must be defined at runtime.

This patch implements arch-specific support to define IMA policy
rules based on the runtime secure boot mode of the system.

This patch provides arch-specific IMA policies if PPC_SECURE_BOOT
config is enabled.

Signed-off-by: Nayna Jain 
Signed-off-by: Mimi Zohar 
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/kernel/Makefile   |  2 +-
 arch/powerpc/kernel/ima_arch.c | 43 ++
 include/linux/ima.h|  3 ++-
 4 files changed, 47 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kernel/ima_arch.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 56ea0019b616..c795039bdc73 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -938,6 +938,7 @@ config PPC_SECURE_BOOT
prompt "Enable secure boot support"
bool
depends on PPC_POWERNV
+   depends on IMA_ARCH_POLICY
help
  Systems with firmware secure boot enabled need to define security
  policies to extend secure boot to the OS. This config allows a user
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index e2a54fa240ac..e8eb2955b7d5 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -161,7 +161,7 @@ ifneq ($(CONFIG_PPC_POWERNV)$(CONFIG_PPC_SVM),)
 obj-y  += ucall.o
 endif
 
-obj-$(CONFIG_PPC_SECURE_BOOT)  += secure_boot.o
+obj-$(CONFIG_PPC_SECURE_BOOT)  += secure_boot.o ima_arch.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
new file mode 100644
index ..d88913dc0da7
--- /dev/null
+++ b/arch/powerpc/kernel/ima_arch.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain
+ */
+
+#include 
+#include 
+
+bool arch_ima_get_secureboot(void)
+{
+   return is_ppc_secureboot_enabled();
+}
+
+/*
+ * The "secure_rules" are enabled only on "secureboot" enabled systems.
+ * These rules verify the file signatures against known good values.
+ * The "appraise_type=imasig|modsig" option allows the known good signature
+ * to be stored as an xattr or as an appended signature.
+ *
+ * To avoid duplicate signature verification as much as possible, the IMA
+ * policy rule for module appraisal is added only if CONFIG_MODULE_SIG_FORCE
+ * is not enabled.
+ */
+static const char *const secure_rules[] = {
+   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
+#ifndef CONFIG_MODULE_SIG_FORCE
+   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
+#endif
+   NULL
+};
+
+/*
+ * Returns the relevant IMA arch-specific policies based on the system secure
+ * boot state.
+ */
+const char *const *arch_get_ima_policy(void)
+{
+   if (is_ppc_secureboot_enabled())
+   return secure_rules;
+
+   return NULL;
+}
diff --git a/include/linux/ima.h b/include/linux/ima.h
index 1c37f17f7203..6d904754d858 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -29,7 +29,8 @@ extern void ima_kexec_cmdline(const void *buf, int size);
 extern void ima_add_kexec_buffer(struct kimage *image);
 #endif
 
-#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390)
+#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390) \
+   || defined(CONFIG_PPC_SECURE_BOOT)
 extern bool arch_ima_get_secureboot(void);
 extern const char * const *arch_get_ima_policy(void);
 #else
-- 
2.7.5



[PATCH v10 0/9] powerpc: Enabling IMA arch specific secure boot policies

2019-10-30 Thread Mimi Zohar
This patchset extends the previous version[1] by adding support for
checking against a blacklist of binary hashes.

The IMA subsystem supports custom, built-in, arch-specific policies to
define the files to be measured and appraised. These policies are honored
based on priority, where arch-specific policy is the highest and custom
is the lowest.

PowerNV system uses a Linux-based bootloader to kexec the OS. The
bootloader kernel relies on IMA for signature verification of the OS
kernel before doing the kexec. This patchset adds support for powerpc
arch-specific IMA policies that are conditionally defined based on a
system's secure boot and trusted boot states. The OS secure boot and
trusted boot states are determined via device-tree properties.

The verification needs to be performed only for binaries that are not
blacklisted. The kernel currently only checks against the blacklist of
keys. However, doing so results in blacklisting all the binaries that
are signed by the same key. In order to prevent just one particular
binary from being loaded, it must be checked against a blacklist of
binary hashes. This patchset also adds support to IMA for checking
against a hash blacklist for files. signed by appended signature.

[1] http://patchwork.ozlabs.org/cover/1149262/ 

Changelog:

v10: (Mimi posting patch set on Nayna's behalf)
- Minor patch description changes
- Include comment in process_buffer_measurement()
- Additional patch: Enforcing kernel module appended signatures should
be reflected in "/sys/module/module/parameters/sig_enforce".
- Trimmed Cc list.

v9:
* Includes feedbacks from Michael
  * fix the missing of_node_put()
* Includes Mimi's feedbacks
  * fix the policy show() function to display check_blacklist
  * fix the other comment related and patch description
  * add the example of blacklist in the Patch 7/8
Note: Patch 7/8 is giving errors when checkpatch.pl is run because
of the format of showing measurement record as part of the example. I am
not very sure if that can be fixed as we need to represent the
measurements as is.

v8:
* Updates the Patch Description as per Michael's and Mimi's feedback
* Includes feedbacks from Michael for the device tree and policies
  * removes the arch-policy hack by defining three arrays.
  * fixes related to device-tree calls 
  * other code specific feedbacks
* Includes feedbacks from Mimi on the blacklist
  * generic blacklist function is modified than previous version
  * other coding fixes

v7:
* Removes patch related to dt-bindings as per input from Rob Herring. 
* fixes Patch 1/8 to use new device-tree updates as per Oliver
  feedback to device-tree documentation in skiboot mailing list.
(https://lists.ozlabs.org/pipermail/skiboot/2019-September/015329.html)
* Includes feedbacks from Mimi, Thiago
  * moves function get_powerpc_fw_sb_node() from Patch 1 to Patch 3 
  * fixes Patch 2/8 to use CONFIG_MODULE_SIG_FORCE.
  * updates Patch description in Patch 5/8
  * adds a new patch to add wrapper is_binary_blacklisted()
  * removes the patch that deprecated permit_directio

v6:
* includes feedbacks from Michael Ellerman on the patchset v5
  * removed email ids from comments
  * add the doc for the device-tree
  * renames the secboot.c to secure_boot.c and secboot.h to secure_boot.h
  * other code specific fixes
* split the patches to differentiate between secureboot and trustedboot
state of the system
* adds the patches to support the blacklisting of the binary hash.

v5:
* secureboot state is now read via device tree entry rather than OPAL
secure variables
* ima arch policies are updated to use policy based template for
measurement rules

v4:
* Fixed the build issue as reported by Satheesh Rajendran.

v3:
* OPAL APIs in Patch 1 are updated to provide generic interface based on
key/keylen. This patchset updates kernel OPAL APIs to be compatible with
generic interface.
* Patch 2 is cleaned up to use new OPAL APIs.
* Since OPAL can support different types of backend which can vary in the
variable interpretation, the Patch 2 is updated to add a check for the
backend version
* OPAL API now expects consumer to first check the supported backend version
before calling other secvar OPAL APIs. This check is now added in patch 2.
* IMA policies in Patch 3 is updated to specify appended signature and
per policy template.
* The patches now are free of any EFIisms.

v2:

* Removed Patch 1: powerpc/include: Override unneeded early ioremap
functions
* Updated Subject line and patch description of the Patch 1 of this series
* Removed dependency of OPAL_SECVAR on EFI, CPU_BIG_ENDIAN and UCS2_STRING
* Changed OPAL APIs from static to non-static. Added opal-secvar.h for the
same
* Removed EFI hooks from opal_secvar.c
* Removed opal_secvar_get_next(), opal_secvar_enqueue() and
opal_query_variable_info() function
* get_powerpc_sb_mode() in secboot.c now directly calls OPAL Runtime API
rather than via EFI hooks.
* Fixed log messages in get_powerpc_sb_mode() function.
* Added 

Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Yunsheng Lin
On 2019/10/30 18:14, Peter Zijlstra wrote:
> On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote:
>> When passing the return value of dev_to_node() to cpumask_of_node()
>> without checking if the device's node id is NUMA_NO_NODE, there is
>> global-out-of-bounds detected by KASAN.
>>
>> From the discussion [1], NUMA_NO_NODE really means no node affinity,
>> which also means all cpus should be usable. So the cpumask_of_node()
>> should always return all cpus online when user passes the node id as
>> NUMA_NO_NODE, just like similar semantic that page allocator handles
>> NUMA_NO_NODE.
>>
>> But we cannot really copy the page allocator logic. Simply because the
>> page allocator doesn't enforce the near node affinity. It just picks it
>> up as a preferred node but then it is free to fallback to any other numa
>> node. This is not the case here and node_to_cpumask_map will only restrict
>> to the particular node's cpus which would have really non deterministic
>> behavior depending on where the code is executed. So in fact we really
>> want to return cpu_online_mask for NUMA_NO_NODE.
>>
>> Also there is a debugging version of node_to_cpumask_map() for x86 and
>> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
>> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
>>
>> [1] https://lkml.org/lkml/2019/9/11/66
>> Signed-off-by: Yunsheng Lin 
>> Suggested-by: Michal Hocko 
>> Acked-by: Michal Hocko 
>> Acked-by: Paul Burton  # MIPS bits
> 
> Still:
> 
> Nacked-by: Peter Zijlstra (Intel) 

It seems I still misunderstood your meaning by "We must not silently accept
NO_NODE there" in [1].

I am not sure if there is still disagreement that the NO_NODE state for
dev->numa_node should exist at all.

>From the previous disscussion [2], you seem to propose to do "wild guess" or
"fixup" for all devices(including virtual and physcial) with NO_NODE, which 
means
the NO_NODE is needed anymore and should be removed when the "wild guess" or 
"fixup"
is done. So maybe the reason for your nack here it is that there should be no 
other
NO_NODE handling or fixing related to NO_NODE before the "wild guess" or "fixup"
process is finished, so making node_to_cpumask_map() NUMA_NO_NODE aware is 
unnecessary.

Or your reason for the nack is still specific to the pcie device without a numa 
node,
the "wild guess" need to be done for this case before making 
node_to_cpumask_map()
NUMA_NO_NODE?

Please help to clarify the reason for nack. Or is there still some other reason 
for the
nack I missed from the previous disscussion?

Thanks.

[1] 
https://lore.kernel.org/lkml/2019101539.gx2...@hirez.programming.kicks-ass.net/
[2] 
https://lore.kernel.org/lkml/20191014094912.gy2...@hirez.programming.kicks-ass.net/
> 
> .
> 



Re: [PATCH v2] cpufreq: powernv: fix stack bloat and NR_CPUS limitation

2019-10-30 Thread Michael Ellerman
Hi John,

Sorry I didn't reply to this sooner, too many patches :/

John Hubbard  writes:
> The following build warning occurred on powerpc 64-bit builds:
>
> drivers/cpufreq/powernv-cpufreq.c: In function 'init_chip_info':
> drivers/cpufreq/powernv-cpufreq.c:1070:1: warning: the frame size of 1040 
> bytes is larger than 1024 bytes [-Wframe-larger-than=]

Oddly I don't see that warning in my builds, eg with GCC9:

  https://travis-ci.org/linuxppc/linux/jobs/604870722

> This is due to putting 1024 bytes on the stack:
>
> unsigned int chip[256];
>
> ...and while looking at this, it also has a bug: it fails with a stack
> overrun, if CONFIG_NR_CPUS > 256.

It _probably_ doesn't, because it only increments the index when the
chip_id of the CPU changes, ie. it doesn't create a chip for every CPU.
But I agree it's flaky the way it's written.

> Fix both problems by dynamically allocating based on CONFIG_NR_CPUS.

Shouldn't it use num_possible_cpus() ?

Given the for loop is over possible CPUs that seems like the upper
bound. In practice it should be lower because some CPUs will share a
chip.

cheers


> Fixes: 053819e0bf840 ("cpufreq: powernv: Handle throttling due to Pmax 
> capping at chip level")
> Cc: Shilpasri G Bhat 
> Cc: Preeti U Murthy 
> Cc: Viresh Kumar 
> Cc: Rafael J. Wysocki 
> Cc: linux...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: John Hubbard 
> ---
>
> Changes since v1: includes Viresh's review commit fixes.
>
>  drivers/cpufreq/powernv-cpufreq.c | 17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 6061850e59c9..5b2e968cb5ea 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -1041,9 +1041,14 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>  
>  static int init_chip_info(void)
>  {
> - unsigned int chip[256];
> + unsigned int *chip;
>   unsigned int cpu, i;
>   unsigned int prev_chip_id = UINT_MAX;
> + int ret = 0;
> +
> + chip = kcalloc(CONFIG_NR_CPUS, sizeof(*chip), GFP_KERNEL);
> + if (!chip)
> + return -ENOMEM;
>  
>   for_each_possible_cpu(cpu) {
>   unsigned int id = cpu_to_chip_id(cpu);
> @@ -1055,8 +1060,10 @@ static int init_chip_info(void)
>   }
>  
>   chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL);
> - if (!chips)
> - return -ENOMEM;
> + if (!chips) {
> + ret = -ENOMEM;
> + goto free_and_return;
> + }
>  
>   for (i = 0; i < nr_chips; i++) {
>   chips[i].id = chip[i];
> @@ -1066,7 +1073,9 @@ static int init_chip_info(void)
>   per_cpu(chip_info, cpu) =  [i];
>   }
>  
> - return 0;
> +free_and_return:
> + kfree(chip);
> + return ret;
>  }
>  
>  static inline void clean_chip_info(void)
> -- 
> 2.23.0


Re: Pull request: scottwood/linux.git next

2019-10-30 Thread Jason Yan
Hi Michael, Can you pull this to linux-next so that we can test it on 
linux-next for some time?


Thanks,
Jason


On 2019/10/23 7:21, Scott Wood wrote:

This contains KASLR support for book3e 32-bit.

The following changes since commit 612ee81b9461475b5a5612c2e8d71559dd3c7920:

   powerpc/papr_scm: Fix an off-by-one check in papr_scm_meta_{get, set} 
(2019-10-10 20:15:53 +1100)

are available in the Git repository at:

   git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git next

for you to fetch changes up to 9df1ef3f1376ec5d3a1b51a4546c94279bcd88ca:

   powerpc/fsl_booke/32: Document KASLR implementation (2019-10-21 16:09:16 
-0500)


Jason Yan (12):
   powerpc: unify definition of M_IF_NEEDED
   powerpc: move memstart_addr and kernstart_addr to init-common.c
   powerpc: introduce kernstart_virt_addr to store the kernel base
   powerpc/fsl_booke/32: introduce create_kaslr_tlb_entry() helper
   powerpc/fsl_booke/32: introduce reloc_kernel_entry() helper
   powerpc/fsl_booke/32: implement KASLR infrastructure
   powerpc/fsl_booke/32: randomize the kernel image offset
   powerpc/fsl_booke/kaslr: clear the original kernel if randomized
   powerpc/fsl_booke/kaslr: support nokaslr cmdline parameter
   powerpc/fsl_booke/kaslr: dump out kernel offset information on panic
   powerpc/fsl_booke/kaslr: export offset in VMCOREINFO ELF notes
   powerpc/fsl_booke/32: Document KASLR implementation

  Documentation/powerpc/kaslr-booke32.rst   |  42 +++
  arch/powerpc/Kconfig  |  11 +
  arch/powerpc/include/asm/nohash/mmu-book3e.h  |  11 +-
  arch/powerpc/include/asm/page.h   |   7 +
  arch/powerpc/kernel/early_32.c|   5 +-
  arch/powerpc/kernel/exceptions-64e.S  |  12 +-
  arch/powerpc/kernel/fsl_booke_entry_mapping.S |  25 +-
  arch/powerpc/kernel/head_fsl_booke.S  |  61 +++-
  arch/powerpc/kernel/machine_kexec.c   |   1 +
  arch/powerpc/kernel/misc_64.S |   7 +-
  arch/powerpc/kernel/setup-common.c|  20 ++
  arch/powerpc/mm/init-common.c |   7 +
  arch/powerpc/mm/init_32.c |   5 -
  arch/powerpc/mm/init_64.c |   5 -
  arch/powerpc/mm/mmu_decl.h|  11 +
  arch/powerpc/mm/nohash/Makefile   |   1 +
  arch/powerpc/mm/nohash/fsl_booke.c|   8 +-
  arch/powerpc/mm/nohash/kaslr_booke.c  | 401 ++
  18 files changed, 587 insertions(+), 53 deletions(-)
  create mode 100644 Documentation/powerpc/kaslr-booke32.rst
  create mode 100644 arch/powerpc/mm/nohash/kaslr_booke.c






Re: [PATCH 1/3] arch: ipcbuf.h: make uapi asm/ipcbuf.h self-contained

2019-10-30 Thread Andrew Morton
On Thu, 31 Oct 2019 10:33:00 +0900 Masahiro Yamada 
 wrote:

> Hi Andrew,
> 
> I think this patch has already been picked up to your tree,
> but I noticed a typo in the commit message just now.
> Please see below.
> 
> ...
>
> > Include  to make it self-contained, and add it to
> 
> Include  to make ...
> 
> Could you please fix it up locally?

No probs, done.


Re: [PATCH 1/3] arch: ipcbuf.h: make uapi asm/ipcbuf.h self-contained

2019-10-30 Thread Masahiro Yamada
Hi Andrew,

I think this patch has already been picked up to your tree,
but I noticed a typo in the commit message just now.
Please see below.

On Wed, Oct 30, 2019 at 3:40 PM Masahiro Yamada
 wrote:
>
> The user-space cannot compile  due to some missing type
> definitions. For example, building it for x86 fails as follows:
>
>   CC  usr/include/asm/ipcbuf.h.s
> In file included from ./usr/include/asm/ipcbuf.h:1:0,
>  from :32:
> ./usr/include/asm-generic/ipcbuf.h:21:2: error: unknown type name 
> ‘__kernel_key_t’
>   __kernel_key_t  key;
>   ^~
> ./usr/include/asm-generic/ipcbuf.h:22:2: error: unknown type name 
> ‘__kernel_uid32_t’
>   __kernel_uid32_t uid;
>   ^~~~
> ./usr/include/asm-generic/ipcbuf.h:23:2: error: unknown type name 
> ‘__kernel_gid32_t’
>   __kernel_gid32_t gid;
>   ^~~~
> ./usr/include/asm-generic/ipcbuf.h:24:2: error: unknown type name 
> ‘__kernel_uid32_t’
>   __kernel_uid32_t cuid;
>   ^~~~
> ./usr/include/asm-generic/ipcbuf.h:25:2: error: unknown type name 
> ‘__kernel_gid32_t’
>   __kernel_gid32_t cgid;
>   ^~~~
> ./usr/include/asm-generic/ipcbuf.h:26:2: error: unknown type name 
> ‘__kernel_mode_t’
>   __kernel_mode_t  mode;
>   ^~~
> ./usr/include/asm-generic/ipcbuf.h:28:35: error: ‘__kernel_mode_t’ undeclared 
> here (not in a function)
>   unsigned char  __pad1[4 - sizeof(__kernel_mode_t)];
>^~~
> ./usr/include/asm-generic/ipcbuf.h:31:2: error: unknown type name 
> ‘__kernel_ulong_t’
>   __kernel_ulong_t __unused1;
>   ^~~~
> ./usr/include/asm-generic/ipcbuf.h:32:2: error: unknown type name 
> ‘__kernel_ulong_t’
>   __kernel_ulong_t __unused2;
>   ^~~~
>
> It is just a matter of missing include directive.
>
> Include  to make it self-contained, and add it to

Include  to make ...

Could you please fix it up locally?


Thank you.
Masahiro Yamada



> the compile-test coverage.
>
> Signed-off-by: Masahiro Yamada 
> ---
>
>  arch/s390/include/uapi/asm/ipcbuf.h   | 2 ++
>  arch/sparc/include/uapi/asm/ipcbuf.h  | 2 ++
>  arch/xtensa/include/uapi/asm/ipcbuf.h | 2 ++
>  include/uapi/asm-generic/ipcbuf.h | 2 ++
>  usr/include/Makefile  | 1 -
>  5 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/include/uapi/asm/ipcbuf.h 
> b/arch/s390/include/uapi/asm/ipcbuf.h
> index 5b1c4f47c656..1030cd186899 100644
> --- a/arch/s390/include/uapi/asm/ipcbuf.h
> +++ b/arch/s390/include/uapi/asm/ipcbuf.h
> @@ -2,6 +2,8 @@
>  #ifndef __S390_IPCBUF_H__
>  #define __S390_IPCBUF_H__
>
> +#include 
> +
>  /*
>   * The user_ipc_perm structure for S/390 architecture.
>   * Note extra padding because this structure is passed back and forth
> diff --git a/arch/sparc/include/uapi/asm/ipcbuf.h 
> b/arch/sparc/include/uapi/asm/ipcbuf.h
> index 9d0d125500e2..5b933a598a33 100644
> --- a/arch/sparc/include/uapi/asm/ipcbuf.h
> +++ b/arch/sparc/include/uapi/asm/ipcbuf.h
> @@ -2,6 +2,8 @@
>  #ifndef __SPARC_IPCBUF_H
>  #define __SPARC_IPCBUF_H
>
> +#include 
> +
>  /*
>   * The ipc64_perm structure for sparc/sparc64 architecture.
>   * Note extra padding because this structure is passed back and forth
> diff --git a/arch/xtensa/include/uapi/asm/ipcbuf.h 
> b/arch/xtensa/include/uapi/asm/ipcbuf.h
> index a57afa0b606f..3bd0642f6660 100644
> --- a/arch/xtensa/include/uapi/asm/ipcbuf.h
> +++ b/arch/xtensa/include/uapi/asm/ipcbuf.h
> @@ -12,6 +12,8 @@
>  #ifndef _XTENSA_IPCBUF_H
>  #define _XTENSA_IPCBUF_H
>
> +#include 
> +
>  /*
>   * Pad space is left for:
>   * - 32-bit mode_t and seq
> diff --git a/include/uapi/asm-generic/ipcbuf.h 
> b/include/uapi/asm-generic/ipcbuf.h
> index 7d80dbd336fb..41a01b494fc7 100644
> --- a/include/uapi/asm-generic/ipcbuf.h
> +++ b/include/uapi/asm-generic/ipcbuf.h
> @@ -2,6 +2,8 @@
>  #ifndef __ASM_GENERIC_IPCBUF_H
>  #define __ASM_GENERIC_IPCBUF_H
>
> +#include 
> +
>  /*
>   * The generic ipc64_perm structure:
>   * Note extra padding because this structure is passed back and forth
> diff --git a/usr/include/Makefile b/usr/include/Makefile
> index 57b20f7b6729..70f8fe256aed 100644
> --- a/usr/include/Makefile
> +++ b/usr/include/Makefile
> @@ -16,7 +16,6 @@ override c_flags = $(UAPI_CFLAGS) -Wp,-MD,$(depfile) 
> -I$(objtree)/usr/include
>  # Please consider to fix the header first.
>  #
>  # Sorted alphabetically.
> -header-test- += asm/ipcbuf.h
>  header-test- += asm/msgbuf.h
>  header-test- += asm/sembuf.h
>  header-test- += asm/shmbuf.h
> --
> 2.17.1
>


-- 
Best Regards
Masahiro Yamada


[Bug 205327] kmemleak reports various leaks in "swapper/0"

2019-10-30 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205327

--- Comment #4 from Erhard F. (erhar...@mailbox.org) ---
(In reply to Michael Ellerman from comment #3)
> That looks like a pretty straight forward memory leak here:
[...]
> Do you want to send a patch?
Thanks for pointing out! Yes, in this case writing a patch is within reach of
my coding skills. ;)

Have to check the procedure of how to submit a patch to the kernel first, as I
have not done this yet.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[RFC PATCH 5/5] powerpc: make iowrite32 and friends static inline when no indirection

2019-10-30 Thread Rasmus Villemoes
When porting a powerpc-only driver to work on another architecture,
one has to change e.g. out_be32 to iowrite32be. Unfortunately, while
the other target architecture (in my case arm) may have static inline
definitions of iowrite32 and friends, this change pessimizes the
existing powerpc users of that driver since out_be32() is inline while
iowrite32be() is out-of-line.

When neither CONFIG_PPC_INDIRECT_PIO or CONFIG_PPC_INDIRECT_MMIO are
set (e.g. all of PPC32), there's no reason for those to be out-of-line
as they compile to just two or three instructions. So copy the
definitions from iomap.c into io.h, make them static inline, and add
the self-define macro boilerplate to prevent asm-generic/iomap.h from
providing extern declarations.

This means that kernel/iomap.c is now only compiled when
!CONFIG_PPC_INDIRECT_PIO && CONFIG_PPC_INDIRECT_MMIO - a combination I
don't think currently exists. So it's possible that file could simply
be deleted.

Signed-off-by: Rasmus Villemoes 
---
 arch/powerpc/include/asm/io.h | 172 ++
 arch/powerpc/kernel/Makefile  |   2 +-
 2 files changed, 173 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index a63ec938636d..a59310620067 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -638,6 +638,178 @@ static inline void name at
\
 #define writel_relaxed(v, addr)writel(v, addr)
 #define writeq_relaxed(v, addr)writeq(v, addr)
 
+#if !defined(CONFIG_PPC_INDIRECT_PIO) && !defined(CONFIG_PPC_INDIRECT_MMIO)
+
+#define ioread8 ioread8
+static inline unsigned int ioread8(void __iomem *addr)
+{
+   return readb(addr);
+}
+#define ioread16 ioread16
+static inline unsigned int ioread16(void __iomem *addr)
+{
+   return readw(addr);
+}
+#define ioread16be ioread16be
+static inline unsigned int ioread16be(void __iomem *addr)
+{
+   return readw_be(addr);
+}
+#define ioread32 ioread32
+static inline unsigned int ioread32(void __iomem *addr)
+{
+   return readl(addr);
+}
+#define ioread32be ioread32be
+static inline unsigned int ioread32be(void __iomem *addr)
+{
+   return readl_be(addr);
+}
+#ifdef __powerpc64__
+#define ioread64 ioread64
+static inline u64 ioread64(void __iomem *addr)
+{
+   return readq(addr);
+}
+#define ioread64_lo_hi ioread64_lo_hi
+static inline u64 ioread64_lo_hi(void __iomem *addr)
+{
+   return readq(addr);
+}
+#define ioread64_hi_lo ioread64_hi_lo
+static inline u64 ioread64_hi_lo(void __iomem *addr)
+{
+   return readq(addr);
+}
+#define ioread64be ioread64be
+static inline u64 ioread64be(void __iomem *addr)
+{
+   return readq_be(addr);
+}
+#define ioread64be_lo_hi ioread64be_lo_hi
+static inline u64 ioread64be_lo_hi(void __iomem *addr)
+{
+   return readq_be(addr);
+}
+#define ioread64be_hi_lo ioread64be_hi_lo
+static inline u64 ioread64be_hi_lo(void __iomem *addr)
+{
+   return readq_be(addr);
+}
+#endif /* __powerpc64__ */
+
+#define iowrite8 iowrite8
+static inline void iowrite8(u8 val, void __iomem *addr)
+{
+   writeb(val, addr);
+}
+#define iowrite16 iowrite16
+static inline void iowrite16(u16 val, void __iomem *addr)
+{
+   writew(val, addr);
+}
+#define iowrite16be iowrite16be
+static inline void iowrite16be(u16 val, void __iomem *addr)
+{
+   writew_be(val, addr);
+}
+#define iowrite32 iowrite32
+static inline void iowrite32(u32 val, void __iomem *addr)
+{
+   writel(val, addr);
+}
+#define iowrite32be iowrite32be
+static inline void iowrite32be(u32 val, void __iomem *addr)
+{
+   writel_be(val, addr);
+}
+#ifdef __powerpc64__
+#define iowrite64 iowrite64
+static inline void iowrite64(u64 val, void __iomem *addr)
+{
+   writeq(val, addr);
+}
+#define iowrite64_lo_hi iowrite64_lo_hi
+static inline void iowrite64_lo_hi(u64 val, void __iomem *addr)
+{
+   writeq(val, addr);
+}
+#define iowrite64_hi_lo iowrite64_hi_lo
+static inline void iowrite64_hi_lo(u64 val, void __iomem *addr)
+{
+   writeq(val, addr);
+}
+#define iowrite64be iowrite64be
+static inline void iowrite64be(u64 val, void __iomem *addr)
+{
+   writeq_be(val, addr);
+}
+#define iowrite64be_lo_hi iowrite64be_lo_hi
+static inline void iowrite64be_lo_hi(u64 val, void __iomem *addr)
+{
+   writeq_be(val, addr);
+}
+#define iowrite64be_hi_lo iowrite64be_hi_lo
+static inline void iowrite64be_hi_lo(u64 val, void __iomem *addr)
+{
+   writeq_be(val, addr);
+}
+#endif /* __powerpc64__ */
+
+/*
+ * These are the "repeat read/write" functions. Note the
+ * non-CPU byte order. We do things in "IO byteorder"
+ * here.
+ *
+ * FIXME! We could make these do EEH handling if we really
+ * wanted. Not clear if we do.
+ */
+#define ioread8_rep ioread8_rep
+static inline void ioread8_rep(void __iomem *addr, void *dst, unsigned long 
count)
+{
+   readsb(addr, dst, count);
+}
+#define ioread16_rep ioread16_rep
+static inline void ioread16_rep(void 

[RFC PATCH 4/5] powerpc: make pcibios_vaddr_is_ioport() static

2019-10-30 Thread Rasmus Villemoes
The only caller of pcibios_vaddr_is_ioport() is in pci-common.c, so we
can make it static and move it into the same #ifndef block as its
caller.

Signed-off-by: Rasmus Villemoes 
---
 arch/powerpc/include/asm/pci-bridge.h | 9 -
 arch/powerpc/kernel/pci-common.c  | 4 ++--
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index ea6ec65970ef..deb29a1c9708 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -283,14 +283,5 @@ extern struct pci_controller 
*pcibios_alloc_controller(struct device_node *dev);
 extern void pcibios_free_controller(struct pci_controller *phb);
 extern void pcibios_free_controller_deferred(struct pci_host_bridge *bridge);
 
-#ifdef CONFIG_PCI
-extern int pcibios_vaddr_is_ioport(void __iomem *address);
-#else
-static inline int pcibios_vaddr_is_ioport(void __iomem *address)
-{
-   return 0;
-}
-#endif /* CONFIG_PCI */
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_PCI_BRIDGE_H */
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index d89a2426b405..928d7576c6c2 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -277,7 +277,8 @@ static resource_size_t pcibios_io_size(const struct 
pci_controller *hose)
 #endif
 }
 
-int pcibios_vaddr_is_ioport(void __iomem *address)
+#ifndef CONFIG_PPC_INDIRECT_PIO
+static int pcibios_vaddr_is_ioport(void __iomem *address)
 {
int ret = 0;
struct pci_controller *hose;
@@ -296,7 +297,6 @@ int pcibios_vaddr_is_ioport(void __iomem *address)
return ret;
 }
 
-#ifndef CONFIG_PPC_INDIRECT_PIO
 void pci_iounmap(struct pci_dev *dev, void __iomem *addr)
 {
if (isa_vaddr_is_ioport(addr))
-- 
2.23.0



[RFC PATCH 0/5] powerpc: make iowrite32be etc. inline

2019-10-30 Thread Rasmus Villemoes
When trying to make the QUICC Engine drivers compile on arm, I
mechanically (with coccinelle) changed out_be32() to iowrite32be()
etc. Christophe pointed out [1][2] that that would pessimize the
powerpc SOCs since the IO accesses now incur a function call
overhead. He asked that I try to make those io accessors inline on
ppc, and this is the best I could come up with.

At first I tried something that wouldn't need to touch anything
outside arch/powerpc/, but I ended up with conditional inclusion of
asm-generic headers and/or duplicating a lot of their contents.

The diffstat may become a little better if kernel/iomap.c can indeed
be removed (due to !CONFIG_PPC_INDIRECT_PIO &&
CONFIG_PPC_INDIRECT_MMIO never happening).

[1] https://lore.kernel.org/lkml/6ee121cf-0e3d-4aa0-2593-fcb00995e...@c-s.fr/
[2] https://lore.kernel.org/lkml/886d5218-6d6b-824c-3ab9-63aafe41f...@c-s.fr/

Rasmus Villemoes (5):
  asm-generic: move pcu_iounmap from iomap.h to pci_iomap.h
  asm-generic: employ "ifndef foo; define foo foo" idiom in iomap.h
  powerpc: move pci_iounmap() from iomap.c to pci-common.c
  powerpc: make pcibios_vaddr_is_ioport() static
  powerpc: make iowrite32 and friends static inline when no indirection

 arch/powerpc/include/asm/io.h | 172 ++
 arch/powerpc/include/asm/pci-bridge.h |   9 --
 arch/powerpc/kernel/Makefile  |   2 +-
 arch/powerpc/kernel/iomap.c   |  13 --
 arch/powerpc/kernel/pci-common.c  |  15 ++-
 include/asm-generic/iomap.h   | 104 +---
 include/asm-generic/pci_iomap.h   |   7 ++
 7 files changed, 282 insertions(+), 40 deletions(-)

-- 
2.23.0



[RFC PATCH 2/5] asm-generic: employ "ifndef foo; define foo foo" idiom in iomap.h

2019-10-30 Thread Rasmus Villemoes
Make it possible for an architecture to include asm-generic/iomap.h
without necessarily getting all the external declarations for
iowrite32 and friends. For example, the architecture could (maybe just
in some configurations) provide static inline versions of some or all
of these, but still use asm-generic/iomap.h for the
ARCH_HAS_IOREMAP_WT/WC logic.

This will be used on powerpc.

Signed-off-by: Rasmus Villemoes 
---
 include/asm-generic/iomap.h | 94 ++---
 1 file changed, 88 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/iomap.h b/include/asm-generic/iomap.h
index 5f8321e8fea9..1b247d3b9fbb 100644
--- a/include/asm-generic/iomap.h
+++ b/include/asm-generic/iomap.h
@@ -26,47 +26,105 @@
  * in the low address range. Architectures for which this is not
  * true can't use this generic implementation.
  */
+#ifndef ioread8
+#define ioread8 ioread8
 extern unsigned int ioread8(void __iomem *);
+#endif
+#ifndef ioread16
+#define ioread16 ioread16
 extern unsigned int ioread16(void __iomem *);
+#endif
+#ifndef ioread16be
+#define ioread16be ioread16be
 extern unsigned int ioread16be(void __iomem *);
+#endif
+#ifndef ioread32
+#define ioread32 ioread32
 extern unsigned int ioread32(void __iomem *);
+#endif
+#ifndef ioread32be
+#define ioread32be ioread32be
 extern unsigned int ioread32be(void __iomem *);
+#endif
 #ifdef CONFIG_64BIT
+#ifndef ioread64
+#define ioread64 ioread64
 extern u64 ioread64(void __iomem *);
+#endif
+#ifndef ioread64be
+#define ioread64be ioread64be
 extern u64 ioread64be(void __iomem *);
 #endif
+#endif /* CONFIG_64BIT */
 
 #ifdef readq
+#ifndef ioread64_lo_hi
 #define ioread64_lo_hi ioread64_lo_hi
-#define ioread64_hi_lo ioread64_hi_lo
-#define ioread64be_lo_hi ioread64be_lo_hi
-#define ioread64be_hi_lo ioread64be_hi_lo
 extern u64 ioread64_lo_hi(void __iomem *addr);
+#endif
+#ifndef ioread64_hi_lo
+#define ioread64_hi_lo ioread64_hi_lo
 extern u64 ioread64_hi_lo(void __iomem *addr);
+#endif
+#ifndef ioread64be_lo_hi
+#define ioread64be_lo_hi ioread64be_lo_hi
 extern u64 ioread64be_lo_hi(void __iomem *addr);
+#endif
+#ifndef ioread64be_hi_lo
+#define ioread64be_hi_lo ioread64be_hi_lo
 extern u64 ioread64be_hi_lo(void __iomem *addr);
 #endif
+#endif /* readq */
 
+#ifndef iowrite8
+#define iowrite8 iowrite8
 extern void iowrite8(u8, void __iomem *);
+#endif
+#ifndef iowrite16
+#define iowrite16 iowrite16
 extern void iowrite16(u16, void __iomem *);
+#endif
+#ifndef iowrite16be
+#define iowrite16be iowrite16be
 extern void iowrite16be(u16, void __iomem *);
+#endif
+#ifndef iowrite32
+#define iowrite32 iowrite32
 extern void iowrite32(u32, void __iomem *);
+#endif
+#ifndef iowrite32be
+#define iowrite32be iowrite32be
 extern void iowrite32be(u32, void __iomem *);
+#endif
 #ifdef CONFIG_64BIT
+#ifndef iowrite64
+#define iowrite64 iowrite64
 extern void iowrite64(u64, void __iomem *);
+#endif
+#ifndef iowrite64be
+#define iowrite64be iowrite64be
 extern void iowrite64be(u64, void __iomem *);
 #endif
+#endif /* CONFIG_64BIT */
 
 #ifdef writeq
+#ifndef iowrite64_lo_hi
 #define iowrite64_lo_hi iowrite64_lo_hi
-#define iowrite64_hi_lo iowrite64_hi_lo
-#define iowrite64be_lo_hi iowrite64be_lo_hi
-#define iowrite64be_hi_lo iowrite64be_hi_lo
 extern void iowrite64_lo_hi(u64 val, void __iomem *addr);
+#endif
+#ifndef iowrite64_hi_lo
+#define iowrite64_hi_lo iowrite64_hi_lo
 extern void iowrite64_hi_lo(u64 val, void __iomem *addr);
+#endif
+#ifndef iowrite64be_lo_hi
+#define iowrite64be_lo_hi iowrite64be_lo_hi
 extern void iowrite64be_lo_hi(u64 val, void __iomem *addr);
+#endif
+#ifndef iowrite64be_hi_lo
+#define iowrite64be_hi_lo iowrite64be_hi_lo
 extern void iowrite64be_hi_lo(u64 val, void __iomem *addr);
 #endif
+#endif /* writeq */
 
 /*
  * "string" versions of the above. Note that they
@@ -79,19 +137,43 @@ extern void iowrite64be_hi_lo(u64 val, void __iomem *addr);
  * memory across multiple ports, use "memcpy_toio()"
  * and friends.
  */
+#ifndef ioread8_rep
+#define ioread8_rep ioread8_rep
 extern void ioread8_rep(void __iomem *port, void *buf, unsigned long count);
+#endif
+#ifndef ioread16_rep
+#define ioread16_rep ioread16_rep
 extern void ioread16_rep(void __iomem *port, void *buf, unsigned long count);
+#endif
+#ifndef ioread32_rep
+#define ioread32_rep ioread32_rep
 extern void ioread32_rep(void __iomem *port, void *buf, unsigned long count);
+#endif
 
+#ifndef iowrite8_rep
+#define iowrite8_rep iowrite8_rep
 extern void iowrite8_rep(void __iomem *port, const void *buf, unsigned long 
count);
+#endif
+#ifndef iowrite16_rep
+#define iowrite16_rep iowrite16_rep
 extern void iowrite16_rep(void __iomem *port, const void *buf, unsigned long 
count);
+#endif
+#ifndef iowrite32_rep
+#define iowrite32_rep iowrite32_rep
 extern void iowrite32_rep(void __iomem *port, const void *buf, unsigned long 
count);
+#endif
 
 #ifdef CONFIG_HAS_IOPORT_MAP
 /* Create a virtual mapping cookie for an IO port range */
+#ifndef ioport_map
+#define ioport_map 

[RFC PATCH 1/5] asm-generic: move pcu_iounmap from iomap.h to pci_iomap.h

2019-10-30 Thread Rasmus Villemoes
pci_iounmap seems misplaced in iomap.h. Move it to where its cousins
live. Since iomap.h includes pci_iomap.h, anybody relying on getting
the declaration through iomap.h will still get it.

It would be natural to put the static inline version inside the

#elif defined(CONFIG_GENERIC_PCI_IOMAP)

Since GENERIC_IOMAP selects GENERIC_PCI_IOMAP, that would be ok for
those who select GENERIC_IOMAP. However, I fear there are some that
select GENERIC_PCI_IOMAP without GENERIC_IOMAP, and which perhaps have
their own pci_iounmap stub definition somewhere. So for now at least,
define the static inline version under the same conditions as it were
in iomap.h.

Signed-off-by: Rasmus Villemoes 
---
 include/asm-generic/iomap.h | 10 --
 include/asm-generic/pci_iomap.h |  7 +++
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/include/asm-generic/iomap.h b/include/asm-generic/iomap.h
index a008f504a2d0..5f8321e8fea9 100644
--- a/include/asm-generic/iomap.h
+++ b/include/asm-generic/iomap.h
@@ -101,16 +101,6 @@ extern void ioport_unmap(void __iomem *);
 #define ioremap_wt ioremap_nocache
 #endif
 
-#ifdef CONFIG_PCI
-/* Destroy a virtual mapping cookie for a PCI BAR (memory or IO) */
-struct pci_dev;
-extern void pci_iounmap(struct pci_dev *dev, void __iomem *);
-#elif defined(CONFIG_GENERIC_IOMAP)
-struct pci_dev;
-static inline void pci_iounmap(struct pci_dev *dev, void __iomem *addr)
-{ }
-#endif
-
 #include 
 
 #endif
diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h
index d4f16dcc2ed7..c2f78db2420e 100644
--- a/include/asm-generic/pci_iomap.h
+++ b/include/asm-generic/pci_iomap.h
@@ -18,6 +18,8 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, int 
bar,
 extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar,
unsigned long offset,
unsigned long maxlen);
+/* Destroy a virtual mapping cookie for a PCI BAR (memory or IO) */
+extern void pci_iounmap(struct pci_dev *dev, void __iomem *);
 /* Create a virtual mapping cookie for a port on a given PCI device.
  * Do not call this directly, it exists to make it easier for architectures
  * to override */
@@ -52,4 +54,9 @@ static inline void __iomem *pci_iomap_wc_range(struct pci_dev 
*dev, int bar,
 }
 #endif
 
+#if !defined(CONFIG_PCI) && defined(CONFIG_GENERIC_IOMAP)
+static inline void pci_iounmap(struct pci_dev *dev, void __iomem *addr)
+{ }
+#endif
+
 #endif /* __ASM_GENERIC_IO_H */
-- 
2.23.0



[RFC PATCH 3/5] powerpc: move pci_iounmap() from iomap.c to pci-common.c

2019-10-30 Thread Rasmus Villemoes
As preparation for making iowrite32 and friends static inlines, move
the definition of pci_iounmap() from iomap.c to pci-common.c. This
definition of pci_iounmap() is compiled in when
!CONFIG_PPC_INDIRECT_PIO && CONFIG_PCI - we're just interchanging
which condition is in the Kbuild logic and which is in the .c file.

Signed-off-by: Rasmus Villemoes 
---
 arch/powerpc/kernel/iomap.c  | 13 -
 arch/powerpc/kernel/pci-common.c | 13 +
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/iomap.c b/arch/powerpc/kernel/iomap.c
index 5ac84efc6ede..b22fa8db5068 100644
--- a/arch/powerpc/kernel/iomap.c
+++ b/arch/powerpc/kernel/iomap.c
@@ -182,16 +182,3 @@ void ioport_unmap(void __iomem *addr)
 }
 EXPORT_SYMBOL(ioport_map);
 EXPORT_SYMBOL(ioport_unmap);
-
-#ifdef CONFIG_PCI
-void pci_iounmap(struct pci_dev *dev, void __iomem *addr)
-{
-   if (isa_vaddr_is_ioport(addr))
-   return;
-   if (pcibios_vaddr_is_ioport(addr))
-   return;
-   iounmap(addr);
-}
-
-EXPORT_SYMBOL(pci_iounmap);
-#endif /* CONFIG_PCI */
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 1c448cf25506..d89a2426b405 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -295,6 +296,18 @@ int pcibios_vaddr_is_ioport(void __iomem *address)
return ret;
 }
 
+#ifndef CONFIG_PPC_INDIRECT_PIO
+void pci_iounmap(struct pci_dev *dev, void __iomem *addr)
+{
+   if (isa_vaddr_is_ioport(addr))
+   return;
+   if (pcibios_vaddr_is_ioport(addr))
+   return;
+   iounmap(addr);
+}
+EXPORT_SYMBOL(pci_iounmap);
+#endif /* CONFIG_PPC_INDIRECT_PIO */
+
 unsigned long pci_address_to_pio(phys_addr_t address)
 {
struct pci_controller *hose;
-- 
2.23.0



Re: [RFC PATCH 0/9] Fixes and Enablement of ibm,drc-info property

2019-10-30 Thread Tyrel Datwyler
On 10/1/19 1:02 PM, Bjorn Helgaas wrote:
> On Tue, Oct 01, 2019 at 01:12:05AM -0500, Tyrel Datwyler wrote:
>> There was an initial previous effort yo add support for the PAPR
>> architected ibm,drc-info property. This property provides a more
>> memory compact representation of a paritions Dynamic Reconfig
>> Connectors (DRC). These can otherwise be thought of the currently
>> partitioned, or available, but yet to be partitioned, system resources
>> such as cpus, memory, and physical/logical IOA devices.
>>
>> The initial implementation proved buggy and was fully turned of by
>> disabling the bit in the appropriate CAS support vector. We now have
>> PowerVM firmware in the field that supports this new property, and 
>> further to suppport partitions with 24TB+ or possible memory this
>> property is required to perform platform migration.
>>
>> This serious fixup the short comings of the previous implementation
>> in the areas of general implementation, cpu hotplug, and IOA hotplug.
>>
>> Tyrel Datwyler (9):
>>   powerpc/pseries: add cpu DLPAR support for drc-info property
>>   powerpc/pseries: fix bad drc_index_start value parsing of drc-info
>> entry
>>   powerpc/pseries: fix drc-info mappings of logical cpus to drc-index
>>   PCI: rpaphp: fix up pointer to first drc-info entry
>>   PCI: rpaphp: don't rely on firmware feature to imply drc-info support
>>   PCI: rpaphp: add drc-info support for hotplug slot registration
>>   PCI: rpaphp: annotate and correctly byte swap DRC properties
>>   PCI: rpaphp: correctly match ibm,my-drc-index to drc-name when using
>> drc-info
>>   powerpc: Enable support for ibm,drc-info property
>>
>>  arch/powerpc/kernel/prom_init.c |   2 +-
>>  arch/powerpc/platforms/pseries/hotplug-cpu.c| 117 --
>>  arch/powerpc/platforms/pseries/of_helpers.c |   8 +-
>>  arch/powerpc/platforms/pseries/pseries_energy.c |  23 ++---
>>  drivers/pci/hotplug/rpaphp_core.c   | 124 
>> +---
>>  5 files changed, 191 insertions(+), 83 deletions(-)
> 
> Michael, I assume you'll take care of this.  If I were applying, I
> would capitalize the commit subjects and fix the typos in the commit
> logs, e.g.,
> 
>   s/the this/the/
>   s/the the/that the/
>   s/short coming/shortcoming/
>   s/seperate/separate/
>   s/bid endian/big endian/
>   s/were appropriate/where appropriate/
>   s/name form/name from/
> 
> etc.  git am also complains about space before tab whitespace errors.
> And it adds a few lines >80 chars.
> 

I'll fix all those up in the next spin.

-Tyrel


Re: [RFC PATCH 2/9] powerpc/pseries: fix bad drc_index_start value parsing of drc-info entry

2019-10-30 Thread Tyrel Datwyler
On 10/10/19 12:04 PM, Nathan Lynch wrote:
> Tyrel Datwyler  writes:
>> The ibm,drc-info property is an array property that contains drc-info
>> entries such that each entry is made up of 2 string encoded elements
>> followed by 5 int encoded elements. The of_read_drc_info_cell()
>> helper contains comments that correctly name the expected elements
>> and their encoding. However, the usage of of_prop_next_string() and
>> of_prop_next_u32() introduced a subtle skippage of the first u32.
>> This is a result of of_prop_next_string() returns a pointer to the
>> next property value which is not a string, but actually a (__be32 *).
>> As, a result the following call to of_prop_next_u32() passes over the
>> current int encoded value and actually stores the next one wrongly.
>>
>> Simply endian swap the current value in place after reading the first
>> two string values. The remaining int encoded values can then be read
>> correctly using of_prop_next_u32().
> 
> Good but I think it would make more sense for a fix for
> of_read_drc_info_cell() to precede any patch in the series which
> introduces new callers, such as patch #1.
> 

Not sure it matters that much since everything in the series is required to
properly enable a working drc-info implementation and the call already exists so
it doesn't break bisectability. It ended up second in the series since testing
patch 1 exposed the flaw.

I'll go ahead and move it first, and add the appropriate fixes tag as well which
is currently missing.

-Tyrel


Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc

2019-10-30 Thread Russell Currey
On Wed, 2019-10-30 at 09:58 +0100, Christophe Leroy wrote:
> 
> Le 30/10/2019 à 08:31, Russell Currey a écrit :
> > v4 cover letter: 
> > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html
> > v3 cover letter: 
> > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html
> > 
> > Changes since v4:
> > [1/5]: Addressed review comments from Michael Ellerman
> > (thanks!)
> > [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on
> >ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid
> >STRICT_MODULE_RWX being *on by default* in cases where
> >STRICT_KERNEL_RWX is *unavailable*
> > [5/5]: split skiroot_defconfig changes out into its own patch
> > 
> > The whole Kconfig situation is really weird and confusing, I
> > believe the
> > correct resolution is to change arch/Kconfig but the consequences
> > are so
> > minor that I don't think it's worth it, especially given that I
> > expect
> > powerpc to have mandatory strict RWX Soon(tm).
> 
> I'm not such strict RWX can be made mandatory due to the impact it
> has 
> on some subarches:
> - On the 8xx, unless all areas are 8Mbytes aligned, there is a 
> significant overhead on TLB misses. And Aligning everthing to 8M is
> a 
> waste of RAM which is not acceptable on systems having very few RAM.
> - On hash book3s32, we are able to map the kernel BATs. With a few 
> alignment constraints, we are able to provide STRICT_KERNEL_RWX. But
> we 
> are unable to provide exec protection on page granularity. Only on 
> 256Mbytes segments. So for modules, we have to have the vmspace X. It
> is 
> also not possible to have a kernel area RO. Only user areas can be
> made RO.
> 

Yes, sorry, this was thoughtless from me, since in my mind I was just
thinking about the platforms I primarily work on (book3s64).

> Christophe
> 
> > Russell Currey (5):
> >powerpc/mm: Implement set_memory() routines
> >powerpc/kprobes: Mark newly allocated probes as RO
> >powerpc/mm/ptdump: debugfs handler for W+X checks at runtime
> >powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
> >powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig
> > 
> >   arch/powerpc/Kconfig   |  2 +
> >   arch/powerpc/Kconfig.debug |  6 +-
> >   arch/powerpc/configs/skiroot_defconfig |  1 +
> >   arch/powerpc/include/asm/set_memory.h  | 32 +++
> >   arch/powerpc/kernel/kprobes.c  |  3 +
> >   arch/powerpc/mm/Makefile   |  1 +
> >   arch/powerpc/mm/pageattr.c | 77
> > ++
> >   arch/powerpc/mm/ptdump/ptdump.c| 21 ++-
> >   8 files changed, 140 insertions(+), 3 deletions(-)
> >   create mode 100644 arch/powerpc/include/asm/set_memory.h
> >   create mode 100644 arch/powerpc/mm/pageattr.c
> > 



Re: [PATCH v5 5/5] powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig

2019-10-30 Thread Joel Stanley
On Wed, 30 Oct 2019 at 07:31, Russell Currey  wrote:
>
> skiroot_defconfig is the only powerpc defconfig with STRICT_KERNEL_RWX
> enabled, and if you want memory protection for kernel text you'd want it
> for modules too, so enable STRICT_MODULE_RWX there.
>
> Signed-off-by: Russell Currey 

Acked-by: Joel Stanley 

> ---
>  arch/powerpc/configs/skiroot_defconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/configs/skiroot_defconfig 
> b/arch/powerpc/configs/skiroot_defconfig
> index 1253482a67c0..719d899081b3 100644
> --- a/arch/powerpc/configs/skiroot_defconfig
> +++ b/arch/powerpc/configs/skiroot_defconfig
> @@ -31,6 +31,7 @@ CONFIG_PERF_EVENTS=y
>  CONFIG_SLAB_FREELIST_HARDENED=y
>  CONFIG_JUMP_LABEL=y
>  CONFIG_STRICT_KERNEL_RWX=y
> +CONFIG_STRICT_MODULE_RWX=y
>  CONFIG_MODULES=y
>  CONFIG_MODULE_UNLOAD=y
>  CONFIG_MODULE_SIG=y
> --
> 2.23.0
>


Re: [RFC PATCH 1/9] powerpc/pseries: add cpu DLPAR support for drc-info property

2019-10-30 Thread Tyrel Datwyler
On 10/10/19 11:56 AM, Nathan Lynch wrote:
> Hi Tyrel,
> 
> Tyrel Datwyler  writes:
>> +static bool valid_cpu_drc_index(struct device_node *parent, u32 drc_index)
>> +{
>> +const __be32 *indexes;
>> +int i;
>> +
>> +if (of_find_property(parent, "ibm,drc-info", NULL))
>> +return drc_info_valid_index(parent, drc_index);
>> +
>> +indexes = of_get_property(parent, "ibm,drc-indexes", NULL);
>> +if (!indexes)
>> +return false;
>> +
>> +for (i = 0; i < indexes[0]; i++) {
> 
> should this be:
> 
> for (i = 0; i < be32_to_cpu(indexes[0]); i++) {
> ?

Yes!

> 
> 
>> +if (be32_to_cpu(indexes[i + 1]) == drc_index)
>> +return true;
>> +}
>> +
>> +return false;
>>  }
> 
> It looks like this rewrites valid_cpu_drc_index()'s existing code for
> parsing ibm,drc-indexes but I don't see the need for this.
> 
> This patch would be easier to review if that were dropped or split out.

Yeah, I'll split it out. There are multiple places where we iterate over the
drc_indexes, and it is implemented several different ways. I basically picked an
implementation to use across the board. I think a better way would be just to
implement a for_each_drc_index(dn, drc_index) macro to abstract away iterator
implementation.

> 
>>  
>>  static ssize_t dlpar_cpu_add(u32 drc_index)
>> @@ -720,8 +756,11 @@ static int dlpar_cpu_remove_by_count(u32 cpus_to_remove)
>>  static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 cpus_to_add)
>>  {
>>  struct device_node *parent;
>> +struct property *info;
>> +const __be32 *indexes;
>>  int cpus_found = 0;
>> -int index, rc;
>> +int i, j;
>> +u32 drc_index;
>>  
>>  parent = of_find_node_by_path("/cpus");
>>  if (!parent) {
>> @@ -730,24 +769,46 @@ static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 
>> cpus_to_add)
>>  return -1;
>>  }
>>  
>> -/* Search the ibm,drc-indexes array for possible CPU drcs to
>> - * add. Note that the format of the ibm,drc-indexes array is
>> - * the number of entries in the array followed by the array
>> - * of drc values so we start looking at index = 1.
>> - */
>> -index = 1;
>> -while (cpus_found < cpus_to_add) {
>> -u32 drc;
>> +info = of_find_property(parent, "ibm,drc-info", NULL);
>> +if (info) {
>> +struct of_drc_info drc;
>> +const __be32 *value;
>> +int count;
>>  
>> -rc = of_property_read_u32_index(parent, "ibm,drc-indexes",
>> -index++, );
>> -if (rc)
>> -break;
>> +value = of_prop_next_u32(info, NULL, );
>> +if (value)
>> +value++;
>>  
>> -if (dlpar_cpu_exists(parent, drc))
>> -continue;
>> +for (i = 0; i < count; i++) {
>> +of_read_drc_info_cell(, , );
>> +if (strncmp(drc.drc_type, "CPU", 3))
>> +break;
>> +
>> +for (j = 0; j < drc.num_sequential_elems; j++) {
>> +drc_index = drc.drc_index_start + 
>> (drc.sequential_inc * j);
>> +
>> +if (dlpar_cpu_exists(parent, drc_index))
>> +continue;
>>  
>> -cpu_drcs[cpus_found++] = drc;
>> +cpu_drcs[cpus_found++] = drc_index;
> 
> I am failing to see how this loop is limited by the cpus_to_add
> parameter as it was before this change. It looks like this will overflow
> the cpu_drcs array when cpus_to_add is less than the number of cpus
> found.

You are right. The code is picking every non-present drc_index which will
overflow the supplied buffer as you stated when there are more available indexes
than requested cpus. Will fix to bound the search.

> 
> As an aside I don't understand how the add_by_count()/dlpar_cpu_exists()
> algorithm could be correct as it currently stands. It seems to pick the
> first X indexes for which a corresponding cpu node is absent, but that
> set of indexes does not necessarily match the set that is available to
> configure. Something to address separately I suppose.

I'm not sure I follow?

> 
>> +}
>> +}
>> +} else {
>> +indexes = of_get_property(parent, "ibm,drc-indexes", NULL);
>> +
>> +/* Search the ibm,drc-indexes array for possible CPU drcs to
>> +* add. Note that the format of the ibm,drc-indexes array is
>> +* the number of entries in the array followed by the array
>> +* of drc values so we start looking at index = 1.
>> +*/
>> +for (i = 1; i < indexes[0]; i++) {
>> +drc_index = be32_to_cpu(indexes[i]);
>> +
>> +if (dlpar_cpu_exists(parent, drc_index))
>> +continue;
>> +
>> +

Re: [PATCH] powerpc/tools: Don't quote $objdump in scripts

2019-10-30 Thread Segher Boessenkool
On Wed, Oct 30, 2019 at 10:55:03PM +1100, Michael Ellerman wrote:
> Segher Boessenkool  writes:
> > On Thu, Oct 24, 2019 at 11:47:30AM +1100, Michael Ellerman wrote:
> >> Some of our scripts are passed $objdump and then call it as
> >> "$objdump". This doesn't work if it contains spaces because we're
> >> using ccache, for example you get errors such as:
> >> 
> >>   ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: 
> >> No such file or directory
> >>   ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache 
> >> ppc64le-objdump: No such file or directory
> >> 
> >> Fix it by not quoting the string when we expand it, allowing the shell
> >> to do the right thing for us.
> >
> > This breaks things for people with spaces in their paths.
> 
> Spaces in their what? Who does that? :)

I know, right?

> Also we don't support it:
> 
>   $ pwd
>   $ /home/michael/foo bar
>   $ make clean
>   Makefile:147: *** source directory cannot contain spaces or colons.  Stop.

Of course.  But it's shell scripting 101 that you *do* quote all variable
expansions.  Do you want to set a bad example?  ;-)


Segher


Re: [PATCH 14/19] vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion

2019-10-30 Thread John Hubbard
On 10/30/19 3:49 PM, John Hubbard wrote:
> This also fixes one or two likely bugs.

Well, actually just one...

> 
> 1. Change vfio from get_user_pages(FOLL_LONGTERM), to
> pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.
> 
> Note that this is a change in behavior, because the
> get_user_pages_remote() call was not setting FOLL_LONGTERM, but the
> new pin_user_pages_remote() call that replaces it, *is* setting
> FOLL_LONGTERM. It is important to set FOLL_LONGTERM, because the
> DMA case requires it. Please see the FOLL_PIN documentation in
> include/linux/mm.h, and Documentation/pin_user_pages.rst for details.

Correction: the above comment is stale and wrong. I wrote it before 
getting further into the details, and the patch doesn't do this. 

Instead, it keeps exactly the old behavior: pin_longterm_pages_remote()
is careful to avoid setting FOLL_LONGTERM. Instead of setting that flag,
it drops in a "TODO" comment nearby. :)

I'll update the commit description in the next version of the series.


thanks,

John Hubbard
NVIDIA

> 
> 2. Because all FOLL_PIN-acquired pages must be released via
> put_user_page(), also convert the put_page() call over to
> put_user_pages().
> 
> Note that this effectively changes the code's behavior in
> vfio_iommu_type1.c: put_pfn(): it now ultimately calls
> set_page_dirty_lock(), instead of set_page_dirty(). This is
> probably more accurate.
> 
> As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
> dealing with a file backed page where we have reference on the inode it
> hangs off." [1]
> 
> [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de
> 
> Cc: Alex Williamson 
> Signed-off-by: John Hubbard 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d864277ea16f..795e13f3ef08 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -327,9 +327,8 @@ static int put_pfn(unsigned long pfn, int prot)
>  {
>   if (!is_invalid_reserved_pfn(pfn)) {
>   struct page *page = pfn_to_page(pfn);
> - if (prot & IOMMU_WRITE)
> - SetPageDirty(page);
> - put_page(page);
> +
> + put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE);
>   return 1;
>   }
>   return 0;
> @@ -349,11 +348,11 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>  
>   down_read(>mmap_sem);
>   if (mm == current->mm) {
> - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> -  vmas);
> + ret = pin_longterm_pages(vaddr, 1, flags, page, vmas);
>   } else {
> - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> - vmas, NULL);
> + ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
> + flags, page, vmas,
> + NULL);
>   /*
>* The lifetime of a vaddr_get_pfn() page pin is
>* userspace-controlled. In the fs-dax case this could
> @@ -363,7 +362,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
>*/
>   if (ret > 0 && vma_is_fsdax(vmas[0])) {
>   ret = -EOPNOTSUPP;
> - put_page(page[0]);
> + put_user_page(page[0]);
>   }
>   }
>   up_read(>mmap_sem);
> 


[PATCH 17/19] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

2019-10-30 Thread John Hubbard
It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Signed-off-by: John Hubbard 
---
 tools/testing/selftests/vm/run_vmtests | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests 
b/tools/testing/selftests/vm/run_vmtests
index 951c507a27f7..93e8dc9a7cad 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -104,6 +104,28 @@ echo "NOTE: The above hugetlb tests provide minimal 
coverage.  Use"
 echo "  https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "  hugetlb regression testing."
 
+echo ""
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo ""
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
+echo "--"
+echo "running gup_benchmark -c (pin_user_pages)"
+echo "--"
+./gup_benchmark -c
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
 echo "---"
 echo "running userfaultfd"
 echo "---"
-- 
2.23.0



[PATCH 19/19] Documentation/vm: add pin_user_pages.rst

2019-10-30 Thread John Hubbard
Document the new pin_user_pages() and related calls
and behavior.

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded on it slightly.)

Cc: Jonathan Corbet 
Signed-off-by: John Hubbard 
---
 Documentation/vm/index.rst  |   1 +
 Documentation/vm/pin_user_pages.rst | 213 
 2 files changed, 214 insertions(+)
 create mode 100644 Documentation/vm/pin_user_pages.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index e8d943b21cf9..7194efa3554a 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -44,6 +44,7 @@ descriptions of data structures and algorithms.
page_migration
page_frags
page_owner
+   pin_user_pages
remap_file_pages
slub
split_page_table_lock
diff --git a/Documentation/vm/pin_user_pages.rst 
b/Documentation/vm/pin_user_pages.rst
new file mode 100644
index ..7110bca3f188
--- /dev/null
+++ b/Documentation/vm/pin_user_pages.rst
@@ -0,0 +1,213 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+pin_user_pages() and related calls
+
+
+.. contents:: :local:
+
+Overview
+
+
+This document describes the following functions: ::
+
+ pin_user_pages
+ pin_user_pages_fast
+ pin_user_pages_remote
+
+ pin_longterm_pages
+ pin_longterm_pages_fast
+ pin_longterm_pages_remote
+
+Basic description of FOLL_PIN
+=
+
+A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
+significant interactions and interdependencies with FOLL_LONGTERM, so both are
+covered here.
+
+Both FOLL_PIN and FOLL_LONGTERM are "internal" to gup, meaning that neither
+FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
+the associated wrapper functions  (pin_user_pages and others) to set the 
correct
+combination of these flags, and to check for problems as well.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===
+
+Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to
+whatever flags the caller provides::
+
+ Functiongup flags (FOLL_PIN or FOLL_LONGTERM only)
+ --
+ pin_user_pages  FOLL_PIN
+ pin_user_pages_fast FOLL_PIN
+ pin_user_pages_remote   FOLL_PIN
+
+ pin_longterm_pages  FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_fast FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_remote   FOLL_PIN | FOLL_LONGTERM
+
+Tracking dma-pinned pages
+=
+
+Some of the key design constraints, and solutions, for tracking dma-pinned
+pages:
+
+* An actual reference count, per struct page, is required. This is because
+  multiple processes may pin and unpin a page.
+
+* False positives (reporting that a page is dma-pinned, when in fact it is not)
+  are acceptable, but false negatives are not.
+
+* struct page may not be increased in size for this, and all fields are already
+  used.
+
+* Given the above, we can overload the page->_refcount field by using, sort of,
+  the upper bits in that field for a dma-pinned count. "Sort of", means that,
+  rather than dividing page->_refcount into bit fields, we simple add a medium-
+  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
+  page->_refcount. This provides fuzzy behavior: if a page has get_page() 
called
+  on it 1024 times, then it will appear to have a single dma-pinned count.
+  And again, that's acceptable.
+
+This also leads to limitations: there are only 32-10==22 bits available for a
+counter that increments 10 bits at a time.
+
+TODO: for 1GB and larger huge pages, this is cutting it close. That's because
+when pin_user_pages() follows such pages, it increments the head page by "1"
+(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
+pin_user_pages()) for each tail page. So if you have a 1GB huge page:
+
+* There are 256K (18 bits) worth of 4 KB tail pages.
+* There are 22 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
+  10 bits at a time)
+* There are 22 - 18 == 4 bits available to count. Except that there aren't,
+  because you need to allow for a few normal get_page() calls on the head page,
+  as well. Fortunately, the approach of 

[PATCH 15/19] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()

2019-10-30 Thread John Hubbard
1. Convert from get_user_pages(FOLL_LONGTERM) to pin_longterm_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

3. Release each page in mem->hpages[] (instead of mem->hpas[]), because
that is the array that pin_longterm_pages() filled in. This is more
accurate and should be a little safer from a maintenance point of
view.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Signed-off-by: John Hubbard 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..69d79cb50d47 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,9 +103,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);
 
-   ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
-   FOLL_WRITE | FOLL_LONGTERM,
-   mem->hpages + entry, NULL);
+   ret = pin_longterm_pages(ua + (entry << PAGE_SHIFT), n,
+FOLL_WRITE, mem->hpages + entry, NULL);
if (ret == n) {
pinned += n;
continue;
@@ -167,9 +166,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
return 0;
 
 free_exit:
-   /* free the reference taken */
-   for (i = 0; i < pinned; i++)
-   put_page(mem->hpages[i]);
+   /* free the references taken */
+   put_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -212,10 +210,9 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (!page)
continue;
 
-   if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
-   SetPageDirty(page);
+   put_user_pages_dirty_lock(>hpages[i], 1,
+ MM_IOMMU_TABLE_GROUP_PAGE_DIRTY);
 
-   put_page(page);
mem->hpas[i] = 0;
}
 }
-- 
2.23.0



[PATCH 16/19] mm/gup_benchmark: support pin_user_pages() and related calls

2019-10-30 Thread John Hubbard
Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages(): via the '-c' command line option
* pin_longterm_pages(): via the '-b' command line option
* pin_user_pages_fast(): via the '-a' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the three commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
PIN_LONGTERM_BENCHMARK : calls pin_longterm_pages()
PIN_BENCHMARK  : calls pin_user_pages()

In between the calls to pin_*() and put_user_pages(),
check each page: if page_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 74 --
 tools/testing/selftests/vm/gup_benchmark.c | 23 ++-
 2 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d7f8db..2bb0f5df4803 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,9 @@
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
+#define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark)
+#define PIN_BENCHMARK  _IOWR('g', 6, struct gup_benchmark)
 
 struct gup_benchmark {
__u64 get_delta_usec;
@@ -19,6 +22,44 @@ struct gup_benchmark {
__u64 expansion[10];/* For future use */
 };
 
+static void put_back_pages(int cmd, struct page **pages, unsigned long 
nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case GUP_FAST_BENCHMARK:
+   case GUP_LONGTERM_BENCHMARK:
+   case GUP_BENCHMARK:
+   for (i = 0; i < nr_pages; i++)
+   put_page(pages[i]);
+   break;
+
+   case PIN_FAST_BENCHMARK:
+   case PIN_LONGTERM_BENCHMARK:
+   case PIN_BENCHMARK:
+   put_user_pages(pages, nr_pages);
+   break;
+   }
+}
+
+static void verify_dma_pinned(int cmd, struct page **pages,
+ unsigned long nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case PIN_FAST_BENCHMARK:
+   case PIN_LONGTERM_BENCHMARK:
+   case PIN_BENCHMARK:
+   for (i = 0; i < nr_pages; i++) {
+   if (WARN(!page_dma_pinned(pages[i]),
+"pages[%d] is NOT dma-pinned\n", i))
+   break;
+   }
+   break;
+   }
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
struct gup_benchmark *gup)
 {
@@ -62,6 +103,19 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
NULL);
break;
+   case PIN_FAST_BENCHMARK:
+   nr = pin_user_pages_fast(addr, nr, gup->flags & 1,
+pages + i);
+   break;
+   case PIN_LONGTERM_BENCHMARK:
+   nr = pin_longterm_pages(addr, nr,
+   (gup->flags & 1),
+   pages + i, NULL);
+   break;
+   case PIN_BENCHMARK:
+   nr = pin_user_pages(addr, nr, gup->flags & 1, pages + i,
+   NULL);
+   break;
default:
return -1;
}
@@ -72,15 +126,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
}
end_time = ktime_get();
 
+   /* Shifting the meaning of nr_pages: now it is actual number pinned: */
+   nr_pages = i;
+
gup->get_delta_usec = ktime_us_delta(end_time, start_time);
gup->size = addr - gup->addr;
 
+   /*
+* Take an un-benchmark-timed moment to verify DMA pinned
+* state: print a warning if any non-dma-pinned pages are found:
+*/
+   verify_dma_pinned(cmd, pages, nr_pages);
+
start_time = ktime_get();
-   for (i = 0; i < nr_pages; i++) {
-   if (!pages[i])
-   break;
-   put_page(pages[i]);
-   }
+
+   put_back_pages(cmd, pages, nr_pages);
+

[PATCH 18/19] mm/gup: remove support for gup(FOLL_LONGTERM)

2019-10-30 Thread John Hubbard
Now that all other kernel callers of get_user_pages(FOLL_LONGTERM)
have been converted to pin_longterm_pages(), lock it down:

1) Add an assertion to get_user_pages(), preventing callers from
   passing FOLL_LONGTERM (in addition to the existing assertion that
   prevents FOLL_PIN).

2) Remove the associated GUP_LONGTERM_BENCHMARK test.

Signed-off-by: John Hubbard 
---
 mm/gup.c   | 8 
 mm/gup_benchmark.c | 9 +
 tools/testing/selftests/vm/gup_benchmark.c | 7 ++-
 3 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e51b3820a995..9a28935a2cb1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1744,11 +1744,11 @@ long get_user_pages(unsigned long start, unsigned long 
nr_pages,
struct vm_area_struct **vmas)
 {
/*
-* As detailed above, FOLL_PIN must only be set internally by the
-* pin_user_page*() and pin_longterm_*() APIs, never directly by the
-* caller, so enforce that with an assertion:
+* As detailed above, FOLL_PIN and FOLL_LONGTERM must only be set
+* internally by the pin_user_page*() and pin_longterm_*() APIs, never
+* directly by the caller, so enforce that with an assertion:
 */
-   if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+   if (WARN_ON_ONCE(gup_flags & (FOLL_PIN | FOLL_LONGTERM)))
return -EINVAL;
 
return __gup_longterm_locked(current, current->mm, start, nr_pages,
diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 2bb0f5df4803..de6941855b7e 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -6,7 +6,7 @@
 #include 
 
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
+/* Command 2 has been deleted. */
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 #define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
 #define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark)
@@ -28,7 +28,6 @@ static void put_back_pages(int cmd, struct page **pages, 
unsigned long nr_pages)
 
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
@@ -94,11 +93,6 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages_fast(addr, nr, gup->flags & 1,
 pages + i);
break;
-   case GUP_LONGTERM_BENCHMARK:
-   nr = get_user_pages(addr, nr,
-   (gup->flags & 1) | FOLL_LONGTERM,
-   pages + i, NULL);
-   break;
case GUP_BENCHMARK:
nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
NULL);
@@ -157,7 +151,6 @@ static long gup_benchmark_ioctl(struct file *filep, 
unsigned int cmd,
 
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
case PIN_FAST_BENCHMARK:
case PIN_LONGTERM_BENCHMARK:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index c5c934c0f402..5ef3cf8f3da5 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -15,7 +15,7 @@
 #define PAGE_SIZE sysconf(_SC_PAGESIZE)
 
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
+/* Command 2 has been deleted. */
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 
 /*
@@ -46,7 +46,7 @@ int main(int argc, char **argv)
char *file = "/dev/zero";
char *p;
 
-   while ((opt = getopt(argc, argv, "m:r:n:f:abctTLUuwSH")) != -1) {
+   while ((opt = getopt(argc, argv, "m:r:n:f:abctTUuwSH")) != -1) {
switch (opt) {
case 'a':
cmd = PIN_FAST_BENCHMARK;
@@ -72,9 +72,6 @@ int main(int argc, char **argv)
case 'T':
thp = 0;
break;
-   case 'L':
-   cmd = GUP_LONGTERM_BENCHMARK;
-   break;
case 'U':
cmd = GUP_BENCHMARK;
break;
-- 
2.23.0



[PATCH 14/19] vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion

2019-10-30 Thread John Hubbard
This also fixes one or two likely bugs.

1. Change vfio from get_user_pages(FOLL_LONGTERM), to
pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.

Note that this is a change in behavior, because the
get_user_pages_remote() call was not setting FOLL_LONGTERM, but the
new pin_user_pages_remote() call that replaces it, *is* setting
FOLL_LONGTERM. It is important to set FOLL_LONGTERM, because the
DMA case requires it. Please see the FOLL_PIN documentation in
include/linux/mm.h, and Documentation/pin_user_pages.rst for details.

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Cc: Alex Williamson 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d864277ea16f..795e13f3ef08 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -327,9 +327,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
-   if (prot & IOMMU_WRITE)
-   SetPageDirty(page);
-   put_page(page);
+
+   put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE);
return 1;
}
return 0;
@@ -349,11 +348,11 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 
down_read(>mmap_sem);
if (mm == current->mm) {
-   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-vmas);
+   ret = pin_longterm_pages(vaddr, 1, flags, page, vmas);
} else {
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-   vmas, NULL);
+   ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
+   flags, page, vmas,
+   NULL);
/*
 * The lifetime of a vaddr_get_pfn() page pin is
 * userspace-controlled. In the fs-dax case this could
@@ -363,7 +362,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 */
if (ret > 0 && vma_is_fsdax(vmas[0])) {
ret = -EOPNOTSUPP;
-   put_page(page[0]);
+   put_user_page(page[0]);
}
}
up_read(>mmap_sem);
-- 
2.23.0



[PATCH 13/19] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion

2019-10-30 Thread John Hubbard
1. Change v4l2 from get_user_pages(FOLL_LONGTERM), to
pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Cc: Mauro Carvalho Chehab 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..9b9c5b37bf59 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
-flags | FOLL_LONGTERM, dma->pages, NULL);
+   err = pin_longterm_pages(data & PAGE_MASK, dma->nr_pages,
+flags, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+   dprintk(1, "pin_longterm_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++) {
-   if (dma->direction == DMA_FROM_DEVICE)
-   set_page_dirty_lock(dma->pages[i]);
-   put_page(dma->pages[i]);
-   }
+   put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+ dma->direction == DMA_FROM_DEVICE);
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.23.0



[PATCH 12/19] mm/gup: track FOLL_PIN pages

2019-10-30 Thread John Hubbard
Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via put_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

This also has a couple of trivial, non-functional change fixes to
try_get_compound_head(). That function got moved to the top of the
file.

This includes the following fix from Ira Weiny:

DAX requires detection of a page crossing to a ref count of 1.  Fix this
for GUP pages by introducing put_devmap_managed_user_page() which
accounts for GUP_PIN_COUNTING_BIAS now used by GUP.

[1] https://lwn.net/Articles/784574/ "Some slow progress on
get_user_pages()"

Suggested-by: Jan Kara 
Suggested-by: Jérôme Glisse 
Signed-off-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h   |  80 +++
 include/linux/mmzone.h   |   2 +
 include/linux/page_ref.h |  10 ++
 mm/gup.c | 213 +++
 mm/huge_memory.c |  32 +-
 mm/hugetlb.c |  28 -
 mm/memremap.c|   4 +-
 mm/vmstat.c  |   2 +
 8 files changed, 300 insertions(+), 71 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 62c838a3e6c7..882fda919c81 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void __put_devmap_managed_page(struct page *page, int count);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
@@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
-   __put_devmap_managed_page(page);
return true;
default:
break;
@@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
 }
 
+static inline bool put_devmap_managed_page(struct page *page)
+{
+   bool is_devmap = page_is_devmap_managed(page);
+
+   if (is_devmap) {
+   int count = page_ref_dec_return(page);
+
+   __put_devmap_managed_page(page, count);
+   }
+
+   return is_devmap;
+}
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
 static inline bool put_devmap_managed_page(struct page *page)
 {
@@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page 
*page)
return true;
 }
 
+__must_check bool user_page_ref_inc(struct page *page);
+
 static inline void put_page(struct page *page)
 {
page = compound_head(page);
@@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
-/**
- * put_user_page() - release a gup-pinned page
- * @page:pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original 
page
+ * reference count, and also a new count of how many get_user_pages() calls 
were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, get_user_pages() becomes special: such pages are marked
+ * as distinct from normal pages. As such, the new put_user_page() call (and
+ * its variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
  *
- * Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to get_user_pages() and put_user_page() becomes simpler,
+ * due to the fact that adding an even power of two to the page refcount has
+ * the effect of using only the upper N bits, for the code that counts up using
+ * the bias value. This means that the lower bits are left for the exclusive
+ * use of the original code that increments and decrements by one (or at least,
+ * by much smaller values than the bias value).
  *
- * 

[PATCH 11/19] net/xdp: set FOLL_PIN via pin_user_pages()

2019-10-30 Thread John Hubbard
Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

Signed-off-by: John Hubbard 
---
 net/xdp/xdp_umem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 16d5f353163a..4d56dfb1139a 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -285,8 +285,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
return -ENOMEM;
 
down_read(>mm->mmap_sem);
-   npgs = get_user_pages(umem->address, umem->npgs,
- gup_flags | FOLL_LONGTERM, >pgs[0], NULL);
+   npgs = pin_longterm_pages(umem->address, umem->npgs, gup_flags,
+ >pgs[0], NULL);
up_read(>mm->mmap_sem);
 
if (npgs != umem->npgs) {
-- 
2.23.0



[PATCH 09/19] drm/via: set FOLL_PIN via pin_user_pages_fast()

2019-10-30 Thread John Hubbard
Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
drm_via_dmablit_t *xfer)
vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
if (NULL == vsg->pages)
return -ENOMEM;
-   ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+   ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
vsg->num_pages,
vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
vsg->pages);
-- 
2.23.0



[PATCH 10/19] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-10-30 Thread John Hubbard
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

Signed-off-by: John Hubbard 
---
 fs/io_uring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a30c4f622cb3..d3924b1760eb 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3431,9 +3431,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx 
*ctx, void __user *arg,
 
ret = 0;
down_read(>mm->mmap_sem);
-   pret = get_user_pages(ubuf, nr_pages,
- FOLL_WRITE | FOLL_LONGTERM,
- pages, vmas);
+   pret = pin_longterm_pages(ubuf, nr_pages, FOLL_WRITE, pages,
+ vmas);
if (pret == nr_pages) {
/* don't support file backed memory */
for (j = 0; j < nr_pages; j++) {
-- 
2.23.0



[PATCH 08/19] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-10-30 Thread John Hubbard
Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Signed-off-by: John Hubbard 
---
 mm/process_vm_access.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
if (copy > len)
copy = len;
 
-   if (vm_write) {
+   if (vm_write)
copied = copy_page_from_iter(page, offset, copy, iter);
-   set_page_dirty_lock(page);
-   } else {
+   else
copied = copy_page_to_iter(page, offset, copy, iter);
-   }
+
len -= copied;
if (copied < copy && iov_iter_count(iter))
return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
flags |= FOLL_WRITE;
 
while (!rc && nr_pages && iov_iter_count(iter)) {
-   int pages = min(nr_pages, max_pages_per_loop);
+   int pinned_pages = min(nr_pages, max_pages_per_loop);
int locked = 1;
size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 * current/current->mm
 */
down_read(>mmap_sem);
-   pages = get_user_pages_remote(task, mm, pa, pages, flags,
- process_pages, NULL, );
+   pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+flags, process_pages,
+NULL, );
if (locked)
up_read(>mmap_sem);
-   if (pages <= 0)
+   if (pinned_pages <= 0)
return -EFAULT;
 
-   bytes = pages * PAGE_SIZE - start_offset;
+   bytes = pinned_pages * PAGE_SIZE - start_offset;
if (bytes > len)
bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 vm_write);
len -= bytes;
start_offset = 0;
-   nr_pages -= pages;
-   pa += pages * PAGE_SIZE;
-   while (pages)
-   put_page(process_pages[--pages]);
+   nr_pages -= pinned_pages;
+   pa += pinned_pages * PAGE_SIZE;
+
+   /* If vm_write is set, the pages need to be made dirty: */
+   put_user_pages_dirty_lock(process_pages, pinned_pages,
+ vm_write);
}
 
return rc;
-- 
2.23.0



[PATCH 05/19] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-10-30 Thread John Hubbard
Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

These variants all set FOLL_PIN, which is also introduced, and
basically documented. (An upcoming patch provides more extensive
documentation.) The second set (pin_longterm*) also sets
FOLL_LONGTERM:

pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()

pin_longterm_pages()
pin_longterm_pages_remote()
pin_longterm_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* These are gup-internal flags, so the call sites should not directly
set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
assertions, for the new FOLL_PIN flag. However, for the pre-existing
FOLL_LONGTERM flag, which has some call sites that still directly
set FOLL_LONGTERM, there is no assertion yet.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
* Start with "pin_user_pages" instead of "get_user_pages". That
  makes it easy to find and audit the call sites.
* Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Signed-off-by: John Hubbard 
---
 include/linux/mm.h |  53 -
 mm/gup.c   | 284 +
 2 files changed, 311 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..62c838a3e6c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1526,9 +1526,23 @@ long get_user_pages_remote(struct task_struct *tsk, 
struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *locked);
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+  unsigned long start, unsigned long nr_pages,
+  unsigned int gup_flags, struct page **pages,
+  struct vm_area_struct **vmas, int *locked);
+long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+  unsigned long start, unsigned long nr_pages,
+  unsigned int gup_flags, struct page **pages,
+  struct vm_area_struct **vmas, int *locked);
 long get_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas);
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+   unsigned int gup_flags, struct page **pages,
+   struct vm_area_struct **vmas);
+long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
+   unsigned int gup_flags, struct page **pages,
+   struct vm_area_struct **vmas);
 long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
@@ -1536,6 +1550,10 @@ long get_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
 
 int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+   unsigned int gup_flags, struct page **pages);
+int pin_longterm_pages_fast(unsigned long start, int nr_pages,
+   unsigned int gup_flags, struct page **pages);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
@@ -2594,13 +2612,15 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_ANON  0x8000  /* don't do file mappings */
 #define FOLL_LONGTERM  0x1 /* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
+#define FOLL_PIN   0x4 /* pages must be released via put_user_page() */
 
 /*
- * NOTE on FOLL_LONGTERM:
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
  *
  * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control.  This is contrasted with
- * iov_iter_get_pages() where usages which are transient.
+ * period _often_ under userspace control.  This is in contrast to
+ * iov_iter_get_pages(), where usages which are transient.
  *
  * FIXME: For 

[PATCH 07/19] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()

2019-10-30 Thread John Hubbard
Convert infiniband to use the new wrapper calls, and stop
explicitly setting FOLL_LONGTERM at the call sites.

The new pin_longterm_*() calls replace get_user_pages*()
calls, and set both FOLL_LONGTERM and a new FOLL_PIN
flag. The FOLL_PIN flag requires that the caller must
return the pages via put_user_page*() calls, but
infiniband was already doing that as part of an earlier
commit.

Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  5 ++---
 drivers/infiniband/core/umem_odp.c  | 10 +-
 drivers/infiniband/hw/hfi1/user_pages.c |  4 ++--
 drivers/infiniband/hw/mthca/mthca_memfree.c |  3 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  8 
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  9 -
 drivers/infiniband/sw/siw/siw_mem.c |  5 ++---
 8 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 24244a2f68cc..c5a78d3e674b 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
 
while (npages) {
down_read(>mmap_sem);
-   ret = get_user_pages(cur_base,
+   ret = pin_longterm_pages(cur_base,
 min_t(unsigned long, npages,
   PAGE_SIZE / sizeof (struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
+gup_flags, page_list, NULL);
if (ret < 0) {
up_read(>mmap_sem);
goto umem_release;
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 163ff7ba92b7..a38b67b83db5 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -534,7 +534,7 @@ static int ib_umem_odp_map_dma_single_page(
} else if (umem_odp->page_list[page_index] == page) {
umem_odp->dma_list[page_index] |= access_mask;
} else {
-   pr_err("error: got different pages in IB device and from 
get_user_pages. IB device page: %p, gup page: %p\n",
+   pr_err("error: got different pages in IB device and from 
pin_longterm_pages. IB device page: %p, gup page: %p\n",
   umem_odp->page_list[page_index], page);
/* Better remove the mapping now, to prevent any further
 * damage. */
@@ -639,11 +639,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp 
*umem_odp, u64 user_virt,
/*
 * Note: this might result in redundent page getting. We can
 * avoid this by checking dma_list to be 0 before calling
-* get_user_pages. However, this make the code much more
-* complex (and doesn't gain us much performance in most use
-* cases).
+* pin_longterm_pages. However, this makes the code much
+* more complex (and doesn't gain us much performance in most
+* use cases).
 */
-   npages = get_user_pages_remote(owning_process, owning_mm,
+   npages = pin_longterm_pages_remote(owning_process, owning_mm,
user_virt, gup_num_pages,
flags, local_page_list, NULL, NULL);
up_read(_mm->mmap_sem);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9b55b0a73e29 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -104,9 +104,9 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
bool writable, struct page **pages)
 {
int ret;
-   unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
+   unsigned int gup_flags = (writable ? FOLL_WRITE : 0);
 
-   ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+   ret = pin_longterm_pages_fast(vaddr, npages, gup_flags, pages);
if (ret < 0)
return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..beec7e4b8a96 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
goto out;
}
 
-   ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
- FOLL_WRITE | FOLL_LONGTERM, pages);
+   ret = pin_longterm_pages_fast(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages);
if (ret < 0)
goto 

[PATCH 06/19] goldish_pipe: convert to pin_user_pages() and put_user_page()

2019-10-30 Thread John Hubbard
1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index 7ed2a21a0bac..635a8bc1b480 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int pin_goldfish_pages(unsigned long first_page,
*iter_last_page_size = last_page_size;
}
 
-   ret = get_user_pages_fast(first_page, requested_pages,
+   ret = pin_user_pages_fast(first_page, requested_pages,
  !is_write ? FOLL_WRITE : 0,
  pages);
if (ret <= 0)
@@ -285,18 +285,6 @@ static int pin_goldfish_pages(unsigned long first_page,
return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-  int is_write, s32 consumed_size)
-{
-   int i;
-
-   for (i = 0; i < pages_count; i++) {
-   if (!is_write && consumed_size > 0)
-   set_page_dirty(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
   int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-   release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+   put_user_pages_dirty_lock(pipe->pages, pages_count,
+ !is_write && *consumed_size > 0);
 
mutex_unlock(>lock);
return 0;
-- 
2.23.0



[PATCH 04/19] media/v4l2-core: set pages dirty upon releasing DMA buffers

2019-10-30 Thread John Hubbard
After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Cc: Mauro Carvalho Chehab 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++)
+   for (i = 0; i < dma->nr_pages; i++) {
+   if (dma->direction == DMA_FROM_DEVICE)
+   set_page_dirty_lock(dma->pages[i]);
put_page(dma->pages[i]);
+   }
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.23.0



[PATCH 03/19] goldish_pipe: rename local pin_user_pages() routine

2019-10-30 Thread John Hubbard
1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "pin_goldfish_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..7ed2a21a0bac 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
}
 }
 
-static int pin_user_pages(unsigned long first_page,
- unsigned long last_page,
- unsigned int last_page_size,
- int is_write,
- struct page *pages[MAX_BUFFERS_PER_COMMAND],
- unsigned int *iter_last_page_size)
+static int pin_goldfish_pages(unsigned long first_page,
+ unsigned long last_page,
+ unsigned int last_page_size,
+ int is_write,
+ struct page *pages[MAX_BUFFERS_PER_COMMAND],
+ unsigned int *iter_last_page_size)
 {
int ret;
int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
if (mutex_lock_interruptible(>lock))
return -ERESTARTSYS;
 
-   pages_count = pin_user_pages(first_page, last_page,
-last_page_size, is_write,
-pipe->pages, _last_page_size);
+   pages_count = pin_goldfish_pages(first_page, last_page,
+last_page_size, is_write,
+pipe->pages, _last_page_size);
if (pages_count < 0) {
mutex_unlock(>lock);
return pages_count;
-- 
2.23.0



[PATCH 02/19] mm/gup: factor out duplicate code from four routines

2019-10-30 Thread John Hubbard
There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Cc: Christoph Hellwig 
Cc: Aneesh Kumar K.V 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 113 ++-
 1 file changed, 46 insertions(+), 67 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 85caf76b3012..8fb0d9cdfaf5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1969,6 +1969,35 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, 
unsigned long addr,
 }
 #endif
 
+static int __record_subpages(struct page *page, unsigned long addr,
+unsigned long end, struct page **pages, int nr)
+{
+   int nr_recorded_pages = 0;
+
+   do {
+   pages[nr] = page;
+   nr++;
+   page++;
+   nr_recorded_pages++;
+   } while (addr += PAGE_SIZE, addr != end);
+   return nr_recorded_pages;
+}
+
+static void __remove_refs_from_head(struct page *page, int refs)
+{
+   /* Do a get_page() first, in case refs == page->_refcount */
+   get_page(page);
+   page_ref_sub(page, refs);
+   put_page(page);
+}
+
+static int __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
+{
+   *nr += nr_recorded_pages;
+   SetPageReferenced(head);
+   return 1;
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
@@ -1998,34 +2027,19 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-   refs = 0;
head = pte_page(pte);
-
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-   do {
-   VM_BUG_ON(compound_head(page) != head);
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = __record_subpages(page, addr, end, pages, *nr);
 
head = try_get_compound_head(head, refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-   /* Could be optimized better */
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   __remove_refs_from_head(head, refs);
return 0;
}
-
-   SetPageReferenced(head);
-   return 1;
+   return __huge_pt_done(head, refs, nr);
 }
 
 static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
@@ -2071,30 +2085,18 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
 pages, nr);
}
 
-   refs = 0;
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = __record_subpages(page, addr, end, pages, *nr);
 
head = try_get_compound_head(pmd_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   __remove_refs_from_head(head, refs);
return 0;
}
-
-   SetPageReferenced(head);
-   return 1;
+   return __huge_pt_done(head, refs, nr);
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
@@ -2114,30 +2116,18 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
 pages, nr);
}
 
-   refs = 0;
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = __record_subpages(page, addr, end, pages, *nr);
 
head = try_get_compound_head(pud_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pud_val(orig) != 

[PATCH 00/19] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-10-30 Thread John Hubbard
Hi,

This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.

This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()

Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:

get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()

to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()


Patch summary:

* Patches 1-4: refactoring and preparatory cleanup, independent fixes
(Patch 4: V4L2-core bug fix (can be separately applied))

* Patch 5: introduce pin_user_pages(), FOLL_PIN, but no functional
   changes yet
* Patches 6-11: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 12: Activate tracking of FOLL_PIN pages.
* Patches 13-15: convert FOLL_LONGTERM callers
* Patches: 16-17: gup_benchmark and run_vmtests support
* Patch 18: enforce FOLL_LONGTERM as a gup-internal (only) flag
* Patch 19: Documentation/vm/pin_user_pages.rst


Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've been
  able to runtime test the core get_user_pages() and pin_user_pages() and
  related routines, but not so much on several of the call sites--but those
  are generally just a couple of lines changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Also, my runtime testing for the call sites so far is very weak:

* io_uring: Some directed tests from liburing exercise this, and they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and passes.
* infiniband (still only have crude "IB pingpong" working, on a
  good day: it's not exercising my conversions at runtime...)
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
  not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...


Next:

* Get the block/bio_vec sites converted to use pin_user_pages().

* Work with Ira and Dave Chinner to weave this together with the
  layout lease stuff.



[1] Some slow progress on get_user_pages() (Apr 2, 2019): 
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): 
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): 
https://lwn.net/Articles/753027/

John Hubbard (19):
  mm/gup: pass flags arg to __gup_device_* functions
  mm/gup: factor out duplicate code from four routines
  goldish_pipe: rename local pin_user_pages() routine
  media/v4l2-core: set pages dirty upon releasing DMA buffers
  mm/gup: introduce pin_user_pages*() and FOLL_PIN
  goldish_pipe: convert to pin_user_pages() and put_user_page()
  infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  drm/via: set FOLL_PIN via pin_user_pages_fast()
  fs/io_uring: set FOLL_PIN via pin_user_pages()
  net/xdp: set FOLL_PIN via pin_user_pages()
  mm/gup: track FOLL_PIN pages
  media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page()
conversion
  vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() 

[PATCH 01/19] mm/gup: pass flags arg to __gup_device_* functions

2019-10-30 Thread John Hubbard
A subsequent patch requires access to gup flags, so
pass the flags argument through to the __gup_device_*
functions.

Also placate checkpatch.pl by shortening a nearby line.

Cc: Kirill A. Shutemov 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..85caf76b3012 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
int nr_start = *nr;
struct dev_pagemap *pgmap = NULL;
@@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, 
unsigned long addr,
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
unsigned long fault_pfn;
int nr_start = *nr;
 
fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+   if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
return 0;
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
@@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t 
*pmdp, unsigned long addr,
 }
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
unsigned long fault_pfn;
int nr_start = *nr;
 
fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+   if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
return 0;
 
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t 
*pudp, unsigned long addr,
 }
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
BUILD_BUG();
return 0;
 }
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
BUILD_BUG();
return 0;
@@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned 
long addr,
if (pmd_devmap(orig)) {
if (unlikely(flags & FOLL_LONGTERM))
return 0;
-   return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
+   return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
+pages, nr);
}
 
refs = 0;
@@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned 
long addr,
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-   unsigned long end, unsigned int flags, struct page **pages, int 
*nr)
+   unsigned long end, unsigned int flags,
+   struct page **pages, int *nr)
 {
struct page *head, *page;
int refs;
@@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned 
long addr,
if (pud_devmap(orig)) {
if (unlikely(flags & FOLL_LONGTERM))
return 0;
-   return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
+   return __gup_device_huge_pud(orig, pudp, addr, end, flags,
+pages, nr);
}
 
refs = 0;
-- 
2.23.0



Re: [PATCH v3 3/8] powerpc: Fix vDSO clock_getres()

2019-10-30 Thread Sasha Levin
Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: a7f290dad32ee [PATCH] powerpc: Merge vdso's and add vdso support 
to 32 bits kernel.

The bot has tested the following trees: v5.3.8, v4.19.81, v4.14.151, v4.9.198, 
v4.4.198.

v5.3.8: Build OK!
v4.19.81: Build OK!
v4.14.151: Failed to apply! Possible dependencies:
5c929885f1bb4 ("powerpc/vdso64: Add support for 
CLOCK_{REALTIME/MONOTONIC}_COARSE")
b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across 
Y2038")

v4.9.198: Failed to apply! Possible dependencies:
4546561551106 ("powerpc/asm: Use OFFSET macro in asm-offsets.c")
5c929885f1bb4 ("powerpc/vdso64: Add support for 
CLOCK_{REALTIME/MONOTONIC}_COARSE")
5d451a87e5ebb ("powerpc/64: Retrieve number of L1 cache sets from 
device-tree")
7c5b06cadf274 ("KVM: PPC: Book3S HV: Adapt TLB invalidations to work on 
POWER9")
83677f551e0a6 ("KVM: PPC: Book3S HV: Adjust host/guest context switch for 
POWER9")
902e06eb86cd6 ("powerpc/32: Change the stack protector canary value per 
task")
b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across 
Y2038")
bd067f83b0840 ("powerpc/64: Fix naming of cache block vs. cache line")
e2827fe5c1566 ("powerpc/64: Clean up ppc64_caches using a struct per cache")
e9cf1e085647b ("KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs")
f4c51f841d2ac ("KVM: PPC: Book3S HV: Modify guest entry/exit paths to 
handle radix guests")

v4.4.198: Failed to apply! Possible dependencies:
153086644fd1f ("powerpc/ftrace: Add support for -mprofile-kernel ftrace 
ABI")
3eb5d5888dc68 ("powerpc: Add ppc_strict_facility_enable boot option")
4546561551106 ("powerpc/asm: Use OFFSET macro in asm-offsets.c")
579e633e764e6 ("powerpc: create flush_all_to_thread()")
5c929885f1bb4 ("powerpc/vdso64: Add support for 
CLOCK_{REALTIME/MONOTONIC}_COARSE")
70fe3d980f5f1 ("powerpc: Restore FPU/VEC/VSX if previously used")
85baa095497f3 ("powerpc/livepatch: Add live patching support on ppc64le")
902e06eb86cd6 ("powerpc/32: Change the stack protector canary value per 
task")
b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across 
Y2038")
bf76f73c5f655 ("powerpc: enable UBSAN support")
c208505900b23 ("powerpc: create giveup_all()")
d1e1cf2e38def ("powerpc: clean up asm/switch_to.h")
dc4fbba11e466 ("powerpc: Create disable_kernel_{fp,altivec,vsx,spe}()")
f17c4e01e906c ("powerpc/module: Mark module stubs with a magic value")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

-- 
Thanks,
Sasha


Re: [PATCH RFC 1/5] dma/direct: turn ARCH_ZONE_DMA_BITS into a variable

2019-10-30 Thread Christoph Hellwig
On Mon, Oct 14, 2019 at 08:31:03PM +0200, Nicolas Saenz Julienne wrote:
> Some architectures, notably ARM, are interested in tweaking this
> depending on their runtime DMA addressing limitations.
> 
> Signed-off-by: Nicolas Saenz Julienne 

Do you want me to pick this up for the 5.5 dma-mapping tree, or do you
want me to wait for the rest to settle?


[PATCH 03/12] powerpc: Replace cpu_up/down with device_online/offline

2019-10-30 Thread Qais Yousef
The core device API performs extra housekeeping bits that are missing
from directly calling cpu_up/down.

See commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and
serialization during LPM") for an example description of what might go
wrong.

This also prepares to make cpu_up/down a private interface for anything
but the cpu subsystem.

Signed-off-by: Qais Yousef 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Enrico Weigelt 
CC: Ram Pai 
CC: Nicholas Piggin 
CC: Thiago Jung Bauermann 
CC: Christophe Leroy 
CC: Thomas Gleixner 
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-ker...@vger.kernel.org
---
 arch/powerpc/kernel/machine_kexec_64.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
b/arch/powerpc/kernel/machine_kexec_64.c
index 04a7cba58eff..ebf8cc7acc4d 100644
--- a/arch/powerpc/kernel/machine_kexec_64.c
+++ b/arch/powerpc/kernel/machine_kexec_64.c
@@ -208,13 +208,15 @@ static void wake_offline_cpus(void)
 {
int cpu = 0;
 
+   lock_device_hotplug();
for_each_present_cpu(cpu) {
if (!cpu_online(cpu)) {
printk(KERN_INFO "kexec: Waking offline cpu %d.\n",
   cpu);
-   WARN_ON(cpu_up(cpu));
+   WARN_ON(device_online(get_cpu_device(cpu)));
}
}
+   unlock_device_hotplug();
 }
 
 static void kexec_prepare_cpus(void)
-- 
2.17.1



[PATCH 00/12] Convert cpu_up/down to device_online/offline

2019-10-30 Thread Qais Yousef
Using cpu_up/down directly to bring cpus online/offline loses synchronization
with sysfs and could suffer from a race similar to what is described in
commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization
during LPM").

cpu_up/down seem to be more of a internal implementation detail for the cpu
subsystem to use to boot up cpus, perform suspend/resume and low level hotplug
operations. Users outside of the cpu subsystem would be better using the device
core API to bring a cpu online/offline which is the interface used to hotplug
memory and other system devices.

Several users have already migrated to use the device core API, this series
converts the remaining users and hides cpu_up/down from internal users at the
end.

I still need to update the documentation to remove references to cpu_up/down
and advocate for device_online/offline instead if this series will make its way
through.

I noticed this problem while working on a hack to disable offlining
a particular CPU but noticed that setting the offline_disabled attribute in the
device struct isn't enough because users can easily bypass the device core.
While my hack isn't a valid use case but it did highlight the inconsistency in
the way cpus are being onlined/offlined and this attempt hopefully improves on
this.

The first 6 patches fixes arch users.

The next 5 patches fixes generic code users. Particularly creating a new
special exported API for the device core to use instead of cpu_up/down.
Maybe we can do something more restrictive than that.

The last patch removes cpu_up/down from cpu.h and unexport the functions.

In some cases where the use of cpu_up/down seemed legitimate, I encapsulated
the logic in a higher level - special purposed function; and converted the code
to use that instead.

I did run the rcu torture, lock torture and psci checker tests and no problem
was noticed. I did perform build tests on all arch affected except for parisc.

Hopefully I got the CC list right for all the patches. Apologies in advance if
some people were omitted from some patches but they should have been CCed.

CC: Armijn Hemel 
CC: Benjamin Herrenschmidt 
CC: Bjorn Helgaas 
CC: Borislav Petkov 
CC: Boris Ostrovsky 
CC: Catalin Marinas 
CC: Christophe Leroy 
CC: Daniel Lezcano 
CC: Davidlohr Bueso 
CC: "David S. Miller" 
CC: Eiichi Tsukata 
CC: Enrico Weigelt 
CC: Fenghua Yu 
CC: Greg Kroah-Hartman 
CC: Helge Deller 
CC: "H. Peter Anvin" 
CC: Ingo Molnar 
CC: "James E.J. Bottomley" 
CC: James Morse 
CC: Jiri Kosina 
CC: Josh Poimboeuf 
CC: Josh Triplett 
CC: Juergen Gross 
CC: Lorenzo Pieralisi 
CC: Mark Rutland 
CC: Michael Ellerman 
CC: Nadav Amit 
CC: Nicholas Piggin 
CC: "Paul E. McKenney" 
CC: Paul Mackerras 
CC: Pavankumar Kondeti 
CC: "Peter Zijlstra (Intel)" 
CC: "Rafael J. Wysocki" 
CC: Ram Pai 
CC: Richard Fontana 
CC: Sakari Ailus 
CC: Stefano Stabellini 
CC: Steve Capper 
CC: Thiago Jung Bauermann 
CC: Thomas Gleixner 
CC: Tony Luck 
CC: Will Deacon 
CC: Zhenzhong Duan 
CC: linux-arm-ker...@lists.infradead.org
CC: linux-i...@vger.kernel.org
CC: linux-ker...@vger.kernel.org
CC: linux-par...@vger.kernel.org
CC: linuxppc-dev@lists.ozlabs.org
CC: sparcli...@vger.kernel.org
CC: x...@kernel.org
CC: xen-de...@lists.xenproject.org


Qais Yousef (12):
  arm64: hibernate.c: create a new function to handle cpu_up(sleep_cpu)
  x86: Replace cpu_up/down with devcie_online/offline
  powerpc: Replace cpu_up/down with device_online/offline
  ia64: Replace cpu_down with freeze_secondary_cpus
  sparc: Replace cpu_up/down with device_online/offline
  parisc: Replace cpu_up/down with device_online/offline
  driver: base: cpu: export device_online/offline
  driver: xen: Replace cpu_up/down with device_online/offline
  firmware: psci: Replace cpu_up/down with device_online/offline
  torture: Replace cpu_up/down with device_online/offline
  smp: Create a new function to bringup nonboot cpus online
  cpu: Hide cpu_up/down

 arch/arm64/kernel/hibernate.c  | 13 +++
 arch/ia64/kernel/process.c |  8 +---
 arch/parisc/kernel/processor.c |  4 +-
 arch/powerpc/kernel/machine_kexec_64.c |  4 +-
 arch/sparc/kernel/ds.c |  8 +++-
 arch/x86/kernel/topology.c |  4 +-
 arch/x86/mm/mmio-mod.c |  8 +++-
 arch/x86/xen/smp.c |  4 +-
 drivers/base/core.c|  4 ++
 drivers/base/cpu.c |  4 +-
 drivers/firmware/psci/psci_checker.c   |  6 ++-
 drivers/xen/cpu_hotplug.c  |  2 +-
 include/linux/cpu.h|  6 ++-
 kernel/cpu.c   | 53 --
 kernel/smp.c   |  9 +
 kernel/torture.c   | 15 ++--
 16 files changed, 106 insertions(+), 46 deletions(-)

-- 
2.17.1



Section mismatch warnings on powerpc

2019-10-30 Thread Qian Cai
Still see those,

WARNING: vmlinux.o(.text+0x2d04): Section mismatch in reference from the
variable __boot_from_prom to the function .init.text:prom_init()
The function __boot_from_prom() references
the function __init prom_init().
This is often because __boot_from_prom lacks a __init
annotation or the annotation of prom_init is wrong.

WARNING: vmlinux.o(.text+0x2ec8): Section mismatch in reference from the
variable start_here_common to the function .init.text:start_kernel()
The function start_here_common() references
the function __init start_kernel().
This is often because start_here_common lacks a __init
annotation or the annotation of start_kernel is wrong.

There is a patch around,

http://patchwork.ozlabs.org/patch/895442/

Does it still wait for Michael to come with some better names?


Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc

2019-10-30 Thread Christophe Leroy




Le 30/10/2019 à 19:30, Kees Cook a écrit :

On Wed, Oct 30, 2019 at 09:58:19AM +0100, Christophe Leroy wrote:



Le 30/10/2019 à 08:31, Russell Currey a écrit :

v4 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html
v3 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html

Changes since v4:
[1/5]: Addressed review comments from Michael Ellerman (thanks!)
[4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on
   ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid
   STRICT_MODULE_RWX being *on by default* in cases where
   STRICT_KERNEL_RWX is *unavailable*
[5/5]: split skiroot_defconfig changes out into its own patch

The whole Kconfig situation is really weird and confusing, I believe the
correct resolution is to change arch/Kconfig but the consequences are so
minor that I don't think it's worth it, especially given that I expect
powerpc to have mandatory strict RWX Soon(tm).


I'm not such strict RWX can be made mandatory due to the impact it has on
some subarches:
- On the 8xx, unless all areas are 8Mbytes aligned, there is a significant
overhead on TLB misses. And Aligning everthing to 8M is a waste of RAM which
is not acceptable on systems having very few RAM.
- On hash book3s32, we are able to map the kernel BATs. With a few alignment
constraints, we are able to provide STRICT_KERNEL_RWX. But we are unable to
provide exec protection on page granularity. Only on 256Mbytes segments. So
for modules, we have to have the vmspace X. It is also not possible to have
a kernel area RO. Only user areas can be made RO.


As I understand it, the idea was for it to be mandatory (or at least
default-on) only for the subarches where it wasn't totally insane to
accomplish. :) (I'm not familiar with all the details on the subarchs,
but it sounded like the more modern systems would be the targets for
this?)



Yes I guess so.

I was just afraid by "I expect powerpc to have mandatory strict RWX"

By the way, we have an open issue #223 namely 'Make strict kernel RWX 
the default on 64-bit', so no worry as 32-bit is aside for now.


Christophe


Re: [PATCH v4 0/4] Implement STRICT_MODULE_RWX for powerpc

2019-10-30 Thread Kees Cook
On Wed, Oct 30, 2019 at 11:16:22AM +1100, Michael Ellerman wrote:
> Kees Cook  writes:
> > On Mon, Oct 14, 2019 at 04:13:16PM +1100, Russell Currey wrote:
> >> v3 cover letter here:
> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html
> >> 
> >> Only minimal changes since then:
> >> 
> >> - patch 2/4 commit message update thanks to Andrew Donnellan
> >> - patch 3/4 made neater thanks to Christophe Leroy
> >> - patch 3/4 updated Kconfig description thanks to Daniel Axtens
> >
> > I continue to be excited about this work. :) Is there anything holding
> > it back from landing in linux-next?
> 
> I had some concerns, which I stupidly posted in reply to v3:
> 
>   https://lore.kernel.org/linuxppc-dev/87pnio5fva@mpe.ellerman.id.au/

Ah-ha! Thanks; I missed that. :)

-- 
Kees Cook


Re: [PATCH v2] powerpc/imc: Dont create debugfs files for cpu-less nodes

2019-10-30 Thread Qian Cai
On Tue, 2019-07-23 at 16:57 +0530, Anju T Sudhakar wrote:
> Hi Qian,
> 
> On 7/16/19 12:11 AM, Qian Cai wrote:
> > On Thu, 2019-07-11 at 14:53 +1000, Michael Ellerman wrote:
> > > Hi Maddy,
> > > 
> > > Madhavan Srinivasan  writes:
> > > > diff --git a/arch/powerpc/platforms/powernv/opal-imc.c
> > > > b/arch/powerpc/platforms/powernv/opal-imc.c
> > > > index 186109bdd41b..e04b20625cb9 100644
> > > > --- a/arch/powerpc/platforms/powernv/opal-imc.c
> > > > +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> > > > @@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct 
> > > > device_node
> > > > *node,
> > > >     if (of_property_read_u32(node, "cb_offset", _offset))
> > > >     cb_offset = IMC_CNTL_BLK_OFFSET;
> > > >   
> > > > -   for_each_node(nid) {
> > > > -   loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset;
> > > > +   while (ptr->vbase != NULL) {
> > > 
> > > This means you'll bail out as soon as you find a node with no vbase, but
> > > it's possible we could have a CPU-less node intermingled with other
> > > nodes.
> > > 
> > > So I think you want to keep the for loop, but continue if you see a NULL
> > > vbase?
> > 
> > Not sure if this will also takes care of some of those messages during the 
> > boot
> > on today's linux-next even without this patch.
> > 
> > 
> > [   18.077780][T1] debugfs: Directory 'imc' with parent 'powerpc' 
> > already
> > present!
> > 
> > 
> 
> This is introduced by a recent commit: c33d442328f55 (debugfs: make 
> error message a bit more verbose).
> 
> So basically, the debugfs imc_* file is created per node, and is created 
> by the first nest unit which is
> 
> being registered. For the subsequent nest units, debugfs_create_dir() 
> will just return since the imc_* file already
> 
> exist.
> 
> The commit "c33d442328f55 (debugfs: make error message a bit more 
> verbose)", prints
> 
> a message if the debugfs file already exists in debugfs_create_dir(). 
> That is why we are encountering these
> 
> messages now.
> 
> 
> This patch (i.e, powerpc/imc: Dont create debugfs files for cpu-less 
> nodes) will address the initial issue, i.e
> 
> "numa crash while reading imc_* debugfs files for cpu less nodes", and 
> will not address these debugfs messages.
> 
> 
> But yeah this is a good catch. We can have some checks to avoid these 
> debugfs messages.

Anju, do you still plan to fix those "Directory 'imc' with parent 'powerpc'
already present!" warnings as they are still there in the latest linux-next?

> 
> 
> Hi Michael,
> 
> Do we need to have a separate patch to address these debugfs messages, 
> or can we address the same
> 
> in the next version of this patch itself?
> 
> 
> Thanks,
> 
> Anju
> 
> 
> 
> 


Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Christoph Hellwig
On Wed, Oct 30, 2019 at 01:32:01PM -0500, Reza Arbab wrote:
> Aha, it's this again. Didn't catch your meaning at first. Point taken.

It's not _me_.  It that you (plural) keep ignoring how Linux development
works.


Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Reza Arbab

On Wed, Oct 30, 2019 at 07:13:59PM +0100, Christoph Hellwig wrote:

On Wed, Oct 30, 2019 at 01:08:51PM -0500, Reza Arbab wrote:

On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote:

How do you even use this code?  Nothing in the kernel even calls
dma_set_mask for NPU devices, as we only suport vfio pass through.


You use it by calling dma_set_mask() for the *GPU* device. The purpose of
pnv_npu_try_dma_set_bypass() is to then propagate the same bypass
configuration to all the NPU devices associated with that GPU.


Which in-kernel driver, which PCI ID?


Aha, it's this again. Didn't catch your meaning at first. Point taken.

--
Reza Arbab


Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc

2019-10-30 Thread Kees Cook
On Wed, Oct 30, 2019 at 09:58:19AM +0100, Christophe Leroy wrote:
> 
> 
> Le 30/10/2019 à 08:31, Russell Currey a écrit :
> > v4 cover letter: 
> > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html
> > v3 cover letter: 
> > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html
> > 
> > Changes since v4:
> > [1/5]: Addressed review comments from Michael Ellerman (thanks!)
> > [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on
> >ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid
> >STRICT_MODULE_RWX being *on by default* in cases where
> >STRICT_KERNEL_RWX is *unavailable*
> > [5/5]: split skiroot_defconfig changes out into its own patch
> > 
> > The whole Kconfig situation is really weird and confusing, I believe the
> > correct resolution is to change arch/Kconfig but the consequences are so
> > minor that I don't think it's worth it, especially given that I expect
> > powerpc to have mandatory strict RWX Soon(tm).
> 
> I'm not such strict RWX can be made mandatory due to the impact it has on
> some subarches:
> - On the 8xx, unless all areas are 8Mbytes aligned, there is a significant
> overhead on TLB misses. And Aligning everthing to 8M is a waste of RAM which
> is not acceptable on systems having very few RAM.
> - On hash book3s32, we are able to map the kernel BATs. With a few alignment
> constraints, we are able to provide STRICT_KERNEL_RWX. But we are unable to
> provide exec protection on page granularity. Only on 256Mbytes segments. So
> for modules, we have to have the vmspace X. It is also not possible to have
> a kernel area RO. Only user areas can be made RO.

As I understand it, the idea was for it to be mandatory (or at least
default-on) only for the subarches where it wasn't totally insane to
accomplish. :) (I'm not familiar with all the details on the subarchs,
but it sounded like the more modern systems would be the targets for
this?)

-- 
Kees Cook


Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Christoph Hellwig
On Wed, Oct 30, 2019 at 01:08:51PM -0500, Reza Arbab wrote:
> On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote:
>> How do you even use this code?  Nothing in the kernel even calls
>> dma_set_mask for NPU devices, as we only suport vfio pass through.
>
> You use it by calling dma_set_mask() for the *GPU* device. The purpose of 
> pnv_npu_try_dma_set_bypass() is to then propagate the same bypass 
> configuration to all the NPU devices associated with that GPU.

Which in-kernel driver, which PCI ID?


Re: [PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()

2019-10-30 Thread Reza Arbab

On Wed, Oct 30, 2019 at 06:55:18PM +0100, Christoph Hellwig wrote:

On Wed, Oct 30, 2019 at 12:00:00PM -0500, Reza Arbab wrote:

Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by
separating the part of the function that determines if bypass is
supported from the part that actually attempts to configure it.

Move the latter to a controller-specific dma_set_mask() callback.


Nak, the dma_set_mask overrides are going away.  But as said in the
reply to the cover letter I don't even see how you could end up calling
this code.


Okay. As mentioned in the cover letter these last few patches could be 
omitted if that's the case.


--
Reza Arbab


Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Reza Arbab

On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote:

How do you even use this code?  Nothing in the kernel even calls
dma_set_mask for NPU devices, as we only suport vfio pass through.


You use it by calling dma_set_mask() for the *GPU* device. The purpose 
of pnv_npu_try_dma_set_bypass() is to then propagate the same bypass 
configuration to all the NPU devices associated with that GPU.


--
Reza Arbab


Re: [PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()

2019-10-30 Thread Christoph Hellwig
On Wed, Oct 30, 2019 at 12:00:00PM -0500, Reza Arbab wrote:
> Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by
> separating the part of the function that determines if bypass is
> supported from the part that actually attempts to configure it.
> 
> Move the latter to a controller-specific dma_set_mask() callback.

Nak, the dma_set_mask overrides are going away.  But as said in the
reply to the cover letter I don't even see how you could end up calling
this code.


Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Christoph Hellwig
On Wed, Oct 30, 2019 at 11:59:49AM -0500, Reza Arbab wrote:
> With recent kernels, TCE tables for NPU devices are no longer being
> configured. That task was performed by pnv_npu_try_dma_set_bypass(), a
> function that got swept away in recent overhauling of dma code.
> 
> Patches 1-4 here bring the lost function back and reintegrate it with
> the updated generic iommu bypass infrastructure.
> 
> Patch 5 fixes a regression in behavior when a requested dma mask can not
> be fulfilled.
> 
> Patches 6-8 are cleanup. I put these later in the set because they
> aren't bisectable until after the restored code is wired back in.
> 
> Patches 9-11 refactor pnv_pci_ioda_iommu_bypass_supported(). It seems
> wrong for a boolean *_supported() function to have side effects. They
> reintroduce a pci controller based dma_set_mask() hook. If that's
> undesirable, these last three patches can be dropped.

How do you even use this code?  Nothing in the kernel even calls
dma_set_mask for NPU devices, as we only suport vfio pass through.


[PATCH 03/11] powerpc/powernv/npu: Change pnv_npu_try_dma_set_bypass() argument

2019-10-30 Thread Reza Arbab
To enable simpler calling code, change this function to find the value
of bypass instead of taking it as an argument.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 12 +---
 arch/powerpc/platforms/powernv/pci.h |  2 +-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 5a8313654033..a6b8c7ad36e4 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -258,13 +258,21 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
return rc;
 }
 
-void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
+void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask)
 {
+   struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev);
int i;
struct pnv_phb *phb;
struct pci_dn *pdn;
struct pnv_ioda_pe *npe;
struct pci_dev *npdev;
+   bool bypass;
+
+   if (!gpe)
+   return;
+
+   /* We only do bypass if it's enabled on the linked device */
+   bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask);
 
for (i = 0; ; ++i) {
npdev = pnv_pci_get_npu_dev(gpdev, i);
@@ -277,8 +285,6 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool 
bypass)
return;
 
phb = pci_bus_to_host(npdev->bus)->private_data;
-
-   /* We only do bypass if it's enabled on the linked device */
npe = >ioda.pe_array[pdn->pe_number];
 
if (bypass) {
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 41f7dec3aee5..21db0f4cfb11 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -211,7 +211,7 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__)
 
 /* Nvlink functions */
-extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
+extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
 extern struct iommu_table_group *pnv_try_setup_npu_table_group(
-- 
1.8.3.1



[PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Reza Arbab
With recent kernels, TCE tables for NPU devices are no longer being
configured. That task was performed by pnv_npu_try_dma_set_bypass(), a
function that got swept away in recent overhauling of dma code.

Patches 1-4 here bring the lost function back and reintegrate it with
the updated generic iommu bypass infrastructure.

Patch 5 fixes a regression in behavior when a requested dma mask can not
be fulfilled.

Patches 6-8 are cleanup. I put these later in the set because they
aren't bisectable until after the restored code is wired back in.

Patches 9-11 refactor pnv_pci_ioda_iommu_bypass_supported(). It seems
wrong for a boolean *_supported() function to have side effects. They
reintroduce a pci controller based dma_set_mask() hook. If that's
undesirable, these last three patches can be dropped.

Reza Arbab (11):
  Revert "powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass()
function"
  powerpc/powernv: Add pnv_ioda_pe_iommu_bypass_supported()
  powerpc/powernv/npu: Change pnv_npu_try_dma_set_bypass() argument
  powerpc/powernv/npu: Wire up pnv_npu_try_dma_set_bypass()
  powerpc/powernv: Return failure for some uses of dma_set_mask()
  powerpc/powernv: Remove intermediate variable
  powerpc/powernv/npu: Simplify pnv_npu_try_dma_set_bypass() loop
  powerpc/powernv: Replace open coded pnv_ioda_get_pe()s
  Revert "powerpc/pci: remove the dma_set_mask pci_controller ops
methods"
  powerpc/powernv: Add pnv_phb3_iommu_bypass_supported()
  powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()

 arch/powerpc/include/asm/pci-bridge.h |   2 +
 arch/powerpc/kernel/dma-iommu.c   |  19 --
 arch/powerpc/kernel/dma-mask.c|   9 +++
 arch/powerpc/platforms/powernv/Kconfig|   1 +
 arch/powerpc/platforms/powernv/npu-dma.c  | 106 +++---
 arch/powerpc/platforms/powernv/pci-ioda.c |  71 
 arch/powerpc/platforms/powernv/pci.h  |  10 ++-
 7 files changed, 177 insertions(+), 41 deletions(-)

-- 
1.8.3.1



[PATCH 09/11] Revert "powerpc/pci: remove the dma_set_mask pci_controller ops methods"

2019-10-30 Thread Reza Arbab
Bring back the pci controller based hook in dma_set_mask(), as it will
have a user again.

This reverts commit 662acad4067a ("powerpc/pci: remove the dma_set_mask
pci_controller ops methods"). The callback signature has been adjusted
with void return to fit its caller.

Signed-off-by: Reza Arbab 
Cc: Christoph Hellwig 
---
 arch/powerpc/include/asm/pci-bridge.h | 2 ++
 arch/powerpc/kernel/dma-mask.c| 9 +
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index ea6ec65970ef..8512dcd053c5 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -43,6 +43,8 @@ struct pci_controller_ops {
void(*teardown_msi_irqs)(struct pci_dev *pdev);
 #endif
 
+   void(*dma_set_mask)(struct pci_dev *pdev, u64 dma_mask);
+
void(*shutdown)(struct pci_controller *hose);
 };
 
diff --git a/arch/powerpc/kernel/dma-mask.c b/arch/powerpc/kernel/dma-mask.c
index ffbbbc432612..35b5fd1b03a6 100644
--- a/arch/powerpc/kernel/dma-mask.c
+++ b/arch/powerpc/kernel/dma-mask.c
@@ -2,11 +2,20 @@
 
 #include 
 #include 
+#include 
 #include 
 
 void arch_dma_set_mask(struct device *dev, u64 dma_mask)
 {
if (ppc_md.dma_set_mask)
ppc_md.dma_set_mask(dev, dma_mask);
+
+   if (dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_controller *phb = pci_bus_to_host(pdev->bus);
+
+   if (phb->controller_ops.dma_set_mask)
+   phb->controller_ops.dma_set_mask(pdev, dma_mask);
+   }
 }
 EXPORT_SYMBOL(arch_dma_set_mask);
-- 
1.8.3.1



[PATCH 02/11] powerpc/powernv: Add pnv_ioda_pe_iommu_bypass_supported()

2019-10-30 Thread Reza Arbab
This little calculation will be needed in other places. Move it to a
convenience function.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++-
 arch/powerpc/platforms/powernv/pci.h  | 8 
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c28d0d9b7ee0..8849218187d7 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1838,11 +1838,9 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
 
pe = >ioda.pe_array[pdn->pe_number];
-   if (pe->tce_bypass_enabled) {
-   u64 top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
-   if (dma_mask >= top)
-   return true;
-   }
+
+   if (pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask))
+   return true;
 
/*
 * If the device can't set the TCE bypass bit but still wants
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index f914f0b14e4e..41f7dec3aee5 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -4,6 +4,7 @@
 
 #include /* for __printf */
 #include 
+#include 
 #include 
 #include 
 
@@ -247,4 +248,11 @@ extern void pnv_pci_setup_iommu_table(struct iommu_table 
*tbl,
void *tce_mem, u64 tce_size,
u64 dma_offset, unsigned int page_shift);
 
+static inline bool pnv_ioda_pe_iommu_bypass_supported(struct pnv_ioda_pe *pe,
+ u64 mask)
+{
+   return pe->tce_bypass_enabled &&
+  mask >= pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
+}
+
 #endif /* __POWERNV_PCI_H */
-- 
1.8.3.1



[PATCH 10/11] powerpc/powernv: Add pnv_phb3_iommu_bypass_supported()

2019-10-30 Thread Reza Arbab
Move this code to its own function for reuse. As a side benefit,
rearrange the comments and spread things out for readability.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 37 +--
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 6b932cfc0deb..57e6a43d9a3a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1826,6 +1826,30 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct 
pnv_ioda_pe *pe)
return -EIO;
 }
 
+/*
+ * If the device can't set the TCE bypass bit but still wants
+ * to access 4GB or more, on PHB3 we can reconfigure TVE#0 to
+ * bypass the 32-bit region and be usable for 64-bit DMAs.
+ */
+static bool pnv_phb3_iommu_bypass_supported(struct pnv_ioda_pe *pe, u64 mask)
+{
+   if (pe->phb->model != PNV_PHB_MODEL_PHB3)
+   return false;
+
+   /* pe->pdev should be set if it's a single device, pe->pbus if not */
+   if (pe->pbus && pe->device_count != 1)
+   return false;
+
+   if (!(mask >> 32))
+   return false;
+
+   /* The device needs to be able to address all of this space. */
+   if (mask <= memory_hotplug_max() + (1ULL << 32))
+   return false;
+
+   return true;
+}
+
 static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev,
u64 dma_mask)
 {
@@ -1837,18 +1861,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
 
bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask);
 
-   /*
-* If the device can't set the TCE bypass bit but still wants
-* to access 4GB or more, on PHB3 we can reconfigure TVE#0 to
-* bypass the 32-bit region and be usable for 64-bit DMAs.
-* The device needs to be able to address all of this space.
-*/
-   if (!bypass &&
-   dma_mask >> 32 &&
-   dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
-   /* pe->pdev should be set if it's a single device, pe->pbus if not 
*/
-   (pe->device_count == 1 || !pe->pbus) &&
-   pe->phb->model == PNV_PHB_MODEL_PHB3) {
+   if (!bypass && pnv_phb3_iommu_bypass_supported(pe, dma_mask)) {
/* Configure the bypass mode */
if (pnv_pci_ioda_dma_64bit_bypass(pe))
return false;
-- 
1.8.3.1



[PATCH 07/11] powerpc/powernv/npu: Simplify pnv_npu_try_dma_set_bypass() loop

2019-10-30 Thread Reza Arbab
Write this loop more compactly to improve readability.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index a6b8c7ad36e4..a77ce7d71634 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -261,12 +261,12 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask)
 {
struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev);
-   int i;
struct pnv_phb *phb;
struct pci_dn *pdn;
struct pnv_ioda_pe *npe;
struct pci_dev *npdev;
bool bypass;
+   int i = 0;
 
if (!gpe)
return;
@@ -274,12 +274,7 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 
mask)
/* We only do bypass if it's enabled on the linked device */
bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask);
 
-   for (i = 0; ; ++i) {
-   npdev = pnv_pci_get_npu_dev(gpdev, i);
-
-   if (!npdev)
-   break;
-
+   while ((npdev = pnv_pci_get_npu_dev(gpdev, i++))) {
pdn = pci_get_pdn(npdev);
if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
return;
-- 
1.8.3.1



[PATCH 01/11] Revert "powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function"

2019-10-30 Thread Reza Arbab
Revert commit b4d37a7b6934 ("powerpc/powernv: Remove unused
pnv_npu_try_dma_set_bypass() function") so that this function can be
reintegrated.

Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code")
Signed-off-by: Reza Arbab 
Cc: Christoph Hellwig 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 99 
 1 file changed, 99 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index b95b9e3c4c98..5a8313654033 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -193,6 +193,105 @@ static long pnv_npu_unset_window(struct iommu_table_group 
*table_group, int num)
return 0;
 }
 
+/*
+ * Enables 32 bit DMA on NPU.
+ */
+static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
+{
+   struct pci_dev *gpdev;
+   struct pnv_ioda_pe *gpe;
+   int64_t rc;
+
+   /*
+* Find the assoicated PCI devices and get the dma window
+* information from there.
+*/
+   if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV))
+   return;
+
+   gpe = get_gpu_pci_dev_and_pe(npe, );
+   if (!gpe)
+   return;
+
+   rc = pnv_npu_set_window(>table_group, 0,
+   gpe->table_group.tables[0]);
+
+   /*
+* NVLink devices use the same TCE table configuration as
+* their parent device so drivers shouldn't be doing DMA
+* operations directly on these devices.
+*/
+   set_dma_ops(>pdev->dev, _dummy_ops);
+}
+
+/*
+ * Enables bypass mode on the NPU. The NPU only supports one
+ * window per link, so bypass needs to be explicitly enabled or
+ * disabled. Unlike for a PHB3 bypass and non-bypass modes can't be
+ * active at the same time.
+ */
+static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
+{
+   struct pnv_phb *phb = npe->phb;
+   int64_t rc = 0;
+   phys_addr_t top = memblock_end_of_DRAM();
+
+   if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev)
+   return -EINVAL;
+
+   rc = pnv_npu_unset_window(>table_group, 0);
+   if (rc != OPAL_SUCCESS)
+   return rc;
+
+   /* Enable the bypass window */
+
+   top = roundup_pow_of_two(top);
+   dev_info(>pdev->dev, "Enabling bypass for PE %x\n",
+   npe->pe_number);
+   rc = opal_pci_map_pe_dma_window_real(phb->opal_id,
+   npe->pe_number, npe->pe_number,
+   0 /* bypass base */, top);
+
+   if (rc == OPAL_SUCCESS)
+   pnv_pci_ioda2_tce_invalidate_entire(phb, false);
+
+   return rc;
+}
+
+void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass)
+{
+   int i;
+   struct pnv_phb *phb;
+   struct pci_dn *pdn;
+   struct pnv_ioda_pe *npe;
+   struct pci_dev *npdev;
+
+   for (i = 0; ; ++i) {
+   npdev = pnv_pci_get_npu_dev(gpdev, i);
+
+   if (!npdev)
+   break;
+
+   pdn = pci_get_pdn(npdev);
+   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   return;
+
+   phb = pci_bus_to_host(npdev->bus)->private_data;
+
+   /* We only do bypass if it's enabled on the linked device */
+   npe = >ioda.pe_array[pdn->pe_number];
+
+   if (bypass) {
+   dev_info(>dev,
+   "Using 64-bit DMA iommu bypass\n");
+   pnv_npu_dma_set_bypass(npe);
+   } else {
+   dev_info(>dev, "Using 32-bit DMA via iommu\n");
+   pnv_npu_dma_set_32(npe);
+   }
+   }
+}
+
 /* Switch ownership from platform code to external user (e.g. VFIO) */
 static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
 {
-- 
1.8.3.1



[PATCH 08/11] powerpc/powernv: Replace open coded pnv_ioda_get_pe()s

2019-10-30 Thread Reza Arbab
Collapse several open coded instances of pnv_ioda_get_pe().

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/npu-dma.c  | 22 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 +++---
 2 files changed, 8 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index a77ce7d71634..0010b21d45b8 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -97,24 +97,17 @@ struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, 
int index)
 static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe,
  struct pci_dev **gpdev)
 {
-   struct pnv_phb *phb;
-   struct pci_controller *hose;
struct pci_dev *pdev;
struct pnv_ioda_pe *pe;
-   struct pci_dn *pdn;
 
pdev = pnv_pci_get_gpu_dev(npe->pdev);
if (!pdev)
return NULL;
 
-   pdn = pci_get_pdn(pdev);
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   pe = pnv_ioda_get_pe(pdev);
+   if (WARN_ON(!pe))
return NULL;
 
-   hose = pci_bus_to_host(pdev->bus);
-   phb = hose->private_data;
-   pe = >ioda.pe_array[pdn->pe_number];
-
if (gpdev)
*gpdev = pdev;
 
@@ -261,9 +254,6 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
 void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask)
 {
struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev);
-   struct pnv_phb *phb;
-   struct pci_dn *pdn;
-   struct pnv_ioda_pe *npe;
struct pci_dev *npdev;
bool bypass;
int i = 0;
@@ -275,12 +265,10 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, 
u64 mask)
bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask);
 
while ((npdev = pnv_pci_get_npu_dev(gpdev, i++))) {
-   pdn = pci_get_pdn(npdev);
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
-   return;
+   struct pnv_ioda_pe *npe = pnv_ioda_get_pe(npdev);
 
-   phb = pci_bus_to_host(npdev->bus)->private_data;
-   npe = >ioda.pe_array[pdn->pe_number];
+   if (WARN_ON(!npe))
+   return;
 
if (bypass) {
dev_info(>dev,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 319152d30bc3..6b932cfc0deb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1829,16 +1829,12 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct 
pnv_ioda_pe *pe)
 static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev,
u64 dma_mask)
 {
-   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
-   struct pci_dn *pdn = pci_get_pdn(pdev);
-   struct pnv_ioda_pe *pe;
+   struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev);
bool bypass;
 
-   if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+   if (WARN_ON(!pe))
return false;
 
-   pe = >ioda.pe_array[pdn->pe_number];
bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask);
 
/*
@@ -1852,7 +1848,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
/* pe->pdev should be set if it's a single device, pe->pbus if not 
*/
(pe->device_count == 1 || !pe->pbus) &&
-   phb->model == PNV_PHB_MODEL_PHB3) {
+   pe->phb->model == PNV_PHB_MODEL_PHB3) {
/* Configure the bypass mode */
if (pnv_pci_ioda_dma_64bit_bypass(pe))
return false;
-- 
1.8.3.1



[PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()

2019-10-30 Thread Reza Arbab
Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by
separating the part of the function that determines if bypass is
supported from the part that actually attempts to configure it.

Move the latter to a controller-specific dma_set_mask() callback.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/Kconfig|  1 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 30 --
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 938803eab0ad..6e6e27841764 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -17,6 +17,7 @@ config PPC_POWERNV
select PPC_DOORBELL
select MMU_NOTIFIER
select FORCE_SMP
+   select ARCH_HAS_DMA_SET_MASK
default y
 
 config OPAL_PRD
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 57e6a43d9a3a..5291464930ed 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1854,32 +1854,33 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
u64 dma_mask)
 {
struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev);
-   bool bypass;
 
if (WARN_ON(!pe))
return false;
 
-   bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask);
+   return pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask) ||
+  pnv_phb3_iommu_bypass_supported(pe, dma_mask);
+}
+
+static void pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 mask)
+{
+   struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev);
+
+   if (!pe)
+   return;
 
-   if (!bypass && pnv_phb3_iommu_bypass_supported(pe, dma_mask)) {
+   if (!pnv_ioda_pe_iommu_bypass_supported(pe, mask) &&
+   pnv_phb3_iommu_bypass_supported(pe, mask)) {
/* Configure the bypass mode */
if (pnv_pci_ioda_dma_64bit_bypass(pe))
-   return false;
+   return;
 
/* 4GB offset bypasses 32-bit space */
pdev->dev.archdata.dma_offset = (1ULL << 32);
-
-   bypass = true;
}
 
-   /*
-* Update peer npu devices. We also do this for the special case where
-* a 64-bit dma mask can't be fulfilled and falls back to default.
-*/
-   if (bypass || !(dma_mask >> 32) || dma_mask == DMA_BIT_MASK(64))
-   pnv_npu_try_dma_set_bypass(pdev, dma_mask);
-
-   return bypass;
+   /* Update peer npu devices */
+   pnv_npu_try_dma_set_bypass(pdev, mask);
 }
 
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
@@ -3612,6 +3613,7 @@ static void pnv_pci_ioda_shutdown(struct pci_controller 
*hose)
 static const struct pci_controller_ops pnv_pci_ioda_controller_ops = {
.dma_dev_setup  = pnv_pci_dma_dev_setup,
.dma_bus_setup  = pnv_pci_dma_bus_setup,
+   .dma_set_mask   = pnv_pci_ioda_dma_set_mask,
.iommu_bypass_supported = pnv_pci_ioda_iommu_bypass_supported,
.setup_msi_irqs = pnv_setup_msi_irqs,
.teardown_msi_irqs  = pnv_teardown_msi_irqs,
-- 
1.8.3.1



[PATCH 05/11] powerpc/powernv: Return failure for some uses of dma_set_mask()

2019-10-30 Thread Reza Arbab
Rework of pnv_pci_ioda_dma_set_mask() effectively reverted commit
253fd51e2f53 ("powerpc/powernv/pci: Return failure for some uses of
dma_set_mask()").

Reintroduce the desired behavior that an unfulfilled request for a DMA
mask between 32 and 64 bits will return error instead of silently
falling back to 32 bits.

Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code")
Signed-off-by: Reza Arbab 
Cc: Christoph Hellwig 
---
 arch/powerpc/kernel/dma-iommu.c   | 19 +++
 arch/powerpc/platforms/powernv/pci-ioda.c |  8 ++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index e486d1d78de2..e1693760654b 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -122,10 +122,21 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
 {
struct iommu_table *tbl = get_iommu_table_base(dev);
 
-   if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
-   dev->archdata.iommu_bypass = true;
-   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
-   return 1;
+   if (dev_is_pci(dev)) {
+   if (dma_iommu_bypass_supported(dev, mask)) {
+   dev->archdata.iommu_bypass = true;
+   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
+   return 1;
+   }
+
+   if (mask >> 32) {
+   dev_err(dev, "Warning: IOMMU dma bypass not supported: 
mask 0x%016llx\n",
+   mask);
+
+   /* A 64-bit request falls back to default ops */
+   if (mask != DMA_BIT_MASK(64))
+   return 0;
+   }
}
 
if (!tbl) {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 70e834635971..b78b5e81f941 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1863,8 +1863,12 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
bypass = true;
}
 
-   /* Update peer npu devices */
-   pnv_npu_try_dma_set_bypass(pdev, dma_mask);
+   /*
+* Update peer npu devices. We also do this for the special case where
+* a 64-bit dma mask can't be fulfilled and falls back to default.
+*/
+   if (bypass || !(dma_mask >> 32) || dma_mask == DMA_BIT_MASK(64))
+   pnv_npu_try_dma_set_bypass(pdev, dma_mask);
 
return bypass;
 }
-- 
1.8.3.1



[PATCH 04/11] powerpc/powernv/npu: Wire up pnv_npu_try_dma_set_bypass()

2019-10-30 Thread Reza Arbab
Rework of pnv_pci_ioda_iommu_bypass_supported() dropped a call to
pnv_npu_try_dma_set_bypass(). Reintroduce this call, so that the DMA
bypass configuration of a GPU device is propagated to its corresponding
NPU devices.

Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code")
Signed-off-by: Reza Arbab 
Cc: Christoph Hellwig 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8849218187d7..70e834635971 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1833,14 +1833,13 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
struct pnv_phb *phb = hose->private_data;
struct pci_dn *pdn = pci_get_pdn(pdev);
struct pnv_ioda_pe *pe;
+   bool bypass;
 
if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
return false;
 
pe = >ioda.pe_array[pdn->pe_number];
-
-   if (pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask))
-   return true;
+   bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask);
 
/*
 * If the device can't set the TCE bypass bit but still wants
@@ -1848,7 +1847,8 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
 * bypass the 32-bit region and be usable for 64-bit DMAs.
 * The device needs to be able to address all of this space.
 */
-   if (dma_mask >> 32 &&
+   if (!bypass &&
+   dma_mask >> 32 &&
dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
/* pe->pdev should be set if it's a single device, pe->pbus if not 
*/
(pe->device_count == 1 || !pe->pbus) &&
@@ -1859,10 +1859,14 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
/* 4GB offset bypasses 32-bit space */
pdev->dev.archdata.dma_offset = (1ULL << 32);
-   return true;
+
+   bypass = true;
}
 
-   return false;
+   /* Update peer npu devices */
+   pnv_npu_try_dma_set_bypass(pdev, dma_mask);
+
+   return bypass;
 }
 
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
-- 
1.8.3.1



[PATCH 06/11] powerpc/powernv: Remove intermediate variable

2019-10-30 Thread Reza Arbab
Trim the pointless temporary variable.

Signed-off-by: Reza Arbab 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index b78b5e81f941..319152d30bc3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1854,9 +1854,9 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
(pe->device_count == 1 || !pe->pbus) &&
phb->model == PNV_PHB_MODEL_PHB3) {
/* Configure the bypass mode */
-   s64 rc = pnv_pci_ioda_dma_64bit_bypass(pe);
-   if (rc)
+   if (pnv_pci_ioda_dma_64bit_bypass(pe))
return false;
+
/* 4GB offset bypasses 32-bit space */
pdev->dev.archdata.dma_offset = (1ULL << 32);
 
-- 
1.8.3.1



Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic

2019-10-30 Thread Mimi Zohar
On Wed, 2019-10-30 at 08:22 -0700, Lakshmi Ramasubramanian wrote:
> On 10/23/19 8:47 PM, Nayna Jain wrote:
> 
> Hi Nayna,
> 
> > process_buffer_measurement() is limited to measuring the kexec boot
> > command line. This patch makes process_buffer_measurement() more
> > generic, allowing it to measure other types of buffer data (e.g.
> > blacklisted binary hashes or key hashes).
> 
> Now that process_buffer_measurement() is being made generic to measure 
> any buffer, it would be good to add a tag to indicate what type of 
> buffer is being measured.
> 
> For example, if the buffer is kexec command line the log could look like:
> 
>   "kexec_cmdline: "
> 
> Similarly, if the buffer is blacklisted binary hash:
> 
>   "blacklist hash: ".
> 
> If the buffer is key hash:
> 
>   ": key data".
> 
> This would greatly help the consumer of the IMA log to know the type of 
> data represented in each IMA log entry.

Both the existing kexec command line and the new blacklist buffer
measurement pass that information in the eventname.   The [PATCH 7/8]
"ima: check against blacklisted hashes for files with modsig" patch
description includes an example.

Mimi



Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic

2019-10-30 Thread Lakshmi Ramasubramanian

On 10/23/19 8:47 PM, Nayna Jain wrote:

Hi Nayna,


process_buffer_measurement() is limited to measuring the kexec boot
command line. This patch makes process_buffer_measurement() more
generic, allowing it to measure other types of buffer data (e.g.
blacklisted binary hashes or key hashes).


Now that process_buffer_measurement() is being made generic to measure 
any buffer, it would be good to add a tag to indicate what type of 
buffer is being measured.


For example, if the buffer is kexec command line the log could look like:

 "kexec_cmdline: "

Similarly, if the buffer is blacklisted binary hash:

 "blacklist hash: ".

If the buffer is key hash:

 ": key data".

This would greatly help the consumer of the IMA log to know the type of 
data represented in each IMA log entry.


thanks,
 -lakshmi


Re: [PATCH v10 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-30 Thread Uladzislau Rezki
Hello, Daniel

>  
> @@ -1294,14 +1299,19 @@ static bool __purge_vmap_area_lazy(unsigned long 
> start, unsigned long end)
>   spin_lock(_vmap_area_lock);
>   llist_for_each_entry_safe(va, n_va, valist, purge_list) {
>   unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
> + unsigned long orig_start = va->va_start;
> + unsigned long orig_end = va->va_end;
>  
>   /*
>* Finally insert or merge lazily-freed area. It is
>* detached and there is no need to "unlink" it from
>* anything.
>*/
> - merge_or_add_vmap_area(va,
> - _vmap_area_root, _vmap_area_list);
> + va = merge_or_add_vmap_area(va, _vmap_area_root,
> + _vmap_area_list);
> +
> + kasan_release_vmalloc(orig_start, orig_end,
> +   va->va_start, va->va_end);
>  
I have some questions here. I have not analyzed kasan_releace_vmalloc()
logic in detail, sorry for that if i miss something. __purge_vmap_area_lazy()
deals with big address space, so not only vmalloc addresses it frees here,
basically it can be any, starting from 1 until ULONG_MAX, whereas vmalloc
space spans from VMALLOC_START - VMALLOC_END:

1) Should it be checked that vmalloc only address is freed or you handle
it somewhere else?

if (is_vmalloc_addr(va->va_start))
kasan_release_vmalloc(...)

2) Have you run any bencmarking just to see how much overhead it adds?
I am asking, because probably it make sense to add those figures to the
backlog(commit message). For example you can run:


sudo ./test_vmalloc.sh performance
and
sudo ./test_vmalloc.sh sequential_test_order=1


Thanks!

--
Vlad Rezki


Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC

2019-10-30 Thread Daniel Axtens
Andrey Ryabinin  writes:

> On 10/30/19 4:50 PM, Daniel Axtens wrote:
>> Andrey Ryabinin  writes:
>> 
>>> On 10/29/19 7:20 AM, Daniel Axtens wrote:
 In the case where KASAN directly allocates memory to back vmalloc
 space, don't map the early shadow page over it.

 We prepopulate pgds/p4ds for the range that would otherwise be empty.
 This is required to get it synced to hardware on boot, allowing the
 lower levels of the page tables to be filled dynamically.

 Acked-by: Dmitry Vyukov 
 Signed-off-by: Daniel Axtens 

 ---
>>>
 +static void __init kasan_shallow_populate_pgds(void *start, void *end)
 +{
 +  unsigned long addr, next;
 +  pgd_t *pgd;
 +  void *p;
 +  int nid = early_pfn_to_nid((unsigned long)start);
>>>
>>> This doesn't make sense. start is not even a pfn. With linear mapping 
>>> we try to identify nid to have the shadow on the same node as memory. But 
>>> in this case we don't have memory or the corresponding shadow (yet),
>>> we only install pgd/p4d.
>>> I guess we could just use NUMA_NO_NODE.
>> 
>> Ah wow, that's quite the clanger on my part.
>> 
>> There are a couple of other invocations of early_pfn_to_nid in that file
>> that use an address directly, but at least they reference actual memory.
>> I'll send a separate patch to fix those up.
>
> I see only one incorrect, in kasan_init(): early_pfn_to_nid(__pa(_stext))
> It should be wrapped with PFN_DOWN().
> Other usages in map_range() seems to be correct, range->start,end is pfns.
>

Oh, right, I didn't realise map_range was already using pfns.

>
>> 
>>> The rest looks ok, so with that fixed:
>>>
>>> Reviewed-by: Andrey Ryabinin 
>> 
>> Thanks heaps! I've fixed up the nit you identifed in the first patch,
>> and I agree that the last patch probably isn't needed. I'll respin the
>> series shortly.
>> 
>
> Hold on a sec, just spotted another thing to fix.
>
>> @@ -352,9 +397,24 @@ void __init kasan_init(void)
>>  shadow_cpu_entry_end = (void *)round_up(
>>  (unsigned long)shadow_cpu_entry_end, PAGE_SIZE);
>>  
>> +/*
>> + * If we're in full vmalloc mode, don't back vmalloc space with early
>> + * shadow pages. Instead, prepopulate pgds/p4ds so they are synced to
>> + * the global table and we can populate the lower levels on demand.
>> + */
>> +#ifdef CONFIG_KASAN_VMALLOC
>> +kasan_shallow_populate_pgds(
>> +kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
>
> This should be VMALLOC_START, there is no point to allocate pgds for the hole 
> between linear mapping
> and vmalloc, just waste of memory. It make sense to map early shadow for that 
> hole, because if code
> dereferences address in that hole we will see the page fault on that address 
> instead of fault on the shadow.
>
> So something like this might work:
>
>   kasan_populate_early_shadow(
>   kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
>   kasan_mem_to_shadow((void *)VMALLOC_START));
>
>   if (IS_ENABLED(CONFIG_KASAN_VMALLOC)
>   kasan_shallow_populate_pgds(kasan_mem_to_shadow(VMALLOC_START), 
> kasan_mem_to_shadow((void *)VMALLOC_END))
>   else
>   kasan_populate_early_shadow(kasan_mem_to_shadow(VMALLOC_START), 
> kasan_mem_to_shadow((void *)VMALLOC_END));
>
>   kasan_populate_early_shadow(
>   kasan_mem_to_shadow((void *)VMALLOC_END + 1),
>   shadow_cpu_entry_begin);

Sounds good. It's getting late for me so I'll change and test that and
send a respin tomorrow my time.

Regards,
Daniel


Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC

2019-10-30 Thread Andrey Ryabinin



On 10/30/19 4:50 PM, Daniel Axtens wrote:
> Andrey Ryabinin  writes:
> 
>> On 10/29/19 7:20 AM, Daniel Axtens wrote:
>>> In the case where KASAN directly allocates memory to back vmalloc
>>> space, don't map the early shadow page over it.
>>>
>>> We prepopulate pgds/p4ds for the range that would otherwise be empty.
>>> This is required to get it synced to hardware on boot, allowing the
>>> lower levels of the page tables to be filled dynamically.
>>>
>>> Acked-by: Dmitry Vyukov 
>>> Signed-off-by: Daniel Axtens 
>>>
>>> ---
>>
>>> +static void __init kasan_shallow_populate_pgds(void *start, void *end)
>>> +{
>>> +   unsigned long addr, next;
>>> +   pgd_t *pgd;
>>> +   void *p;
>>> +   int nid = early_pfn_to_nid((unsigned long)start);
>>
>> This doesn't make sense. start is not even a pfn. With linear mapping 
>> we try to identify nid to have the shadow on the same node as memory. But 
>> in this case we don't have memory or the corresponding shadow (yet),
>> we only install pgd/p4d.
>> I guess we could just use NUMA_NO_NODE.
> 
> Ah wow, that's quite the clanger on my part.
> 
> There are a couple of other invocations of early_pfn_to_nid in that file
> that use an address directly, but at least they reference actual memory.
> I'll send a separate patch to fix those up.

I see only one incorrect, in kasan_init(): early_pfn_to_nid(__pa(_stext))
It should be wrapped with PFN_DOWN().
Other usages in map_range() seems to be correct, range->start,end is pfns.


> 
>> The rest looks ok, so with that fixed:
>>
>> Reviewed-by: Andrey Ryabinin 
> 
> Thanks heaps! I've fixed up the nit you identifed in the first patch,
> and I agree that the last patch probably isn't needed. I'll respin the
> series shortly.
> 

Hold on a sec, just spotted another thing to fix.

> @@ -352,9 +397,24 @@ void __init kasan_init(void)
>   shadow_cpu_entry_end = (void *)round_up(
>   (unsigned long)shadow_cpu_entry_end, PAGE_SIZE);
>  
> + /*
> +  * If we're in full vmalloc mode, don't back vmalloc space with early
> +  * shadow pages. Instead, prepopulate pgds/p4ds so they are synced to
> +  * the global table and we can populate the lower levels on demand.
> +  */
> +#ifdef CONFIG_KASAN_VMALLOC
> + kasan_shallow_populate_pgds(
> + kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),

This should be VMALLOC_START, there is no point to allocate pgds for the hole 
between linear mapping
and vmalloc, just waste of memory. It make sense to map early shadow for that 
hole, because if code
dereferences address in that hole we will see the page fault on that address 
instead of fault on the shadow.

So something like this might work:

kasan_populate_early_shadow(
kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
kasan_mem_to_shadow((void *)VMALLOC_START));

if (IS_ENABLED(CONFIG_KASAN_VMALLOC)
kasan_shallow_populate_pgds(kasan_mem_to_shadow(VMALLOC_START), 
kasan_mem_to_shadow((void *)VMALLOC_END))
else
kasan_populate_early_shadow(kasan_mem_to_shadow(VMALLOC_START), 
kasan_mem_to_shadow((void *)VMALLOC_END));

kasan_populate_early_shadow(
kasan_mem_to_shadow((void *)VMALLOC_END + 1),
shadow_cpu_entry_begin);


Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC

2019-10-30 Thread Daniel Axtens
Andrey Ryabinin  writes:

> On 10/29/19 7:20 AM, Daniel Axtens wrote:
>> In the case where KASAN directly allocates memory to back vmalloc
>> space, don't map the early shadow page over it.
>> 
>> We prepopulate pgds/p4ds for the range that would otherwise be empty.
>> This is required to get it synced to hardware on boot, allowing the
>> lower levels of the page tables to be filled dynamically.
>> 
>> Acked-by: Dmitry Vyukov 
>> Signed-off-by: Daniel Axtens 
>> 
>> ---
>
>> +static void __init kasan_shallow_populate_pgds(void *start, void *end)
>> +{
>> +unsigned long addr, next;
>> +pgd_t *pgd;
>> +void *p;
>> +int nid = early_pfn_to_nid((unsigned long)start);
>
> This doesn't make sense. start is not even a pfn. With linear mapping 
> we try to identify nid to have the shadow on the same node as memory. But 
> in this case we don't have memory or the corresponding shadow (yet),
> we only install pgd/p4d.
> I guess we could just use NUMA_NO_NODE.

Ah wow, that's quite the clanger on my part.

There are a couple of other invocations of early_pfn_to_nid in that file
that use an address directly, but at least they reference actual memory.
I'll send a separate patch to fix those up.

> The rest looks ok, so with that fixed:
>
> Reviewed-by: Andrey Ryabinin 

Thanks heaps! I've fixed up the nit you identifed in the first patch,
and I agree that the last patch probably isn't needed. I'll respin the
series shortly.

Regards,
Daniel


Re: [RFC PATCH 00/27] current interrupt series plus scv syscall

2019-10-30 Thread Michal Suchánek
Hello,

On Wed, Oct 02, 2019 at 01:13:52PM +1000, Nicholas Piggin wrote:
> Michal Suchánek's on September 24, 2019 7:33 pm:
> > Hello,
> > 
> > can you mark the individual patches with RFC rather than the wole
> > series?
> 
> Hey, thanks for the reviews. I'll resend all but the last two patches
> soon for merge in the next window.

Will you resend these or should I cherry-pick the part I need for the
!COMPAT?

Thanks

Michal


Re: [PATCH v2 17/23] soc: fsl: qe: make qe_ic_cascade_* static

2019-10-30 Thread Rasmus Villemoes
On 30/10/2019 11.50, Christophe Leroy wrote:
> 
> 
> Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit :
>> Now that the references from arch/powerpc/ are gone, these are only
>> referenced from inside qe_ic.c, so make them static.
> 
> Why do that in two steps ?
> I think patch 9 could remain until here, and then you could squash patch
> 9 and patch 17 together here.

Agreed, I should rearrange stuff to avoid touching the same lines again
and again.

Thanks,
Rasmus


Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Qian Cai



> On Oct 30, 2019, at 6:28 AM, Peter Zijlstra  wrote:
> 
> It only makes 'wild' guesses when the BIOS is shit and it complains
> about that.
> 
> Or do you like you BIOS broken?

Agree. It is the garbage in and garbage out. No need to complicate the existing 
code further.

Re: [PATCH] powerpc/powernv: Fix CPU idle to be called with IRQs disabled

2019-10-30 Thread Michael Ellerman
On Tue, 2019-10-22 at 11:58:14 UTC, Nicholas Piggin wrote:
> Commit e78a7614f3876 ("idle: Prevent late-arriving interrupts from
> disrupting offline") changes arch_cpu_idle_dead to be called with
> interrupts disabled, which triggers the WARN in pnv_smp_cpu_kill_self.
> 
> Fix this by fixing up irq_happened after hard disabling, rather than
> requiring there are no pending interrupts, similarly to what was done
> done until commit 2525db04d1cc5 ("powerpc/powernv: Simplify lazy IRQ
> handling in CPU offline").
> 
> Fixes: e78a7614f3876 ("idle: Prevent late-arriving interrupts from disrupting 
> offline")
> Reported-by: Paul Mackerras 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/7d6475051fb3d9339c5c760ed9883bc0a9048b21

cheers


Re: [PATCH] powernv/eeh: Fix oops when probing cxl devices

2019-10-30 Thread Michael Ellerman
On Wed, 2019-10-16 at 16:28:33 UTC, Frederic Barrat wrote:
> Recent cleanup in the way EEH support is added to a device causes a
> kernel oops when the cxl driver probes a device and creates virtual
> devices discovered on the FPGA:
> 
> BUG: Kernel NULL pointer dereference at 0x00a0
> Faulting instruction address: 0xc0048070
> Oops: Kernel access of bad area, sig: 7 [#1]
> ...
> NIP [c0048070] eeh_add_device_late.part.9+0x50/0x1e0
> LR [c004805c] eeh_add_device_late.part.9+0x3c/0x1e0
> Call Trace:
> [c000200e43983900] [c079e250] _dev_info+0x5c/0x6c (unreliable)
> [c000200e43983980] [c00d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0
> [c000200e439839f0] [c00606d0] pcibios_bus_add_device+0x40/0x60
> [c000200e43983a10] [c06aa3a0] pci_bus_add_device+0x30/0x100
> [c000200e43983a80] [c06aa4d4] pci_bus_add_devices+0x64/0xd0
> [c000200e43983ac0] [c0081c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl]
> [c000200e43983b00] [c0081c4242ac] cxl_probe+0x504/0x5b0 [cxl]
> [c000200e43983bb0] [c06bba1c] local_pci_probe+0x6c/0x110
> [c000200e43983c30] [c0159278] work_for_cpu_fn+0x38/0x60
> 
> The root cause is that those cxl virtual devices don't have a
> representation in the device tree and therefore no associated pci_dn
> structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and
> we oops.
> 
> We never had explicit support for EEH for those virtual
> devices. Instead, EEH events are reported to the (real) pci device and
> handled by the cxl driver. Which can then forward to the virtual
> devices and handle dependencies. The fact that we try adding EEH
> support for the virtual devices is new and a side-effect of the recent
> cleanup.
> 
> This patch fixes it by skipping adding EEH support on powernv for
> devices which don't have a pci_dn structure.
> 
> The cxl driver doesn't create virtual devices on pseries so this patch
> doesn't fix it there intentionally.
> 
> Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug")
> Signed-off-by: Frederic Barrat 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/a8a30219ba78b1abb92091102b632f8e9bbdbf03

cheers


Re: [PATCH] powerpc/prom_init: Undo relocation before entering secure mode

2019-10-30 Thread Michael Ellerman
On Wed, 2019-09-11 at 16:34:33 UTC, Thiago Jung Bauermann wrote:
> The ultravisor will do an integrity check of the kernel image but we
> relocated it so the check will fail. Restore the original image by
> relocating it back to the kernel virtual base address.
> 
> This works because during build vmlinux is linked with an expected virtual
> runtime address of KERNELBASE.
> 
> Fixes: 6a9c930bd775 ("powerpc/prom_init: Add the ESM call to prom_init")
> Signed-off-by: Thiago Jung Bauermann 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/05d9a952832cb206a32e3705eff6edebdb2207e7

cheers


Re: [PATCH v3] powerpc/powernv: Add queue mechanism for early messages

2019-10-30 Thread Michael Ellerman
On Mon, 2018-05-21 at 02:04:38 UTC, Deb McLemore wrote:
> Problem being solved is when issuing a BMC soft poweroff during IPL,
> the poweroff was being lost so the machine would not poweroff.
> 
> Opal messages were being received before the opal-power code
> registered its notifiers.
> 
> Alternatives discussed (option #3 was chosen):
> 
> 1 - Have opal_message_init() explicitly call opal_power_control_init()
> before it dequeues any OPAL messages (i.e. before we register the
> opal-msg IRQ handler).
> 
> 2 - Introduce concept of critical message types and when we register
> handlers we track which message types have a registered handler,
> then defer the opal-msg IRQ registration until we have a handler
> registered for all the critical types.
> 
> 3 - Buffering messages, if we receive a message and do not yet
> have a handler for that type, store the message and replay when
> a handler for that type is registered.
> 
> Signed-off-by: Deb McLemore 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a9336ddf448b1cba3080195cec2287af3907236c

cheers


Re: [PATCH v2 1/3] powerpc/pseries: Don't opencode HPTE_V_BOLTED

2019-10-30 Thread Michael Ellerman
On Thu, 2019-10-24 at 09:35:40 UTC, "Aneesh Kumar K.V" wrote:
> No functional change in this patch.
> 
> Signed-off-by: Aneesh Kumar K.V 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/82ce028ad26dd075b06285ef61a854a564d910fb

cheers


Re: [PATCH] powerpc/pseries: Mark accumulate_stolen_time() as notrace

2019-10-30 Thread Michael Ellerman
On Thu, 2019-10-24 at 05:59:32 UTC, Michael Ellerman wrote:
> accumulate_stolen_time() is called prior to interrupt state being
> reconciled, which can trip the warning in arch_local_irq_restore():
> 
>   WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 
> .arch_local_irq_restore+0x9c/0x130
>   ...
>   NIP .arch_local_irq_restore+0x9c/0x130
>   LR  .rb_start_commit+0x38/0x80
>   Call Trace:
> .ring_buffer_lock_reserve+0xe4/0x620
> .trace_function+0x44/0x210
> .function_trace_call+0x148/0x170
> .ftrace_ops_no_ops+0x180/0x1d0
> ftrace_call+0x4/0x8
> .accumulate_stolen_time+0x1c/0xb0
> decrementer_common+0x124/0x160
> 
> For now just mark it as notrace. We may change the ordering to call it
> after interrupt state has been reconciled, but that is a larger
> change.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/eb8e20f89093b64f48975c74ccb114e6775cee22

cheers


Re: [PATCH] selftests/powerpc: Reduce sigfuz runtime to ~60s

2019-10-30 Thread Michael Ellerman
On Sun, 2019-10-13 at 23:46:43 UTC, Michael Ellerman wrote:
> The defaults for the sigfuz test is to run for 4000 iterations, but
> that can take quite a while and the test harness may kill the test.
> Reduce the number of iterations to 600, which gives a runtime of
> roughly 1 minute on a Power8 system.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/4f5c5b76cc00ccf5be89a2b9883feba3baf6eb2e

cheers


Re: [PATCH] powerpc: make syntax for FADump config options in kernel/Makefile readable

2019-10-30 Thread Michael Ellerman
On Wed, 2019-10-09 at 15:27:20 UTC, Hari Bathini wrote:
> arch/powerpc/kernel/fadump.c file needs to be compiled in if 'config
> FA_DUMP' or 'config PRESERVE_FA_DUMP' is set. The current syntax
> achieves that but looks a bit odd. Fix it for better readability.
> 
> Signed-off-by: Hari Bathini 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/cd1d55f16d48d97d681d9534170ce712ac1d09e7

cheers


Re: [PATCH] powerpc/configs: add FADump awareness to skiroot_defconfig

2019-10-30 Thread Michael Ellerman
On Wed, 2019-10-09 at 14:04:29 UTC, Hari Bathini wrote:
> FADump is supported on PowerNV platform. To fulfill this support, the
> petitboot kernel must be FADump aware. Enable config PRESERVE_FA_DUMP
> to make the petitboot kernel FADump aware.
> 
> Signed-off-by: Hari Bathini 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/aaa351504449c4babb80753c72920e4b25fbd8a9

cheers


Re: [PATCH] powerpc/pkeys: remove unused pkey_allows_readwrite

2019-10-30 Thread Michael Ellerman
On Tue, 2019-09-17 at 15:22:30 UTC, Qian Cai wrote:
> pkey_allows_readwrite() was first introduced in the commit 5586cf61e108
> ("powerpc: introduce execute-only pkey"), but the usage was removed
> entirely in the commit a4fcc877d4e1 ("powerpc/pkeys: Preallocate
> execute-only key").
> 
> Found by the "-Wunused-function" compiler warning flag.
> 
> Fixes: a4fcc877d4e1 ("powerpc/pkeys: Preallocate execute-only key")
> Signed-off-by: Qian Cai 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/29674a1c71be710f8418468aa6a8addd6d1aba1c

cheers


Re: [PATCH v2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range

2019-10-30 Thread Michael Ellerman
On Tue, 2019-09-17 at 12:38:51 UTC, "Aneesh Kumar K.V" wrote:
> With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap")
> pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we
> can map the memmap area using 16MB hugepage size and that can cover
> a memory range of 16G.
> 
> While enabling new pmem namespaces, since memory is added in sub-section 
> chunks,
> before creating a new memmap mapping, kernel should check whether there is an
> existing memmap mapping covering the new pmem namespace. Currently, this is
> validated by checking whether the section covering the range is already
> initialized or not. Considering there can be multiple namespaces in the same
> section this can result in wrong validation. Update this to check for
> sub-sections in the range. This is done by checking for all pfns in the range 
> we
> are mapping.
> 
> We could optimize this by checking only just one pfn in each sub-section. But
> since this is not fast-path we keep this simple.
> 
> Signed-off-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/5f5d6e40a01e70b731df843d8b5a61b4b28b19d9

cheers


  1   2   >