Re: [RFC PATCH 1/4] libnvdimm/namespace: Make namespace size validation arch dependent
On 10/29/19 11:00 AM, Dan Williams wrote: On Mon, Oct 28, 2019 at 9:35 PM Aneesh Kumar K.V wrote: On 10/29/19 4:38 AM, Dan Williams wrote: On Mon, Oct 28, 2019 at 2:48 AM Aneesh Kumar K.V wrote: The page size used to map the namespace is arch dependent. For example architectures like ppc64 use 16MB page size for direct-mapping. If the namespace size is not aligned to the mapping page size, we can observe kernel crash during namespace init and destroy. This is due to kernel doing partial map/unmap of the resource range BUG: Unable to handle kernel data access at 0xc00100040600 Faulting instruction address: 0xc0090790 NIP [c0090790] arch_add_memory+0xc0/0x130 LR [c0090744] arch_add_memory+0x74/0x130 Call Trace: arch_add_memory+0x74/0x130 (unreliable) memremap_pages+0x74c/0xa30 devm_memremap_pages+0x3c/0xa0 pmem_attach_disk+0x188/0x770 nvdimm_bus_probe+0xd8/0x470 really_probe+0x148/0x570 driver_probe_device+0x19c/0x1d0 device_driver_attach+0xcc/0x100 bind_store+0x134/0x1c0 drv_attr_store+0x44/0x60 sysfs_kf_write+0x74/0xc0 kernfs_fop_write+0x1b4/0x290 __vfs_write+0x3c/0x70 vfs_write+0xd0/0x260 ksys_write+0xdc/0x130 system_call+0x5c/0x68 Signed-off-by: Aneesh Kumar K.V --- arch/arm64/mm/flush.c | 11 +++ arch/powerpc/lib/pmem.c | 21 +++-- arch/x86/mm/pageattr.c| 12 include/linux/libnvdimm.h | 1 + 4 files changed, 43 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c index ac485163a4a7..90c54c600023 100644 --- a/arch/arm64/mm/flush.c +++ b/arch/arm64/mm/flush.c @@ -91,4 +91,15 @@ void arch_invalidate_pmem(void *addr, size_t size) __inval_dcache_area(addr, size); } EXPORT_SYMBOL_GPL(arch_invalidate_pmem); + +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) +{ + u32 remainder; + + div_u64_rem(size, PAGE_SIZE * ndr_mappings, ); + if (remainder) + return PAGE_SIZE * ndr_mappings; + return 0; +} +EXPORT_SYMBOL_GPL(arch_validate_namespace_size); #endif diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c index 377712e85605..2e661a08dae5 100644 --- a/arch/powerpc/lib/pmem.c +++ b/arch/powerpc/lib/pmem.c @@ -17,14 +17,31 @@ void arch_wb_cache_pmem(void *addr, size_t size) unsigned long start = (unsigned long) addr; flush_dcache_range(start, start + size); } -EXPORT_SYMBOL(arch_wb_cache_pmem); +EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); void arch_invalidate_pmem(void *addr, size_t size) { unsigned long start = (unsigned long) addr; flush_dcache_range(start, start + size); } -EXPORT_SYMBOL(arch_invalidate_pmem); +EXPORT_SYMBOL_GPL(arch_invalidate_pmem); + +unsigned long arch_validate_namespace_size(unsigned int ndr_mappings, unsigned long size) +{ + u32 remainder; + unsigned long linear_map_size; + + if (radix_enabled()) + linear_map_size = PAGE_SIZE; + else + linear_map_size = (1UL << mmu_psize_defs[mmu_linear_psize].shift); This seems more a "supported_alignments" problem, and less a namespace size or PAGE_SIZE problem, because if the starting address is misaligned this size validation can still succeed when it shouldn't. Isn't supported_alignments an indication of how user want the namespace to be mapped to applications? Ie, with the above restrictions we can still do both 64K and 16M mapping of the namespace to userspace. True, for the pfn device and the device-dax mapping size, but I'm suggesting adding another instance of alignment control at the raw namespace level. That would need to be disconnected from the device-dax page mapping granularity. Can you explain what you mean by raw namespace level ? We don't have multiple values against which we need to check the alignment of namespace start and namespace size. If you can outline how and where you would like to enforce that check I can start working on it. -aneesh
[PATCH v3] cpufreq: powernv: fix stack bloat and hard limit on num cpus
The following build warning occurred on powerpc 64-bit builds: drivers/cpufreq/powernv-cpufreq.c: In function 'init_chip_info': drivers/cpufreq/powernv-cpufreq.c:1070:1: warning: the frame size of 1040 bytes is larger than 1024 bytes [-Wframe-larger-than=] This is with a cross-compiler based on gcc 8.1.0, which I got from: https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/ The warning is due to putting 1024 bytes on the stack: unsigned int chip[256]; ...and it's also undesirable to have a hard limit on the number of CPUs here. Fix both problems by dynamically allocating based on num_possible_cpus, as recommended by Michael Ellerman. Fixes: 053819e0bf840 ("cpufreq: powernv: Handle throttling due to Pmax capping at chip level") Cc: Michael Ellerman Cc: Shilpasri G Bhat Cc: Preeti U Murthy Cc: Viresh Kumar Cc: Rafael J. Wysocki Cc: linux...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: John Hubbard --- Changes since v2: applied fixes from Michael Ellerman's review: * Changed from CONFIG_NR_CPUS to num_possible_cpus() * Fixed up commit description: added a note about exactly which compiler generates the warning. And softened up wording about the limitation on number of CPUs. Changes since v1: includes Viresh's review commit fixes. thanks, John Hubbard NVIDIA drivers/cpufreq/powernv-cpufreq.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 6061850e59c9..56f4bc0d209e 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -1041,9 +1041,14 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int init_chip_info(void) { - unsigned int chip[256]; + unsigned int *chip; unsigned int cpu, i; unsigned int prev_chip_id = UINT_MAX; + int ret = 0; + + chip = kcalloc(num_possible_cpus(), sizeof(*chip), GFP_KERNEL); + if (!chip) + return -ENOMEM; for_each_possible_cpu(cpu) { unsigned int id = cpu_to_chip_id(cpu); @@ -1055,8 +1060,10 @@ static int init_chip_info(void) } chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL); - if (!chips) - return -ENOMEM; + if (!chips) { + ret = -ENOMEM; + goto free_and_return; + } for (i = 0; i < nr_chips; i++) { chips[i].id = chip[i]; @@ -1066,7 +1073,9 @@ static int init_chip_info(void) per_cpu(chip_info, cpu) = [i]; } - return 0; +free_and_return: + kfree(chip); + return ret; } static inline void clean_chip_info(void) -- 2.23.0
Re: [PATCH v2] cpufreq: powernv: fix stack bloat and NR_CPUS limitation
On 10/30/19 7:39 PM, Michael Ellerman wrote: Hi John, Sorry I didn't reply to this sooner, too many patches :/ John Hubbard writes: The following build warning occurred on powerpc 64-bit builds: drivers/cpufreq/powernv-cpufreq.c: In function 'init_chip_info': drivers/cpufreq/powernv-cpufreq.c:1070:1: warning: the frame size of 1040 bytes is larger than 1024 bytes [-Wframe-larger-than=] Oddly I don't see that warning in my builds, eg with GCC9: https://travis-ci.org/linuxppc/linux/jobs/604870722 This is with a cross-compiler based on gcc 8.1.0, which I got from: https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/ I'll put that in the v3 commit description. This is due to putting 1024 bytes on the stack: unsigned int chip[256]; ...and while looking at this, it also has a bug: it fails with a stack overrun, if CONFIG_NR_CPUS > 256. It _probably_ doesn't, because it only increments the index when the chip_id of the CPU changes, ie. it doesn't create a chip for every CPU. But I agree it's flaky the way it's written. I'll soften up the wording accordingly. Fix both problems by dynamically allocating based on CONFIG_NR_CPUS. Shouldn't it use num_possible_cpus() ? Given the for loop is over possible CPUs that seems like the upper bound. In practice it should be lower because some CPUs will share a chip. OK, I see, that's more consistent with the code, I'll change to that. thanks, -- John Hubbard NVIDIA Fixes: 053819e0bf840 ("cpufreq: powernv: Handle throttling due to Pmax capping at chip level") Cc: Shilpasri G Bhat Cc: Preeti U Murthy Cc: Viresh Kumar Cc: Rafael J. Wysocki Cc: linux...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: John Hubbard --- Changes since v1: includes Viresh's review commit fixes. drivers/cpufreq/powernv-cpufreq.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index 6061850e59c9..5b2e968cb5ea 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -1041,9 +1041,14 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int init_chip_info(void) { - unsigned int chip[256]; + unsigned int *chip; unsigned int cpu, i; unsigned int prev_chip_id = UINT_MAX; + int ret = 0; + + chip = kcalloc(CONFIG_NR_CPUS, sizeof(*chip), GFP_KERNEL); + if (!chip) + return -ENOMEM; for_each_possible_cpu(cpu) { unsigned int id = cpu_to_chip_id(cpu); @@ -1055,8 +1060,10 @@ static int init_chip_info(void) } chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL); - if (!chips) - return -ENOMEM; + if (!chips) { + ret = -ENOMEM; + goto free_and_return; + } for (i = 0; i < nr_chips; i++) { chips[i].id = chip[i]; @@ -1066,7 +1073,9 @@ static int init_chip_info(void) per_cpu(chip_info, cpu) = [i]; } - return 0; +free_and_return: + kfree(chip); + return ret; } static inline void clean_chip_info(void) -- 2.23.0
[RFC PATCH v10 9/9] powerpc/ima: indicate kernel modules appended signatures are enforced
The arch specific kernel module policy rule requires kernel modules to be signed, either as an IMA signature, stored as an xattr, or as an appended signature. As a result, kernel modules appended signatures could be enforced without "sig_enforce" being set or reflected in /sys/module/module/parameters/sig_enforce. This patch sets "sig_enforce". Signed-off-by: Mimi Zohar Cc: Jessica Yu --- arch/powerpc/kernel/ima_arch.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c index b9de0fb45bb9..e34116255ced 100644 --- a/arch/powerpc/kernel/ima_arch.c +++ b/arch/powerpc/kernel/ima_arch.c @@ -62,13 +62,17 @@ static const char *const secure_and_trusted_rules[] = { */ const char *const *arch_get_ima_policy(void) { - if (is_ppc_secureboot_enabled()) + if (is_ppc_secureboot_enabled()) { + if (IS_ENABLED(CONFIG_MODULE_SIG)) + set_module_sig_enforced(); + if (is_ppc_trustedboot_enabled()) return secure_and_trusted_rules; else return secure_rules; - else if (is_ppc_trustedboot_enabled()) + } else if (is_ppc_trustedboot_enabled()) { return trusted_rules; + } return NULL; } -- 2.7.5
[PATCH v10 8/9] powerpc/ima: update ima arch policy to check for blacklist
From: Nayna Jain This patch updates the arch-specific policies for PowerNV system to make sure that the binary hash is not blacklisted. Signed-off-by: Nayna Jain Cc: Jessica Yu Signed-off-by: Mimi Zohar --- arch/powerpc/kernel/ima_arch.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c index 0ef5956c9753..b9de0fb45bb9 100644 --- a/arch/powerpc/kernel/ima_arch.c +++ b/arch/powerpc/kernel/ima_arch.c @@ -23,9 +23,9 @@ bool arch_ima_get_secureboot(void) * is not enabled. */ static const char *const secure_rules[] = { - "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig", + "appraise func=KEXEC_KERNEL_CHECK appraise_flag=check_blacklist appraise_type=imasig|modsig", #ifndef CONFIG_MODULE_SIG_FORCE - "appraise func=MODULE_CHECK appraise_type=imasig|modsig", + "appraise func=MODULE_CHECK appraise_flag=check_blacklist appraise_type=imasig|modsig", #endif NULL }; @@ -49,9 +49,9 @@ static const char *const trusted_rules[] = { static const char *const secure_and_trusted_rules[] = { "measure func=KEXEC_KERNEL_CHECK template=ima-modsig", "measure func=MODULE_CHECK template=ima-modsig", - "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig", + "appraise func=KEXEC_KERNEL_CHECK appraise_flag=check_blacklist appraise_type=imasig|modsig", #ifndef CONFIG_MODULE_SIG_FORCE - "appraise func=MODULE_CHECK appraise_type=imasig|modsig", + "appraise func=MODULE_CHECK appraise_flag=check_blacklist appraise_type=imasig|modsig", #endif NULL }; -- 2.7.5
[PATCH v10 7/9] ima: check against blacklisted hashes for files with modsig
From: Nayna Jain Asymmetric private keys are used to sign multiple files. The kernel currently supports checking against blacklisted keys. However, if the public key is blacklisted, any file signed by the blacklisted key will automatically fail signature verification. Blacklisting the public key is not fine enough granularity, as we might want to only blacklist a particular file. This patch adds support for checking against the blacklisted hash of the file, without the appended signature, based on the IMA policy. It defines a new policy option "appraise_flag=check_blacklist". In addition to the blacklisted binary hashes stored in the firmware "dbx" variable, the Linux kernel may be configured to load blacklisted binary hashes onto the .blacklist keyring as well. The following example shows how to blacklist a specific kernel module hash. $ sha256sum kernel/kheaders.ko 77fa889b35a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3 kernel/kheaders.ko $ grep BLACKLIST .config CONFIG_SYSTEM_BLACKLIST_KEYRING=y CONFIG_SYSTEM_BLACKLIST_HASH_LIST="blacklist-hash-list" $ cat certs/blacklist-hash-list "bin:77fa889b35a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3" Update the IMA custom measurement and appraisal policy rules (/etc/ima-policy): measure func=MODULE_CHECK template=ima-modsig appraise func=MODULE_CHECK appraise_flag=check_blacklist appraise_type=imasig|modsig After building, installing, and rebooting the kernel: 545660333 ---lswrv 0 0 \_ blacklist: bin:77fa889b35a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3 measure func=MODULE_CHECK template=ima-modsig appraise func=MODULE_CHECK appraise_flag=check_blacklist appraise_type=imasig|modsig modprobe: ERROR: could not insert 'kheaders': Permission denied 10 0c9834db5a0182c1fb0cdc5d3adcf11a11fd83dd ima-sig sha256:3bc6ed4f0b4d6e31bc1dbc9ef844605abc7afdc6d81a57d77a1ec9407997c40 2 /usr/lib/modules/5.4.0-rc3+/kernel/kernel/kheaders.ko 10 82aad2bcc3fa8ed94762356b5c14838f3bcfa6a0 ima-modsig sha256:3bc6ed4f0b4d6e31bc1dbc9ef844605abc7afdc6d81a57d77a1ec9407997c40 2 /usr/lib/modules/5.4.0rc3+/kernel/kernel/kheaders.ko sha256:77fa889b3 5a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3 3082029a06092a864886f70d010702a082028b30820287020101310d300b0609608648 016503040201300b06092a864886f70d01070131820264 10 25b72217cc1152b44b134ce2cd68f12dfb71acb3 ima-buf sha256:8b58427fedcf8f4b20bc8dc007f2e232bf7285d7b93a66476321f9c2a3aa132 b blacklisted-hash 77fa889b35a05338ec52e51591c1b89d4c8d1c99a21251d7c22b1a8642a6bad3 Signed-off-by: Nayna Jain Cc: Jessica Yu Cc: David Howells [zo...@linux.ibm.com: updated patch description] Signed-off-by: Mimi Zohar --- Documentation/ABI/testing/ima_policy | 4 security/integrity/ima/ima.h | 8 security/integrity/ima/ima_appraise.c | 33 + security/integrity/ima/ima_main.c | 12 security/integrity/ima/ima_policy.c | 12 ++-- security/integrity/integrity.h| 1 + 6 files changed, 64 insertions(+), 6 deletions(-) diff --git a/Documentation/ABI/testing/ima_policy b/Documentation/ABI/testing/ima_policy index 29ebe9afdac4..29aaedf33246 100644 --- a/Documentation/ABI/testing/ima_policy +++ b/Documentation/ABI/testing/ima_policy @@ -25,6 +25,7 @@ Description: lsm:[[subj_user=] [subj_role=] [subj_type=] [obj_user=] [obj_role=] [obj_type=]] option: [[appraise_type=]] [template=] [permit_directio] + [appraise_flag=] base: func:= [BPRM_CHECK][MMAP_CHECK][CREDS_CHECK][FILE_CHECK][MODULE_CHECK] [FIRMWARE_CHECK] [KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK] @@ -38,6 +39,9 @@ Description: fowner:= decimal value lsm:are LSM specific option: appraise_type:= [imasig] [imasig|modsig] + appraise_flag:= [check_blacklist] + Currently, blacklist check is only for files signed with appended + signature. template:= name of a defined IMA template type (eg, ima-ng). Only valid when action is "measure". pcr:= decimal value diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index a65772ffa427..df4ca482fb53 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -256,6 +256,8 @@ int ima_policy_show(struct seq_file *m, void *v); #define IMA_APPRAISE_KEXEC 0x40 #ifdef CONFIG_IMA_APPRAISE +int ima_check_blacklist(struct integrity_iint_cache *iint, + const struct modsig *modsig, int pcr); int ima_appraise_measurement(enum ima_hooks func, struct integrity_iint_cache *iint, struct file *file, const unsigned
[PATCH v10 6/9] certs: add wrapper function to check blacklisted binary hash
From: Nayna Jain The -EKEYREJECTED error returned by existing is_hash_blacklisted() is misleading when called for checking against blacklisted hash of a binary. This patch adds a wrapper function is_binary_blacklisted() to return -EPERM error if binary is blacklisted. Signed-off-by: Nayna Jain Cc: David Howells Reviewed-by: Mimi Zohar --- certs/blacklist.c | 9 + include/keys/system_keyring.h | 6 ++ 2 files changed, 15 insertions(+) diff --git a/certs/blacklist.c b/certs/blacklist.c index ec00bf337eb6..6514f9ebc943 100644 --- a/certs/blacklist.c +++ b/certs/blacklist.c @@ -135,6 +135,15 @@ int is_hash_blacklisted(const u8 *hash, size_t hash_len, const char *type) } EXPORT_SYMBOL_GPL(is_hash_blacklisted); +int is_binary_blacklisted(const u8 *hash, size_t hash_len) +{ + if (is_hash_blacklisted(hash, hash_len, "bin") == -EKEYREJECTED) + return -EPERM; + + return 0; +} +EXPORT_SYMBOL_GPL(is_binary_blacklisted); + /* * Initialise the blacklist */ diff --git a/include/keys/system_keyring.h b/include/keys/system_keyring.h index c1a96fdf598b..fb8b07daa9d1 100644 --- a/include/keys/system_keyring.h +++ b/include/keys/system_keyring.h @@ -35,12 +35,18 @@ extern int restrict_link_by_builtin_and_secondary_trusted( extern int mark_hash_blacklisted(const char *hash); extern int is_hash_blacklisted(const u8 *hash, size_t hash_len, const char *type); +extern int is_binary_blacklisted(const u8 *hash, size_t hash_len); #else static inline int is_hash_blacklisted(const u8 *hash, size_t hash_len, const char *type) { return 0; } + +static inline int is_binary_blacklisted(const u8 *hash, size_t hash_len) +{ + return 0; +} #endif #ifdef CONFIG_IMA_BLACKLIST_KEYRING -- 2.7.5
[PATCH v10 5/9] ima: make process_buffer_measurement() generic
From: Nayna Jain process_buffer_measurement() is limited to measuring the kexec boot command line. This patch makes process_buffer_measurement() more generic, allowing it to measure other types of buffer data (e.g. blacklisted binary hashes or key hashes). process_buffer_measurement() may be called directly from an IMA hook or as an auxiliary measurement record. In both cases the buffer measurement is based on policy. This patch modifies the function to conditionally retrieve the policy defined PCR and template for the IMA hook case. Signed-off-by: Nayna Jain [zo...@linux.ibm.com: added comment in process_buffer_measurement()] Signed-off-by: Mimi Zohar --- security/integrity/ima/ima.h | 3 ++ security/integrity/ima/ima_main.c | 58 +++ 2 files changed, 43 insertions(+), 18 deletions(-) diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index 3689081aaf38..a65772ffa427 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -217,6 +217,9 @@ void ima_store_measurement(struct integrity_iint_cache *iint, struct file *file, struct evm_ima_xattr_data *xattr_value, int xattr_len, const struct modsig *modsig, int pcr, struct ima_template_desc *template_desc); +void process_buffer_measurement(const void *buf, int size, + const char *eventname, enum ima_hooks func, + int pcr); void ima_audit_measurement(struct integrity_iint_cache *iint, const unsigned char *filename); int ima_alloc_init_template(struct ima_event_data *event_data, diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c index 60027c643ecd..a26e3ad4e886 100644 --- a/security/integrity/ima/ima_main.c +++ b/security/integrity/ima/ima_main.c @@ -626,14 +626,14 @@ int ima_load_data(enum kernel_load_data_id id) * @buf: pointer to the buffer that needs to be added to the log. * @size: size of buffer(in bytes). * @eventname: event name to be used for the buffer entry. - * @cred: a pointer to a credentials structure for user validation. - * @secid: the secid of the task to be validated. + * @func: IMA hook + * @pcr: pcr to extend the measurement * * Based on policy, the buffer is measured into the ima log. */ -static void process_buffer_measurement(const void *buf, int size, - const char *eventname, - const struct cred *cred, u32 secid) +void process_buffer_measurement(const void *buf, int size, + const char *eventname, enum ima_hooks func, + int pcr) { int ret = 0; struct ima_template_entry *entry = NULL; @@ -642,19 +642,45 @@ static void process_buffer_measurement(const void *buf, int size, .filename = eventname, .buf = buf, .buf_len = size}; - struct ima_template_desc *template_desc = NULL; + struct ima_template_desc *template = NULL; struct { struct ima_digest_data hdr; char digest[IMA_MAX_DIGEST_SIZE]; } hash = {}; int violation = 0; - int pcr = CONFIG_IMA_MEASURE_PCR_IDX; int action = 0; + u32 secid; - action = ima_get_action(NULL, cred, secid, 0, KEXEC_CMDLINE, , - _desc); - if (!(action & IMA_MEASURE)) - return; + /* +* Both LSM hooks and auxilary based buffer measurements are +* based on policy. To avoid code duplication, differentiate +* between the LSM hooks and auxilary buffer measurements, +* retrieving the policy rule information only for the LSM hook +* buffer measurements. +*/ + if (func) { + security_task_getsecid(current, ); + action = ima_get_action(NULL, current_cred(), secid, 0, func, + , ); + if (!(action & IMA_MEASURE)) + return; + } + + if (!pcr) + pcr = CONFIG_IMA_MEASURE_PCR_IDX; + + if (!template) { + template = lookup_template_desc("ima-buf"); + ret = template_desc_init_fields(template->fmt, + &(template->fields), + &(template->num_fields)); + if (ret < 0) { + pr_err("template %s init failed, result: %d\n", + (strlen(template->name) ? + template->name : template->fmt), ret); + return; + } + } iint.ima_hash = iint.ima_hash->algo = ima_hash_algo;
[PATCH v10 4/9] powerpc/ima: define trusted boot policy
From: Nayna Jain This patch defines an arch-specific trusted boot only policy and a combined secure and trusted boot policy. Signed-off-by: Nayna Jain Signed-off-by: Mimi Zohar --- arch/powerpc/kernel/ima_arch.c | 33 - 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c index d88913dc0da7..0ef5956c9753 100644 --- a/arch/powerpc/kernel/ima_arch.c +++ b/arch/powerpc/kernel/ima_arch.c @@ -31,13 +31,44 @@ static const char *const secure_rules[] = { }; /* + * The "trusted_rules" are enabled only on "trustedboot" enabled systems. + * These rules add the kexec kernel image and kernel modules file hashes to + * the IMA measurement list. + */ +static const char *const trusted_rules[] = { + "measure func=KEXEC_KERNEL_CHECK", + "measure func=MODULE_CHECK", + NULL +}; + +/* + * The "secure_and_trusted_rules" contains rules for both the secure boot and + * trusted boot. The "template=ima-modsig" option includes the appended + * signature, when available, in the IMA measurement list. + */ +static const char *const secure_and_trusted_rules[] = { + "measure func=KEXEC_KERNEL_CHECK template=ima-modsig", + "measure func=MODULE_CHECK template=ima-modsig", + "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig", +#ifndef CONFIG_MODULE_SIG_FORCE + "appraise func=MODULE_CHECK appraise_type=imasig|modsig", +#endif + NULL +}; + +/* * Returns the relevant IMA arch-specific policies based on the system secure * boot state. */ const char *const *arch_get_ima_policy(void) { if (is_ppc_secureboot_enabled()) - return secure_rules; + if (is_ppc_trustedboot_enabled()) + return secure_and_trusted_rules; + else + return secure_rules; + else if (is_ppc_trustedboot_enabled()) + return trusted_rules; return NULL; } -- 2.7.5
[PATCH v10 1/9] powerpc: detect the secure boot mode of the system
From: Nayna Jain This patch defines a function to detect the secure boot state of a PowerNV system. The PPC_SECURE_BOOT config represents the base enablement of secure boot for powerpc. Signed-off-by: Nayna Jain --- arch/powerpc/Kconfig | 10 ++ arch/powerpc/include/asm/secure_boot.h | 23 +++ arch/powerpc/kernel/Makefile | 2 ++ arch/powerpc/kernel/secure_boot.c | 32 4 files changed, 67 insertions(+) create mode 100644 arch/powerpc/include/asm/secure_boot.h create mode 100644 arch/powerpc/kernel/secure_boot.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3e56c9c2f16e..56ea0019b616 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -934,6 +934,16 @@ config PPC_MEM_KEYS If unsure, say y. +config PPC_SECURE_BOOT + prompt "Enable secure boot support" + bool + depends on PPC_POWERNV + help + Systems with firmware secure boot enabled need to define security + policies to extend secure boot to the OS. This config allows a user + to enable OS secure boot on systems that have firmware support for + it. If in doubt say N. + endmenu config ISA_DMA_API diff --git a/arch/powerpc/include/asm/secure_boot.h b/arch/powerpc/include/asm/secure_boot.h new file mode 100644 index ..07d0fe0ca81f --- /dev/null +++ b/arch/powerpc/include/asm/secure_boot.h @@ -0,0 +1,23 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Secure boot definitions + * + * Copyright (C) 2019 IBM Corporation + * Author: Nayna Jain + */ +#ifndef _ASM_POWER_SECURE_BOOT_H +#define _ASM_POWER_SECURE_BOOT_H + +#ifdef CONFIG_PPC_SECURE_BOOT + +bool is_ppc_secureboot_enabled(void); + +#else + +static inline bool is_ppc_secureboot_enabled(void) +{ + return false; +} + +#endif +#endif diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index a7ca8fe62368..e2a54fa240ac 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -161,6 +161,8 @@ ifneq ($(CONFIG_PPC_POWERNV)$(CONFIG_PPC_SVM),) obj-y += ucall.o endif +obj-$(CONFIG_PPC_SECURE_BOOT) += secure_boot.o + # Disable GCOV, KCOV & sanitizers in odd or sensitive code GCOV_PROFILE_prom_init.o := n KCOV_INSTRUMENT_prom_init.o := n diff --git a/arch/powerpc/kernel/secure_boot.c b/arch/powerpc/kernel/secure_boot.c new file mode 100644 index ..63dc82c50862 --- /dev/null +++ b/arch/powerpc/kernel/secure_boot.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 IBM Corporation + * Author: Nayna Jain + */ +#include +#include +#include + +bool is_ppc_secureboot_enabled(void) +{ + struct device_node *node; + bool enabled = false; + + node = of_find_compatible_node(NULL, NULL, "ibm,secvar-v1"); + if (!of_device_is_available(node)) { + pr_err("Cannot find secure variable node in device tree; failing to secure state\n"); + goto out; + } + + /* +* secureboot is enabled if os-secure-enforcing property exists, +* else disabled. +*/ + enabled = of_property_read_bool(node, "os-secure-enforcing"); + +out: + of_node_put(node); + + pr_info("Secure boot mode %s\n", enabled ? "enabled" : "disabled"); + return enabled; +} -- 2.7.5
[PATCH v10 3/9] powerpc: detect the trusted boot state of the system
From: Nayna Jain While secure boot permits only properly verified signed kernels to be booted, trusted boot calculates the file hash of the kernel image and stores the measurement prior to boot, that can be subsequently compared against good known values via attestation services. This patch reads the trusted boot state of a PowerNV system. The state is used to conditionally enable additional measurement rules in the IMA arch-specific policies. Signed-off-by: Nayna Jain --- arch/powerpc/include/asm/secure_boot.h | 6 ++ arch/powerpc/kernel/secure_boot.c | 26 ++ 2 files changed, 32 insertions(+) diff --git a/arch/powerpc/include/asm/secure_boot.h b/arch/powerpc/include/asm/secure_boot.h index 07d0fe0ca81f..a2ff556916c6 100644 --- a/arch/powerpc/include/asm/secure_boot.h +++ b/arch/powerpc/include/asm/secure_boot.h @@ -11,6 +11,7 @@ #ifdef CONFIG_PPC_SECURE_BOOT bool is_ppc_secureboot_enabled(void); +bool is_ppc_trustedboot_enabled(void); #else @@ -19,5 +20,10 @@ static inline bool is_ppc_secureboot_enabled(void) return false; } +static inline bool is_ppc_trustedboot_enabled(void) +{ + return false; +} + #endif #endif diff --git a/arch/powerpc/kernel/secure_boot.c b/arch/powerpc/kernel/secure_boot.c index 63dc82c50862..a6a5f17ede03 100644 --- a/arch/powerpc/kernel/secure_boot.c +++ b/arch/powerpc/kernel/secure_boot.c @@ -7,6 +7,17 @@ #include #include +static struct device_node *get_ppc_fw_sb_node(void) +{ + static const struct of_device_id ids[] = { + { .compatible = "ibm,secureboot-v1", }, + { .compatible = "ibm,secureboot-v2", }, + {}, + }; + + return of_find_matching_node(NULL, ids); +} + bool is_ppc_secureboot_enabled(void) { struct device_node *node; @@ -30,3 +41,18 @@ bool is_ppc_secureboot_enabled(void) pr_info("Secure boot mode %s\n", enabled ? "enabled" : "disabled"); return enabled; } + +bool is_ppc_trustedboot_enabled(void) +{ + struct device_node *node; + bool enabled = false; + + node = get_ppc_fw_sb_node(); + enabled = of_property_read_bool(node, "trusted-enabled"); + + of_node_put(node); + + pr_info("Trusted boot mode %s\n", enabled ? "enabled" : "disabled"); + + return enabled; +} -- 2.7.5
[PATCH v10 2/9] powerpc/ima: add support to initialize ima policy rules
From: Nayna Jain PowerNV systems use a Linux-based bootloader, which rely on the IMA subsystem to enforce different secure boot modes. Since the verification policy may differ based on the secure boot mode of the system, the policies must be defined at runtime. This patch implements arch-specific support to define IMA policy rules based on the runtime secure boot mode of the system. This patch provides arch-specific IMA policies if PPC_SECURE_BOOT config is enabled. Signed-off-by: Nayna Jain Signed-off-by: Mimi Zohar --- arch/powerpc/Kconfig | 1 + arch/powerpc/kernel/Makefile | 2 +- arch/powerpc/kernel/ima_arch.c | 43 ++ include/linux/ima.h| 3 ++- 4 files changed, 47 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/kernel/ima_arch.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 56ea0019b616..c795039bdc73 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -938,6 +938,7 @@ config PPC_SECURE_BOOT prompt "Enable secure boot support" bool depends on PPC_POWERNV + depends on IMA_ARCH_POLICY help Systems with firmware secure boot enabled need to define security policies to extend secure boot to the OS. This config allows a user diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index e2a54fa240ac..e8eb2955b7d5 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -161,7 +161,7 @@ ifneq ($(CONFIG_PPC_POWERNV)$(CONFIG_PPC_SVM),) obj-y += ucall.o endif -obj-$(CONFIG_PPC_SECURE_BOOT) += secure_boot.o +obj-$(CONFIG_PPC_SECURE_BOOT) += secure_boot.o ima_arch.o # Disable GCOV, KCOV & sanitizers in odd or sensitive code GCOV_PROFILE_prom_init.o := n diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c new file mode 100644 index ..d88913dc0da7 --- /dev/null +++ b/arch/powerpc/kernel/ima_arch.c @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 IBM Corporation + * Author: Nayna Jain + */ + +#include +#include + +bool arch_ima_get_secureboot(void) +{ + return is_ppc_secureboot_enabled(); +} + +/* + * The "secure_rules" are enabled only on "secureboot" enabled systems. + * These rules verify the file signatures against known good values. + * The "appraise_type=imasig|modsig" option allows the known good signature + * to be stored as an xattr or as an appended signature. + * + * To avoid duplicate signature verification as much as possible, the IMA + * policy rule for module appraisal is added only if CONFIG_MODULE_SIG_FORCE + * is not enabled. + */ +static const char *const secure_rules[] = { + "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig", +#ifndef CONFIG_MODULE_SIG_FORCE + "appraise func=MODULE_CHECK appraise_type=imasig|modsig", +#endif + NULL +}; + +/* + * Returns the relevant IMA arch-specific policies based on the system secure + * boot state. + */ +const char *const *arch_get_ima_policy(void) +{ + if (is_ppc_secureboot_enabled()) + return secure_rules; + + return NULL; +} diff --git a/include/linux/ima.h b/include/linux/ima.h index 1c37f17f7203..6d904754d858 100644 --- a/include/linux/ima.h +++ b/include/linux/ima.h @@ -29,7 +29,8 @@ extern void ima_kexec_cmdline(const void *buf, int size); extern void ima_add_kexec_buffer(struct kimage *image); #endif -#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390) +#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390) \ + || defined(CONFIG_PPC_SECURE_BOOT) extern bool arch_ima_get_secureboot(void); extern const char * const *arch_get_ima_policy(void); #else -- 2.7.5
[PATCH v10 0/9] powerpc: Enabling IMA arch specific secure boot policies
This patchset extends the previous version[1] by adding support for checking against a blacklist of binary hashes. The IMA subsystem supports custom, built-in, arch-specific policies to define the files to be measured and appraised. These policies are honored based on priority, where arch-specific policy is the highest and custom is the lowest. PowerNV system uses a Linux-based bootloader to kexec the OS. The bootloader kernel relies on IMA for signature verification of the OS kernel before doing the kexec. This patchset adds support for powerpc arch-specific IMA policies that are conditionally defined based on a system's secure boot and trusted boot states. The OS secure boot and trusted boot states are determined via device-tree properties. The verification needs to be performed only for binaries that are not blacklisted. The kernel currently only checks against the blacklist of keys. However, doing so results in blacklisting all the binaries that are signed by the same key. In order to prevent just one particular binary from being loaded, it must be checked against a blacklist of binary hashes. This patchset also adds support to IMA for checking against a hash blacklist for files. signed by appended signature. [1] http://patchwork.ozlabs.org/cover/1149262/ Changelog: v10: (Mimi posting patch set on Nayna's behalf) - Minor patch description changes - Include comment in process_buffer_measurement() - Additional patch: Enforcing kernel module appended signatures should be reflected in "/sys/module/module/parameters/sig_enforce". - Trimmed Cc list. v9: * Includes feedbacks from Michael * fix the missing of_node_put() * Includes Mimi's feedbacks * fix the policy show() function to display check_blacklist * fix the other comment related and patch description * add the example of blacklist in the Patch 7/8 Note: Patch 7/8 is giving errors when checkpatch.pl is run because of the format of showing measurement record as part of the example. I am not very sure if that can be fixed as we need to represent the measurements as is. v8: * Updates the Patch Description as per Michael's and Mimi's feedback * Includes feedbacks from Michael for the device tree and policies * removes the arch-policy hack by defining three arrays. * fixes related to device-tree calls * other code specific feedbacks * Includes feedbacks from Mimi on the blacklist * generic blacklist function is modified than previous version * other coding fixes v7: * Removes patch related to dt-bindings as per input from Rob Herring. * fixes Patch 1/8 to use new device-tree updates as per Oliver feedback to device-tree documentation in skiboot mailing list. (https://lists.ozlabs.org/pipermail/skiboot/2019-September/015329.html) * Includes feedbacks from Mimi, Thiago * moves function get_powerpc_fw_sb_node() from Patch 1 to Patch 3 * fixes Patch 2/8 to use CONFIG_MODULE_SIG_FORCE. * updates Patch description in Patch 5/8 * adds a new patch to add wrapper is_binary_blacklisted() * removes the patch that deprecated permit_directio v6: * includes feedbacks from Michael Ellerman on the patchset v5 * removed email ids from comments * add the doc for the device-tree * renames the secboot.c to secure_boot.c and secboot.h to secure_boot.h * other code specific fixes * split the patches to differentiate between secureboot and trustedboot state of the system * adds the patches to support the blacklisting of the binary hash. v5: * secureboot state is now read via device tree entry rather than OPAL secure variables * ima arch policies are updated to use policy based template for measurement rules v4: * Fixed the build issue as reported by Satheesh Rajendran. v3: * OPAL APIs in Patch 1 are updated to provide generic interface based on key/keylen. This patchset updates kernel OPAL APIs to be compatible with generic interface. * Patch 2 is cleaned up to use new OPAL APIs. * Since OPAL can support different types of backend which can vary in the variable interpretation, the Patch 2 is updated to add a check for the backend version * OPAL API now expects consumer to first check the supported backend version before calling other secvar OPAL APIs. This check is now added in patch 2. * IMA policies in Patch 3 is updated to specify appended signature and per policy template. * The patches now are free of any EFIisms. v2: * Removed Patch 1: powerpc/include: Override unneeded early ioremap functions * Updated Subject line and patch description of the Patch 1 of this series * Removed dependency of OPAL_SECVAR on EFI, CPU_BIG_ENDIAN and UCS2_STRING * Changed OPAL APIs from static to non-static. Added opal-secvar.h for the same * Removed EFI hooks from opal_secvar.c * Removed opal_secvar_get_next(), opal_secvar_enqueue() and opal_query_variable_info() function * get_powerpc_sb_mode() in secboot.c now directly calls OPAL Runtime API rather than via EFI hooks. * Fixed log messages in get_powerpc_sb_mode() function. * Added
Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
On 2019/10/30 18:14, Peter Zijlstra wrote: > On Wed, Oct 30, 2019 at 05:34:28PM +0800, Yunsheng Lin wrote: >> When passing the return value of dev_to_node() to cpumask_of_node() >> without checking if the device's node id is NUMA_NO_NODE, there is >> global-out-of-bounds detected by KASAN. >> >> From the discussion [1], NUMA_NO_NODE really means no node affinity, >> which also means all cpus should be usable. So the cpumask_of_node() >> should always return all cpus online when user passes the node id as >> NUMA_NO_NODE, just like similar semantic that page allocator handles >> NUMA_NO_NODE. >> >> But we cannot really copy the page allocator logic. Simply because the >> page allocator doesn't enforce the near node affinity. It just picks it >> up as a preferred node but then it is free to fallback to any other numa >> node. This is not the case here and node_to_cpumask_map will only restrict >> to the particular node's cpus which would have really non deterministic >> behavior depending on where the code is executed. So in fact we really >> want to return cpu_online_mask for NUMA_NO_NODE. >> >> Also there is a debugging version of node_to_cpumask_map() for x86 and >> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this >> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map(). >> >> [1] https://lkml.org/lkml/2019/9/11/66 >> Signed-off-by: Yunsheng Lin >> Suggested-by: Michal Hocko >> Acked-by: Michal Hocko >> Acked-by: Paul Burton # MIPS bits > > Still: > > Nacked-by: Peter Zijlstra (Intel) It seems I still misunderstood your meaning by "We must not silently accept NO_NODE there" in [1]. I am not sure if there is still disagreement that the NO_NODE state for dev->numa_node should exist at all. >From the previous disscussion [2], you seem to propose to do "wild guess" or "fixup" for all devices(including virtual and physcial) with NO_NODE, which means the NO_NODE is needed anymore and should be removed when the "wild guess" or "fixup" is done. So maybe the reason for your nack here it is that there should be no other NO_NODE handling or fixing related to NO_NODE before the "wild guess" or "fixup" process is finished, so making node_to_cpumask_map() NUMA_NO_NODE aware is unnecessary. Or your reason for the nack is still specific to the pcie device without a numa node, the "wild guess" need to be done for this case before making node_to_cpumask_map() NUMA_NO_NODE? Please help to clarify the reason for nack. Or is there still some other reason for the nack I missed from the previous disscussion? Thanks. [1] https://lore.kernel.org/lkml/2019101539.gx2...@hirez.programming.kicks-ass.net/ [2] https://lore.kernel.org/lkml/20191014094912.gy2...@hirez.programming.kicks-ass.net/ > > . >
Re: [PATCH v2] cpufreq: powernv: fix stack bloat and NR_CPUS limitation
Hi John, Sorry I didn't reply to this sooner, too many patches :/ John Hubbard writes: > The following build warning occurred on powerpc 64-bit builds: > > drivers/cpufreq/powernv-cpufreq.c: In function 'init_chip_info': > drivers/cpufreq/powernv-cpufreq.c:1070:1: warning: the frame size of 1040 > bytes is larger than 1024 bytes [-Wframe-larger-than=] Oddly I don't see that warning in my builds, eg with GCC9: https://travis-ci.org/linuxppc/linux/jobs/604870722 > This is due to putting 1024 bytes on the stack: > > unsigned int chip[256]; > > ...and while looking at this, it also has a bug: it fails with a stack > overrun, if CONFIG_NR_CPUS > 256. It _probably_ doesn't, because it only increments the index when the chip_id of the CPU changes, ie. it doesn't create a chip for every CPU. But I agree it's flaky the way it's written. > Fix both problems by dynamically allocating based on CONFIG_NR_CPUS. Shouldn't it use num_possible_cpus() ? Given the for loop is over possible CPUs that seems like the upper bound. In practice it should be lower because some CPUs will share a chip. cheers > Fixes: 053819e0bf840 ("cpufreq: powernv: Handle throttling due to Pmax > capping at chip level") > Cc: Shilpasri G Bhat > Cc: Preeti U Murthy > Cc: Viresh Kumar > Cc: Rafael J. Wysocki > Cc: linux...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Signed-off-by: John Hubbard > --- > > Changes since v1: includes Viresh's review commit fixes. > > drivers/cpufreq/powernv-cpufreq.c | 17 + > 1 file changed, 13 insertions(+), 4 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 6061850e59c9..5b2e968cb5ea 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -1041,9 +1041,14 @@ static struct cpufreq_driver powernv_cpufreq_driver = { > > static int init_chip_info(void) > { > - unsigned int chip[256]; > + unsigned int *chip; > unsigned int cpu, i; > unsigned int prev_chip_id = UINT_MAX; > + int ret = 0; > + > + chip = kcalloc(CONFIG_NR_CPUS, sizeof(*chip), GFP_KERNEL); > + if (!chip) > + return -ENOMEM; > > for_each_possible_cpu(cpu) { > unsigned int id = cpu_to_chip_id(cpu); > @@ -1055,8 +1060,10 @@ static int init_chip_info(void) > } > > chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL); > - if (!chips) > - return -ENOMEM; > + if (!chips) { > + ret = -ENOMEM; > + goto free_and_return; > + } > > for (i = 0; i < nr_chips; i++) { > chips[i].id = chip[i]; > @@ -1066,7 +1073,9 @@ static int init_chip_info(void) > per_cpu(chip_info, cpu) = [i]; > } > > - return 0; > +free_and_return: > + kfree(chip); > + return ret; > } > > static inline void clean_chip_info(void) > -- > 2.23.0
Re: Pull request: scottwood/linux.git next
Hi Michael, Can you pull this to linux-next so that we can test it on linux-next for some time? Thanks, Jason On 2019/10/23 7:21, Scott Wood wrote: This contains KASLR support for book3e 32-bit. The following changes since commit 612ee81b9461475b5a5612c2e8d71559dd3c7920: powerpc/papr_scm: Fix an off-by-one check in papr_scm_meta_{get, set} (2019-10-10 20:15:53 +1100) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git next for you to fetch changes up to 9df1ef3f1376ec5d3a1b51a4546c94279bcd88ca: powerpc/fsl_booke/32: Document KASLR implementation (2019-10-21 16:09:16 -0500) Jason Yan (12): powerpc: unify definition of M_IF_NEEDED powerpc: move memstart_addr and kernstart_addr to init-common.c powerpc: introduce kernstart_virt_addr to store the kernel base powerpc/fsl_booke/32: introduce create_kaslr_tlb_entry() helper powerpc/fsl_booke/32: introduce reloc_kernel_entry() helper powerpc/fsl_booke/32: implement KASLR infrastructure powerpc/fsl_booke/32: randomize the kernel image offset powerpc/fsl_booke/kaslr: clear the original kernel if randomized powerpc/fsl_booke/kaslr: support nokaslr cmdline parameter powerpc/fsl_booke/kaslr: dump out kernel offset information on panic powerpc/fsl_booke/kaslr: export offset in VMCOREINFO ELF notes powerpc/fsl_booke/32: Document KASLR implementation Documentation/powerpc/kaslr-booke32.rst | 42 +++ arch/powerpc/Kconfig | 11 + arch/powerpc/include/asm/nohash/mmu-book3e.h | 11 +- arch/powerpc/include/asm/page.h | 7 + arch/powerpc/kernel/early_32.c| 5 +- arch/powerpc/kernel/exceptions-64e.S | 12 +- arch/powerpc/kernel/fsl_booke_entry_mapping.S | 25 +- arch/powerpc/kernel/head_fsl_booke.S | 61 +++- arch/powerpc/kernel/machine_kexec.c | 1 + arch/powerpc/kernel/misc_64.S | 7 +- arch/powerpc/kernel/setup-common.c| 20 ++ arch/powerpc/mm/init-common.c | 7 + arch/powerpc/mm/init_32.c | 5 - arch/powerpc/mm/init_64.c | 5 - arch/powerpc/mm/mmu_decl.h| 11 + arch/powerpc/mm/nohash/Makefile | 1 + arch/powerpc/mm/nohash/fsl_booke.c| 8 +- arch/powerpc/mm/nohash/kaslr_booke.c | 401 ++ 18 files changed, 587 insertions(+), 53 deletions(-) create mode 100644 Documentation/powerpc/kaslr-booke32.rst create mode 100644 arch/powerpc/mm/nohash/kaslr_booke.c
Re: [PATCH 1/3] arch: ipcbuf.h: make uapi asm/ipcbuf.h self-contained
On Thu, 31 Oct 2019 10:33:00 +0900 Masahiro Yamada wrote: > Hi Andrew, > > I think this patch has already been picked up to your tree, > but I noticed a typo in the commit message just now. > Please see below. > > ... > > > Include to make it self-contained, and add it to > > Include to make ... > > Could you please fix it up locally? No probs, done.
Re: [PATCH 1/3] arch: ipcbuf.h: make uapi asm/ipcbuf.h self-contained
Hi Andrew, I think this patch has already been picked up to your tree, but I noticed a typo in the commit message just now. Please see below. On Wed, Oct 30, 2019 at 3:40 PM Masahiro Yamada wrote: > > The user-space cannot compile due to some missing type > definitions. For example, building it for x86 fails as follows: > > CC usr/include/asm/ipcbuf.h.s > In file included from ./usr/include/asm/ipcbuf.h:1:0, > from :32: > ./usr/include/asm-generic/ipcbuf.h:21:2: error: unknown type name > ‘__kernel_key_t’ > __kernel_key_t key; > ^~ > ./usr/include/asm-generic/ipcbuf.h:22:2: error: unknown type name > ‘__kernel_uid32_t’ > __kernel_uid32_t uid; > ^~~~ > ./usr/include/asm-generic/ipcbuf.h:23:2: error: unknown type name > ‘__kernel_gid32_t’ > __kernel_gid32_t gid; > ^~~~ > ./usr/include/asm-generic/ipcbuf.h:24:2: error: unknown type name > ‘__kernel_uid32_t’ > __kernel_uid32_t cuid; > ^~~~ > ./usr/include/asm-generic/ipcbuf.h:25:2: error: unknown type name > ‘__kernel_gid32_t’ > __kernel_gid32_t cgid; > ^~~~ > ./usr/include/asm-generic/ipcbuf.h:26:2: error: unknown type name > ‘__kernel_mode_t’ > __kernel_mode_t mode; > ^~~ > ./usr/include/asm-generic/ipcbuf.h:28:35: error: ‘__kernel_mode_t’ undeclared > here (not in a function) > unsigned char __pad1[4 - sizeof(__kernel_mode_t)]; >^~~ > ./usr/include/asm-generic/ipcbuf.h:31:2: error: unknown type name > ‘__kernel_ulong_t’ > __kernel_ulong_t __unused1; > ^~~~ > ./usr/include/asm-generic/ipcbuf.h:32:2: error: unknown type name > ‘__kernel_ulong_t’ > __kernel_ulong_t __unused2; > ^~~~ > > It is just a matter of missing include directive. > > Include to make it self-contained, and add it to Include to make ... Could you please fix it up locally? Thank you. Masahiro Yamada > the compile-test coverage. > > Signed-off-by: Masahiro Yamada > --- > > arch/s390/include/uapi/asm/ipcbuf.h | 2 ++ > arch/sparc/include/uapi/asm/ipcbuf.h | 2 ++ > arch/xtensa/include/uapi/asm/ipcbuf.h | 2 ++ > include/uapi/asm-generic/ipcbuf.h | 2 ++ > usr/include/Makefile | 1 - > 5 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/s390/include/uapi/asm/ipcbuf.h > b/arch/s390/include/uapi/asm/ipcbuf.h > index 5b1c4f47c656..1030cd186899 100644 > --- a/arch/s390/include/uapi/asm/ipcbuf.h > +++ b/arch/s390/include/uapi/asm/ipcbuf.h > @@ -2,6 +2,8 @@ > #ifndef __S390_IPCBUF_H__ > #define __S390_IPCBUF_H__ > > +#include > + > /* > * The user_ipc_perm structure for S/390 architecture. > * Note extra padding because this structure is passed back and forth > diff --git a/arch/sparc/include/uapi/asm/ipcbuf.h > b/arch/sparc/include/uapi/asm/ipcbuf.h > index 9d0d125500e2..5b933a598a33 100644 > --- a/arch/sparc/include/uapi/asm/ipcbuf.h > +++ b/arch/sparc/include/uapi/asm/ipcbuf.h > @@ -2,6 +2,8 @@ > #ifndef __SPARC_IPCBUF_H > #define __SPARC_IPCBUF_H > > +#include > + > /* > * The ipc64_perm structure for sparc/sparc64 architecture. > * Note extra padding because this structure is passed back and forth > diff --git a/arch/xtensa/include/uapi/asm/ipcbuf.h > b/arch/xtensa/include/uapi/asm/ipcbuf.h > index a57afa0b606f..3bd0642f6660 100644 > --- a/arch/xtensa/include/uapi/asm/ipcbuf.h > +++ b/arch/xtensa/include/uapi/asm/ipcbuf.h > @@ -12,6 +12,8 @@ > #ifndef _XTENSA_IPCBUF_H > #define _XTENSA_IPCBUF_H > > +#include > + > /* > * Pad space is left for: > * - 32-bit mode_t and seq > diff --git a/include/uapi/asm-generic/ipcbuf.h > b/include/uapi/asm-generic/ipcbuf.h > index 7d80dbd336fb..41a01b494fc7 100644 > --- a/include/uapi/asm-generic/ipcbuf.h > +++ b/include/uapi/asm-generic/ipcbuf.h > @@ -2,6 +2,8 @@ > #ifndef __ASM_GENERIC_IPCBUF_H > #define __ASM_GENERIC_IPCBUF_H > > +#include > + > /* > * The generic ipc64_perm structure: > * Note extra padding because this structure is passed back and forth > diff --git a/usr/include/Makefile b/usr/include/Makefile > index 57b20f7b6729..70f8fe256aed 100644 > --- a/usr/include/Makefile > +++ b/usr/include/Makefile > @@ -16,7 +16,6 @@ override c_flags = $(UAPI_CFLAGS) -Wp,-MD,$(depfile) > -I$(objtree)/usr/include > # Please consider to fix the header first. > # > # Sorted alphabetically. > -header-test- += asm/ipcbuf.h > header-test- += asm/msgbuf.h > header-test- += asm/sembuf.h > header-test- += asm/shmbuf.h > -- > 2.17.1 > -- Best Regards Masahiro Yamada
[Bug 205327] kmemleak reports various leaks in "swapper/0"
https://bugzilla.kernel.org/show_bug.cgi?id=205327 --- Comment #4 from Erhard F. (erhar...@mailbox.org) --- (In reply to Michael Ellerman from comment #3) > That looks like a pretty straight forward memory leak here: [...] > Do you want to send a patch? Thanks for pointing out! Yes, in this case writing a patch is within reach of my coding skills. ;) Have to check the procedure of how to submit a patch to the kernel first, as I have not done this yet. -- You are receiving this mail because: You are watching the assignee of the bug.
[RFC PATCH 5/5] powerpc: make iowrite32 and friends static inline when no indirection
When porting a powerpc-only driver to work on another architecture, one has to change e.g. out_be32 to iowrite32be. Unfortunately, while the other target architecture (in my case arm) may have static inline definitions of iowrite32 and friends, this change pessimizes the existing powerpc users of that driver since out_be32() is inline while iowrite32be() is out-of-line. When neither CONFIG_PPC_INDIRECT_PIO or CONFIG_PPC_INDIRECT_MMIO are set (e.g. all of PPC32), there's no reason for those to be out-of-line as they compile to just two or three instructions. So copy the definitions from iomap.c into io.h, make them static inline, and add the self-define macro boilerplate to prevent asm-generic/iomap.h from providing extern declarations. This means that kernel/iomap.c is now only compiled when !CONFIG_PPC_INDIRECT_PIO && CONFIG_PPC_INDIRECT_MMIO - a combination I don't think currently exists. So it's possible that file could simply be deleted. Signed-off-by: Rasmus Villemoes --- arch/powerpc/include/asm/io.h | 172 ++ arch/powerpc/kernel/Makefile | 2 +- 2 files changed, 173 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index a63ec938636d..a59310620067 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -638,6 +638,178 @@ static inline void name at \ #define writel_relaxed(v, addr)writel(v, addr) #define writeq_relaxed(v, addr)writeq(v, addr) +#if !defined(CONFIG_PPC_INDIRECT_PIO) && !defined(CONFIG_PPC_INDIRECT_MMIO) + +#define ioread8 ioread8 +static inline unsigned int ioread8(void __iomem *addr) +{ + return readb(addr); +} +#define ioread16 ioread16 +static inline unsigned int ioread16(void __iomem *addr) +{ + return readw(addr); +} +#define ioread16be ioread16be +static inline unsigned int ioread16be(void __iomem *addr) +{ + return readw_be(addr); +} +#define ioread32 ioread32 +static inline unsigned int ioread32(void __iomem *addr) +{ + return readl(addr); +} +#define ioread32be ioread32be +static inline unsigned int ioread32be(void __iomem *addr) +{ + return readl_be(addr); +} +#ifdef __powerpc64__ +#define ioread64 ioread64 +static inline u64 ioread64(void __iomem *addr) +{ + return readq(addr); +} +#define ioread64_lo_hi ioread64_lo_hi +static inline u64 ioread64_lo_hi(void __iomem *addr) +{ + return readq(addr); +} +#define ioread64_hi_lo ioread64_hi_lo +static inline u64 ioread64_hi_lo(void __iomem *addr) +{ + return readq(addr); +} +#define ioread64be ioread64be +static inline u64 ioread64be(void __iomem *addr) +{ + return readq_be(addr); +} +#define ioread64be_lo_hi ioread64be_lo_hi +static inline u64 ioread64be_lo_hi(void __iomem *addr) +{ + return readq_be(addr); +} +#define ioread64be_hi_lo ioread64be_hi_lo +static inline u64 ioread64be_hi_lo(void __iomem *addr) +{ + return readq_be(addr); +} +#endif /* __powerpc64__ */ + +#define iowrite8 iowrite8 +static inline void iowrite8(u8 val, void __iomem *addr) +{ + writeb(val, addr); +} +#define iowrite16 iowrite16 +static inline void iowrite16(u16 val, void __iomem *addr) +{ + writew(val, addr); +} +#define iowrite16be iowrite16be +static inline void iowrite16be(u16 val, void __iomem *addr) +{ + writew_be(val, addr); +} +#define iowrite32 iowrite32 +static inline void iowrite32(u32 val, void __iomem *addr) +{ + writel(val, addr); +} +#define iowrite32be iowrite32be +static inline void iowrite32be(u32 val, void __iomem *addr) +{ + writel_be(val, addr); +} +#ifdef __powerpc64__ +#define iowrite64 iowrite64 +static inline void iowrite64(u64 val, void __iomem *addr) +{ + writeq(val, addr); +} +#define iowrite64_lo_hi iowrite64_lo_hi +static inline void iowrite64_lo_hi(u64 val, void __iomem *addr) +{ + writeq(val, addr); +} +#define iowrite64_hi_lo iowrite64_hi_lo +static inline void iowrite64_hi_lo(u64 val, void __iomem *addr) +{ + writeq(val, addr); +} +#define iowrite64be iowrite64be +static inline void iowrite64be(u64 val, void __iomem *addr) +{ + writeq_be(val, addr); +} +#define iowrite64be_lo_hi iowrite64be_lo_hi +static inline void iowrite64be_lo_hi(u64 val, void __iomem *addr) +{ + writeq_be(val, addr); +} +#define iowrite64be_hi_lo iowrite64be_hi_lo +static inline void iowrite64be_hi_lo(u64 val, void __iomem *addr) +{ + writeq_be(val, addr); +} +#endif /* __powerpc64__ */ + +/* + * These are the "repeat read/write" functions. Note the + * non-CPU byte order. We do things in "IO byteorder" + * here. + * + * FIXME! We could make these do EEH handling if we really + * wanted. Not clear if we do. + */ +#define ioread8_rep ioread8_rep +static inline void ioread8_rep(void __iomem *addr, void *dst, unsigned long count) +{ + readsb(addr, dst, count); +} +#define ioread16_rep ioread16_rep +static inline void ioread16_rep(void
[RFC PATCH 4/5] powerpc: make pcibios_vaddr_is_ioport() static
The only caller of pcibios_vaddr_is_ioport() is in pci-common.c, so we can make it static and move it into the same #ifndef block as its caller. Signed-off-by: Rasmus Villemoes --- arch/powerpc/include/asm/pci-bridge.h | 9 - arch/powerpc/kernel/pci-common.c | 4 ++-- 2 files changed, 2 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index ea6ec65970ef..deb29a1c9708 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -283,14 +283,5 @@ extern struct pci_controller *pcibios_alloc_controller(struct device_node *dev); extern void pcibios_free_controller(struct pci_controller *phb); extern void pcibios_free_controller_deferred(struct pci_host_bridge *bridge); -#ifdef CONFIG_PCI -extern int pcibios_vaddr_is_ioport(void __iomem *address); -#else -static inline int pcibios_vaddr_is_ioport(void __iomem *address) -{ - return 0; -} -#endif /* CONFIG_PCI */ - #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_PCI_BRIDGE_H */ diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index d89a2426b405..928d7576c6c2 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -277,7 +277,8 @@ static resource_size_t pcibios_io_size(const struct pci_controller *hose) #endif } -int pcibios_vaddr_is_ioport(void __iomem *address) +#ifndef CONFIG_PPC_INDIRECT_PIO +static int pcibios_vaddr_is_ioport(void __iomem *address) { int ret = 0; struct pci_controller *hose; @@ -296,7 +297,6 @@ int pcibios_vaddr_is_ioport(void __iomem *address) return ret; } -#ifndef CONFIG_PPC_INDIRECT_PIO void pci_iounmap(struct pci_dev *dev, void __iomem *addr) { if (isa_vaddr_is_ioport(addr)) -- 2.23.0
[RFC PATCH 0/5] powerpc: make iowrite32be etc. inline
When trying to make the QUICC Engine drivers compile on arm, I mechanically (with coccinelle) changed out_be32() to iowrite32be() etc. Christophe pointed out [1][2] that that would pessimize the powerpc SOCs since the IO accesses now incur a function call overhead. He asked that I try to make those io accessors inline on ppc, and this is the best I could come up with. At first I tried something that wouldn't need to touch anything outside arch/powerpc/, but I ended up with conditional inclusion of asm-generic headers and/or duplicating a lot of their contents. The diffstat may become a little better if kernel/iomap.c can indeed be removed (due to !CONFIG_PPC_INDIRECT_PIO && CONFIG_PPC_INDIRECT_MMIO never happening). [1] https://lore.kernel.org/lkml/6ee121cf-0e3d-4aa0-2593-fcb00995e...@c-s.fr/ [2] https://lore.kernel.org/lkml/886d5218-6d6b-824c-3ab9-63aafe41f...@c-s.fr/ Rasmus Villemoes (5): asm-generic: move pcu_iounmap from iomap.h to pci_iomap.h asm-generic: employ "ifndef foo; define foo foo" idiom in iomap.h powerpc: move pci_iounmap() from iomap.c to pci-common.c powerpc: make pcibios_vaddr_is_ioport() static powerpc: make iowrite32 and friends static inline when no indirection arch/powerpc/include/asm/io.h | 172 ++ arch/powerpc/include/asm/pci-bridge.h | 9 -- arch/powerpc/kernel/Makefile | 2 +- arch/powerpc/kernel/iomap.c | 13 -- arch/powerpc/kernel/pci-common.c | 15 ++- include/asm-generic/iomap.h | 104 +--- include/asm-generic/pci_iomap.h | 7 ++ 7 files changed, 282 insertions(+), 40 deletions(-) -- 2.23.0
[RFC PATCH 2/5] asm-generic: employ "ifndef foo; define foo foo" idiom in iomap.h
Make it possible for an architecture to include asm-generic/iomap.h without necessarily getting all the external declarations for iowrite32 and friends. For example, the architecture could (maybe just in some configurations) provide static inline versions of some or all of these, but still use asm-generic/iomap.h for the ARCH_HAS_IOREMAP_WT/WC logic. This will be used on powerpc. Signed-off-by: Rasmus Villemoes --- include/asm-generic/iomap.h | 94 ++--- 1 file changed, 88 insertions(+), 6 deletions(-) diff --git a/include/asm-generic/iomap.h b/include/asm-generic/iomap.h index 5f8321e8fea9..1b247d3b9fbb 100644 --- a/include/asm-generic/iomap.h +++ b/include/asm-generic/iomap.h @@ -26,47 +26,105 @@ * in the low address range. Architectures for which this is not * true can't use this generic implementation. */ +#ifndef ioread8 +#define ioread8 ioread8 extern unsigned int ioread8(void __iomem *); +#endif +#ifndef ioread16 +#define ioread16 ioread16 extern unsigned int ioread16(void __iomem *); +#endif +#ifndef ioread16be +#define ioread16be ioread16be extern unsigned int ioread16be(void __iomem *); +#endif +#ifndef ioread32 +#define ioread32 ioread32 extern unsigned int ioread32(void __iomem *); +#endif +#ifndef ioread32be +#define ioread32be ioread32be extern unsigned int ioread32be(void __iomem *); +#endif #ifdef CONFIG_64BIT +#ifndef ioread64 +#define ioread64 ioread64 extern u64 ioread64(void __iomem *); +#endif +#ifndef ioread64be +#define ioread64be ioread64be extern u64 ioread64be(void __iomem *); #endif +#endif /* CONFIG_64BIT */ #ifdef readq +#ifndef ioread64_lo_hi #define ioread64_lo_hi ioread64_lo_hi -#define ioread64_hi_lo ioread64_hi_lo -#define ioread64be_lo_hi ioread64be_lo_hi -#define ioread64be_hi_lo ioread64be_hi_lo extern u64 ioread64_lo_hi(void __iomem *addr); +#endif +#ifndef ioread64_hi_lo +#define ioread64_hi_lo ioread64_hi_lo extern u64 ioread64_hi_lo(void __iomem *addr); +#endif +#ifndef ioread64be_lo_hi +#define ioread64be_lo_hi ioread64be_lo_hi extern u64 ioread64be_lo_hi(void __iomem *addr); +#endif +#ifndef ioread64be_hi_lo +#define ioread64be_hi_lo ioread64be_hi_lo extern u64 ioread64be_hi_lo(void __iomem *addr); #endif +#endif /* readq */ +#ifndef iowrite8 +#define iowrite8 iowrite8 extern void iowrite8(u8, void __iomem *); +#endif +#ifndef iowrite16 +#define iowrite16 iowrite16 extern void iowrite16(u16, void __iomem *); +#endif +#ifndef iowrite16be +#define iowrite16be iowrite16be extern void iowrite16be(u16, void __iomem *); +#endif +#ifndef iowrite32 +#define iowrite32 iowrite32 extern void iowrite32(u32, void __iomem *); +#endif +#ifndef iowrite32be +#define iowrite32be iowrite32be extern void iowrite32be(u32, void __iomem *); +#endif #ifdef CONFIG_64BIT +#ifndef iowrite64 +#define iowrite64 iowrite64 extern void iowrite64(u64, void __iomem *); +#endif +#ifndef iowrite64be +#define iowrite64be iowrite64be extern void iowrite64be(u64, void __iomem *); #endif +#endif /* CONFIG_64BIT */ #ifdef writeq +#ifndef iowrite64_lo_hi #define iowrite64_lo_hi iowrite64_lo_hi -#define iowrite64_hi_lo iowrite64_hi_lo -#define iowrite64be_lo_hi iowrite64be_lo_hi -#define iowrite64be_hi_lo iowrite64be_hi_lo extern void iowrite64_lo_hi(u64 val, void __iomem *addr); +#endif +#ifndef iowrite64_hi_lo +#define iowrite64_hi_lo iowrite64_hi_lo extern void iowrite64_hi_lo(u64 val, void __iomem *addr); +#endif +#ifndef iowrite64be_lo_hi +#define iowrite64be_lo_hi iowrite64be_lo_hi extern void iowrite64be_lo_hi(u64 val, void __iomem *addr); +#endif +#ifndef iowrite64be_hi_lo +#define iowrite64be_hi_lo iowrite64be_hi_lo extern void iowrite64be_hi_lo(u64 val, void __iomem *addr); #endif +#endif /* writeq */ /* * "string" versions of the above. Note that they @@ -79,19 +137,43 @@ extern void iowrite64be_hi_lo(u64 val, void __iomem *addr); * memory across multiple ports, use "memcpy_toio()" * and friends. */ +#ifndef ioread8_rep +#define ioread8_rep ioread8_rep extern void ioread8_rep(void __iomem *port, void *buf, unsigned long count); +#endif +#ifndef ioread16_rep +#define ioread16_rep ioread16_rep extern void ioread16_rep(void __iomem *port, void *buf, unsigned long count); +#endif +#ifndef ioread32_rep +#define ioread32_rep ioread32_rep extern void ioread32_rep(void __iomem *port, void *buf, unsigned long count); +#endif +#ifndef iowrite8_rep +#define iowrite8_rep iowrite8_rep extern void iowrite8_rep(void __iomem *port, const void *buf, unsigned long count); +#endif +#ifndef iowrite16_rep +#define iowrite16_rep iowrite16_rep extern void iowrite16_rep(void __iomem *port, const void *buf, unsigned long count); +#endif +#ifndef iowrite32_rep +#define iowrite32_rep iowrite32_rep extern void iowrite32_rep(void __iomem *port, const void *buf, unsigned long count); +#endif #ifdef CONFIG_HAS_IOPORT_MAP /* Create a virtual mapping cookie for an IO port range */ +#ifndef ioport_map +#define ioport_map
[RFC PATCH 1/5] asm-generic: move pcu_iounmap from iomap.h to pci_iomap.h
pci_iounmap seems misplaced in iomap.h. Move it to where its cousins live. Since iomap.h includes pci_iomap.h, anybody relying on getting the declaration through iomap.h will still get it. It would be natural to put the static inline version inside the #elif defined(CONFIG_GENERIC_PCI_IOMAP) Since GENERIC_IOMAP selects GENERIC_PCI_IOMAP, that would be ok for those who select GENERIC_IOMAP. However, I fear there are some that select GENERIC_PCI_IOMAP without GENERIC_IOMAP, and which perhaps have their own pci_iounmap stub definition somewhere. So for now at least, define the static inline version under the same conditions as it were in iomap.h. Signed-off-by: Rasmus Villemoes --- include/asm-generic/iomap.h | 10 -- include/asm-generic/pci_iomap.h | 7 +++ 2 files changed, 7 insertions(+), 10 deletions(-) diff --git a/include/asm-generic/iomap.h b/include/asm-generic/iomap.h index a008f504a2d0..5f8321e8fea9 100644 --- a/include/asm-generic/iomap.h +++ b/include/asm-generic/iomap.h @@ -101,16 +101,6 @@ extern void ioport_unmap(void __iomem *); #define ioremap_wt ioremap_nocache #endif -#ifdef CONFIG_PCI -/* Destroy a virtual mapping cookie for a PCI BAR (memory or IO) */ -struct pci_dev; -extern void pci_iounmap(struct pci_dev *dev, void __iomem *); -#elif defined(CONFIG_GENERIC_IOMAP) -struct pci_dev; -static inline void pci_iounmap(struct pci_dev *dev, void __iomem *addr) -{ } -#endif - #include #endif diff --git a/include/asm-generic/pci_iomap.h b/include/asm-generic/pci_iomap.h index d4f16dcc2ed7..c2f78db2420e 100644 --- a/include/asm-generic/pci_iomap.h +++ b/include/asm-generic/pci_iomap.h @@ -18,6 +18,8 @@ extern void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, extern void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar, unsigned long offset, unsigned long maxlen); +/* Destroy a virtual mapping cookie for a PCI BAR (memory or IO) */ +extern void pci_iounmap(struct pci_dev *dev, void __iomem *); /* Create a virtual mapping cookie for a port on a given PCI device. * Do not call this directly, it exists to make it easier for architectures * to override */ @@ -52,4 +54,9 @@ static inline void __iomem *pci_iomap_wc_range(struct pci_dev *dev, int bar, } #endif +#if !defined(CONFIG_PCI) && defined(CONFIG_GENERIC_IOMAP) +static inline void pci_iounmap(struct pci_dev *dev, void __iomem *addr) +{ } +#endif + #endif /* __ASM_GENERIC_IO_H */ -- 2.23.0
[RFC PATCH 3/5] powerpc: move pci_iounmap() from iomap.c to pci-common.c
As preparation for making iowrite32 and friends static inlines, move the definition of pci_iounmap() from iomap.c to pci-common.c. This definition of pci_iounmap() is compiled in when !CONFIG_PPC_INDIRECT_PIO && CONFIG_PCI - we're just interchanging which condition is in the Kbuild logic and which is in the .c file. Signed-off-by: Rasmus Villemoes --- arch/powerpc/kernel/iomap.c | 13 - arch/powerpc/kernel/pci-common.c | 13 + 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/iomap.c b/arch/powerpc/kernel/iomap.c index 5ac84efc6ede..b22fa8db5068 100644 --- a/arch/powerpc/kernel/iomap.c +++ b/arch/powerpc/kernel/iomap.c @@ -182,16 +182,3 @@ void ioport_unmap(void __iomem *addr) } EXPORT_SYMBOL(ioport_map); EXPORT_SYMBOL(ioport_unmap); - -#ifdef CONFIG_PCI -void pci_iounmap(struct pci_dev *dev, void __iomem *addr) -{ - if (isa_vaddr_is_ioport(addr)) - return; - if (pcibios_vaddr_is_ioport(addr)) - return; - iounmap(addr); -} - -EXPORT_SYMBOL(pci_iounmap); -#endif /* CONFIG_PCI */ diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index 1c448cf25506..d89a2426b405 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -295,6 +296,18 @@ int pcibios_vaddr_is_ioport(void __iomem *address) return ret; } +#ifndef CONFIG_PPC_INDIRECT_PIO +void pci_iounmap(struct pci_dev *dev, void __iomem *addr) +{ + if (isa_vaddr_is_ioport(addr)) + return; + if (pcibios_vaddr_is_ioport(addr)) + return; + iounmap(addr); +} +EXPORT_SYMBOL(pci_iounmap); +#endif /* CONFIG_PPC_INDIRECT_PIO */ + unsigned long pci_address_to_pio(phys_addr_t address) { struct pci_controller *hose; -- 2.23.0
Re: [RFC PATCH 0/9] Fixes and Enablement of ibm,drc-info property
On 10/1/19 1:02 PM, Bjorn Helgaas wrote: > On Tue, Oct 01, 2019 at 01:12:05AM -0500, Tyrel Datwyler wrote: >> There was an initial previous effort yo add support for the PAPR >> architected ibm,drc-info property. This property provides a more >> memory compact representation of a paritions Dynamic Reconfig >> Connectors (DRC). These can otherwise be thought of the currently >> partitioned, or available, but yet to be partitioned, system resources >> such as cpus, memory, and physical/logical IOA devices. >> >> The initial implementation proved buggy and was fully turned of by >> disabling the bit in the appropriate CAS support vector. We now have >> PowerVM firmware in the field that supports this new property, and >> further to suppport partitions with 24TB+ or possible memory this >> property is required to perform platform migration. >> >> This serious fixup the short comings of the previous implementation >> in the areas of general implementation, cpu hotplug, and IOA hotplug. >> >> Tyrel Datwyler (9): >> powerpc/pseries: add cpu DLPAR support for drc-info property >> powerpc/pseries: fix bad drc_index_start value parsing of drc-info >> entry >> powerpc/pseries: fix drc-info mappings of logical cpus to drc-index >> PCI: rpaphp: fix up pointer to first drc-info entry >> PCI: rpaphp: don't rely on firmware feature to imply drc-info support >> PCI: rpaphp: add drc-info support for hotplug slot registration >> PCI: rpaphp: annotate and correctly byte swap DRC properties >> PCI: rpaphp: correctly match ibm,my-drc-index to drc-name when using >> drc-info >> powerpc: Enable support for ibm,drc-info property >> >> arch/powerpc/kernel/prom_init.c | 2 +- >> arch/powerpc/platforms/pseries/hotplug-cpu.c| 117 -- >> arch/powerpc/platforms/pseries/of_helpers.c | 8 +- >> arch/powerpc/platforms/pseries/pseries_energy.c | 23 ++--- >> drivers/pci/hotplug/rpaphp_core.c | 124 >> +--- >> 5 files changed, 191 insertions(+), 83 deletions(-) > > Michael, I assume you'll take care of this. If I were applying, I > would capitalize the commit subjects and fix the typos in the commit > logs, e.g., > > s/the this/the/ > s/the the/that the/ > s/short coming/shortcoming/ > s/seperate/separate/ > s/bid endian/big endian/ > s/were appropriate/where appropriate/ > s/name form/name from/ > > etc. git am also complains about space before tab whitespace errors. > And it adds a few lines >80 chars. > I'll fix all those up in the next spin. -Tyrel
Re: [RFC PATCH 2/9] powerpc/pseries: fix bad drc_index_start value parsing of drc-info entry
On 10/10/19 12:04 PM, Nathan Lynch wrote: > Tyrel Datwyler writes: >> The ibm,drc-info property is an array property that contains drc-info >> entries such that each entry is made up of 2 string encoded elements >> followed by 5 int encoded elements. The of_read_drc_info_cell() >> helper contains comments that correctly name the expected elements >> and their encoding. However, the usage of of_prop_next_string() and >> of_prop_next_u32() introduced a subtle skippage of the first u32. >> This is a result of of_prop_next_string() returns a pointer to the >> next property value which is not a string, but actually a (__be32 *). >> As, a result the following call to of_prop_next_u32() passes over the >> current int encoded value and actually stores the next one wrongly. >> >> Simply endian swap the current value in place after reading the first >> two string values. The remaining int encoded values can then be read >> correctly using of_prop_next_u32(). > > Good but I think it would make more sense for a fix for > of_read_drc_info_cell() to precede any patch in the series which > introduces new callers, such as patch #1. > Not sure it matters that much since everything in the series is required to properly enable a working drc-info implementation and the call already exists so it doesn't break bisectability. It ended up second in the series since testing patch 1 exposed the flaw. I'll go ahead and move it first, and add the appropriate fixes tag as well which is currently missing. -Tyrel
Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc
On Wed, 2019-10-30 at 09:58 +0100, Christophe Leroy wrote: > > Le 30/10/2019 à 08:31, Russell Currey a écrit : > > v4 cover letter: > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html > > v3 cover letter: > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html > > > > Changes since v4: > > [1/5]: Addressed review comments from Michael Ellerman > > (thanks!) > > [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on > >ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid > >STRICT_MODULE_RWX being *on by default* in cases where > >STRICT_KERNEL_RWX is *unavailable* > > [5/5]: split skiroot_defconfig changes out into its own patch > > > > The whole Kconfig situation is really weird and confusing, I > > believe the > > correct resolution is to change arch/Kconfig but the consequences > > are so > > minor that I don't think it's worth it, especially given that I > > expect > > powerpc to have mandatory strict RWX Soon(tm). > > I'm not such strict RWX can be made mandatory due to the impact it > has > on some subarches: > - On the 8xx, unless all areas are 8Mbytes aligned, there is a > significant overhead on TLB misses. And Aligning everthing to 8M is > a > waste of RAM which is not acceptable on systems having very few RAM. > - On hash book3s32, we are able to map the kernel BATs. With a few > alignment constraints, we are able to provide STRICT_KERNEL_RWX. But > we > are unable to provide exec protection on page granularity. Only on > 256Mbytes segments. So for modules, we have to have the vmspace X. It > is > also not possible to have a kernel area RO. Only user areas can be > made RO. > Yes, sorry, this was thoughtless from me, since in my mind I was just thinking about the platforms I primarily work on (book3s64). > Christophe > > > Russell Currey (5): > >powerpc/mm: Implement set_memory() routines > >powerpc/kprobes: Mark newly allocated probes as RO > >powerpc/mm/ptdump: debugfs handler for W+X checks at runtime > >powerpc: Set ARCH_HAS_STRICT_MODULE_RWX > >powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig > > > > arch/powerpc/Kconfig | 2 + > > arch/powerpc/Kconfig.debug | 6 +- > > arch/powerpc/configs/skiroot_defconfig | 1 + > > arch/powerpc/include/asm/set_memory.h | 32 +++ > > arch/powerpc/kernel/kprobes.c | 3 + > > arch/powerpc/mm/Makefile | 1 + > > arch/powerpc/mm/pageattr.c | 77 > > ++ > > arch/powerpc/mm/ptdump/ptdump.c| 21 ++- > > 8 files changed, 140 insertions(+), 3 deletions(-) > > create mode 100644 arch/powerpc/include/asm/set_memory.h > > create mode 100644 arch/powerpc/mm/pageattr.c > >
Re: [PATCH v5 5/5] powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig
On Wed, 30 Oct 2019 at 07:31, Russell Currey wrote: > > skiroot_defconfig is the only powerpc defconfig with STRICT_KERNEL_RWX > enabled, and if you want memory protection for kernel text you'd want it > for modules too, so enable STRICT_MODULE_RWX there. > > Signed-off-by: Russell Currey Acked-by: Joel Stanley > --- > arch/powerpc/configs/skiroot_defconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/configs/skiroot_defconfig > b/arch/powerpc/configs/skiroot_defconfig > index 1253482a67c0..719d899081b3 100644 > --- a/arch/powerpc/configs/skiroot_defconfig > +++ b/arch/powerpc/configs/skiroot_defconfig > @@ -31,6 +31,7 @@ CONFIG_PERF_EVENTS=y > CONFIG_SLAB_FREELIST_HARDENED=y > CONFIG_JUMP_LABEL=y > CONFIG_STRICT_KERNEL_RWX=y > +CONFIG_STRICT_MODULE_RWX=y > CONFIG_MODULES=y > CONFIG_MODULE_UNLOAD=y > CONFIG_MODULE_SIG=y > -- > 2.23.0 >
Re: [RFC PATCH 1/9] powerpc/pseries: add cpu DLPAR support for drc-info property
On 10/10/19 11:56 AM, Nathan Lynch wrote: > Hi Tyrel, > > Tyrel Datwyler writes: >> +static bool valid_cpu_drc_index(struct device_node *parent, u32 drc_index) >> +{ >> +const __be32 *indexes; >> +int i; >> + >> +if (of_find_property(parent, "ibm,drc-info", NULL)) >> +return drc_info_valid_index(parent, drc_index); >> + >> +indexes = of_get_property(parent, "ibm,drc-indexes", NULL); >> +if (!indexes) >> +return false; >> + >> +for (i = 0; i < indexes[0]; i++) { > > should this be: > > for (i = 0; i < be32_to_cpu(indexes[0]); i++) { > ? Yes! > > >> +if (be32_to_cpu(indexes[i + 1]) == drc_index) >> +return true; >> +} >> + >> +return false; >> } > > It looks like this rewrites valid_cpu_drc_index()'s existing code for > parsing ibm,drc-indexes but I don't see the need for this. > > This patch would be easier to review if that were dropped or split out. Yeah, I'll split it out. There are multiple places where we iterate over the drc_indexes, and it is implemented several different ways. I basically picked an implementation to use across the board. I think a better way would be just to implement a for_each_drc_index(dn, drc_index) macro to abstract away iterator implementation. > >> >> static ssize_t dlpar_cpu_add(u32 drc_index) >> @@ -720,8 +756,11 @@ static int dlpar_cpu_remove_by_count(u32 cpus_to_remove) >> static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 cpus_to_add) >> { >> struct device_node *parent; >> +struct property *info; >> +const __be32 *indexes; >> int cpus_found = 0; >> -int index, rc; >> +int i, j; >> +u32 drc_index; >> >> parent = of_find_node_by_path("/cpus"); >> if (!parent) { >> @@ -730,24 +769,46 @@ static int find_dlpar_cpus_to_add(u32 *cpu_drcs, u32 >> cpus_to_add) >> return -1; >> } >> >> -/* Search the ibm,drc-indexes array for possible CPU drcs to >> - * add. Note that the format of the ibm,drc-indexes array is >> - * the number of entries in the array followed by the array >> - * of drc values so we start looking at index = 1. >> - */ >> -index = 1; >> -while (cpus_found < cpus_to_add) { >> -u32 drc; >> +info = of_find_property(parent, "ibm,drc-info", NULL); >> +if (info) { >> +struct of_drc_info drc; >> +const __be32 *value; >> +int count; >> >> -rc = of_property_read_u32_index(parent, "ibm,drc-indexes", >> -index++, ); >> -if (rc) >> -break; >> +value = of_prop_next_u32(info, NULL, ); >> +if (value) >> +value++; >> >> -if (dlpar_cpu_exists(parent, drc)) >> -continue; >> +for (i = 0; i < count; i++) { >> +of_read_drc_info_cell(, , ); >> +if (strncmp(drc.drc_type, "CPU", 3)) >> +break; >> + >> +for (j = 0; j < drc.num_sequential_elems; j++) { >> +drc_index = drc.drc_index_start + >> (drc.sequential_inc * j); >> + >> +if (dlpar_cpu_exists(parent, drc_index)) >> +continue; >> >> -cpu_drcs[cpus_found++] = drc; >> +cpu_drcs[cpus_found++] = drc_index; > > I am failing to see how this loop is limited by the cpus_to_add > parameter as it was before this change. It looks like this will overflow > the cpu_drcs array when cpus_to_add is less than the number of cpus > found. You are right. The code is picking every non-present drc_index which will overflow the supplied buffer as you stated when there are more available indexes than requested cpus. Will fix to bound the search. > > As an aside I don't understand how the add_by_count()/dlpar_cpu_exists() > algorithm could be correct as it currently stands. It seems to pick the > first X indexes for which a corresponding cpu node is absent, but that > set of indexes does not necessarily match the set that is available to > configure. Something to address separately I suppose. I'm not sure I follow? > >> +} >> +} >> +} else { >> +indexes = of_get_property(parent, "ibm,drc-indexes", NULL); >> + >> +/* Search the ibm,drc-indexes array for possible CPU drcs to >> +* add. Note that the format of the ibm,drc-indexes array is >> +* the number of entries in the array followed by the array >> +* of drc values so we start looking at index = 1. >> +*/ >> +for (i = 1; i < indexes[0]; i++) { >> +drc_index = be32_to_cpu(indexes[i]); >> + >> +if (dlpar_cpu_exists(parent, drc_index)) >> +continue; >> + >> +
Re: [PATCH] powerpc/tools: Don't quote $objdump in scripts
On Wed, Oct 30, 2019 at 10:55:03PM +1100, Michael Ellerman wrote: > Segher Boessenkool writes: > > On Thu, Oct 24, 2019 at 11:47:30AM +1100, Michael Ellerman wrote: > >> Some of our scripts are passed $objdump and then call it as > >> "$objdump". This doesn't work if it contains spaces because we're > >> using ccache, for example you get errors such as: > >> > >> ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: > >> No such file or directory > >> ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache > >> ppc64le-objdump: No such file or directory > >> > >> Fix it by not quoting the string when we expand it, allowing the shell > >> to do the right thing for us. > > > > This breaks things for people with spaces in their paths. > > Spaces in their what? Who does that? :) I know, right? > Also we don't support it: > > $ pwd > $ /home/michael/foo bar > $ make clean > Makefile:147: *** source directory cannot contain spaces or colons. Stop. Of course. But it's shell scripting 101 that you *do* quote all variable expansions. Do you want to set a bad example? ;-) Segher
Re: [PATCH 14/19] vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
On 10/30/19 3:49 PM, John Hubbard wrote: > This also fixes one or two likely bugs. Well, actually just one... > > 1. Change vfio from get_user_pages(FOLL_LONGTERM), to > pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN. > > Note that this is a change in behavior, because the > get_user_pages_remote() call was not setting FOLL_LONGTERM, but the > new pin_user_pages_remote() call that replaces it, *is* setting > FOLL_LONGTERM. It is important to set FOLL_LONGTERM, because the > DMA case requires it. Please see the FOLL_PIN documentation in > include/linux/mm.h, and Documentation/pin_user_pages.rst for details. Correction: the above comment is stale and wrong. I wrote it before getting further into the details, and the patch doesn't do this. Instead, it keeps exactly the old behavior: pin_longterm_pages_remote() is careful to avoid setting FOLL_LONGTERM. Instead of setting that flag, it drops in a "TODO" comment nearby. :) I'll update the commit description in the next version of the series. thanks, John Hubbard NVIDIA > > 2. Because all FOLL_PIN-acquired pages must be released via > put_user_page(), also convert the put_page() call over to > put_user_pages(). > > Note that this effectively changes the code's behavior in > vfio_iommu_type1.c: put_pfn(): it now ultimately calls > set_page_dirty_lock(), instead of set_page_dirty(). This is > probably more accurate. > > As Christoph Hellwig put it, "set_page_dirty() is only safe if we are > dealing with a file backed page where we have reference on the inode it > hangs off." [1] > > [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de > > Cc: Alex Williamson > Signed-off-by: John Hubbard > --- > drivers/vfio/vfio_iommu_type1.c | 15 +++ > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index d864277ea16f..795e13f3ef08 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -327,9 +327,8 @@ static int put_pfn(unsigned long pfn, int prot) > { > if (!is_invalid_reserved_pfn(pfn)) { > struct page *page = pfn_to_page(pfn); > - if (prot & IOMMU_WRITE) > - SetPageDirty(page); > - put_page(page); > + > + put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE); > return 1; > } > return 0; > @@ -349,11 +348,11 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned > long vaddr, > > down_read(>mmap_sem); > if (mm == current->mm) { > - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page, > - vmas); > + ret = pin_longterm_pages(vaddr, 1, flags, page, vmas); > } else { > - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page, > - vmas, NULL); > + ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1, > + flags, page, vmas, > + NULL); > /* >* The lifetime of a vaddr_get_pfn() page pin is >* userspace-controlled. In the fs-dax case this could > @@ -363,7 +362,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned > long vaddr, >*/ > if (ret > 0 && vma_is_fsdax(vmas[0])) { > ret = -EOPNOTSUPP; > - put_page(page[0]); > + put_user_page(page[0]); > } > } > up_read(>mmap_sem); >
[PATCH 17/19] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage
It's good to have basic unit test coverage of the new FOLL_PIN behavior. Fortunately, the gup_benchmark unit test is extremely fast (a few milliseconds), so adding it the the run_vmtests suite is going to cause no noticeable change in running time. So, add two new invocations to run_vmtests: 1) Run gup_benchmark with normal get_user_pages(). 2) Run gup_benchmark with pin_user_pages(). This is much like the first call, except that it sets FOLL_PIN. Running these two in quick succession also provide a visual comparison of the running times, which is convenient. The new invocations are fairly early in the run_vmtests script, because with test suites, it's usually preferable to put the shorter, faster tests first, all other things being equal. Signed-off-by: John Hubbard --- tools/testing/selftests/vm/run_vmtests | 22 ++ 1 file changed, 22 insertions(+) diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests index 951c507a27f7..93e8dc9a7cad 100755 --- a/tools/testing/selftests/vm/run_vmtests +++ b/tools/testing/selftests/vm/run_vmtests @@ -104,6 +104,28 @@ echo "NOTE: The above hugetlb tests provide minimal coverage. Use" echo " https://github.com/libhugetlbfs/libhugetlbfs.git for" echo " hugetlb regression testing." +echo "" +echo "running 'gup_benchmark -U' (normal/slow gup)" +echo "" +./gup_benchmark -U +if [ $? -ne 0 ]; then + echo "[FAIL]" + exitcode=1 +else + echo "[PASS]" +fi + +echo "--" +echo "running gup_benchmark -c (pin_user_pages)" +echo "--" +./gup_benchmark -c +if [ $? -ne 0 ]; then + echo "[FAIL]" + exitcode=1 +else + echo "[PASS]" +fi + echo "---" echo "running userfaultfd" echo "---" -- 2.23.0
[PATCH 19/19] Documentation/vm: add pin_user_pages.rst
Document the new pin_user_pages() and related calls and behavior. Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in this documentation. (I've reworded it and expanded on it slightly.) Cc: Jonathan Corbet Signed-off-by: John Hubbard --- Documentation/vm/index.rst | 1 + Documentation/vm/pin_user_pages.rst | 213 2 files changed, 214 insertions(+) create mode 100644 Documentation/vm/pin_user_pages.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index e8d943b21cf9..7194efa3554a 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -44,6 +44,7 @@ descriptions of data structures and algorithms. page_migration page_frags page_owner + pin_user_pages remap_file_pages slub split_page_table_lock diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst new file mode 100644 index ..7110bca3f188 --- /dev/null +++ b/Documentation/vm/pin_user_pages.rst @@ -0,0 +1,213 @@ +.. SPDX-License-Identifier: GPL-2.0 + + +pin_user_pages() and related calls + + +.. contents:: :local: + +Overview + + +This document describes the following functions: :: + + pin_user_pages + pin_user_pages_fast + pin_user_pages_remote + + pin_longterm_pages + pin_longterm_pages_fast + pin_longterm_pages_remote + +Basic description of FOLL_PIN += + +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has +significant interactions and interdependencies with FOLL_LONGTERM, so both are +covered here. + +Both FOLL_PIN and FOLL_LONGTERM are "internal" to gup, meaning that neither +FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows +the associated wrapper functions (pin_user_pages and others) to set the correct +combination of these flags, and to check for problems as well. + +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, +multiple threads and call sites are free to pin the same struct pages, via both +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the +other, not the struct page(s). + +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN +uses a different reference counting technique. + +FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is, +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. + +Which flags are set by each wrapper +=== + +Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to +whatever flags the caller provides:: + + Functiongup flags (FOLL_PIN or FOLL_LONGTERM only) + -- + pin_user_pages FOLL_PIN + pin_user_pages_fast FOLL_PIN + pin_user_pages_remote FOLL_PIN + + pin_longterm_pages FOLL_PIN | FOLL_LONGTERM + pin_longterm_pages_fast FOLL_PIN | FOLL_LONGTERM + pin_longterm_pages_remote FOLL_PIN | FOLL_LONGTERM + +Tracking dma-pinned pages += + +Some of the key design constraints, and solutions, for tracking dma-pinned +pages: + +* An actual reference count, per struct page, is required. This is because + multiple processes may pin and unpin a page. + +* False positives (reporting that a page is dma-pinned, when in fact it is not) + are acceptable, but false negatives are not. + +* struct page may not be increased in size for this, and all fields are already + used. + +* Given the above, we can overload the page->_refcount field by using, sort of, + the upper bits in that field for a dma-pinned count. "Sort of", means that, + rather than dividing page->_refcount into bit fields, we simple add a medium- + large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to + page->_refcount. This provides fuzzy behavior: if a page has get_page() called + on it 1024 times, then it will appear to have a single dma-pinned count. + And again, that's acceptable. + +This also leads to limitations: there are only 32-10==22 bits available for a +counter that increments 10 bits at a time. + +TODO: for 1GB and larger huge pages, this is cutting it close. That's because +when pin_user_pages() follows such pages, it increments the head page by "1" +(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for +pin_user_pages()) for each tail page. So if you have a 1GB huge page: + +* There are 256K (18 bits) worth of 4 KB tail pages. +* There are 22 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, + 10 bits at a time) +* There are 22 - 18 == 4 bits available to count. Except that there aren't, + because you need to allow for a few normal get_page() calls on the head page, + as well. Fortunately, the approach of
[PATCH 15/19] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
1. Convert from get_user_pages(FOLL_LONGTERM) to pin_longterm_pages(). 2. As required by pin_user_pages(), release these pages via put_user_page(). In this case, do so via put_user_pages_dirty_lock(). That has the side effect of calling set_page_dirty_lock(), instead of set_page_dirty(). This is probably more accurate. As Christoph Hellwig put it, "set_page_dirty() is only safe if we are dealing with a file backed page where we have reference on the inode it hangs off." [1] 3. Release each page in mem->hpages[] (instead of mem->hpas[]), because that is the array that pin_longterm_pages() filled in. This is more accurate and should be a little safer from a maintenance point of view. [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de Signed-off-by: John Hubbard --- arch/powerpc/mm/book3s64/iommu_api.c | 15 ++- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c index 56cc84520577..69d79cb50d47 100644 --- a/arch/powerpc/mm/book3s64/iommu_api.c +++ b/arch/powerpc/mm/book3s64/iommu_api.c @@ -103,9 +103,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, for (entry = 0; entry < entries; entry += chunk) { unsigned long n = min(entries - entry, chunk); - ret = get_user_pages(ua + (entry << PAGE_SHIFT), n, - FOLL_WRITE | FOLL_LONGTERM, - mem->hpages + entry, NULL); + ret = pin_longterm_pages(ua + (entry << PAGE_SHIFT), n, +FOLL_WRITE, mem->hpages + entry, NULL); if (ret == n) { pinned += n; continue; @@ -167,9 +166,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, return 0; free_exit: - /* free the reference taken */ - for (i = 0; i < pinned; i++) - put_page(mem->hpages[i]); + /* free the references taken */ + put_user_pages(mem->hpages, pinned); vfree(mem->hpas); kfree(mem); @@ -212,10 +210,9 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem) if (!page) continue; - if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY) - SetPageDirty(page); + put_user_pages_dirty_lock(>hpages[i], 1, + MM_IOMMU_TABLE_GROUP_PAGE_DIRTY); - put_page(page); mem->hpas[i] = 0; } } -- 2.23.0
[PATCH 16/19] mm/gup_benchmark: support pin_user_pages() and related calls
Up until now, gup_benchmark supported testing of the following kernel functions: * get_user_pages(): via the '-U' command line option * get_user_pages_longterm(): via the '-L' command line option * get_user_pages_fast(): as the default (no options required) Add test coverage for the new corresponding pin_*() functions: * pin_user_pages(): via the '-c' command line option * pin_longterm_pages(): via the '-b' command line option * pin_user_pages_fast(): via the '-a' command line option Also, add an option for clarity: '-u' for what is now (still) the default choice: get_user_pages_fast(). Also, for the three commands that set FOLL_PIN, verify that the pages really are dma-pinned, via the new is_dma_pinned() routine. Those commands are: PIN_FAST_BENCHMARK : calls pin_user_pages_fast() PIN_LONGTERM_BENCHMARK : calls pin_longterm_pages() PIN_BENCHMARK : calls pin_user_pages() In between the calls to pin_*() and put_user_pages(), check each page: if page_dma_pinned() returns false, then WARN and return. Do this outside of the benchmark timestamps, so that it doesn't affect reported times. Signed-off-by: John Hubbard --- mm/gup_benchmark.c | 74 -- tools/testing/selftests/vm/gup_benchmark.c | 23 ++- 2 files changed, 91 insertions(+), 6 deletions(-) diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c index 7dd602d7f8db..2bb0f5df4803 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_benchmark.c @@ -8,6 +8,9 @@ #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) +#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark) +#define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark) +#define PIN_BENCHMARK _IOWR('g', 6, struct gup_benchmark) struct gup_benchmark { __u64 get_delta_usec; @@ -19,6 +22,44 @@ struct gup_benchmark { __u64 expansion[10];/* For future use */ }; +static void put_back_pages(int cmd, struct page **pages, unsigned long nr_pages) +{ + int i; + + switch (cmd) { + case GUP_FAST_BENCHMARK: + case GUP_LONGTERM_BENCHMARK: + case GUP_BENCHMARK: + for (i = 0; i < nr_pages; i++) + put_page(pages[i]); + break; + + case PIN_FAST_BENCHMARK: + case PIN_LONGTERM_BENCHMARK: + case PIN_BENCHMARK: + put_user_pages(pages, nr_pages); + break; + } +} + +static void verify_dma_pinned(int cmd, struct page **pages, + unsigned long nr_pages) +{ + int i; + + switch (cmd) { + case PIN_FAST_BENCHMARK: + case PIN_LONGTERM_BENCHMARK: + case PIN_BENCHMARK: + for (i = 0; i < nr_pages; i++) { + if (WARN(!page_dma_pinned(pages[i]), +"pages[%d] is NOT dma-pinned\n", i)) + break; + } + break; + } +} + static int __gup_benchmark_ioctl(unsigned int cmd, struct gup_benchmark *gup) { @@ -62,6 +103,19 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = get_user_pages(addr, nr, gup->flags & 1, pages + i, NULL); break; + case PIN_FAST_BENCHMARK: + nr = pin_user_pages_fast(addr, nr, gup->flags & 1, +pages + i); + break; + case PIN_LONGTERM_BENCHMARK: + nr = pin_longterm_pages(addr, nr, + (gup->flags & 1), + pages + i, NULL); + break; + case PIN_BENCHMARK: + nr = pin_user_pages(addr, nr, gup->flags & 1, pages + i, + NULL); + break; default: return -1; } @@ -72,15 +126,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd, } end_time = ktime_get(); + /* Shifting the meaning of nr_pages: now it is actual number pinned: */ + nr_pages = i; + gup->get_delta_usec = ktime_us_delta(end_time, start_time); gup->size = addr - gup->addr; + /* +* Take an un-benchmark-timed moment to verify DMA pinned +* state: print a warning if any non-dma-pinned pages are found: +*/ + verify_dma_pinned(cmd, pages, nr_pages); + start_time = ktime_get(); - for (i = 0; i < nr_pages; i++) { - if (!pages[i]) - break; - put_page(pages[i]); - } + + put_back_pages(cmd, pages, nr_pages); +
[PATCH 18/19] mm/gup: remove support for gup(FOLL_LONGTERM)
Now that all other kernel callers of get_user_pages(FOLL_LONGTERM) have been converted to pin_longterm_pages(), lock it down: 1) Add an assertion to get_user_pages(), preventing callers from passing FOLL_LONGTERM (in addition to the existing assertion that prevents FOLL_PIN). 2) Remove the associated GUP_LONGTERM_BENCHMARK test. Signed-off-by: John Hubbard --- mm/gup.c | 8 mm/gup_benchmark.c | 9 + tools/testing/selftests/vm/gup_benchmark.c | 7 ++- 3 files changed, 7 insertions(+), 17 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index e51b3820a995..9a28935a2cb1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1744,11 +1744,11 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, struct vm_area_struct **vmas) { /* -* As detailed above, FOLL_PIN must only be set internally by the -* pin_user_page*() and pin_longterm_*() APIs, never directly by the -* caller, so enforce that with an assertion: +* As detailed above, FOLL_PIN and FOLL_LONGTERM must only be set +* internally by the pin_user_page*() and pin_longterm_*() APIs, never +* directly by the caller, so enforce that with an assertion: */ - if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) + if (WARN_ON_ONCE(gup_flags & (FOLL_PIN | FOLL_LONGTERM))) return -EINVAL; return __gup_longterm_locked(current, current->mm, start, nr_pages, diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c index 2bb0f5df4803..de6941855b7e 100644 --- a/mm/gup_benchmark.c +++ b/mm/gup_benchmark.c @@ -6,7 +6,7 @@ #include #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) +/* Command 2 has been deleted. */ #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) #define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark) #define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark) @@ -28,7 +28,6 @@ static void put_back_pages(int cmd, struct page **pages, unsigned long nr_pages) switch (cmd) { case GUP_FAST_BENCHMARK: - case GUP_LONGTERM_BENCHMARK: case GUP_BENCHMARK: for (i = 0; i < nr_pages; i++) put_page(pages[i]); @@ -94,11 +93,6 @@ static int __gup_benchmark_ioctl(unsigned int cmd, nr = get_user_pages_fast(addr, nr, gup->flags & 1, pages + i); break; - case GUP_LONGTERM_BENCHMARK: - nr = get_user_pages(addr, nr, - (gup->flags & 1) | FOLL_LONGTERM, - pages + i, NULL); - break; case GUP_BENCHMARK: nr = get_user_pages(addr, nr, gup->flags & 1, pages + i, NULL); @@ -157,7 +151,6 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd, switch (cmd) { case GUP_FAST_BENCHMARK: - case GUP_LONGTERM_BENCHMARK: case GUP_BENCHMARK: case PIN_FAST_BENCHMARK: case PIN_LONGTERM_BENCHMARK: diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c index c5c934c0f402..5ef3cf8f3da5 100644 --- a/tools/testing/selftests/vm/gup_benchmark.c +++ b/tools/testing/selftests/vm/gup_benchmark.c @@ -15,7 +15,7 @@ #define PAGE_SIZE sysconf(_SC_PAGESIZE) #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark) -#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark) +/* Command 2 has been deleted. */ #define GUP_BENCHMARK _IOWR('g', 3, struct gup_benchmark) /* @@ -46,7 +46,7 @@ int main(int argc, char **argv) char *file = "/dev/zero"; char *p; - while ((opt = getopt(argc, argv, "m:r:n:f:abctTLUuwSH")) != -1) { + while ((opt = getopt(argc, argv, "m:r:n:f:abctTUuwSH")) != -1) { switch (opt) { case 'a': cmd = PIN_FAST_BENCHMARK; @@ -72,9 +72,6 @@ int main(int argc, char **argv) case 'T': thp = 0; break; - case 'L': - cmd = GUP_LONGTERM_BENCHMARK; - break; case 'U': cmd = GUP_BENCHMARK; break; -- 2.23.0
[PATCH 14/19] vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
This also fixes one or two likely bugs. 1. Change vfio from get_user_pages(FOLL_LONGTERM), to pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN. Note that this is a change in behavior, because the get_user_pages_remote() call was not setting FOLL_LONGTERM, but the new pin_user_pages_remote() call that replaces it, *is* setting FOLL_LONGTERM. It is important to set FOLL_LONGTERM, because the DMA case requires it. Please see the FOLL_PIN documentation in include/linux/mm.h, and Documentation/pin_user_pages.rst for details. 2. Because all FOLL_PIN-acquired pages must be released via put_user_page(), also convert the put_page() call over to put_user_pages(). Note that this effectively changes the code's behavior in vfio_iommu_type1.c: put_pfn(): it now ultimately calls set_page_dirty_lock(), instead of set_page_dirty(). This is probably more accurate. As Christoph Hellwig put it, "set_page_dirty() is only safe if we are dealing with a file backed page where we have reference on the inode it hangs off." [1] [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de Cc: Alex Williamson Signed-off-by: John Hubbard --- drivers/vfio/vfio_iommu_type1.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index d864277ea16f..795e13f3ef08 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -327,9 +327,8 @@ static int put_pfn(unsigned long pfn, int prot) { if (!is_invalid_reserved_pfn(pfn)) { struct page *page = pfn_to_page(pfn); - if (prot & IOMMU_WRITE) - SetPageDirty(page); - put_page(page); + + put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE); return 1; } return 0; @@ -349,11 +348,11 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, down_read(>mmap_sem); if (mm == current->mm) { - ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page, -vmas); + ret = pin_longterm_pages(vaddr, 1, flags, page, vmas); } else { - ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page, - vmas, NULL); + ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1, + flags, page, vmas, + NULL); /* * The lifetime of a vaddr_get_pfn() page pin is * userspace-controlled. In the fs-dax case this could @@ -363,7 +362,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, */ if (ret > 0 && vma_is_fsdax(vmas[0])) { ret = -EOPNOTSUPP; - put_page(page[0]); + put_user_page(page[0]); } } up_read(>mmap_sem); -- 2.23.0
[PATCH 13/19] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
1. Change v4l2 from get_user_pages(FOLL_LONGTERM), to pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN. 2. Because all FOLL_PIN-acquired pages must be released via put_user_page(), also convert the put_page() call over to put_user_pages_dirty_lock(). Cc: Mauro Carvalho Chehab Signed-off-by: John Hubbard --- drivers/media/v4l2-core/videobuf-dma-sg.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 28262190c3ab..9b9c5b37bf59 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma, dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n", data, size, dma->nr_pages); - err = get_user_pages(data & PAGE_MASK, dma->nr_pages, -flags | FOLL_LONGTERM, dma->pages, NULL); + err = pin_longterm_pages(data & PAGE_MASK, dma->nr_pages, +flags, dma->pages, NULL); if (err != dma->nr_pages) { dma->nr_pages = (err >= 0) ? err : 0; - dprintk(1, "get_user_pages: err=%d [%d]\n", err, + dprintk(1, "pin_longterm_pages: err=%d [%d]\n", err, dma->nr_pages); return err < 0 ? err : -EINVAL; } @@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma) BUG_ON(dma->sglen); if (dma->pages) { - for (i = 0; i < dma->nr_pages; i++) { - if (dma->direction == DMA_FROM_DEVICE) - set_page_dirty_lock(dma->pages[i]); - put_page(dma->pages[i]); - } + put_user_pages_dirty_lock(dma->pages, dma->nr_pages, + dma->direction == DMA_FROM_DEVICE); kfree(dma->pages); dma->pages = NULL; } -- 2.23.0
[PATCH 12/19] mm/gup: track FOLL_PIN pages
Add tracking of pages that were pinned via FOLL_PIN. As mentioned in the FOLL_PIN documentation, callers who effectively set FOLL_PIN are required to ultimately free such pages via put_user_page(). The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET for DIO and/or RDMA use". Pages that have been pinned via FOLL_PIN are identifiable via a new function call: bool page_dma_pinned(struct page *page); What to do in response to encountering such a page, is left to later patchsets. There is discussion about this in [1]. This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). This also has a couple of trivial, non-functional change fixes to try_get_compound_head(). That function got moved to the top of the file. This includes the following fix from Ira Weiny: DAX requires detection of a page crossing to a ref count of 1. Fix this for GUP pages by introducing put_devmap_managed_user_page() which accounts for GUP_PIN_COUNTING_BIAS now used by GUP. [1] https://lwn.net/Articles/784574/ "Some slow progress on get_user_pages()" Suggested-by: Jan Kara Suggested-by: Jérôme Glisse Signed-off-by: Ira Weiny Signed-off-by: John Hubbard --- include/linux/mm.h | 80 +++ include/linux/mmzone.h | 2 + include/linux/page_ref.h | 10 ++ mm/gup.c | 213 +++ mm/huge_memory.c | 32 +- mm/hugetlb.c | 28 - mm/memremap.c| 4 +- mm/vmstat.c | 2 + 8 files changed, 300 insertions(+), 71 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 62c838a3e6c7..882fda919c81 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page *page) #endif #ifdef CONFIG_DEV_PAGEMAP_OPS -void __put_devmap_managed_page(struct page *page); +void __put_devmap_managed_page(struct page *page, int count); DECLARE_STATIC_KEY_FALSE(devmap_managed_key); -static inline bool put_devmap_managed_page(struct page *page) + +static inline bool page_is_devmap_managed(struct page *page) { if (!static_branch_unlikely(_managed_key)) return false; @@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page *page) switch (page->pgmap->type) { case MEMORY_DEVICE_PRIVATE: case MEMORY_DEVICE_FS_DAX: - __put_devmap_managed_page(page); return true; default: break; @@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page *page) return false; } +static inline bool put_devmap_managed_page(struct page *page) +{ + bool is_devmap = page_is_devmap_managed(page); + + if (is_devmap) { + int count = page_ref_dec_return(page); + + __put_devmap_managed_page(page, count); + } + + return is_devmap; +} + #else /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool put_devmap_managed_page(struct page *page) { @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page) return true; } +__must_check bool user_page_ref_inc(struct page *page); + static inline void put_page(struct page *page) { page = compound_head(page); @@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page) __put_page(page); } -/** - * put_user_page() - release a gup-pinned page - * @page:pointer to page to be released +/* + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload + * the page's refcount so that two separate items are tracked: the original page + * reference count, and also a new count of how many get_user_pages() calls were + * made against the page. ("gup-pinned" is another term for the latter). + * + * With this scheme, get_user_pages() becomes special: such pages are marked + * as distinct from normal pages. As such, the new put_user_page() call (and + * its variants) must be used in order to release gup-pinned pages. + * + * Choice of value: * - * Pages that were pinned via get_user_pages*() must be released via - * either put_user_page(), or one of the put_user_pages*() routines - * below. This is so that eventually, pages that are pinned via - * get_user_pages*() can be separately tracked and uniquely handled. In - * particular, interactions with RDMA and filesystems need special - * handling. + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference + * counts with respect to get_user_pages() and put_user_page() becomes simpler, + * due to the fact that adding an even power of two to the page refcount has + * the effect of using only the upper N bits, for the code that counts up using + * the bias value. This means that the lower bits are left for the exclusive + * use of the original code that increments and decrements by one (or at least, + * by much smaller values than the bias value). * - *
[PATCH 11/19] net/xdp: set FOLL_PIN via pin_user_pages()
Convert net/xdp to use the new pin_longterm_pages() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages. Signed-off-by: John Hubbard --- net/xdp/xdp_umem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index 16d5f353163a..4d56dfb1139a 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -285,8 +285,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem) return -ENOMEM; down_read(>mm->mmap_sem); - npgs = get_user_pages(umem->address, umem->npgs, - gup_flags | FOLL_LONGTERM, >pgs[0], NULL); + npgs = pin_longterm_pages(umem->address, umem->npgs, gup_flags, + >pgs[0], NULL); up_read(>mm->mmap_sem); if (npgs != umem->npgs) { -- 2.23.0
[PATCH 09/19] drm/via: set FOLL_PIN via pin_user_pages_fast()
Convert drm/via to use the new pin_user_pages_fast() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages, and therefore for any code that calls put_user_page(). Signed-off-by: John Hubbard --- drivers/gpu/drm/via/via_dmablit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c index 3db000aacd26..37c5e572993a 100644 --- a/drivers/gpu/drm/via/via_dmablit.c +++ b/drivers/gpu/drm/via/via_dmablit.c @@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg, drm_via_dmablit_t *xfer) vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages)); if (NULL == vsg->pages) return -ENOMEM; - ret = get_user_pages_fast((unsigned long)xfer->mem_addr, + ret = pin_user_pages_fast((unsigned long)xfer->mem_addr, vsg->num_pages, vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0, vsg->pages); -- 2.23.0
[PATCH 10/19] fs/io_uring: set FOLL_PIN via pin_user_pages()
Convert fs/io_uring to use the new pin_user_pages() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages, and therefore for any code that calls put_user_page(). Signed-off-by: John Hubbard --- fs/io_uring.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index a30c4f622cb3..d3924b1760eb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3431,9 +3431,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, ret = 0; down_read(>mm->mmap_sem); - pret = get_user_pages(ubuf, nr_pages, - FOLL_WRITE | FOLL_LONGTERM, - pages, vmas); + pret = pin_longterm_pages(ubuf, nr_pages, FOLL_WRITE, pages, + vmas); if (pret == nr_pages) { /* don't support file backed memory */ for (j = 0; j < nr_pages; j++) { -- 2.23.0
[PATCH 08/19] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
Convert process_vm_access to use the new pin_user_pages_remote() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages. Also, release the pages via put_user_page*(). Also, rename "pages" to "pinned_pages", as this makes for easier reading of process_vm_rw_single_vec(). Signed-off-by: John Hubbard --- mm/process_vm_access.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 357aa7bef6c0..fd20ab675b85 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages, if (copy > len) copy = len; - if (vm_write) { + if (vm_write) copied = copy_page_from_iter(page, offset, copy, iter); - set_page_dirty_lock(page); - } else { + else copied = copy_page_to_iter(page, offset, copy, iter); - } + len -= copied; if (copied < copy && iov_iter_count(iter)) return -EFAULT; @@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr, flags |= FOLL_WRITE; while (!rc && nr_pages && iov_iter_count(iter)) { - int pages = min(nr_pages, max_pages_per_loop); + int pinned_pages = min(nr_pages, max_pages_per_loop); int locked = 1; size_t bytes; @@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr, * current/current->mm */ down_read(>mmap_sem); - pages = get_user_pages_remote(task, mm, pa, pages, flags, - process_pages, NULL, ); + pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages, +flags, process_pages, +NULL, ); if (locked) up_read(>mmap_sem); - if (pages <= 0) + if (pinned_pages <= 0) return -EFAULT; - bytes = pages * PAGE_SIZE - start_offset; + bytes = pinned_pages * PAGE_SIZE - start_offset; if (bytes > len) bytes = len; @@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr, vm_write); len -= bytes; start_offset = 0; - nr_pages -= pages; - pa += pages * PAGE_SIZE; - while (pages) - put_page(process_pages[--pages]); + nr_pages -= pinned_pages; + pa += pinned_pages * PAGE_SIZE; + + /* If vm_write is set, the pages need to be made dirty: */ + put_user_pages_dirty_lock(process_pages, pinned_pages, + vm_write); } return rc; -- 2.23.0
[PATCH 05/19] mm/gup: introduce pin_user_pages*() and FOLL_PIN
Introduce pin_user_pages*() variations of get_user_pages*() calls, and also pin_longterm_pages*() variations. These variants all set FOLL_PIN, which is also introduced, and basically documented. (An upcoming patch provides more extensive documentation.) The second set (pin_longterm*) also sets FOLL_LONGTERM: pin_user_pages() pin_user_pages_remote() pin_user_pages_fast() pin_longterm_pages() pin_longterm_pages_remote() pin_longterm_pages_fast() All pages that are pinned via the above calls, must be unpinned via put_user_page(). The underlying rules are: * These are gup-internal flags, so the call sites should not directly set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with assertions, for the new FOLL_PIN flag. However, for the pre-existing FOLL_LONGTERM flag, which has some call sites that still directly set FOLL_LONGTERM, there is no assertion yet. * Call sites that want to indicate that they are going to do DirectIO ("DIO") or something with similar characteristics, should call a get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers will: * Start with "pin_user_pages" instead of "get_user_pages". That makes it easy to find and audit the call sites. * Set FOLL_PIN * For pages that are received via FOLL_PIN, those pages must be returned via put_user_page(). Signed-off-by: John Hubbard --- include/linux/mm.h | 53 - mm/gup.c | 284 + 2 files changed, 311 insertions(+), 26 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index cc292273e6ba..62c838a3e6c7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1526,9 +1526,23 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked); +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked); +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked); long get_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas); +long pin_user_pages(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas); +long pin_longterm_pages(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas); long get_user_pages_locked(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, int *locked); long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, @@ -1536,6 +1550,10 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, int get_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); +int pin_user_pages_fast(unsigned long start, int nr_pages, + unsigned int gup_flags, struct page **pages); +int pin_longterm_pages_fast(unsigned long start, int nr_pages, + unsigned int gup_flags, struct page **pages); int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc); int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, @@ -2594,13 +2612,15 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_ANON 0x8000 /* don't do file mappings */ #define FOLL_LONGTERM 0x1 /* mapping lifetime is indefinite: see below */ #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */ +#define FOLL_PIN 0x4 /* pages must be released via put_user_page() */ /* - * NOTE on FOLL_LONGTERM: + * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each + * other. Here is what they mean, and how to use them: * * FOLL_LONGTERM indicates that the page will be held for an indefinite time - * period _often_ under userspace control. This is contrasted with - * iov_iter_get_pages() where usages which are transient. + * period _often_ under userspace control. This is in contrast to + * iov_iter_get_pages(), where usages which are transient. * * FIXME: For
[PATCH 07/19] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
Convert infiniband to use the new wrapper calls, and stop explicitly setting FOLL_LONGTERM at the call sites. The new pin_longterm_*() calls replace get_user_pages*() calls, and set both FOLL_LONGTERM and a new FOLL_PIN flag. The FOLL_PIN flag requires that the caller must return the pages via put_user_page*() calls, but infiniband was already doing that as part of an earlier commit. Signed-off-by: John Hubbard --- drivers/infiniband/core/umem.c | 5 ++--- drivers/infiniband/core/umem_odp.c | 10 +- drivers/infiniband/hw/hfi1/user_pages.c | 4 ++-- drivers/infiniband/hw/mthca/mthca_memfree.c | 3 +-- drivers/infiniband/hw/qib/qib_user_pages.c | 8 drivers/infiniband/hw/qib/qib_user_sdma.c | 2 +- drivers/infiniband/hw/usnic/usnic_uiom.c| 9 - drivers/infiniband/sw/siw/siw_mem.c | 5 ++--- 8 files changed, 21 insertions(+), 25 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 24244a2f68cc..c5a78d3e674b 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr, while (npages) { down_read(>mmap_sem); - ret = get_user_pages(cur_base, + ret = pin_longterm_pages(cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), -gup_flags | FOLL_LONGTERM, -page_list, NULL); +gup_flags, page_list, NULL); if (ret < 0) { up_read(>mmap_sem); goto umem_release; diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index 163ff7ba92b7..a38b67b83db5 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -534,7 +534,7 @@ static int ib_umem_odp_map_dma_single_page( } else if (umem_odp->page_list[page_index] == page) { umem_odp->dma_list[page_index] |= access_mask; } else { - pr_err("error: got different pages in IB device and from get_user_pages. IB device page: %p, gup page: %p\n", + pr_err("error: got different pages in IB device and from pin_longterm_pages. IB device page: %p, gup page: %p\n", umem_odp->page_list[page_index], page); /* Better remove the mapping now, to prevent any further * damage. */ @@ -639,11 +639,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt, /* * Note: this might result in redundent page getting. We can * avoid this by checking dma_list to be 0 before calling -* get_user_pages. However, this make the code much more -* complex (and doesn't gain us much performance in most use -* cases). +* pin_longterm_pages. However, this makes the code much +* more complex (and doesn't gain us much performance in most +* use cases). */ - npages = get_user_pages_remote(owning_process, owning_mm, + npages = pin_longterm_pages_remote(owning_process, owning_mm, user_virt, gup_num_pages, flags, local_page_list, NULL, NULL); up_read(_mm->mmap_sem); diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c index 469acb961fbd..9b55b0a73e29 100644 --- a/drivers/infiniband/hw/hfi1/user_pages.c +++ b/drivers/infiniband/hw/hfi1/user_pages.c @@ -104,9 +104,9 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np bool writable, struct page **pages) { int ret; - unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0); + unsigned int gup_flags = (writable ? FOLL_WRITE : 0); - ret = get_user_pages_fast(vaddr, npages, gup_flags, pages); + ret = pin_longterm_pages_fast(vaddr, npages, gup_flags, pages); if (ret < 0) return ret; diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c index edccfd6e178f..beec7e4b8a96 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.c +++ b/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar, goto out; } - ret = get_user_pages_fast(uaddr & PAGE_MASK, 1, - FOLL_WRITE | FOLL_LONGTERM, pages); + ret = pin_longterm_pages_fast(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages); if (ret < 0) goto
[PATCH 06/19] goldish_pipe: convert to pin_user_pages() and put_user_page()
1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages(). 2. As required by pin_user_pages(), release these pages via put_user_page(). In this case, do so via put_user_pages_dirty_lock(). That has the side effect of calling set_page_dirty_lock(), instead of set_page_dirty(). This is probably more accurate. As Christoph Hellwig put it, "set_page_dirty() is only safe if we are dealing with a file backed page where we have reference on the inode it hangs off." [1] Another side effect is that the release code is simplified because the page[] loop is now in gup.c instead of here, so just delete the local release_user_pages() entirely, and call put_user_pages_dirty_lock() directly, instead. [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de Signed-off-by: John Hubbard --- drivers/platform/goldfish/goldfish_pipe.c | 17 +++-- 1 file changed, 3 insertions(+), 14 deletions(-) diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c index 7ed2a21a0bac..635a8bc1b480 100644 --- a/drivers/platform/goldfish/goldfish_pipe.c +++ b/drivers/platform/goldfish/goldfish_pipe.c @@ -274,7 +274,7 @@ static int pin_goldfish_pages(unsigned long first_page, *iter_last_page_size = last_page_size; } - ret = get_user_pages_fast(first_page, requested_pages, + ret = pin_user_pages_fast(first_page, requested_pages, !is_write ? FOLL_WRITE : 0, pages); if (ret <= 0) @@ -285,18 +285,6 @@ static int pin_goldfish_pages(unsigned long first_page, return ret; } -static void release_user_pages(struct page **pages, int pages_count, - int is_write, s32 consumed_size) -{ - int i; - - for (i = 0; i < pages_count; i++) { - if (!is_write && consumed_size > 0) - set_page_dirty(pages[i]); - put_page(pages[i]); - } -} - /* Populate the call parameters, merging adjacent pages together */ static void populate_rw_params(struct page **pages, int pages_count, @@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe, *consumed_size = pipe->command_buffer->rw_params.consumed_size; - release_user_pages(pipe->pages, pages_count, is_write, *consumed_size); + put_user_pages_dirty_lock(pipe->pages, pages_count, + !is_write && *consumed_size > 0); mutex_unlock(>lock); return 0; -- 2.23.0
[PATCH 04/19] media/v4l2-core: set pages dirty upon releasing DMA buffers
After DMA is complete, and the device and CPU caches are synchronized, it's still required to mark the CPU pages as dirty, if the data was coming from the device. However, this driver was just issuing a bare put_page() call, without any set_page_dirty*() call. Fix the problem, by calling set_page_dirty_lock() if the CPU pages were potentially receiving data from the device. Cc: Mauro Carvalho Chehab Signed-off-by: John Hubbard --- drivers/media/v4l2-core/videobuf-dma-sg.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 66a6c6c236a7..28262190c3ab 100644 --- a/drivers/media/v4l2-core/videobuf-dma-sg.c +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c @@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma) BUG_ON(dma->sglen); if (dma->pages) { - for (i = 0; i < dma->nr_pages; i++) + for (i = 0; i < dma->nr_pages; i++) { + if (dma->direction == DMA_FROM_DEVICE) + set_page_dirty_lock(dma->pages[i]); put_page(dma->pages[i]); + } kfree(dma->pages); dma->pages = NULL; } -- 2.23.0
[PATCH 03/19] goldish_pipe: rename local pin_user_pages() routine
1. Avoid naming conflicts: rename local static function from "pin_user_pages()" to "pin_goldfish_pages()". An upcoming patch will introduce a global pin_user_pages() function. Signed-off-by: John Hubbard --- drivers/platform/goldfish/goldfish_pipe.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c index cef0133aa47a..7ed2a21a0bac 100644 --- a/drivers/platform/goldfish/goldfish_pipe.c +++ b/drivers/platform/goldfish/goldfish_pipe.c @@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status) } } -static int pin_user_pages(unsigned long first_page, - unsigned long last_page, - unsigned int last_page_size, - int is_write, - struct page *pages[MAX_BUFFERS_PER_COMMAND], - unsigned int *iter_last_page_size) +static int pin_goldfish_pages(unsigned long first_page, + unsigned long last_page, + unsigned int last_page_size, + int is_write, + struct page *pages[MAX_BUFFERS_PER_COMMAND], + unsigned int *iter_last_page_size) { int ret; int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1; @@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe, if (mutex_lock_interruptible(>lock)) return -ERESTARTSYS; - pages_count = pin_user_pages(first_page, last_page, -last_page_size, is_write, -pipe->pages, _last_page_size); + pages_count = pin_goldfish_pages(first_page, last_page, +last_page_size, is_write, +pipe->pages, _last_page_size); if (pages_count < 0) { mutex_unlock(>lock); return pages_count; -- 2.23.0
[PATCH 02/19] mm/gup: factor out duplicate code from four routines
There are four locations in gup.c that have a fair amount of code duplication. This means that changing one requires making the same changes in four places, not to mention reading the same code four times, and wondering if there are subtle differences. Factor out the common code into static functions, thus reducing the overall line count and the code's complexity. Also, take the opportunity to slightly improve the efficiency of the error cases, by doing a mass subtraction of the refcount, surrounded by get_page()/put_page(). Also, further simplify (slightly), by waiting until the the successful end of each routine, to increment *nr. Cc: Christoph Hellwig Cc: Aneesh Kumar K.V Signed-off-by: John Hubbard --- mm/gup.c | 113 ++- 1 file changed, 46 insertions(+), 67 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 85caf76b3012..8fb0d9cdfaf5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1969,6 +1969,35 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr, } #endif +static int __record_subpages(struct page *page, unsigned long addr, +unsigned long end, struct page **pages, int nr) +{ + int nr_recorded_pages = 0; + + do { + pages[nr] = page; + nr++; + page++; + nr_recorded_pages++; + } while (addr += PAGE_SIZE, addr != end); + return nr_recorded_pages; +} + +static void __remove_refs_from_head(struct page *page, int refs) +{ + /* Do a get_page() first, in case refs == page->_refcount */ + get_page(page); + page_ref_sub(page, refs); + put_page(page); +} + +static int __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr) +{ + *nr += nr_recorded_pages; + SetPageReferenced(head); + return 1; +} + #ifdef CONFIG_ARCH_HAS_HUGEPD static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, unsigned long sz) @@ -1998,34 +2027,19 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, /* hugepages are never "special" */ VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - refs = 0; head = pte_page(pte); - page = head + ((addr & (sz-1)) >> PAGE_SHIFT); - do { - VM_BUG_ON(compound_head(page) != head); - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = __record_subpages(page, addr, end, pages, *nr); head = try_get_compound_head(head, refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pte_val(pte) != pte_val(*ptep))) { - /* Could be optimized better */ - *nr -= refs; - while (refs--) - put_page(head); + __remove_refs_from_head(head, refs); return 0; } - - SetPageReferenced(head); - return 1; + return __huge_pt_done(head, refs, nr); } static int gup_huge_pd(hugepd_t hugepd, unsigned long addr, @@ -2071,30 +2085,18 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, pages, nr); } - refs = 0; page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = __record_subpages(page, addr, end, pages, *nr); head = try_get_compound_head(pmd_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - *nr -= refs; - while (refs--) - put_page(head); + __remove_refs_from_head(head, refs); return 0; } - - SetPageReferenced(head); - return 1; + return __huge_pt_done(head, refs, nr); } static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, @@ -2114,30 +2116,18 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, pages, nr); } - refs = 0; page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = __record_subpages(page, addr, end, pages, *nr); head = try_get_compound_head(pud_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pud_val(orig) !=
[PATCH 00/19] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM
Hi, This applies cleanly to linux-next and mmotm, and also to linux.git if linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB case") is first applied there. This provides tracking of dma-pinned pages. This is a prerequisite to solving the larger problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], and in a remarkable number of email threads since about 2017. :) A new internal gup flag, FOLL_PIN is introduced, and thoroughly documented in the last patch's Documentation/vm/pin_user_pages.rst. I believe that this will provide a good starting point for doing the layout lease work that Ira Weiny has been working on. That's because these new wrapper functions provide a clean, constrained, systematically named set of functionality that, again, is required in order to even know if a page is "dma-pinned". In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this: get_user_pages() (sets FOLL_GET) put_page() to this: pin_user_pages() (sets FOLL_PIN) put_user_page() Because there are interdependencies with FOLL_LONGTERM, a similar conversion as for FOLL_PIN, was applied. The change was from this: get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET) put_page() to this: pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM) put_user_page() Patch summary: * Patches 1-4: refactoring and preparatory cleanup, independent fixes (Patch 4: V4L2-core bug fix (can be separately applied)) * Patch 5: introduce pin_user_pages(), FOLL_PIN, but no functional changes yet * Patches 6-11: Convert existing put_user_page() callers, to use the new pin*() * Patch 12: Activate tracking of FOLL_PIN pages. * Patches 13-15: convert FOLL_LONGTERM callers * Patches: 16-17: gup_benchmark and run_vmtests support * Patch 18: enforce FOLL_LONGTERM as a gup-internal (only) flag * Patch 19: Documentation/vm/pin_user_pages.rst Testing: * I've done some overall kernel testing (LTP, and a few other goodies), and some directed testing to exercise some of the changes. And as you can see, gup_benchmark is enhanced to exercise this. Basically, I've been able to runtime test the core get_user_pages() and pin_user_pages() and related routines, but not so much on several of the call sites--but those are generally just a couple of lines changed, each. Not much of the kernel is actually using this, which on one hand reduces risk quite a lot. But on the other hand, testing coverage is low. So I'd love it if, in particular, the Infiniband and PowerPC folks could do a smoke test of this series for me. Also, my runtime testing for the call sites so far is very weak: * io_uring: Some directed tests from liburing exercise this, and they pass. * process_vm_access.c: A small directed test passes. * gup_benchmark: the enhanced version hits the new gup.c code, and passes. * infiniband (still only have crude "IB pingpong" working, on a good day: it's not exercising my conversions at runtime...) * VFIO: compiles (I'm vowing to set up a run time test soon, but it's not ready just yet) * powerpc: it compiles... * drm/via: compiles... * goldfish: compiles... * net/xdp: compiles... * media/v4l2: compiles... Next: * Get the block/bio_vec sites converted to use pin_user_pages(). * Work with Ira and Dave Chinner to weave this together with the layout lease stuff. [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ John Hubbard (19): mm/gup: pass flags arg to __gup_device_* functions mm/gup: factor out duplicate code from four routines goldish_pipe: rename local pin_user_pages() routine media/v4l2-core: set pages dirty upon releasing DMA buffers mm/gup: introduce pin_user_pages*() and FOLL_PIN goldish_pipe: convert to pin_user_pages() and put_user_page() infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*() mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() drm/via: set FOLL_PIN via pin_user_pages_fast() fs/io_uring: set FOLL_PIN via pin_user_pages() net/xdp: set FOLL_PIN via pin_user_pages() mm/gup: track FOLL_PIN pages media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page()
[PATCH 01/19] mm/gup: pass flags arg to __gup_device_* functions
A subsequent patch requires access to gup flags, so pass the flags argument through to the __gup_device_* functions. Also placate checkpatch.pl by shortening a nearby line. Cc: Kirill A. Shutemov Signed-off-by: John Hubbard --- mm/gup.c | 28 ++-- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 8f236a335ae9..85caf76b3012 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static int __gup_device_huge(unsigned long pfn, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { int nr_start = *nr; struct dev_pagemap *pgmap = NULL; @@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, } static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { unsigned long fault_pfn; int nr_start = *nr; fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr)) + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0; if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { @@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, } static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { unsigned long fault_pfn; int nr_start = *nr; fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr)) + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0; if (unlikely(pud_val(orig) != pud_val(*pudp))) { @@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, } #else static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { BUILD_BUG(); return 0; } static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr, - unsigned long end, struct page **pages, int *nr) +unsigned long end, unsigned int flags, +struct page **pages, int *nr) { BUILD_BUG(); return 0; @@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (pmd_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; - return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr); + return __gup_device_huge_pmd(orig, pmdp, addr, end, flags, +pages, nr); } refs = 0; @@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, } static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, - unsigned long end, unsigned int flags, struct page **pages, int *nr) + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { struct page *head, *page; int refs; @@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, if (pud_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; - return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr); + return __gup_device_huge_pud(orig, pudp, addr, end, flags, +pages, nr); } refs = 0; -- 2.23.0
Re: [PATCH v3 3/8] powerpc: Fix vDSO clock_getres()
Hi, [This is an automated email] This commit has been processed because it contains a "Fixes:" tag, fixing commit: a7f290dad32ee [PATCH] powerpc: Merge vdso's and add vdso support to 32 bits kernel. The bot has tested the following trees: v5.3.8, v4.19.81, v4.14.151, v4.9.198, v4.4.198. v5.3.8: Build OK! v4.19.81: Build OK! v4.14.151: Failed to apply! Possible dependencies: 5c929885f1bb4 ("powerpc/vdso64: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE") b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038") v4.9.198: Failed to apply! Possible dependencies: 4546561551106 ("powerpc/asm: Use OFFSET macro in asm-offsets.c") 5c929885f1bb4 ("powerpc/vdso64: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE") 5d451a87e5ebb ("powerpc/64: Retrieve number of L1 cache sets from device-tree") 7c5b06cadf274 ("KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9") 83677f551e0a6 ("KVM: PPC: Book3S HV: Adjust host/guest context switch for POWER9") 902e06eb86cd6 ("powerpc/32: Change the stack protector canary value per task") b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038") bd067f83b0840 ("powerpc/64: Fix naming of cache block vs. cache line") e2827fe5c1566 ("powerpc/64: Clean up ppc64_caches using a struct per cache") e9cf1e085647b ("KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs") f4c51f841d2ac ("KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests") v4.4.198: Failed to apply! Possible dependencies: 153086644fd1f ("powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI") 3eb5d5888dc68 ("powerpc: Add ppc_strict_facility_enable boot option") 4546561551106 ("powerpc/asm: Use OFFSET macro in asm-offsets.c") 579e633e764e6 ("powerpc: create flush_all_to_thread()") 5c929885f1bb4 ("powerpc/vdso64: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE") 70fe3d980f5f1 ("powerpc: Restore FPU/VEC/VSX if previously used") 85baa095497f3 ("powerpc/livepatch: Add live patching support on ppc64le") 902e06eb86cd6 ("powerpc/32: Change the stack protector canary value per task") b5b4453e7912f ("powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038") bf76f73c5f655 ("powerpc: enable UBSAN support") c208505900b23 ("powerpc: create giveup_all()") d1e1cf2e38def ("powerpc: clean up asm/switch_to.h") dc4fbba11e466 ("powerpc: Create disable_kernel_{fp,altivec,vsx,spe}()") f17c4e01e906c ("powerpc/module: Mark module stubs with a magic value") NOTE: The patch will not be queued to stable trees until it is upstream. How should we proceed with this patch? -- Thanks, Sasha
Re: [PATCH RFC 1/5] dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
On Mon, Oct 14, 2019 at 08:31:03PM +0200, Nicolas Saenz Julienne wrote: > Some architectures, notably ARM, are interested in tweaking this > depending on their runtime DMA addressing limitations. > > Signed-off-by: Nicolas Saenz Julienne Do you want me to pick this up for the 5.5 dma-mapping tree, or do you want me to wait for the rest to settle?
[PATCH 03/12] powerpc: Replace cpu_up/down with device_online/offline
The core device API performs extra housekeeping bits that are missing from directly calling cpu_up/down. See commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization during LPM") for an example description of what might go wrong. This also prepares to make cpu_up/down a private interface for anything but the cpu subsystem. Signed-off-by: Qais Yousef CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Michael Ellerman CC: Enrico Weigelt CC: Ram Pai CC: Nicholas Piggin CC: Thiago Jung Bauermann CC: Christophe Leroy CC: Thomas Gleixner CC: linuxppc-dev@lists.ozlabs.org CC: linux-ker...@vger.kernel.org --- arch/powerpc/kernel/machine_kexec_64.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/machine_kexec_64.c b/arch/powerpc/kernel/machine_kexec_64.c index 04a7cba58eff..ebf8cc7acc4d 100644 --- a/arch/powerpc/kernel/machine_kexec_64.c +++ b/arch/powerpc/kernel/machine_kexec_64.c @@ -208,13 +208,15 @@ static void wake_offline_cpus(void) { int cpu = 0; + lock_device_hotplug(); for_each_present_cpu(cpu) { if (!cpu_online(cpu)) { printk(KERN_INFO "kexec: Waking offline cpu %d.\n", cpu); - WARN_ON(cpu_up(cpu)); + WARN_ON(device_online(get_cpu_device(cpu))); } } + unlock_device_hotplug(); } static void kexec_prepare_cpus(void) -- 2.17.1
[PATCH 00/12] Convert cpu_up/down to device_online/offline
Using cpu_up/down directly to bring cpus online/offline loses synchronization with sysfs and could suffer from a race similar to what is described in commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization during LPM"). cpu_up/down seem to be more of a internal implementation detail for the cpu subsystem to use to boot up cpus, perform suspend/resume and low level hotplug operations. Users outside of the cpu subsystem would be better using the device core API to bring a cpu online/offline which is the interface used to hotplug memory and other system devices. Several users have already migrated to use the device core API, this series converts the remaining users and hides cpu_up/down from internal users at the end. I still need to update the documentation to remove references to cpu_up/down and advocate for device_online/offline instead if this series will make its way through. I noticed this problem while working on a hack to disable offlining a particular CPU but noticed that setting the offline_disabled attribute in the device struct isn't enough because users can easily bypass the device core. While my hack isn't a valid use case but it did highlight the inconsistency in the way cpus are being onlined/offlined and this attempt hopefully improves on this. The first 6 patches fixes arch users. The next 5 patches fixes generic code users. Particularly creating a new special exported API for the device core to use instead of cpu_up/down. Maybe we can do something more restrictive than that. The last patch removes cpu_up/down from cpu.h and unexport the functions. In some cases where the use of cpu_up/down seemed legitimate, I encapsulated the logic in a higher level - special purposed function; and converted the code to use that instead. I did run the rcu torture, lock torture and psci checker tests and no problem was noticed. I did perform build tests on all arch affected except for parisc. Hopefully I got the CC list right for all the patches. Apologies in advance if some people were omitted from some patches but they should have been CCed. CC: Armijn Hemel CC: Benjamin Herrenschmidt CC: Bjorn Helgaas CC: Borislav Petkov CC: Boris Ostrovsky CC: Catalin Marinas CC: Christophe Leroy CC: Daniel Lezcano CC: Davidlohr Bueso CC: "David S. Miller" CC: Eiichi Tsukata CC: Enrico Weigelt CC: Fenghua Yu CC: Greg Kroah-Hartman CC: Helge Deller CC: "H. Peter Anvin" CC: Ingo Molnar CC: "James E.J. Bottomley" CC: James Morse CC: Jiri Kosina CC: Josh Poimboeuf CC: Josh Triplett CC: Juergen Gross CC: Lorenzo Pieralisi CC: Mark Rutland CC: Michael Ellerman CC: Nadav Amit CC: Nicholas Piggin CC: "Paul E. McKenney" CC: Paul Mackerras CC: Pavankumar Kondeti CC: "Peter Zijlstra (Intel)" CC: "Rafael J. Wysocki" CC: Ram Pai CC: Richard Fontana CC: Sakari Ailus CC: Stefano Stabellini CC: Steve Capper CC: Thiago Jung Bauermann CC: Thomas Gleixner CC: Tony Luck CC: Will Deacon CC: Zhenzhong Duan CC: linux-arm-ker...@lists.infradead.org CC: linux-i...@vger.kernel.org CC: linux-ker...@vger.kernel.org CC: linux-par...@vger.kernel.org CC: linuxppc-dev@lists.ozlabs.org CC: sparcli...@vger.kernel.org CC: x...@kernel.org CC: xen-de...@lists.xenproject.org Qais Yousef (12): arm64: hibernate.c: create a new function to handle cpu_up(sleep_cpu) x86: Replace cpu_up/down with devcie_online/offline powerpc: Replace cpu_up/down with device_online/offline ia64: Replace cpu_down with freeze_secondary_cpus sparc: Replace cpu_up/down with device_online/offline parisc: Replace cpu_up/down with device_online/offline driver: base: cpu: export device_online/offline driver: xen: Replace cpu_up/down with device_online/offline firmware: psci: Replace cpu_up/down with device_online/offline torture: Replace cpu_up/down with device_online/offline smp: Create a new function to bringup nonboot cpus online cpu: Hide cpu_up/down arch/arm64/kernel/hibernate.c | 13 +++ arch/ia64/kernel/process.c | 8 +--- arch/parisc/kernel/processor.c | 4 +- arch/powerpc/kernel/machine_kexec_64.c | 4 +- arch/sparc/kernel/ds.c | 8 +++- arch/x86/kernel/topology.c | 4 +- arch/x86/mm/mmio-mod.c | 8 +++- arch/x86/xen/smp.c | 4 +- drivers/base/core.c| 4 ++ drivers/base/cpu.c | 4 +- drivers/firmware/psci/psci_checker.c | 6 ++- drivers/xen/cpu_hotplug.c | 2 +- include/linux/cpu.h| 6 ++- kernel/cpu.c | 53 -- kernel/smp.c | 9 + kernel/torture.c | 15 ++-- 16 files changed, 106 insertions(+), 46 deletions(-) -- 2.17.1
Section mismatch warnings on powerpc
Still see those, WARNING: vmlinux.o(.text+0x2d04): Section mismatch in reference from the variable __boot_from_prom to the function .init.text:prom_init() The function __boot_from_prom() references the function __init prom_init(). This is often because __boot_from_prom lacks a __init annotation or the annotation of prom_init is wrong. WARNING: vmlinux.o(.text+0x2ec8): Section mismatch in reference from the variable start_here_common to the function .init.text:start_kernel() The function start_here_common() references the function __init start_kernel(). This is often because start_here_common lacks a __init annotation or the annotation of start_kernel is wrong. There is a patch around, http://patchwork.ozlabs.org/patch/895442/ Does it still wait for Michael to come with some better names?
Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc
Le 30/10/2019 à 19:30, Kees Cook a écrit : On Wed, Oct 30, 2019 at 09:58:19AM +0100, Christophe Leroy wrote: Le 30/10/2019 à 08:31, Russell Currey a écrit : v4 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html v3 cover letter: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html Changes since v4: [1/5]: Addressed review comments from Michael Ellerman (thanks!) [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid STRICT_MODULE_RWX being *on by default* in cases where STRICT_KERNEL_RWX is *unavailable* [5/5]: split skiroot_defconfig changes out into its own patch The whole Kconfig situation is really weird and confusing, I believe the correct resolution is to change arch/Kconfig but the consequences are so minor that I don't think it's worth it, especially given that I expect powerpc to have mandatory strict RWX Soon(tm). I'm not such strict RWX can be made mandatory due to the impact it has on some subarches: - On the 8xx, unless all areas are 8Mbytes aligned, there is a significant overhead on TLB misses. And Aligning everthing to 8M is a waste of RAM which is not acceptable on systems having very few RAM. - On hash book3s32, we are able to map the kernel BATs. With a few alignment constraints, we are able to provide STRICT_KERNEL_RWX. But we are unable to provide exec protection on page granularity. Only on 256Mbytes segments. So for modules, we have to have the vmspace X. It is also not possible to have a kernel area RO. Only user areas can be made RO. As I understand it, the idea was for it to be mandatory (or at least default-on) only for the subarches where it wasn't totally insane to accomplish. :) (I'm not familiar with all the details on the subarchs, but it sounded like the more modern systems would be the targets for this?) Yes I guess so. I was just afraid by "I expect powerpc to have mandatory strict RWX" By the way, we have an open issue #223 namely 'Make strict kernel RWX the default on 64-bit', so no worry as 32-bit is aside for now. Christophe
Re: [PATCH v4 0/4] Implement STRICT_MODULE_RWX for powerpc
On Wed, Oct 30, 2019 at 11:16:22AM +1100, Michael Ellerman wrote: > Kees Cook writes: > > On Mon, Oct 14, 2019 at 04:13:16PM +1100, Russell Currey wrote: > >> v3 cover letter here: > >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html > >> > >> Only minimal changes since then: > >> > >> - patch 2/4 commit message update thanks to Andrew Donnellan > >> - patch 3/4 made neater thanks to Christophe Leroy > >> - patch 3/4 updated Kconfig description thanks to Daniel Axtens > > > > I continue to be excited about this work. :) Is there anything holding > > it back from landing in linux-next? > > I had some concerns, which I stupidly posted in reply to v3: > > https://lore.kernel.org/linuxppc-dev/87pnio5fva@mpe.ellerman.id.au/ Ah-ha! Thanks; I missed that. :) -- Kees Cook
Re: [PATCH v2] powerpc/imc: Dont create debugfs files for cpu-less nodes
On Tue, 2019-07-23 at 16:57 +0530, Anju T Sudhakar wrote: > Hi Qian, > > On 7/16/19 12:11 AM, Qian Cai wrote: > > On Thu, 2019-07-11 at 14:53 +1000, Michael Ellerman wrote: > > > Hi Maddy, > > > > > > Madhavan Srinivasan writes: > > > > diff --git a/arch/powerpc/platforms/powernv/opal-imc.c > > > > b/arch/powerpc/platforms/powernv/opal-imc.c > > > > index 186109bdd41b..e04b20625cb9 100644 > > > > --- a/arch/powerpc/platforms/powernv/opal-imc.c > > > > +++ b/arch/powerpc/platforms/powernv/opal-imc.c > > > > @@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct > > > > device_node > > > > *node, > > > > if (of_property_read_u32(node, "cb_offset", _offset)) > > > > cb_offset = IMC_CNTL_BLK_OFFSET; > > > > > > > > - for_each_node(nid) { > > > > - loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset; > > > > + while (ptr->vbase != NULL) { > > > > > > This means you'll bail out as soon as you find a node with no vbase, but > > > it's possible we could have a CPU-less node intermingled with other > > > nodes. > > > > > > So I think you want to keep the for loop, but continue if you see a NULL > > > vbase? > > > > Not sure if this will also takes care of some of those messages during the > > boot > > on today's linux-next even without this patch. > > > > > > [ 18.077780][T1] debugfs: Directory 'imc' with parent 'powerpc' > > already > > present! > > > > > > This is introduced by a recent commit: c33d442328f55 (debugfs: make > error message a bit more verbose). > > So basically, the debugfs imc_* file is created per node, and is created > by the first nest unit which is > > being registered. For the subsequent nest units, debugfs_create_dir() > will just return since the imc_* file already > > exist. > > The commit "c33d442328f55 (debugfs: make error message a bit more > verbose)", prints > > a message if the debugfs file already exists in debugfs_create_dir(). > That is why we are encountering these > > messages now. > > > This patch (i.e, powerpc/imc: Dont create debugfs files for cpu-less > nodes) will address the initial issue, i.e > > "numa crash while reading imc_* debugfs files for cpu less nodes", and > will not address these debugfs messages. > > > But yeah this is a good catch. We can have some checks to avoid these > debugfs messages. Anju, do you still plan to fix those "Directory 'imc' with parent 'powerpc' already present!" warnings as they are still there in the latest linux-next? > > > Hi Michael, > > Do we need to have a separate patch to address these debugfs messages, > or can we address the same > > in the next version of this patch itself? > > > Thanks, > > Anju > > > >
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 01:32:01PM -0500, Reza Arbab wrote: > Aha, it's this again. Didn't catch your meaning at first. Point taken. It's not _me_. It that you (plural) keep ignoring how Linux development works.
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 07:13:59PM +0100, Christoph Hellwig wrote: On Wed, Oct 30, 2019 at 01:08:51PM -0500, Reza Arbab wrote: On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote: How do you even use this code? Nothing in the kernel even calls dma_set_mask for NPU devices, as we only suport vfio pass through. You use it by calling dma_set_mask() for the *GPU* device. The purpose of pnv_npu_try_dma_set_bypass() is to then propagate the same bypass configuration to all the NPU devices associated with that GPU. Which in-kernel driver, which PCI ID? Aha, it's this again. Didn't catch your meaning at first. Point taken. -- Reza Arbab
Re: [PATCH v5 0/5] Implement STRICT_MODULE_RWX for powerpc
On Wed, Oct 30, 2019 at 09:58:19AM +0100, Christophe Leroy wrote: > > > Le 30/10/2019 à 08:31, Russell Currey a écrit : > > v4 cover letter: > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html > > v3 cover letter: > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html > > > > Changes since v4: > > [1/5]: Addressed review comments from Michael Ellerman (thanks!) > > [4/5]: make ARCH_HAS_STRICT_MODULE_RWX depend on > >ARCH_HAS_STRICT_KERNEL_RWX to simplify things and avoid > >STRICT_MODULE_RWX being *on by default* in cases where > >STRICT_KERNEL_RWX is *unavailable* > > [5/5]: split skiroot_defconfig changes out into its own patch > > > > The whole Kconfig situation is really weird and confusing, I believe the > > correct resolution is to change arch/Kconfig but the consequences are so > > minor that I don't think it's worth it, especially given that I expect > > powerpc to have mandatory strict RWX Soon(tm). > > I'm not such strict RWX can be made mandatory due to the impact it has on > some subarches: > - On the 8xx, unless all areas are 8Mbytes aligned, there is a significant > overhead on TLB misses. And Aligning everthing to 8M is a waste of RAM which > is not acceptable on systems having very few RAM. > - On hash book3s32, we are able to map the kernel BATs. With a few alignment > constraints, we are able to provide STRICT_KERNEL_RWX. But we are unable to > provide exec protection on page granularity. Only on 256Mbytes segments. So > for modules, we have to have the vmspace X. It is also not possible to have > a kernel area RO. Only user areas can be made RO. As I understand it, the idea was for it to be mandatory (or at least default-on) only for the subarches where it wasn't totally insane to accomplish. :) (I'm not familiar with all the details on the subarchs, but it sounded like the more modern systems would be the targets for this?) -- Kees Cook
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 01:08:51PM -0500, Reza Arbab wrote: > On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote: >> How do you even use this code? Nothing in the kernel even calls >> dma_set_mask for NPU devices, as we only suport vfio pass through. > > You use it by calling dma_set_mask() for the *GPU* device. The purpose of > pnv_npu_try_dma_set_bypass() is to then propagate the same bypass > configuration to all the NPU devices associated with that GPU. Which in-kernel driver, which PCI ID?
Re: [PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()
On Wed, Oct 30, 2019 at 06:55:18PM +0100, Christoph Hellwig wrote: On Wed, Oct 30, 2019 at 12:00:00PM -0500, Reza Arbab wrote: Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by separating the part of the function that determines if bypass is supported from the part that actually attempts to configure it. Move the latter to a controller-specific dma_set_mask() callback. Nak, the dma_set_mask overrides are going away. But as said in the reply to the cover letter I don't even see how you could end up calling this code. Okay. As mentioned in the cover letter these last few patches could be omitted if that's the case. -- Reza Arbab
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 06:53:41PM +0100, Christoph Hellwig wrote: How do you even use this code? Nothing in the kernel even calls dma_set_mask for NPU devices, as we only suport vfio pass through. You use it by calling dma_set_mask() for the *GPU* device. The purpose of pnv_npu_try_dma_set_bypass() is to then propagate the same bypass configuration to all the NPU devices associated with that GPU. -- Reza Arbab
Re: [PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()
On Wed, Oct 30, 2019 at 12:00:00PM -0500, Reza Arbab wrote: > Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by > separating the part of the function that determines if bypass is > supported from the part that actually attempts to configure it. > > Move the latter to a controller-specific dma_set_mask() callback. Nak, the dma_set_mask overrides are going away. But as said in the reply to the cover letter I don't even see how you could end up calling this code.
Re: [PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
On Wed, Oct 30, 2019 at 11:59:49AM -0500, Reza Arbab wrote: > With recent kernels, TCE tables for NPU devices are no longer being > configured. That task was performed by pnv_npu_try_dma_set_bypass(), a > function that got swept away in recent overhauling of dma code. > > Patches 1-4 here bring the lost function back and reintegrate it with > the updated generic iommu bypass infrastructure. > > Patch 5 fixes a regression in behavior when a requested dma mask can not > be fulfilled. > > Patches 6-8 are cleanup. I put these later in the set because they > aren't bisectable until after the restored code is wired back in. > > Patches 9-11 refactor pnv_pci_ioda_iommu_bypass_supported(). It seems > wrong for a boolean *_supported() function to have side effects. They > reintroduce a pci controller based dma_set_mask() hook. If that's > undesirable, these last three patches can be dropped. How do you even use this code? Nothing in the kernel even calls dma_set_mask for NPU devices, as we only suport vfio pass through.
[PATCH 03/11] powerpc/powernv/npu: Change pnv_npu_try_dma_set_bypass() argument
To enable simpler calling code, change this function to find the value of bypass instead of taking it as an argument. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/npu-dma.c | 12 +--- arch/powerpc/platforms/powernv/pci.h | 2 +- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 5a8313654033..a6b8c7ad36e4 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -258,13 +258,21 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) return rc; } -void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass) +void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) { + struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev); int i; struct pnv_phb *phb; struct pci_dn *pdn; struct pnv_ioda_pe *npe; struct pci_dev *npdev; + bool bypass; + + if (!gpe) + return; + + /* We only do bypass if it's enabled on the linked device */ + bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask); for (i = 0; ; ++i) { npdev = pnv_pci_get_npu_dev(gpdev, i); @@ -277,8 +285,6 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass) return; phb = pci_bus_to_host(npdev->bus)->private_data; - - /* We only do bypass if it's enabled on the linked device */ npe = >ioda.pe_array[pdn->pe_number]; if (bypass) { diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index 41f7dec3aee5..21db0f4cfb11 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -211,7 +211,7 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, pe_level_printk(pe, KERN_INFO, fmt, ##__VA_ARGS__) /* Nvlink functions */ -extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass); +extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask); extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm); extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe); extern struct iommu_table_group *pnv_try_setup_npu_table_group( -- 1.8.3.1
[PATCH 00/11] powerpv/powernv: Restore pnv_npu_try_dma_set_bypass()
With recent kernels, TCE tables for NPU devices are no longer being configured. That task was performed by pnv_npu_try_dma_set_bypass(), a function that got swept away in recent overhauling of dma code. Patches 1-4 here bring the lost function back and reintegrate it with the updated generic iommu bypass infrastructure. Patch 5 fixes a regression in behavior when a requested dma mask can not be fulfilled. Patches 6-8 are cleanup. I put these later in the set because they aren't bisectable until after the restored code is wired back in. Patches 9-11 refactor pnv_pci_ioda_iommu_bypass_supported(). It seems wrong for a boolean *_supported() function to have side effects. They reintroduce a pci controller based dma_set_mask() hook. If that's undesirable, these last three patches can be dropped. Reza Arbab (11): Revert "powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function" powerpc/powernv: Add pnv_ioda_pe_iommu_bypass_supported() powerpc/powernv/npu: Change pnv_npu_try_dma_set_bypass() argument powerpc/powernv/npu: Wire up pnv_npu_try_dma_set_bypass() powerpc/powernv: Return failure for some uses of dma_set_mask() powerpc/powernv: Remove intermediate variable powerpc/powernv/npu: Simplify pnv_npu_try_dma_set_bypass() loop powerpc/powernv: Replace open coded pnv_ioda_get_pe()s Revert "powerpc/pci: remove the dma_set_mask pci_controller ops methods" powerpc/powernv: Add pnv_phb3_iommu_bypass_supported() powerpc/powernv: Add pnv_pci_ioda_dma_set_mask() arch/powerpc/include/asm/pci-bridge.h | 2 + arch/powerpc/kernel/dma-iommu.c | 19 -- arch/powerpc/kernel/dma-mask.c| 9 +++ arch/powerpc/platforms/powernv/Kconfig| 1 + arch/powerpc/platforms/powernv/npu-dma.c | 106 +++--- arch/powerpc/platforms/powernv/pci-ioda.c | 71 arch/powerpc/platforms/powernv/pci.h | 10 ++- 7 files changed, 177 insertions(+), 41 deletions(-) -- 1.8.3.1
[PATCH 09/11] Revert "powerpc/pci: remove the dma_set_mask pci_controller ops methods"
Bring back the pci controller based hook in dma_set_mask(), as it will have a user again. This reverts commit 662acad4067a ("powerpc/pci: remove the dma_set_mask pci_controller ops methods"). The callback signature has been adjusted with void return to fit its caller. Signed-off-by: Reza Arbab Cc: Christoph Hellwig --- arch/powerpc/include/asm/pci-bridge.h | 2 ++ arch/powerpc/kernel/dma-mask.c| 9 + 2 files changed, 11 insertions(+) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index ea6ec65970ef..8512dcd053c5 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -43,6 +43,8 @@ struct pci_controller_ops { void(*teardown_msi_irqs)(struct pci_dev *pdev); #endif + void(*dma_set_mask)(struct pci_dev *pdev, u64 dma_mask); + void(*shutdown)(struct pci_controller *hose); }; diff --git a/arch/powerpc/kernel/dma-mask.c b/arch/powerpc/kernel/dma-mask.c index ffbbbc432612..35b5fd1b03a6 100644 --- a/arch/powerpc/kernel/dma-mask.c +++ b/arch/powerpc/kernel/dma-mask.c @@ -2,11 +2,20 @@ #include #include +#include #include void arch_dma_set_mask(struct device *dev, u64 dma_mask) { if (ppc_md.dma_set_mask) ppc_md.dma_set_mask(dev, dma_mask); + + if (dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + struct pci_controller *phb = pci_bus_to_host(pdev->bus); + + if (phb->controller_ops.dma_set_mask) + phb->controller_ops.dma_set_mask(pdev, dma_mask); + } } EXPORT_SYMBOL(arch_dma_set_mask); -- 1.8.3.1
[PATCH 02/11] powerpc/powernv: Add pnv_ioda_pe_iommu_bypass_supported()
This little calculation will be needed in other places. Move it to a convenience function. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++- arch/powerpc/platforms/powernv/pci.h | 8 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index c28d0d9b7ee0..8849218187d7 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1838,11 +1838,9 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, return false; pe = >ioda.pe_array[pdn->pe_number]; - if (pe->tce_bypass_enabled) { - u64 top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1; - if (dma_mask >= top) - return true; - } + + if (pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask)) + return true; /* * If the device can't set the TCE bypass bit but still wants diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index f914f0b14e4e..41f7dec3aee5 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -4,6 +4,7 @@ #include /* for __printf */ #include +#include #include #include @@ -247,4 +248,11 @@ extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl, void *tce_mem, u64 tce_size, u64 dma_offset, unsigned int page_shift); +static inline bool pnv_ioda_pe_iommu_bypass_supported(struct pnv_ioda_pe *pe, + u64 mask) +{ + return pe->tce_bypass_enabled && + mask >= pe->tce_bypass_base + memblock_end_of_DRAM() - 1; +} + #endif /* __POWERNV_PCI_H */ -- 1.8.3.1
[PATCH 10/11] powerpc/powernv: Add pnv_phb3_iommu_bypass_supported()
Move this code to its own function for reuse. As a side benefit, rearrange the comments and spread things out for readability. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/pci-ioda.c | 37 +-- 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 6b932cfc0deb..57e6a43d9a3a 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1826,6 +1826,30 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe) return -EIO; } +/* + * If the device can't set the TCE bypass bit but still wants + * to access 4GB or more, on PHB3 we can reconfigure TVE#0 to + * bypass the 32-bit region and be usable for 64-bit DMAs. + */ +static bool pnv_phb3_iommu_bypass_supported(struct pnv_ioda_pe *pe, u64 mask) +{ + if (pe->phb->model != PNV_PHB_MODEL_PHB3) + return false; + + /* pe->pdev should be set if it's a single device, pe->pbus if not */ + if (pe->pbus && pe->device_count != 1) + return false; + + if (!(mask >> 32)) + return false; + + /* The device needs to be able to address all of this space. */ + if (mask <= memory_hotplug_max() + (1ULL << 32)) + return false; + + return true; +} + static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, u64 dma_mask) { @@ -1837,18 +1861,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask); - /* -* If the device can't set the TCE bypass bit but still wants -* to access 4GB or more, on PHB3 we can reconfigure TVE#0 to -* bypass the 32-bit region and be usable for 64-bit DMAs. -* The device needs to be able to address all of this space. -*/ - if (!bypass && - dma_mask >> 32 && - dma_mask > (memory_hotplug_max() + (1ULL << 32)) && - /* pe->pdev should be set if it's a single device, pe->pbus if not */ - (pe->device_count == 1 || !pe->pbus) && - pe->phb->model == PNV_PHB_MODEL_PHB3) { + if (!bypass && pnv_phb3_iommu_bypass_supported(pe, dma_mask)) { /* Configure the bypass mode */ if (pnv_pci_ioda_dma_64bit_bypass(pe)) return false; -- 1.8.3.1
[PATCH 07/11] powerpc/powernv/npu: Simplify pnv_npu_try_dma_set_bypass() loop
Write this loop more compactly to improve readability. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/npu-dma.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index a6b8c7ad36e4..a77ce7d71634 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -261,12 +261,12 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) { struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev); - int i; struct pnv_phb *phb; struct pci_dn *pdn; struct pnv_ioda_pe *npe; struct pci_dev *npdev; bool bypass; + int i = 0; if (!gpe) return; @@ -274,12 +274,7 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) /* We only do bypass if it's enabled on the linked device */ bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask); - for (i = 0; ; ++i) { - npdev = pnv_pci_get_npu_dev(gpdev, i); - - if (!npdev) - break; - + while ((npdev = pnv_pci_get_npu_dev(gpdev, i++))) { pdn = pci_get_pdn(npdev); if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) return; -- 1.8.3.1
[PATCH 01/11] Revert "powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function"
Revert commit b4d37a7b6934 ("powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function") so that this function can be reintegrated. Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code") Signed-off-by: Reza Arbab Cc: Christoph Hellwig --- arch/powerpc/platforms/powernv/npu-dma.c | 99 1 file changed, 99 insertions(+) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index b95b9e3c4c98..5a8313654033 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -193,6 +193,105 @@ static long pnv_npu_unset_window(struct iommu_table_group *table_group, int num) return 0; } +/* + * Enables 32 bit DMA on NPU. + */ +static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) +{ + struct pci_dev *gpdev; + struct pnv_ioda_pe *gpe; + int64_t rc; + + /* +* Find the assoicated PCI devices and get the dma window +* information from there. +*/ + if (!npe->pdev || !(npe->flags & PNV_IODA_PE_DEV)) + return; + + gpe = get_gpu_pci_dev_and_pe(npe, ); + if (!gpe) + return; + + rc = pnv_npu_set_window(>table_group, 0, + gpe->table_group.tables[0]); + + /* +* NVLink devices use the same TCE table configuration as +* their parent device so drivers shouldn't be doing DMA +* operations directly on these devices. +*/ + set_dma_ops(>pdev->dev, _dummy_ops); +} + +/* + * Enables bypass mode on the NPU. The NPU only supports one + * window per link, so bypass needs to be explicitly enabled or + * disabled. Unlike for a PHB3 bypass and non-bypass modes can't be + * active at the same time. + */ +static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) +{ + struct pnv_phb *phb = npe->phb; + int64_t rc = 0; + phys_addr_t top = memblock_end_of_DRAM(); + + if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev) + return -EINVAL; + + rc = pnv_npu_unset_window(>table_group, 0); + if (rc != OPAL_SUCCESS) + return rc; + + /* Enable the bypass window */ + + top = roundup_pow_of_two(top); + dev_info(>pdev->dev, "Enabling bypass for PE %x\n", + npe->pe_number); + rc = opal_pci_map_pe_dma_window_real(phb->opal_id, + npe->pe_number, npe->pe_number, + 0 /* bypass base */, top); + + if (rc == OPAL_SUCCESS) + pnv_pci_ioda2_tce_invalidate_entire(phb, false); + + return rc; +} + +void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass) +{ + int i; + struct pnv_phb *phb; + struct pci_dn *pdn; + struct pnv_ioda_pe *npe; + struct pci_dev *npdev; + + for (i = 0; ; ++i) { + npdev = pnv_pci_get_npu_dev(gpdev, i); + + if (!npdev) + break; + + pdn = pci_get_pdn(npdev); + if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + return; + + phb = pci_bus_to_host(npdev->bus)->private_data; + + /* We only do bypass if it's enabled on the linked device */ + npe = >ioda.pe_array[pdn->pe_number]; + + if (bypass) { + dev_info(>dev, + "Using 64-bit DMA iommu bypass\n"); + pnv_npu_dma_set_bypass(npe); + } else { + dev_info(>dev, "Using 32-bit DMA via iommu\n"); + pnv_npu_dma_set_32(npe); + } + } +} + /* Switch ownership from platform code to external user (e.g. VFIO) */ static void pnv_npu_take_ownership(struct iommu_table_group *table_group) { -- 1.8.3.1
[PATCH 08/11] powerpc/powernv: Replace open coded pnv_ioda_get_pe()s
Collapse several open coded instances of pnv_ioda_get_pe(). Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/npu-dma.c | 22 +- arch/powerpc/platforms/powernv/pci-ioda.c | 10 +++--- 2 files changed, 8 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index a77ce7d71634..0010b21d45b8 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -97,24 +97,17 @@ struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index) static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, struct pci_dev **gpdev) { - struct pnv_phb *phb; - struct pci_controller *hose; struct pci_dev *pdev; struct pnv_ioda_pe *pe; - struct pci_dn *pdn; pdev = pnv_pci_get_gpu_dev(npe->pdev); if (!pdev) return NULL; - pdn = pci_get_pdn(pdev); - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + pe = pnv_ioda_get_pe(pdev); + if (WARN_ON(!pe)) return NULL; - hose = pci_bus_to_host(pdev->bus); - phb = hose->private_data; - pe = >ioda.pe_array[pdn->pe_number]; - if (gpdev) *gpdev = pdev; @@ -261,9 +254,6 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) { struct pnv_ioda_pe *gpe = pnv_ioda_get_pe(gpdev); - struct pnv_phb *phb; - struct pci_dn *pdn; - struct pnv_ioda_pe *npe; struct pci_dev *npdev; bool bypass; int i = 0; @@ -275,12 +265,10 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, u64 mask) bypass = pnv_ioda_pe_iommu_bypass_supported(gpe, mask); while ((npdev = pnv_pci_get_npu_dev(gpdev, i++))) { - pdn = pci_get_pdn(npdev); - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) - return; + struct pnv_ioda_pe *npe = pnv_ioda_get_pe(npdev); - phb = pci_bus_to_host(npdev->bus)->private_data; - npe = >ioda.pe_array[pdn->pe_number]; + if (WARN_ON(!npe)) + return; if (bypass) { dev_info(>dev, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 319152d30bc3..6b932cfc0deb 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1829,16 +1829,12 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe) static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, u64 dma_mask) { - struct pci_controller *hose = pci_bus_to_host(pdev->bus); - struct pnv_phb *phb = hose->private_data; - struct pci_dn *pdn = pci_get_pdn(pdev); - struct pnv_ioda_pe *pe; + struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev); bool bypass; - if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) + if (WARN_ON(!pe)) return false; - pe = >ioda.pe_array[pdn->pe_number]; bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask); /* @@ -1852,7 +1848,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, dma_mask > (memory_hotplug_max() + (1ULL << 32)) && /* pe->pdev should be set if it's a single device, pe->pbus if not */ (pe->device_count == 1 || !pe->pbus) && - phb->model == PNV_PHB_MODEL_PHB3) { + pe->phb->model == PNV_PHB_MODEL_PHB3) { /* Configure the bypass mode */ if (pnv_pci_ioda_dma_64bit_bypass(pe)) return false; -- 1.8.3.1
[PATCH 11/11] powerpc/powernv: Add pnv_pci_ioda_dma_set_mask()
Change pnv_pci_ioda_iommu_bypass_supported() to have no side effects, by separating the part of the function that determines if bypass is supported from the part that actually attempts to configure it. Move the latter to a controller-specific dma_set_mask() callback. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/Kconfig| 1 + arch/powerpc/platforms/powernv/pci-ioda.c | 30 -- 2 files changed, 17 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 938803eab0ad..6e6e27841764 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -17,6 +17,7 @@ config PPC_POWERNV select PPC_DOORBELL select MMU_NOTIFIER select FORCE_SMP + select ARCH_HAS_DMA_SET_MASK default y config OPAL_PRD diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 57e6a43d9a3a..5291464930ed 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1854,32 +1854,33 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, u64 dma_mask) { struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev); - bool bypass; if (WARN_ON(!pe)) return false; - bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask); + return pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask) || + pnv_phb3_iommu_bypass_supported(pe, dma_mask); +} + +static void pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 mask) +{ + struct pnv_ioda_pe *pe = pnv_ioda_get_pe(pdev); + + if (!pe) + return; - if (!bypass && pnv_phb3_iommu_bypass_supported(pe, dma_mask)) { + if (!pnv_ioda_pe_iommu_bypass_supported(pe, mask) && + pnv_phb3_iommu_bypass_supported(pe, mask)) { /* Configure the bypass mode */ if (pnv_pci_ioda_dma_64bit_bypass(pe)) - return false; + return; /* 4GB offset bypasses 32-bit space */ pdev->dev.archdata.dma_offset = (1ULL << 32); - - bypass = true; } - /* -* Update peer npu devices. We also do this for the special case where -* a 64-bit dma mask can't be fulfilled and falls back to default. -*/ - if (bypass || !(dma_mask >> 32) || dma_mask == DMA_BIT_MASK(64)) - pnv_npu_try_dma_set_bypass(pdev, dma_mask); - - return bypass; + /* Update peer npu devices */ + pnv_npu_try_dma_set_bypass(pdev, mask); } static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) @@ -3612,6 +3613,7 @@ static void pnv_pci_ioda_shutdown(struct pci_controller *hose) static const struct pci_controller_ops pnv_pci_ioda_controller_ops = { .dma_dev_setup = pnv_pci_dma_dev_setup, .dma_bus_setup = pnv_pci_dma_bus_setup, + .dma_set_mask = pnv_pci_ioda_dma_set_mask, .iommu_bypass_supported = pnv_pci_ioda_iommu_bypass_supported, .setup_msi_irqs = pnv_setup_msi_irqs, .teardown_msi_irqs = pnv_teardown_msi_irqs, -- 1.8.3.1
[PATCH 05/11] powerpc/powernv: Return failure for some uses of dma_set_mask()
Rework of pnv_pci_ioda_dma_set_mask() effectively reverted commit 253fd51e2f53 ("powerpc/powernv/pci: Return failure for some uses of dma_set_mask()"). Reintroduce the desired behavior that an unfulfilled request for a DMA mask between 32 and 64 bits will return error instead of silently falling back to 32 bits. Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code") Signed-off-by: Reza Arbab Cc: Christoph Hellwig --- arch/powerpc/kernel/dma-iommu.c | 19 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 8 ++-- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c index e486d1d78de2..e1693760654b 100644 --- a/arch/powerpc/kernel/dma-iommu.c +++ b/arch/powerpc/kernel/dma-iommu.c @@ -122,10 +122,21 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask) { struct iommu_table *tbl = get_iommu_table_base(dev); - if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) { - dev->archdata.iommu_bypass = true; - dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n"); - return 1; + if (dev_is_pci(dev)) { + if (dma_iommu_bypass_supported(dev, mask)) { + dev->archdata.iommu_bypass = true; + dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n"); + return 1; + } + + if (mask >> 32) { + dev_err(dev, "Warning: IOMMU dma bypass not supported: mask 0x%016llx\n", + mask); + + /* A 64-bit request falls back to default ops */ + if (mask != DMA_BIT_MASK(64)) + return 0; + } } if (!tbl) { diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 70e834635971..b78b5e81f941 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1863,8 +1863,12 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, bypass = true; } - /* Update peer npu devices */ - pnv_npu_try_dma_set_bypass(pdev, dma_mask); + /* +* Update peer npu devices. We also do this for the special case where +* a 64-bit dma mask can't be fulfilled and falls back to default. +*/ + if (bypass || !(dma_mask >> 32) || dma_mask == DMA_BIT_MASK(64)) + pnv_npu_try_dma_set_bypass(pdev, dma_mask); return bypass; } -- 1.8.3.1
[PATCH 04/11] powerpc/powernv/npu: Wire up pnv_npu_try_dma_set_bypass()
Rework of pnv_pci_ioda_iommu_bypass_supported() dropped a call to pnv_npu_try_dma_set_bypass(). Reintroduce this call, so that the DMA bypass configuration of a GPU device is propagated to its corresponding NPU devices. Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code") Signed-off-by: Reza Arbab Cc: Christoph Hellwig --- arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 8849218187d7..70e834635971 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1833,14 +1833,13 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, struct pnv_phb *phb = hose->private_data; struct pci_dn *pdn = pci_get_pdn(pdev); struct pnv_ioda_pe *pe; + bool bypass; if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE)) return false; pe = >ioda.pe_array[pdn->pe_number]; - - if (pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask)) - return true; + bypass = pnv_ioda_pe_iommu_bypass_supported(pe, dma_mask); /* * If the device can't set the TCE bypass bit but still wants @@ -1848,7 +1847,8 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, * bypass the 32-bit region and be usable for 64-bit DMAs. * The device needs to be able to address all of this space. */ - if (dma_mask >> 32 && + if (!bypass && + dma_mask >> 32 && dma_mask > (memory_hotplug_max() + (1ULL << 32)) && /* pe->pdev should be set if it's a single device, pe->pbus if not */ (pe->device_count == 1 || !pe->pbus) && @@ -1859,10 +1859,14 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, return false; /* 4GB offset bypasses 32-bit space */ pdev->dev.archdata.dma_offset = (1ULL << 32); - return true; + + bypass = true; } - return false; + /* Update peer npu devices */ + pnv_npu_try_dma_set_bypass(pdev, dma_mask); + + return bypass; } static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) -- 1.8.3.1
[PATCH 06/11] powerpc/powernv: Remove intermediate variable
Trim the pointless temporary variable. Signed-off-by: Reza Arbab --- arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index b78b5e81f941..319152d30bc3 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1854,9 +1854,9 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, (pe->device_count == 1 || !pe->pbus) && phb->model == PNV_PHB_MODEL_PHB3) { /* Configure the bypass mode */ - s64 rc = pnv_pci_ioda_dma_64bit_bypass(pe); - if (rc) + if (pnv_pci_ioda_dma_64bit_bypass(pe)) return false; + /* 4GB offset bypasses 32-bit space */ pdev->dev.archdata.dma_offset = (1ULL << 32); -- 1.8.3.1
Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic
On Wed, 2019-10-30 at 08:22 -0700, Lakshmi Ramasubramanian wrote: > On 10/23/19 8:47 PM, Nayna Jain wrote: > > Hi Nayna, > > > process_buffer_measurement() is limited to measuring the kexec boot > > command line. This patch makes process_buffer_measurement() more > > generic, allowing it to measure other types of buffer data (e.g. > > blacklisted binary hashes or key hashes). > > Now that process_buffer_measurement() is being made generic to measure > any buffer, it would be good to add a tag to indicate what type of > buffer is being measured. > > For example, if the buffer is kexec command line the log could look like: > > "kexec_cmdline: " > > Similarly, if the buffer is blacklisted binary hash: > > "blacklist hash: ". > > If the buffer is key hash: > > ": key data". > > This would greatly help the consumer of the IMA log to know the type of > data represented in each IMA log entry. Both the existing kexec command line and the new blacklist buffer measurement pass that information in the eventname. The [PATCH 7/8] "ima: check against blacklisted hashes for files with modsig" patch description includes an example. Mimi
Re: [PATCH v9 5/8] ima: make process_buffer_measurement() generic
On 10/23/19 8:47 PM, Nayna Jain wrote: Hi Nayna, process_buffer_measurement() is limited to measuring the kexec boot command line. This patch makes process_buffer_measurement() more generic, allowing it to measure other types of buffer data (e.g. blacklisted binary hashes or key hashes). Now that process_buffer_measurement() is being made generic to measure any buffer, it would be good to add a tag to indicate what type of buffer is being measured. For example, if the buffer is kexec command line the log could look like: "kexec_cmdline: " Similarly, if the buffer is blacklisted binary hash: "blacklist hash: ". If the buffer is key hash: ": key data". This would greatly help the consumer of the IMA log to know the type of data represented in each IMA log entry. thanks, -lakshmi
Re: [PATCH v10 1/5] kasan: support backing vmalloc space with real shadow memory
Hello, Daniel > > @@ -1294,14 +1299,19 @@ static bool __purge_vmap_area_lazy(unsigned long > start, unsigned long end) > spin_lock(_vmap_area_lock); > llist_for_each_entry_safe(va, n_va, valist, purge_list) { > unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; > + unsigned long orig_start = va->va_start; > + unsigned long orig_end = va->va_end; > > /* >* Finally insert or merge lazily-freed area. It is >* detached and there is no need to "unlink" it from >* anything. >*/ > - merge_or_add_vmap_area(va, > - _vmap_area_root, _vmap_area_list); > + va = merge_or_add_vmap_area(va, _vmap_area_root, > + _vmap_area_list); > + > + kasan_release_vmalloc(orig_start, orig_end, > + va->va_start, va->va_end); > I have some questions here. I have not analyzed kasan_releace_vmalloc() logic in detail, sorry for that if i miss something. __purge_vmap_area_lazy() deals with big address space, so not only vmalloc addresses it frees here, basically it can be any, starting from 1 until ULONG_MAX, whereas vmalloc space spans from VMALLOC_START - VMALLOC_END: 1) Should it be checked that vmalloc only address is freed or you handle it somewhere else? if (is_vmalloc_addr(va->va_start)) kasan_release_vmalloc(...) 2) Have you run any bencmarking just to see how much overhead it adds? I am asking, because probably it make sense to add those figures to the backlog(commit message). For example you can run: sudo ./test_vmalloc.sh performance and sudo ./test_vmalloc.sh sequential_test_order=1 Thanks! -- Vlad Rezki
Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC
Andrey Ryabinin writes: > On 10/30/19 4:50 PM, Daniel Axtens wrote: >> Andrey Ryabinin writes: >> >>> On 10/29/19 7:20 AM, Daniel Axtens wrote: In the case where KASAN directly allocates memory to back vmalloc space, don't map the early shadow page over it. We prepopulate pgds/p4ds for the range that would otherwise be empty. This is required to get it synced to hardware on boot, allowing the lower levels of the page tables to be filled dynamically. Acked-by: Dmitry Vyukov Signed-off-by: Daniel Axtens --- >>> +static void __init kasan_shallow_populate_pgds(void *start, void *end) +{ + unsigned long addr, next; + pgd_t *pgd; + void *p; + int nid = early_pfn_to_nid((unsigned long)start); >>> >>> This doesn't make sense. start is not even a pfn. With linear mapping >>> we try to identify nid to have the shadow on the same node as memory. But >>> in this case we don't have memory or the corresponding shadow (yet), >>> we only install pgd/p4d. >>> I guess we could just use NUMA_NO_NODE. >> >> Ah wow, that's quite the clanger on my part. >> >> There are a couple of other invocations of early_pfn_to_nid in that file >> that use an address directly, but at least they reference actual memory. >> I'll send a separate patch to fix those up. > > I see only one incorrect, in kasan_init(): early_pfn_to_nid(__pa(_stext)) > It should be wrapped with PFN_DOWN(). > Other usages in map_range() seems to be correct, range->start,end is pfns. > Oh, right, I didn't realise map_range was already using pfns. > >> >>> The rest looks ok, so with that fixed: >>> >>> Reviewed-by: Andrey Ryabinin >> >> Thanks heaps! I've fixed up the nit you identifed in the first patch, >> and I agree that the last patch probably isn't needed. I'll respin the >> series shortly. >> > > Hold on a sec, just spotted another thing to fix. > >> @@ -352,9 +397,24 @@ void __init kasan_init(void) >> shadow_cpu_entry_end = (void *)round_up( >> (unsigned long)shadow_cpu_entry_end, PAGE_SIZE); >> >> +/* >> + * If we're in full vmalloc mode, don't back vmalloc space with early >> + * shadow pages. Instead, prepopulate pgds/p4ds so they are synced to >> + * the global table and we can populate the lower levels on demand. >> + */ >> +#ifdef CONFIG_KASAN_VMALLOC >> +kasan_shallow_populate_pgds( >> +kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), > > This should be VMALLOC_START, there is no point to allocate pgds for the hole > between linear mapping > and vmalloc, just waste of memory. It make sense to map early shadow for that > hole, because if code > dereferences address in that hole we will see the page fault on that address > instead of fault on the shadow. > > So something like this might work: > > kasan_populate_early_shadow( > kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), > kasan_mem_to_shadow((void *)VMALLOC_START)); > > if (IS_ENABLED(CONFIG_KASAN_VMALLOC) > kasan_shallow_populate_pgds(kasan_mem_to_shadow(VMALLOC_START), > kasan_mem_to_shadow((void *)VMALLOC_END)) > else > kasan_populate_early_shadow(kasan_mem_to_shadow(VMALLOC_START), > kasan_mem_to_shadow((void *)VMALLOC_END)); > > kasan_populate_early_shadow( > kasan_mem_to_shadow((void *)VMALLOC_END + 1), > shadow_cpu_entry_begin); Sounds good. It's getting late for me so I'll change and test that and send a respin tomorrow my time. Regards, Daniel
Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC
On 10/30/19 4:50 PM, Daniel Axtens wrote: > Andrey Ryabinin writes: > >> On 10/29/19 7:20 AM, Daniel Axtens wrote: >>> In the case where KASAN directly allocates memory to back vmalloc >>> space, don't map the early shadow page over it. >>> >>> We prepopulate pgds/p4ds for the range that would otherwise be empty. >>> This is required to get it synced to hardware on boot, allowing the >>> lower levels of the page tables to be filled dynamically. >>> >>> Acked-by: Dmitry Vyukov >>> Signed-off-by: Daniel Axtens >>> >>> --- >> >>> +static void __init kasan_shallow_populate_pgds(void *start, void *end) >>> +{ >>> + unsigned long addr, next; >>> + pgd_t *pgd; >>> + void *p; >>> + int nid = early_pfn_to_nid((unsigned long)start); >> >> This doesn't make sense. start is not even a pfn. With linear mapping >> we try to identify nid to have the shadow on the same node as memory. But >> in this case we don't have memory or the corresponding shadow (yet), >> we only install pgd/p4d. >> I guess we could just use NUMA_NO_NODE. > > Ah wow, that's quite the clanger on my part. > > There are a couple of other invocations of early_pfn_to_nid in that file > that use an address directly, but at least they reference actual memory. > I'll send a separate patch to fix those up. I see only one incorrect, in kasan_init(): early_pfn_to_nid(__pa(_stext)) It should be wrapped with PFN_DOWN(). Other usages in map_range() seems to be correct, range->start,end is pfns. > >> The rest looks ok, so with that fixed: >> >> Reviewed-by: Andrey Ryabinin > > Thanks heaps! I've fixed up the nit you identifed in the first patch, > and I agree that the last patch probably isn't needed. I'll respin the > series shortly. > Hold on a sec, just spotted another thing to fix. > @@ -352,9 +397,24 @@ void __init kasan_init(void) > shadow_cpu_entry_end = (void *)round_up( > (unsigned long)shadow_cpu_entry_end, PAGE_SIZE); > > + /* > + * If we're in full vmalloc mode, don't back vmalloc space with early > + * shadow pages. Instead, prepopulate pgds/p4ds so they are synced to > + * the global table and we can populate the lower levels on demand. > + */ > +#ifdef CONFIG_KASAN_VMALLOC > + kasan_shallow_populate_pgds( > + kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), This should be VMALLOC_START, there is no point to allocate pgds for the hole between linear mapping and vmalloc, just waste of memory. It make sense to map early shadow for that hole, because if code dereferences address in that hole we will see the page fault on that address instead of fault on the shadow. So something like this might work: kasan_populate_early_shadow( kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), kasan_mem_to_shadow((void *)VMALLOC_START)); if (IS_ENABLED(CONFIG_KASAN_VMALLOC) kasan_shallow_populate_pgds(kasan_mem_to_shadow(VMALLOC_START), kasan_mem_to_shadow((void *)VMALLOC_END)) else kasan_populate_early_shadow(kasan_mem_to_shadow(VMALLOC_START), kasan_mem_to_shadow((void *)VMALLOC_END)); kasan_populate_early_shadow( kasan_mem_to_shadow((void *)VMALLOC_END + 1), shadow_cpu_entry_begin);
Re: [PATCH v10 4/5] x86/kasan: support KASAN_VMALLOC
Andrey Ryabinin writes: > On 10/29/19 7:20 AM, Daniel Axtens wrote: >> In the case where KASAN directly allocates memory to back vmalloc >> space, don't map the early shadow page over it. >> >> We prepopulate pgds/p4ds for the range that would otherwise be empty. >> This is required to get it synced to hardware on boot, allowing the >> lower levels of the page tables to be filled dynamically. >> >> Acked-by: Dmitry Vyukov >> Signed-off-by: Daniel Axtens >> >> --- > >> +static void __init kasan_shallow_populate_pgds(void *start, void *end) >> +{ >> +unsigned long addr, next; >> +pgd_t *pgd; >> +void *p; >> +int nid = early_pfn_to_nid((unsigned long)start); > > This doesn't make sense. start is not even a pfn. With linear mapping > we try to identify nid to have the shadow on the same node as memory. But > in this case we don't have memory or the corresponding shadow (yet), > we only install pgd/p4d. > I guess we could just use NUMA_NO_NODE. Ah wow, that's quite the clanger on my part. There are a couple of other invocations of early_pfn_to_nid in that file that use an address directly, but at least they reference actual memory. I'll send a separate patch to fix those up. > The rest looks ok, so with that fixed: > > Reviewed-by: Andrey Ryabinin Thanks heaps! I've fixed up the nit you identifed in the first patch, and I agree that the last patch probably isn't needed. I'll respin the series shortly. Regards, Daniel
Re: [RFC PATCH 00/27] current interrupt series plus scv syscall
Hello, On Wed, Oct 02, 2019 at 01:13:52PM +1000, Nicholas Piggin wrote: > Michal Suchánek's on September 24, 2019 7:33 pm: > > Hello, > > > > can you mark the individual patches with RFC rather than the wole > > series? > > Hey, thanks for the reviews. I'll resend all but the last two patches > soon for merge in the next window. Will you resend these or should I cherry-pick the part I need for the !COMPAT? Thanks Michal
Re: [PATCH v2 17/23] soc: fsl: qe: make qe_ic_cascade_* static
On 30/10/2019 11.50, Christophe Leroy wrote: > > > Le 25/10/2019 à 14:40, Rasmus Villemoes a écrit : >> Now that the references from arch/powerpc/ are gone, these are only >> referenced from inside qe_ic.c, so make them static. > > Why do that in two steps ? > I think patch 9 could remain until here, and then you could squash patch > 9 and patch 17 together here. Agreed, I should rearrange stuff to avoid touching the same lines again and again. Thanks, Rasmus
Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
> On Oct 30, 2019, at 6:28 AM, Peter Zijlstra wrote: > > It only makes 'wild' guesses when the BIOS is shit and it complains > about that. > > Or do you like you BIOS broken? Agree. It is the garbage in and garbage out. No need to complicate the existing code further.
Re: [PATCH] powerpc/powernv: Fix CPU idle to be called with IRQs disabled
On Tue, 2019-10-22 at 11:58:14 UTC, Nicholas Piggin wrote: > Commit e78a7614f3876 ("idle: Prevent late-arriving interrupts from > disrupting offline") changes arch_cpu_idle_dead to be called with > interrupts disabled, which triggers the WARN in pnv_smp_cpu_kill_self. > > Fix this by fixing up irq_happened after hard disabling, rather than > requiring there are no pending interrupts, similarly to what was done > done until commit 2525db04d1cc5 ("powerpc/powernv: Simplify lazy IRQ > handling in CPU offline"). > > Fixes: e78a7614f3876 ("idle: Prevent late-arriving interrupts from disrupting > offline") > Reported-by: Paul Mackerras > Signed-off-by: Nicholas Piggin Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/7d6475051fb3d9339c5c760ed9883bc0a9048b21 cheers
Re: [PATCH] powernv/eeh: Fix oops when probing cxl devices
On Wed, 2019-10-16 at 16:28:33 UTC, Frederic Barrat wrote: > Recent cleanup in the way EEH support is added to a device causes a > kernel oops when the cxl driver probes a device and creates virtual > devices discovered on the FPGA: > > BUG: Kernel NULL pointer dereference at 0x00a0 > Faulting instruction address: 0xc0048070 > Oops: Kernel access of bad area, sig: 7 [#1] > ... > NIP [c0048070] eeh_add_device_late.part.9+0x50/0x1e0 > LR [c004805c] eeh_add_device_late.part.9+0x3c/0x1e0 > Call Trace: > [c000200e43983900] [c079e250] _dev_info+0x5c/0x6c (unreliable) > [c000200e43983980] [c00d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0 > [c000200e439839f0] [c00606d0] pcibios_bus_add_device+0x40/0x60 > [c000200e43983a10] [c06aa3a0] pci_bus_add_device+0x30/0x100 > [c000200e43983a80] [c06aa4d4] pci_bus_add_devices+0x64/0xd0 > [c000200e43983ac0] [c0081c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl] > [c000200e43983b00] [c0081c4242ac] cxl_probe+0x504/0x5b0 [cxl] > [c000200e43983bb0] [c06bba1c] local_pci_probe+0x6c/0x110 > [c000200e43983c30] [c0159278] work_for_cpu_fn+0x38/0x60 > > The root cause is that those cxl virtual devices don't have a > representation in the device tree and therefore no associated pci_dn > structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and > we oops. > > We never had explicit support for EEH for those virtual > devices. Instead, EEH events are reported to the (real) pci device and > handled by the cxl driver. Which can then forward to the virtual > devices and handle dependencies. The fact that we try adding EEH > support for the virtual devices is new and a side-effect of the recent > cleanup. > > This patch fixes it by skipping adding EEH support on powernv for > devices which don't have a pci_dn structure. > > The cxl driver doesn't create virtual devices on pseries so this patch > doesn't fix it there intentionally. > > Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug") > Signed-off-by: Frederic Barrat Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/a8a30219ba78b1abb92091102b632f8e9bbdbf03 cheers
Re: [PATCH] powerpc/prom_init: Undo relocation before entering secure mode
On Wed, 2019-09-11 at 16:34:33 UTC, Thiago Jung Bauermann wrote: > The ultravisor will do an integrity check of the kernel image but we > relocated it so the check will fail. Restore the original image by > relocating it back to the kernel virtual base address. > > This works because during build vmlinux is linked with an expected virtual > runtime address of KERNELBASE. > > Fixes: 6a9c930bd775 ("powerpc/prom_init: Add the ESM call to prom_init") > Signed-off-by: Thiago Jung Bauermann Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/05d9a952832cb206a32e3705eff6edebdb2207e7 cheers
Re: [PATCH v3] powerpc/powernv: Add queue mechanism for early messages
On Mon, 2018-05-21 at 02:04:38 UTC, Deb McLemore wrote: > Problem being solved is when issuing a BMC soft poweroff during IPL, > the poweroff was being lost so the machine would not poweroff. > > Opal messages were being received before the opal-power code > registered its notifiers. > > Alternatives discussed (option #3 was chosen): > > 1 - Have opal_message_init() explicitly call opal_power_control_init() > before it dequeues any OPAL messages (i.e. before we register the > opal-msg IRQ handler). > > 2 - Introduce concept of critical message types and when we register > handlers we track which message types have a registered handler, > then defer the opal-msg IRQ registration until we have a handler > registered for all the critical types. > > 3 - Buffering messages, if we receive a message and do not yet > have a handler for that type, store the message and replay when > a handler for that type is registered. > > Signed-off-by: Deb McLemore Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/a9336ddf448b1cba3080195cec2287af3907236c cheers
Re: [PATCH v2 1/3] powerpc/pseries: Don't opencode HPTE_V_BOLTED
On Thu, 2019-10-24 at 09:35:40 UTC, "Aneesh Kumar K.V" wrote: > No functional change in this patch. > > Signed-off-by: Aneesh Kumar K.V Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/82ce028ad26dd075b06285ef61a854a564d910fb cheers
Re: [PATCH] powerpc/pseries: Mark accumulate_stolen_time() as notrace
On Thu, 2019-10-24 at 05:59:32 UTC, Michael Ellerman wrote: > accumulate_stolen_time() is called prior to interrupt state being > reconciled, which can trip the warning in arch_local_irq_restore(): > > WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 > .arch_local_irq_restore+0x9c/0x130 > ... > NIP .arch_local_irq_restore+0x9c/0x130 > LR .rb_start_commit+0x38/0x80 > Call Trace: > .ring_buffer_lock_reserve+0xe4/0x620 > .trace_function+0x44/0x210 > .function_trace_call+0x148/0x170 > .ftrace_ops_no_ops+0x180/0x1d0 > ftrace_call+0x4/0x8 > .accumulate_stolen_time+0x1c/0xb0 > decrementer_common+0x124/0x160 > > For now just mark it as notrace. We may change the ordering to call it > after interrupt state has been reconciled, but that is a larger > change. > > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/eb8e20f89093b64f48975c74ccb114e6775cee22 cheers
Re: [PATCH] selftests/powerpc: Reduce sigfuz runtime to ~60s
On Sun, 2019-10-13 at 23:46:43 UTC, Michael Ellerman wrote: > The defaults for the sigfuz test is to run for 4000 iterations, but > that can take quite a while and the test harness may kill the test. > Reduce the number of iterations to 600, which gives a runtime of > roughly 1 minute on a Power8 system. > > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/4f5c5b76cc00ccf5be89a2b9883feba3baf6eb2e cheers
Re: [PATCH] powerpc: make syntax for FADump config options in kernel/Makefile readable
On Wed, 2019-10-09 at 15:27:20 UTC, Hari Bathini wrote: > arch/powerpc/kernel/fadump.c file needs to be compiled in if 'config > FA_DUMP' or 'config PRESERVE_FA_DUMP' is set. The current syntax > achieves that but looks a bit odd. Fix it for better readability. > > Signed-off-by: Hari Bathini Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/cd1d55f16d48d97d681d9534170ce712ac1d09e7 cheers
Re: [PATCH] powerpc/configs: add FADump awareness to skiroot_defconfig
On Wed, 2019-10-09 at 14:04:29 UTC, Hari Bathini wrote: > FADump is supported on PowerNV platform. To fulfill this support, the > petitboot kernel must be FADump aware. Enable config PRESERVE_FA_DUMP > to make the petitboot kernel FADump aware. > > Signed-off-by: Hari Bathini Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/aaa351504449c4babb80753c72920e4b25fbd8a9 cheers
Re: [PATCH] powerpc/pkeys: remove unused pkey_allows_readwrite
On Tue, 2019-09-17 at 15:22:30 UTC, Qian Cai wrote: > pkey_allows_readwrite() was first introduced in the commit 5586cf61e108 > ("powerpc: introduce execute-only pkey"), but the usage was removed > entirely in the commit a4fcc877d4e1 ("powerpc/pkeys: Preallocate > execute-only key"). > > Found by the "-Wunused-function" compiler warning flag. > > Fixes: a4fcc877d4e1 ("powerpc/pkeys: Preallocate execute-only key") > Signed-off-by: Qian Cai Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/29674a1c71be710f8418468aa6a8addd6d1aba1c cheers
Re: [PATCH v2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range
On Tue, 2019-09-17 at 12:38:51 UTC, "Aneesh Kumar K.V" wrote: > With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap") > pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we > can map the memmap area using 16MB hugepage size and that can cover > a memory range of 16G. > > While enabling new pmem namespaces, since memory is added in sub-section > chunks, > before creating a new memmap mapping, kernel should check whether there is an > existing memmap mapping covering the new pmem namespace. Currently, this is > validated by checking whether the section covering the range is already > initialized or not. Considering there can be multiple namespaces in the same > section this can result in wrong validation. Update this to check for > sub-sections in the range. This is done by checking for all pfns in the range > we > are mapping. > > We could optimize this by checking only just one pfn in each sub-section. But > since this is not fast-path we keep this simple. > > Signed-off-by: Aneesh Kumar K.V Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/5f5d6e40a01e70b731df843d8b5a61b4b28b19d9 cheers