[PATCH v2] device-dax: use fallback nid when numa node is invalid
Previously, numa_off was set unconditionally in dummy_numa_init() even with a fake numa node. Then ACPI sets node id as NUMA_NO_NODE(-1) after acpi_map_pxm_to_node() because it regards numa_off as turning off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa case. Without this patch, pmem can't be probed as RAM devices on arm64 if SRAT table isn't present: $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with invalid node: -1 kmem: probe of dax0.0 failed with error -22 This fixes it by using fallback memory_add_physaddr_to_nid() as nid. Suggested-by: David Hildenbrand Signed-off-by: Jia He --- v2: - rebase it based on David's "memory group" patch. - drop the changes in dev_dax_kmem_remove() since nid had been removed in remove_memory(). drivers/dax/kmem.c | 31 +-- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a37622060fff..e4836eb7539e 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -47,20 +47,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) unsigned long total_len = 0; struct dax_kmem_data *data; int i, rc, mapped = 0; - int numa_node; - - /* -* Ensure good NUMA information for the persistent memory. -* Without this check, there is a risk that slow memory -* could be mixed in a node with faster memory, causing -* unavoidable performance issues. -*/ - numa_node = dev_dax->target_node; - if (numa_node < 0) { - dev_warn(dev, "rejecting DAX region with invalid node: %d\n", - numa_node); - return -EINVAL; - } + int numa_node = dev_dax->target_node; for (i = 0; i < dev_dax->nr_range; i++) { struct range range; @@ -71,6 +58,22 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) i, range.start, range.end); continue; } + + /* +* Ensure good NUMA information for the persistent memory. +* Without this check, there is a risk but not fatal that slow +* memory could be mixed in a node with faster memory, causing +* unavoidable performance issues. Warn this and use fallback +* node id. +*/ + if (numa_node < 0) { + int new_node = memory_add_physaddr_to_nid(range.start); + + dev_info(dev, "changing nid from %d to %d for DAX region [%#llx-%#llx]\n", +numa_node, new_node, range.start, range.end); + numa_node = new_node; + } + total_len += range_len(); } -- 2.17.1
[PATCH] KVM: arm64: Fix unaligned addr case in mmu walking
If the start addr is not aligned with the granule size of that level. loop step size should be adjusted to boundary instead of simple kvm_granual_size(level) increment. Otherwise, some mmu entries might miss the chance to be walked through. E.g. Assume the unmap range [data->addr, data->end] is [0xff00ab2000,0xff00cb2000] in level 2 walking and NOT block mapping. And the 1st part of that pmd entry is [0xff00ab2000,0xff00c0]. The pmd value is 0x83fbd2c1002 (not valid entry). In this case, data->addr should be adjusted to 0xff00c0 instead of 0xff00cb2000. Without this fix, userspace "segment fault" error can be easily triggered by running simple gVisor runsc cases on an Ampere Altra server: docker run --runtime=runsc -it --rm ubuntu /bin/bash In container: for i in `seq 1 100`;do ls;done Reported-by: Howard Zhang Signed-off-by: Jia He --- arch/arm64/kvm/hyp/pgtable.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index bdf8e55ed308..4d99d07c610c 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -225,6 +225,7 @@ static inline int __kvm_pgtable_visit(struct kvm_pgtable_walk_data *data, goto out; if (!table) { + data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level)); data->addr += kvm_granule_size(level); goto out; } -- 2.17.1
[RFC PATCH 1/2] arm64/cpuinfo: Move init_cpu_features() ahead of setup.c::early_fixmap_init()
Move init_cpu_features() ahead of setup_arch()->early_fixmap_init(), which is the preparation work for checking the condition to assign arm64_use_ng_mappings as cpus_have_const_cap(ARM64_UNMAP_KERNEL_AT_EL0). Besides, jump_label_init() is also moved ahead because cpus_have_const_cap() depends on static key enable api. Percpu helpers should be avoided in cpuinfo_store_boot_cpu() before percpu init at main.c::setup_per_cpu_areas() Signed-off-by: Jia He --- arch/arm64/include/asm/cpu.h | 1 + arch/arm64/kernel/cpuinfo.c | 13 ++--- arch/arm64/kernel/setup.c| 14 +- arch/arm64/kernel/smp.c | 3 +-- 4 files changed, 21 insertions(+), 10 deletions(-) diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h index 7faae6ff3ab4..59f36f5e3c04 100644 --- a/arch/arm64/include/asm/cpu.h +++ b/arch/arm64/include/asm/cpu.h @@ -63,6 +63,7 @@ DECLARE_PER_CPU(struct cpuinfo_arm64, cpu_data); void cpuinfo_store_cpu(void); void __init cpuinfo_store_boot_cpu(void); +void __init save_boot_cpuinfo_data(void); void __init init_cpu_features(struct cpuinfo_arm64 *info); void update_cpu_features(int cpu, struct cpuinfo_arm64 *info, diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c index 77605aec25fe..f8de5b8bae20 100644 --- a/arch/arm64/kernel/cpuinfo.c +++ b/arch/arm64/kernel/cpuinfo.c @@ -413,9 +413,16 @@ void cpuinfo_store_cpu(void) void __init cpuinfo_store_boot_cpu(void) { - struct cpuinfo_arm64 *info = _cpu(cpu_data, 0); - __cpuinfo_store_cpu(info); + __cpuinfo_store_cpu(_cpu_data); - boot_cpu_data = *info; init_cpu_features(_cpu_data); } + +void __init save_boot_cpuinfo_data(void) +{ + struct cpuinfo_arm64 *info; + + set_my_cpu_offset(per_cpu_offset(smp_processor_id())); + info = _cpu(cpu_data, 0); + *info = boot_cpu_data; +} diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index 1a57a76e1cc2..e078ab068f3b 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -297,16 +297,20 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p) */ arm64_use_ng_mappings = kaslr_requires_kpti(); - early_fixmap_init(); - early_ioremap_init(); - - setup_machine_fdt(__fdt_pointer); - /* * Initialise the static keys early as they may be enabled by the * cpufeature code and early parameters. */ jump_label_init(); + + /* Init the cpu feature codes for boot cpu */ + cpuinfo_store_boot_cpu(); + + early_fixmap_init(); + early_ioremap_init(); + + setup_machine_fdt(__fdt_pointer); + parse_early_param(); /* diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 2499b895efea..3df1f5b1da0b 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -449,8 +449,7 @@ void __init smp_cpus_done(unsigned int max_cpus) void __init smp_prepare_boot_cpu(void) { - set_my_cpu_offset(per_cpu_offset(smp_processor_id())); - cpuinfo_store_boot_cpu(); + save_boot_cpuinfo_data(); /* * We now know enough about the boot CPU to apply the -- 2.17.1
[RFC PATCH 0/2] Avoid booting stall caused by
There is a 10s stall in idmap_kpti_install_ng_mappings when kernel boots on a Ampere EMAG server. Commit f992b4dfd58b ("arm64: kpti: Add ->enable callback to remap swapper using nG mappings") updates the nG bit runtime if kpti is required. But things get worse if rodata=full in map_mem(). NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS is required when creating pagetable mapping. Hence all ptes are fully mapped in this case. On a Ampere EMAG server with 256G memory(pagesize=4k), it causes the 10s stall. After moving init_cpu_features() ahead of early_fixmap_init(), we can use cpu_have_const_cap earlier than before. Hence we can avoid this stall by updating arm64_use_ng_mappings. After this patch series, it reduces the kernel boot time from 14.7s to 4.1s: Before: [ 14.757569] Freeing initrd memory: 60752K After: [4.138819] Freeing initrd memory: 60752K Set it as RFC because I want to resolve any other points which I have misconerned. Jia He (2): arm64/cpuinfo: Move init_cpu_features() ahead of early_fixmap_init() arm64: kpti: Update arm64_use_ng_mappings before pagetable mapping arch/arm64/include/asm/cpu.h | 1 + arch/arm64/kernel/cpuinfo.c | 13 ++--- arch/arm64/kernel/setup.c| 18 +- arch/arm64/kernel/smp.c | 3 +-- 4 files changed, 25 insertions(+), 10 deletions(-) -- 2.17.1
[RFC PATCH 2/2] arm64: kpti: Update arm64_use_ng_mappings before pagetable mapping
There is a 10s stall in idmap_kpti_install_ng_mappings when kernel boots on a Ampere EMAG server. Commit f992b4dfd58b ("arm64: kpti: Add ->enable callback to remap swapper using nG mappings") updates the nG bit runtime if kpti is required. But things get worse if rodata=full in map_mem(). NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS is required when creating pagetable mapping. Hence all ptes are fully mapped in this case. On a Ampere EMAG server with 256G memory(pagesize=4k), it causes the 10s stall. After previous commit moving init_cpu_features(), we can use cpu_have_const_cap earlier than before. Hence we can avoid this stall by updating arm64_use_ng_mappings. Signed-off-by: Jia He --- arch/arm64/kernel/setup.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index e078ab068f3b..51098ceb7159 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -306,6 +306,10 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p) /* Init the cpu feature codes for boot cpu */ cpuinfo_store_boot_cpu(); + /* ARM64_UNMAP_KERNEL_AT_EL0 cap can be updated in cpuinfo_store_boot_cpu() */ + if (!arm64_use_ng_mappings) + arm64_use_ng_mappings = cpus_have_const_cap(ARM64_UNMAP_KERNEL_AT_EL0); + early_fixmap_init(); early_ioremap_init(); -- 2.17.1
[PATCH] vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote()
The permission of vfio iommu is different and incompatible with vma permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't map a dma region with NONE permission. This corner case will be exposed in coming virtio_fs cache_size commit [1] - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); memory_region_init_ram_ptr() - re-mmap the above area with read/write authority. - vfio_dma_map() will be invoked when vfio device is hotplug added. qemu: vfio_listener_region_add() vfio_dma_map(..., readonly=false) map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE ioctl(VFIO_IOMMU_MAP_DMA) kernel: vfio_dma_do_map() vfio_pin_map_dma() vfio_pin_pages_remote() vaddr_get_pfn() ... check_vma_flags() failed! because vm_flags hasn't VM_WRITE && gup_flags has FOLL_WRITE It will report error in qemu log when hotplug adding(vfio) a nvme disk to qemu guest on an Ampere EMAG server: "VFIO_MAP_DMA failed: Bad address" [1] https://gitlab.com/virtio-fs/qemu/-/blob/virtio-fs-dev/hw/virtio/vhost-user-fs.c#L502 Signed-off-by: Jia He --- drivers/vfio/vfio_iommu_type1.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 67e827638995..33faa6b7dbd4 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -453,7 +453,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, flags |= FOLL_WRITE; mmap_read_lock(mm); - ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM, + ret = pin_user_pages_remote(mm, vaddr, 1, + flags | FOLL_LONGTERM | FOLL_FORCE, page, NULL, NULL); if (ret == 1) { *pfn = page_to_pfn(page[0]); -- 2.17.1
[PATCH] gpio: dwapb: Fix missing conversion to GPIO-lib-based IRQ-chip
Commit 0ea683931adb ("gpio: dwapb: Convert driver to using the GPIO-lib-based IRQ-chip") missed the case in dwapb_irq_set_wake(). Without this fix, probing the dwapb gpio driver will hit a error: "address between user and kernel address ranges" on a Ampere armv8a server and cause a panic. Fixes: 0ea683931adb ("gpio: dwapb: Convert driver to using the GPIO-lib-based IRQ-chip") Signed-off-by: Jia He --- drivers/gpio/gpio-dwapb.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpio/gpio-dwapb.c b/drivers/gpio/gpio-dwapb.c index a5b326754124..2a9046c0fb16 100644 --- a/drivers/gpio/gpio-dwapb.c +++ b/drivers/gpio/gpio-dwapb.c @@ -343,8 +343,8 @@ static int dwapb_irq_set_type(struct irq_data *d, u32 type) #ifdef CONFIG_PM_SLEEP static int dwapb_irq_set_wake(struct irq_data *d, unsigned int enable) { - struct irq_chip_generic *igc = irq_data_get_irq_chip_data(d); - struct dwapb_gpio *gpio = igc->private; + struct gpio_chip *gc = irq_data_get_irq_chip_data(d); + struct dwapb_gpio *gpio = to_dwapb_gpio(gc); struct dwapb_context *ctx = gpio->ports[0].ctx; irq_hw_number_t bit = irqd_to_hwirq(d); -- 2.17.1
[RFC PATCH 1/6] mm/memory_hotplug: remove redundant memory block size alignment check
The alignment check has been done by check_hotplug_memory_range(). Hence the redundant one in create_memory_block_devices() can be removed. The similar redundant check is removed in remove_memory_block_devices(). Signed-off-by: Jia He --- drivers/base/memory.c | 8 1 file changed, 8 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 2b09b68b9f78..4a1691664c6c 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -642,10 +642,6 @@ int create_memory_block_devices(unsigned long start, unsigned long size) unsigned long block_id; int ret = 0; - if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) || -!IS_ALIGNED(size, memory_block_size_bytes( - return -EINVAL; - for (block_id = start_block_id; block_id != end_block_id; block_id++) { ret = init_memory_block(, block_id, MEM_OFFLINE); if (ret) @@ -678,10 +674,6 @@ void remove_memory_block_devices(unsigned long start, unsigned long size) struct memory_block *mem; unsigned long block_id; - if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) || -!IS_ALIGNED(size, memory_block_size_bytes( - return; - for (block_id = start_block_id; block_id != end_block_id; block_id++) { mem = find_memory_block_by_id(block_id); if (WARN_ON_ONCE(!mem)) -- 2.17.1
[RFC PATCH 2/6] resource: export find_next_iomem_res() helper
The helper is to find the lowest iomem resource that covers part of [@start..@end] It is useful when relaxing the alignment check for dax pmem kmem. Signed-off-by: Jia He --- include/linux/ioport.h | 3 +++ kernel/resource.c | 3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 6c2b06fe8beb..203fd16c9f45 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -247,6 +247,9 @@ extern struct resource * __request_region(struct resource *, extern void __release_region(struct resource *, resource_size_t, resource_size_t); +extern int find_next_iomem_res(resource_size_t start, resource_size_t end, + unsigned long flags, unsigned long desc, + bool first_lvl, struct resource *res); #ifdef CONFIG_MEMORY_HOTREMOVE extern int release_mem_region_adjustable(struct resource *, resource_size_t, resource_size_t); diff --git a/kernel/resource.c b/kernel/resource.c index 841737bbda9e..57e6a6802a3d 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -338,7 +338,7 @@ EXPORT_SYMBOL(release_resource); * @first_lvl: walk only the first level children, if set * @res: return ptr, if resource found */ -static int find_next_iomem_res(resource_size_t start, resource_size_t end, +int find_next_iomem_res(resource_size_t start, resource_size_t end, unsigned long flags, unsigned long desc, bool first_lvl, struct resource *res) { @@ -391,6 +391,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end, read_unlock(_lock); return p ? 0 : -ENODEV; } +EXPORT_SYMBOL(find_next_iomem_res); static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end, unsigned long flags, unsigned long desc, -- 2.17.1
[RFC PATCH 3/6] mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
When dax pmem is probed as RAM device on arm64, previously, kmem_start in dev_dax_kmem_probe() should be aligned with 1G memblock size on arm64 due to SECTION_SIZE_BITS(30). There will be some meta data at the beginning/end of the iomem space, e.g. namespace info and nvdimm label: 24000-33fdf : Persistent Memory 24000-2403f : namespace0.0 28000-2bfff : dax0.0 28000-2bfff : System RAM Hence it makes the whole kmem space not aligned with memory_block_size for both start addr and end addr. Hence there is a big gap when kmem is added into memory block which causes big memory space wasting. This changes it by relaxing the alignment check for dax pmem kmem in the path of online/offline memory blocks. Signed-off-by: Jia He --- drivers/base/memory.c | 16 mm/memory_hotplug.c | 39 ++- 2 files changed, 54 insertions(+), 1 deletion(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 4a1691664c6c..3d2a94f3b1d9 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -334,6 +334,22 @@ static ssize_t valid_zones_show(struct device *dev, * online nodes otherwise the page_zone is not reliable */ if (mem->state == MEM_ONLINE) { +#ifdef CONFIG_ZONE_DEVICE + struct resource res; + int ret; + + /* adjust start_pfn for dax pmem kmem */ + ret = find_next_iomem_res(start_pfn << PAGE_SHIFT, + ((start_pfn + nr_pages) << PAGE_SHIFT) - 1, + IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, + IORES_DESC_PERSISTENT_MEMORY, + false, ); + if (!ret && PFN_UP(res.start) > start_pfn) { + nr_pages -= PFN_UP(res.start) - start_pfn; + start_pfn = PFN_UP(res.start); + } +#endif + /* * The block contains more than one zone can not be offlined. * This can happen e.g. for ZONE_DMA and ZONE_DMA32 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index a53103dc292b..25745f67b680 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -999,6 +999,20 @@ int try_online_node(int nid) static int check_hotplug_memory_range(u64 start, u64 size) { +#ifdef CONFIG_ZONE_DEVICE + struct resource res; + int ret; + + /* Allow pmem kmem not to align with block size */ + ret = find_next_iomem_res(start, start + size - 1, + IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, + IORES_DESC_PERSISTENT_MEMORY, + false, ); + if (!ret) { + return 0; + } +#endif + /* memory range must be block size aligned */ if (!size || !IS_ALIGNED(start, memory_block_size_bytes()) || !IS_ALIGNED(size, memory_block_size_bytes())) { @@ -1481,19 +1495,42 @@ static int __ref __offline_pages(unsigned long start_pfn, mem_hotplug_begin(); /* -* Don't allow to offline memory blocks that contain holes. +* Don't allow to offline memory blocks that contain holes except +* for pmem. * Consequently, memory blocks with holes can never get onlined * via the hotplug path - online_pages() - as hotplugged memory has * no holes. This way, we e.g., don't have to worry about marking * memory holes PG_reserved, don't need pfn_valid() checks, and can * avoid using walk_system_ram_range() later. +* When dax pmem is used as RAM (kmem), holes at the beginning is +* allowed. */ walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages, count_system_ram_pages_cb); if (nr_pages != end_pfn - start_pfn) { +#ifdef CONFIG_ZONE_DEVICE + struct resource res; + + /* Allow pmem kmem not to align with block size */ + ret = find_next_iomem_res(start_pfn << PAGE_SHIFT, + (end_pfn << PAGE_SHIFT) - 1, + IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, + IORES_DESC_PERSISTENT_MEMORY, + false, ); + if (ret) { + ret = -EINVAL; + reason = "memory holes"; + goto failed_removal; + } + + /* adjust start_pfn for dax pmem kmem */ + start_pfn = PFN_UP(res.start); + end_pfn = PFN_DOWN(res.end + 1); +#else ret = -EINVAL; reason = "memory holes"; goto failed_removal; +#endif } /* This makes hotplug much easier...and readable. -- 2.17.1
[RFC PATCH 6/6] arm64: fall back to vmemmap_populate_basepages if not aligned with PMD_SIZE
In dax pmem kmem (dax pmem used as RAM device) case, the start address might not be aligned with PMD_SIZE e.g. 24000-33fdf : Persistent Memory 24000-2421f : namespace0.0 24240-2bfff : dax0.0 24240-2bfff : System RAM (kmem) pfn_to_page(0x24240) is fe0007e9. Without this patch, vmemmap_populate(fe0007e9, ...) will incorrectly create a pmd mapping [fe0007e0, fe000800] which contains fe0007e9. This adds the check and then falls back to vmemmap_populate_basepages() Signed-off-by: Jia He --- arch/arm64/mm/mmu.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index d69feb2cfb84..3b21bd47e801 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1102,6 +1102,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, do { next = pmd_addr_end(addr, end); + if (next - addr < PMD_SIZE) { + vmemmap_populate_basepages(start, next, node, altmap); + continue; + } pgdp = vmemmap_pgd_populate(addr, node); if (!pgdp) -- 2.17.1
[RFC PATCH 4/6] mm/page_alloc: adjust the start,end in dax pmem kmem case
There are 3 cases when doing online pages: - normal RAM, should be aligned with memory block size - persistent memory with ZONE_DEVICE - persistent memory used as normal RAM (kmem) with ZONE_NORMAL, this patch tries to adjust the start_pfn/end_pfn after finding the corresponding resource range. Without this patch, the check of __init_single_page when doing online memory will be failed because those pages haven't been mapped in mmu(not present from mmu's point of view). Signed-off-by: Jia He --- mm/page_alloc.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e028b87ce294..13216ab3623f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5971,6 +5971,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (start_pfn == altmap->base_pfn) start_pfn += altmap->reserve; end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap); + } else { + struct resource res; + int ret; + + /* adjust the start,end in dax pmem kmem case */ + ret = find_next_iomem_res(start_pfn << PAGE_SHIFT, + (end_pfn << PAGE_SHIFT) - 1, + IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, + IORES_DESC_PERSISTENT_MEMORY, + false, ); + if (!ret) { + start_pfn = PFN_UP(res.start); + end_pfn = PFN_DOWN(res.end + 1); + } } #endif -- 2.17.1
[RFC PATCH 5/6] device-dax: relax the memblock size alignment for kmem_start
Previously, kmem_start in dev_dax_kmem_probe should be aligned with SECTION_SIZE_BITS(30), i.e. 1G memblock size on arm64. Even with Dan Williams' sub-section patch series, it was not helpful when adding the dax pmem kmem to memblock: $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id $cat /proc/iomem ... 23c00-23fff : System RAM 23dd4-23fec : reserved 23fed-23fff : reserved 24000-33fdf : Persistent Memory 24000-2403f : namespace0.0 28000-2bfff : dax0.0 <- boundary are aligned with 1G 28000-2bfff : System RAM (kmem) $ lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x4000-0x00023fff 8G online yes 1-8 0x00028000-0x0002bfff 1G online yes10 Memory block size: 1G Total online memory: 9G Total offline memory: 0B ... Hence there is a big gap between 0x2403f and 0x28000 due to the 1G alignment on arm64. More than that, only 1G memory is returned while 2G is requested. On x86, the gap is relatively small due to SECTION_SIZE_BITS(27). Besides descreasing SECTION_SIZE_BITS on arm64, we can relax the alignment when adding the kmem. After this patch: 24000-33fdf : Persistent Memory 24000-2421f : namespace0.0 24240-2bfff : dax0.0 24240-2bfff : System RAM (kmem) $ lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x4000-0x0002bfff 10G online yes 1-10 Memory block size: 1G Total online memory: 10G Total offline memory: 0B Notes, block 9-10 are the newly hotplug added. This patches remove the tight alignment constraint of memory_block_size_bytes(), but still keep the constraint from online_pages_range(). Signed-off-by: Jia He --- drivers/dax/kmem.c | 22 +- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index d77786dc0d92..849d0706dfe0 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -30,9 +30,20 @@ int dev_dax_kmem_probe(struct device *dev) const char *new_res_name; int numa_node; int rc; + int order; - /* Hotplug starting at the beginning of the next block: */ - kmem_start = ALIGN(res->start, memory_block_size_bytes()); + /* kmem_start needn't be aligned with memory_block_size_bytes(). +* But given the constraint in online_pages_range(), adjust the +* alignment of kmem_start and kmem_size +*/ + kmem_size = resource_size(res); + order = min_t(int, MAX_ORDER - 1, get_order(kmem_size)); + kmem_start = ALIGN(res->start, 1ul << (order + PAGE_SHIFT)); + /* Adjust the size down to compensate for moving up kmem_start: */ + kmem_size -= kmem_start - res->start; + /* Align the size down to cover only complete blocks: */ + kmem_size &= ~((1ul << (order + PAGE_SHIFT)) - 1); + kmem_end = kmem_start + kmem_size; /* * Ensure good NUMA information for the persistent memory. @@ -48,13 +59,6 @@ int dev_dax_kmem_probe(struct device *dev) numa_node, res); } - kmem_size = resource_size(res); - /* Adjust the size down to compensate for moving up kmem_start: */ - kmem_size -= kmem_start - res->start; - /* Align the size down to cover only complete blocks: */ - kmem_size &= ~(memory_block_size_bytes() - 1); - kmem_end = kmem_start + kmem_size; - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); if (!new_res_name) return -ENOMEM; -- 2.17.1
[RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment
When enabling dax pmem as RAM device on arm64, I noticed that kmem_start addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e. 1G memblock size. Even Dan Williams' sub-section patch series [1] had been upstream merged, it was not helpful due to hard limitation of kmem_start: $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id $cat /proc/iomem ... 23c00-23fff : System RAM 23dd4-23fec : reserved 23fed-23fff : reserved 24000-33fdf : Persistent Memory 24000-2403f : namespace0.0 28000-2bfff : dax0.0 <- aligned with 1G boundary 28000-2bfff : System RAM Hence there is a big gap between 0x2403f and 0x28000 due to the 1G alignment. Without this series, if qemu creates a 4G bytes nvdimm device, we can only use 2G bytes for dax pmem(kmem) in the worst case. e.g. 24000-33fdf : Persistent Memory We can only use the memblock between [24000, 2] due to the hard limitation. It wastes too much memory space. Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb, SPARSEMEM_VMEMMAP, page bits in struct page ... Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment with memory_block_size_bytes(). Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem can be used as ram with smaller gap. Also the kmem hotplug add/remove are both tested on arm64/x86 guest. This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 [2]. [1] https://lkml.org/lkml/2019/6/19/67 [2] https://lkml.org/lkml/2020/7/8/1546 Jia He (6): mm/memory_hotplug: remove redundant memory block size alignment check resource: export find_next_iomem_res() helper mm/memory_hotplug: allow pmem kmem not to align with memory_block_size mm/page_alloc: adjust the start,end in dax pmem kmem case device-dax: relax the memblock size alignment for kmem_start arm64: fall back to vmemmap_populate_basepages if not aligned with PMD_SIZE arch/arm64/mm/mmu.c| 4 drivers/base/memory.c | 24 drivers/dax/kmem.c | 22 +- include/linux/ioport.h | 3 +++ kernel/resource.c | 3 ++- mm/memory_hotplug.c| 39 ++- mm/page_alloc.c| 14 ++ 7 files changed, 90 insertions(+), 19 deletions(-) -- 2.17.1
[PATCH 1/2] mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid()
This is to introduce a general dummy helper. memory_add_physaddr_to_nid() is a fallback option to get the nid in case NUMA_NO_NID is detected. After this patch, arm64/sh/s390 can simply use the general dummy version. PowerPC/x86/ia64 will still use their specific version. This is the preparation to set a fallback value for dev_dax->target_node. Reviewed-by: David Hildenbrand Signed-off-by: Jia He --- arch/arm64/mm/numa.c | 10 -- arch/ia64/mm/numa.c | 2 -- arch/sh/mm/init.c| 9 - arch/x86/mm/numa.c | 1 - mm/memory_hotplug.c | 10 ++ 5 files changed, 10 insertions(+), 22 deletions(-) diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index aafcee3e3f7e..73f8b49d485c 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -461,13 +461,3 @@ void __init arm64_numa_init(void) numa_init(dummy_numa_init); } - -/* - * We hope that we will be hotplugging memory on nodes we already know about, - * such that acpi_get_node() succeeds and we never fall back to this... - */ -int memory_add_physaddr_to_nid(u64 addr) -{ - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr); - return 0; -} diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index 5e1015eb6d0d..f34964271101 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -106,7 +106,5 @@ int memory_add_physaddr_to_nid(u64 addr) return 0; return nid; } - -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif #endif diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index a70ba0fdd0b3..f75932ba87a6 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -430,15 +430,6 @@ int arch_add_memory(int nid, u64 start, u64 size, return ret; } -#ifdef CONFIG_NUMA -int memory_add_physaddr_to_nid(u64 addr) -{ - /* Node 0 for now.. */ - return 0; -} -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); -#endif - void arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 8ee952038c80..2a6e62af4636 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -929,5 +929,4 @@ int memory_add_physaddr_to_nid(u64 start) nid = numa_meminfo.blk[0].nid; return nid; } -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index da374cd3d45b..b49ab743d914 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -350,6 +350,16 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, return err; } +#ifdef CONFIG_NUMA +int __weak memory_add_physaddr_to_nid(u64 start) +{ + pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n", + start); + return 0; +} +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); +#endif + /* find the smallest valid pfn in the range [start_pfn, end_pfn) */ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone, unsigned long start_pfn, -- 2.17.1
[PATCH 2/2] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done
When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is online at that time), mem_hotplug_begin/done is unpaired in such case. Therefore a warning: Call Trace: percpu_up_write+0x33/0x40 try_remove_memory+0x66/0x120 ? _cond_resched+0x19/0x30 remove_memory+0x2b/0x40 dev_dax_kmem_remove+0x36/0x72 [kmem] device_release_driver_internal+0xf0/0x1c0 device_release_driver+0x12/0x20 bus_remove_device+0xe1/0x150 device_del+0x17b/0x3e0 unregister_dev_dax+0x29/0x60 devm_action_release+0x15/0x20 release_nodes+0x19a/0x1e0 devres_release_all+0x3f/0x50 device_release_driver_internal+0x100/0x1c0 driver_detach+0x4c/0x8f bus_remove_driver+0x5c/0xd0 driver_unregister+0x31/0x50 dax_pmem_exit+0x10/0xfe0 [dax_pmem] Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat") Cc: sta...@vger.kernel.org # v5.6+ Signed-off-by: Jia He Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Dan Williams --- mm/memory_hotplug.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b49ab743d914..3e0645387daf 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1752,7 +1752,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) */ rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb); if (rc) - goto done; + return rc; /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); @@ -1776,9 +1776,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) try_offline_node(nid); -done: mem_hotplug_done(); - return rc; + return 0; } /** -- 2.17.1
[PATCH v4 0/2] Fix and enable pmem as RAM device on arm64
This fixies a few issues when I tried to enable pmem as RAM device on arm64. To use memory_add_physaddr_to_nid as a fallback nid, it would be better implement a general version (__weak) in mm/memory_hotplug. After that, arm64/ sh/s390 can simply use the general version, and PowerPC/ia64/x86 will use arch specific version. Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The memblocks from the dax pmem device can be either hot-added or hot-removed on arm64 guest. Also passed the compilation test on x86. Changes: v4: - remove "device-dax: use fallback nid when numa_node is invalid", wait for Dan Williams' phys_addr_to_target_node() patch - folder v3 patch1-4 into single one, no functional changes v3: https://lkml.org/lkml/2020/7/8/1541 - introduce general version memory_add_physaddr_to_nid, refine the arch specific one - fix an uninitialization bug in v2 device-dax patch v2: https://lkml.org/lkml/2020/7/7/71 - Drop unnecessary patch to harden try_offline_node - Use new solution(by David) to fix dev->target_node=-1 during probing - Refine the mem_hotplug_begin/done patch v1: https://lkml.org/lkml/2020/7/5/381 Jia He (2): mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() mm/memory_hotplug: fix unpaired mem_hotplug_begin/done arch/arm64/mm/numa.c | 10 -- arch/ia64/mm/numa.c | 2 -- arch/sh/mm/init.c| 9 - arch/x86/mm/numa.c | 1 - mm/memory_hotplug.c | 15 --- 5 files changed, 12 insertions(+), 25 deletions(-) -- 2.17.1
[PATCH v3 5/6] device-dax: use fallback nid when numa_node is invalid
numa_off is set unconditionally at the end of dummy_numa_init(), even with a fake numa node. ACPI detects node id as NUMA_NO_NODE(-1) in acpi_map_pxm_to_node() because it regards numa_off as turning off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa. Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table isn't present: $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with invalid node: -1 kmem: probe of dax0.0 failed with error -22 This fixes it by using fallback memory_add_physaddr_to_nid() as nid. Suggested-by: David Hildenbrand Signed-off-by: Jia He --- drivers/dax/kmem.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 275aa5f87399..218f66057994 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -31,22 +31,23 @@ int dev_dax_kmem_probe(struct device *dev) int numa_node; int rc; + /* Hotplug starting at the beginning of the next block: */ + kmem_start = ALIGN(res->start, memory_block_size_bytes()); + /* * Ensure good NUMA information for the persistent memory. * Without this check, there is a risk that slow memory * could be mixed in a node with faster memory, causing -* unavoidable performance issues. +* unavoidable performance issues. Furthermore, fallback node +* id can be used when numa_node is invalid. */ numa_node = dev_dax->target_node; if (numa_node < 0) { - dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n", -res, numa_node); - return -EINVAL; + numa_node = memory_add_physaddr_to_nid(kmem_start); + dev_info(dev, "using nid %d for DAX region with undefined nid %pR\n", + numa_node, res); } - /* Hotplug starting at the beginning of the next block: */ - kmem_start = ALIGN(res->start, memory_block_size_bytes()); - kmem_size = resource_size(res); /* Adjust the size down to compensate for moving up kmem_start: */ kmem_size -= kmem_start - res->start; @@ -100,15 +101,19 @@ static int dev_dax_kmem_remove(struct device *dev) resource_size_t kmem_start = res->start; resource_size_t kmem_size = resource_size(res); const char *res_name = res->name; + int numa_node = dev_dax->target_node; int rc; + if (numa_node < 0) + numa_node = memory_add_physaddr_to_nid(kmem_start); + /* * We have one shot for removing memory, if some memory blocks were not * offline prior to calling this function remove_memory() will fail, and * there is no way to hotremove this memory until reboot because device * unbind will succeed even if we return failure. */ - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size); + rc = remove_memory(numa_node, kmem_start, kmem_size); if (rc) { any_hotremove_failed = true; dev_err(dev, -- 2.17.1
[PATCH v3 6/6] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done
When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is online at that time), mem_hotplug_begin/done is unpaired in such case. Therefore a warning: Call Trace: percpu_up_write+0x33/0x40 try_remove_memory+0x66/0x120 ? _cond_resched+0x19/0x30 remove_memory+0x2b/0x40 dev_dax_kmem_remove+0x36/0x72 [kmem] device_release_driver_internal+0xf0/0x1c0 device_release_driver+0x12/0x20 bus_remove_device+0xe1/0x150 device_del+0x17b/0x3e0 unregister_dev_dax+0x29/0x60 devm_action_release+0x15/0x20 release_nodes+0x19a/0x1e0 devres_release_all+0x3f/0x50 device_release_driver_internal+0x100/0x1c0 driver_detach+0x4c/0x8f bus_remove_driver+0x5c/0xd0 driver_unregister+0x31/0x50 dax_pmem_exit+0x10/0xfe0 [dax_pmem] Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat") Cc: sta...@vger.kernel.org # v5.6+ Signed-off-by: Jia He Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Dan Williams --- mm/memory_hotplug.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b49ab743d914..3e0645387daf 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1752,7 +1752,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) */ rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb); if (rc) - goto done; + return rc; /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); @@ -1776,9 +1776,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) try_offline_node(nid); -done: mem_hotplug_done(); - return rc; + return 0; } /** -- 2.17.1
[PATCH v3 4/6] mm: don't export memory_add_physaddr_to_nid in arch specific directory
After a general version of __weak memory_add_physaddr_to_nid implemented and exported , it is no use exporting twice in arch directory even if e,g, ia64/x86 have their specific version. This is to suppress the modpost warning: WARNING: modpost: vmlinux: 'memory_add_physaddr_to_nid' exported twice. Previous export was in vmlinux Suggested-by: David Hildenbrand Signed-off-by: Jia He --- arch/ia64/mm/numa.c | 2 -- arch/x86/mm/numa.c | 1 - 2 files changed, 3 deletions(-) diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index 5e1015eb6d0d..f34964271101 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -106,7 +106,5 @@ int memory_add_physaddr_to_nid(u64 addr) return 0; return nid; } - -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif #endif diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 8ee952038c80..2a6e62af4636 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -929,5 +929,4 @@ int memory_add_physaddr_to_nid(u64 start) nid = numa_meminfo.blk[0].nid; return nid; } -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif -- 2.17.1
[PATCH v3 1/6] mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid()
This is to introduce a general dummy helper. memory_add_physaddr_to_nid() is a fallback option to get the nid in case NUMA_NO_NID is detected. After this patch, arm64/sh/s390 can simply use the general dummy version. PowerPC/x86/ia64 will still use their specific version. Signed-off-by: Jia He --- mm/memory_hotplug.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index da374cd3d45b..b49ab743d914 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -350,6 +350,16 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, return err; } +#ifdef CONFIG_NUMA +int __weak memory_add_physaddr_to_nid(u64 start) +{ + pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n", + start); + return 0; +} +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); +#endif + /* find the smallest valid pfn in the range [start_pfn, end_pfn) */ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone, unsigned long start_pfn, -- 2.17.1
[PATCH v3 2/6] arm64/mm: use default dummy memory_add_physaddr_to_nid()
After making default memory_add_physaddr_to_nid() in mm/memory_hotplug, it is no use defining a similar one in arch specific directory. Signed-off-by: Jia He --- arch/arm64/mm/numa.c | 10 -- 1 file changed, 10 deletions(-) diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index aafcee3e3f7e..73f8b49d485c 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -461,13 +461,3 @@ void __init arm64_numa_init(void) numa_init(dummy_numa_init); } - -/* - * We hope that we will be hotplugging memory on nodes we already know about, - * such that acpi_get_node() succeeds and we never fall back to this... - */ -int memory_add_physaddr_to_nid(u64 addr) -{ - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr); - return 0; -} -- 2.17.1
[PATCH v3 3/6] sh/mm: use default dummy memory_add_physaddr_to_nid()
After making default memory_add_physaddr_to_nid in mm/memory_hotplug, there is no use to define a similar one in arch specific directory. Signed-off-by: Jia He --- arch/sh/mm/init.c | 9 - 1 file changed, 9 deletions(-) diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index a70ba0fdd0b3..f75932ba87a6 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -430,15 +430,6 @@ int arch_add_memory(int nid, u64 start, u64 size, return ret; } -#ifdef CONFIG_NUMA -int memory_add_physaddr_to_nid(u64 addr) -{ - /* Node 0 for now.. */ - return 0; -} -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); -#endif - void arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { -- 2.17.1
[PATCH v3 0/6] Fix and enable pmem as RAM device on arm64
This fixies a few issues when I tried to enable pmem as RAM device on arm64. To use memory_add_physaddr_to_nid as a fallback nid, it would be better implement a general version (__weak) in mm/memory_hotplug. After that, arm64/ sh/s390 can simply use the general version, and PowerPC/ia64/x86 will use arch specific version. Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The memblocks from the dax pmem device can be either hot-added or hot-removed on arm64 guest. Also passed the compilation test on x86. Changes: v3: - introduce general version memory_add_physaddr_to_nid, refine the arch specific one - fix an uninitialization bug in v2 device-dax patch v2: https://lkml.org/lkml/2020/7/7/71 - Drop unnecessary patch to harden try_offline_node - Use new solution(by David) to fix dev->target_node=-1 during probing - Refine the mem_hotplug_begin/done patch v1: https://lkml.org/lkml/2020/7/5/381 Jia He (6): mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() arm64/mm: use default dummy memory_add_physaddr_to_nid() sh/mm: use default dummy memory_add_physaddr_to_nid() mm: don't export memory_add_physaddr_to_nid in arch specific directory device-dax: use fallback nid when numa_node is invalid mm/memory_hotplug: fix unpaired mem_hotplug_begin/done arch/arm64/mm/numa.c | 10 -- arch/ia64/mm/numa.c | 2 -- arch/sh/mm/init.c| 9 - arch/x86/mm/numa.c | 1 - drivers/dax/kmem.c | 21 + mm/memory_hotplug.c | 15 --- 6 files changed, 25 insertions(+), 33 deletions(-) -- 2.17.1
[PATCH v2 0/3] Fix and enable pmem as RAM device on arm64
This fix a few issues when I tried to enable pmem as RAM device on arm64. Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The memblocks from the dax pmem device can be either hot-added or hot-removed on arm64 guest. Changes: v2: - Drop unneccessary patch to harden try_offline_node - Use new solution(by David) to fix dev->target_node=-1 during probing - Refine the mem_hotplug_begin/done patch v1: https://lkml.org/lkml/2020/7/5/381 Jia He (3): arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL device-dax: use fallback nid when numa_node is invalid mm/memory_hotplug: fix unpaired mem_hotplug_begin/done arch/arm64/mm/numa.c | 5 +++-- drivers/dax/kmem.c | 22 ++ mm/memory_hotplug.c | 5 ++--- 3 files changed, 19 insertions(+), 13 deletions(-) -- 2.17.1
[RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is invalid
Previously, numa_off is set unconditionally at the end of dummy_numa_init(), even with a fake numa node. Then ACPI detects node id as NUMA_NO_NODE(-1) in acpi_map_pxm_to_node() because it regards numa_off as turning off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa. Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table isn't present: $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with invalid node: -1 kmem: probe of dax0.0 failed with error -22 This fixes it by using fallback memory_add_physaddr_to_nid() as nid. Suggested-by: David Hildenbrand Signed-off-by: Jia He --- I noticed that on powerpc memory_add_physaddr_to_nid is not exported for module driver. Set it to RFC due to this concern. drivers/dax/kmem.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 275aa5f87399..68e693ca6d59 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -28,20 +28,22 @@ int dev_dax_kmem_probe(struct device *dev) resource_size_t kmem_end; struct resource *new_res; const char *new_res_name; - int numa_node; + int numa_node, new_node; int rc; /* * Ensure good NUMA information for the persistent memory. -* Without this check, there is a risk that slow memory -* could be mixed in a node with faster memory, causing -* unavoidable performance issues. +* Without this check, there is a risk but not fatal that slow +* memory could be mixed in a node with faster memory, causing +* unavoidable performance issues. Furthermore, fallback node +* id can be used when numa_node is invalid. */ numa_node = dev_dax->target_node; if (numa_node < 0) { - dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n", -res, numa_node); - return -EINVAL; + new_node = memory_add_physaddr_to_nid(kmem_start); + dev_info(dev, "changing nid from %d to %d for DAX region %pR\n", + numa_node, new_node, res); + numa_node = new_node; } /* Hotplug starting at the beginning of the next block: */ @@ -100,6 +102,7 @@ static int dev_dax_kmem_remove(struct device *dev) resource_size_t kmem_start = res->start; resource_size_t kmem_size = resource_size(res); const char *res_name = res->name; + int numa_node = dev_dax->target_node; int rc; /* @@ -108,7 +111,10 @@ static int dev_dax_kmem_remove(struct device *dev) * there is no way to hotremove this memory until reboot because device * unbind will succeed even if we return failure. */ - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size); + if (numa_node < 0) + numa_node = memory_add_physaddr_to_nid(kmem_start); + + rc = remove_memory(numa_node, kmem_start, kmem_size); if (rc) { any_hotremove_failed = true; dev_err(dev, -- 2.17.1
[PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL
This exports memory_add_physaddr_to_nid() for module driver to use. memory_add_physaddr_to_nid() is a fallback option to get the nid in case NUMA_NO_NID is detected. Suggested-by: David Hildenbrand Signed-off-by: Jia He --- arch/arm64/mm/numa.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index aafcee3e3f7e..7eeb31740248 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -464,10 +464,11 @@ void __init arm64_numa_init(void) /* * We hope that we will be hotplugging memory on nodes we already know about, - * such that acpi_get_node() succeeds and we never fall back to this... + * such that acpi_get_node() succeeds. But when SRAT is not present, the node + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option. */ int memory_add_physaddr_to_nid(u64 addr) { - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr); return 0; } +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); -- 2.17.1
[PATCH v2 3/3] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done
When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is online at that time), mem_hotplug_begin/done is unpaired in such case. Therefore a warning: Call Trace: percpu_up_write+0x33/0x40 try_remove_memory+0x66/0x120 ? _cond_resched+0x19/0x30 remove_memory+0x2b/0x40 dev_dax_kmem_remove+0x36/0x72 [kmem] device_release_driver_internal+0xf0/0x1c0 device_release_driver+0x12/0x20 bus_remove_device+0xe1/0x150 device_del+0x17b/0x3e0 unregister_dev_dax+0x29/0x60 devm_action_release+0x15/0x20 release_nodes+0x19a/0x1e0 devres_release_all+0x3f/0x50 device_release_driver_internal+0x100/0x1c0 driver_detach+0x4c/0x8f bus_remove_driver+0x5c/0xd0 driver_unregister+0x31/0x50 dax_pmem_exit+0x10/0xfe0 [dax_pmem] Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat") Cc: sta...@vger.kernel.org # v5.6+ Signed-off-by: Jia He --- mm/memory_hotplug.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index da374cd3d45b..76c75a599da3 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1742,7 +1742,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) */ rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb); if (rc) - goto done; + return rc; /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); @@ -1766,9 +1766,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) try_offline_node(nid); -done: mem_hotplug_done(); - return rc; + return 0; } /** -- 2.17.1
[PATCH 3/3] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done
When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is online at that time), mem_hotplug_begin/done is unpaired in such case. Therefore a warning: Call Trace: percpu_up_write+0x33/0x40 try_remove_memory+0x66/0x120 ? _cond_resched+0x19/0x30 remove_memory+0x2b/0x40 dev_dax_kmem_remove+0x36/0x72 [kmem] device_release_driver_internal+0xf0/0x1c0 device_release_driver+0x12/0x20 bus_remove_device+0xe1/0x150 device_del+0x17b/0x3e0 unregister_dev_dax+0x29/0x60 devm_action_release+0x15/0x20 release_nodes+0x19a/0x1e0 devres_release_all+0x3f/0x50 device_release_driver_internal+0x100/0x1c0 driver_detach+0x4c/0x8f bus_remove_driver+0x5c/0xd0 driver_unregister+0x31/0x50 dax_pmem_exit+0x10/0xfe0 [dax_pmem] This fixes it by moving mem_hotplug_done ahead of "done" Signed-off-by: Jia He --- mm/memory_hotplug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index e1e290577b45..86b36714342b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1769,8 +1769,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) try_offline_node(nid); -done: mem_hotplug_done(); +done: return rc; } -- 2.17.1
[PATCH 0/3] Fix and enable pmem as RAM on arm64
This fix a few issues when I tried to enable pmem as RAM device on arm64. Tested on ThunderX2 host/qemu "-M virt" guest. Jia He (3): arm64/numa: set numa_off to false when numa node is fake mm/memory_hotplug: harden try_offline_node against bogus nid mm/memory_hotplug: fix unpaired mem_hotplug_begin/done arch/arm64/mm/numa.c | 3 ++- mm/memory_hotplug.c | 5 - 2 files changed, 6 insertions(+), 2 deletions(-) -- 2.17.1
[PATCH 1/3] arm64/numa: set numa_off to false when numa node is fake
Previously, numa_off is set to true unconditionally in dummy_numa_init(), even if there is a fake numa node. But acpi will translate node id to NUMA_NO_NODE(-1) in acpi_map_pxm_to_node() because it regards numa_off as turning off the numa node. Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table isn't present. $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with invalid node: -1 kmem: probe of dax0.0 failed with error -22 This fixes it by setting numa_off to false. Signed-off-by: Jia He --- arch/arm64/mm/numa.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index aafcee3e3f7e..7689986020d9 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -440,7 +440,8 @@ static int __init dummy_numa_init(void) return ret; } - numa_off = true; + /* force numa_off to be false since we have a fake numa node here */ + numa_off = false; return 0; } -- 2.17.1
[PATCH 2/3] mm/memory_hotplug: harden try_offline_node against bogus nid
When testing the remove_memory path of dax pmem, there will be a panic with call trace: try_remove_memory+0x84/0x170 remove_memory+0x38/0x58 dev_dax_kmem_remove+0x3c/0x84 [kmem] device_release_driver_internal+0xfc/0x1c8 device_release_driver+0x28/0x38 bus_remove_device+0xd4/0x158 device_del+0x160/0x3a0 unregister_dev_dax+0x30/0x68 devm_action_release+0x20/0x30 release_nodes+0x150/0x240 devres_release_all+0x6c/0x1d0 device_release_driver_internal+0x10c/0x1c8 driver_detach+0xac/0x170 bus_remove_driver+0x64/0x130 driver_unregister+0x34/0x60 dax_pmem_exit+0x14/0xffc4 [dax_pmem] __arm64_sys_delete_module+0x18c/0x2d0 el0_svc_common.constprop.2+0x78/0x168 do_el0_svc+0x34/0xa0 el0_sync_handler+0xe0/0x188 el0_sync+0x164/0x180 It is caused by the bogus nid (-1). Although the root cause is pmem dax translates from pxm to node_id incorrectly due to numa_off, it is worth hardening the codes in try_offline_node(), quiting if !pgdat. Signed-off-by: Jia He --- mm/memory_hotplug.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index da374cd3d45b..e1e290577b45 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1680,6 +1680,9 @@ void try_offline_node(int nid) pg_data_t *pgdat = NODE_DATA(nid); int rc; + if (WARN_ON(!pgdat)) + return; + /* * If the node still spans pages (especially ZONE_DEVICE), don't * offline it. A node spans memory after move_pfn_range_to_zone(), -- 2.17.1
[PATCH v5] virtio_vsock: Fix race condition in virtio_transport_recv_pkt
When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the server on the guest, there will be a panic on a ThunderX2 (armv8a server): [ 463.718844] Unable to handle kernel NULL pointer dereference at virtual address [ 463.718848] Mem abort info: [ 463.718849] ESR = 0x9644 [ 463.718852] EC = 0x25: DABT (current EL), IL = 32 bits [ 463.718853] SET = 0, FnV = 0 [ 463.718854] EA = 0, S1PTW = 0 [ 463.718855] Data abort info: [ 463.718856] ISV = 0, ISS = 0x0044 [ 463.718857] CM = 0, WnR = 1 [ 463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000 [ 463.718861] [] pgd= [ 463.718866] Internal error: Oops: 9644 [#1] SMP [...] [ 463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G O 5.7.0-rc7+ #139 [ 463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 09/25/2018 [ 463.718982] pstate: 6049 (nZCv daif +PAN -UAO) [ 463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719000] sp : 80002dbe3c40 [...] [ 463.719025] Call trace: [ 463.719030] virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719034] vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock] [ 463.719041] vhost_worker+0x100/0x1a0 [vhost] [ 463.719048] kthread+0x128/0x130 [ 463.719052] ret_from_fork+0x10/0x18 The race condition is as follows: Task1Task2 == __sock_release virtio_transport_recv_pkt __vsock_release vsock_find_bound_socket (found sk) lock_sock_nested vsock_remove_sock sock_orphan sk_set_socket(sk, NULL) sk->sk_shutdown = SHUTDOWN_MASK ... release_sock lock_sock virtio_transport_recv_connecting sk->sk_socket->state (panic!) The root cause is that vsock_find_bound_socket can't hold the lock_sock, so there is a small race window between vsock_find_bound_socket() and lock_sock(). If __vsock_release() is running in another task, sk->sk_socket will be set to NULL inadvertently. Thus check the data structure member "sk_shutdown" (suggested by Stefano) after a call of the function "lock_sock" since this field is set to "SHUTDOWN_MASK" under the protection of "lock_sock_nested". Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko") Signed-off-by: Jia He Cc: sta...@vger.kernel.org Cc: Asias He Reviewed-by: Stefano Garzarella --- v5: sorry, MIME type in the previous commit message net/vmw_vsock/virtio_transport_common.c | 8 1 file changed, 8 insertions(+) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 69efc891885f..0edda1edf988 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1132,6 +1132,14 @@ void virtio_transport_recv_pkt(struct virtio_transport *t, lock_sock(sk); + /* Check if sk has been released before lock_sock */ + if (sk->sk_shutdown == SHUTDOWN_MASK) { + (void)virtio_transport_reset_no_sock(t, pkt); + release_sock(sk); + sock_put(sk); + goto free_pkt; + } + /* Update CID in case it has changed after a transport reset event */ vsk->local_addr.svm_cid = dst.svm_cid; -- 2.17.1
[PATCH] virtio_vsock: Fix race condition in virtio_transport_recv_pkt
When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the server on the guest, there will be a panic on a ThunderX2 (armv8a server): [ 463.718844] Unable to handle kernel NULL pointer dereference at virtual address [ 463.718848] Mem abort info: [ 463.718849] ESR = 0x9644 [ 463.718852] EC = 0x25: DABT (current EL), IL = 32 bits [ 463.718853] SET = 0, FnV = 0 [ 463.718854] EA = 0, S1PTW = 0 [ 463.718855] Data abort info: [ 463.718856] ISV = 0, ISS = 0x0044 [ 463.718857] CM = 0, WnR = 1 [ 463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000 [ 463.718861] [] pgd= [ 463.718866] Internal error: Oops: 9644 [#1] SMP [...] [ 463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G O 5.7.0-rc7+ #139 [ 463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 09/25/2018 [ 463.718982] pstate: 6049 (nZCv daif +PAN -UAO) [ 463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719000] sp : 80002dbe3c40 [...] [ 463.719025] Call trace: [ 463.719030] virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719034] vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock] [ 463.719041] vhost_worker+0x100/0x1a0 [vhost] [ 463.719048] kthread+0x128/0x130 [ 463.719052] ret_from_fork+0x10/0x18 The race condition is as follows: Task1Task2 == __sock_release virtio_transport_recv_pkt __vsock_release vsock_find_bound_socket (found sk) lock_sock_nested vsock_remove_sock sock_orphan sk_set_socket(sk, NULL) sk->sk_shutdown = SHUTDOWN_MASK ... release_sock lock_sock virtio_transport_recv_connecting sk->sk_socket->state (panic!) The root cause is that vsock_find_bound_socket can't hold the lock_sock, so there is a small race window between vsock_find_bound_socket() and lock_sock(). If __vsock_release() is running in another task, sk->sk_socket will be set to NULL inadvertently. Thus check the data structure member “sk_shutdown” (suggested by Stefano) after a call of the function “lock_sock” since this field is set to “SHUTDOWN_MASK” under the protection of “lock_sock_nested”. Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko") Signed-off-by: Jia He Cc: sta...@vger.kernel.org Cc: Asias He Reviewed-by: Stefano Garzarella --- v4: refine the commit msg (from Markus) net/vmw_vsock/virtio_transport_common.c | 8 1 file changed, 8 insertions(+) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 69efc891885f..0edda1edf988 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1132,6 +1132,14 @@ void virtio_transport_recv_pkt(struct virtio_transport *t, lock_sock(sk); + /* Check if sk has been released before lock_sock */ + if (sk->sk_shutdown == SHUTDOWN_MASK) { + (void)virtio_transport_reset_no_sock(t, pkt); + release_sock(sk); + sock_put(sk); + goto free_pkt; + } + /* Update CID in case it has changed after a transport reset event */ vsk->local_addr.svm_cid = dst.svm_cid; -- 2.17.1
[PATCH v3] virtio_vsock: Fix race condition in virtio_transport_recv_pkt
When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the server on the guest, there will be a panic on a ThunderX2 (armv8a server): [ 463.718844] Unable to handle kernel NULL pointer dereference at virtual address [ 463.718848] Mem abort info: [ 463.718849] ESR = 0x9644 [ 463.718852] EC = 0x25: DABT (current EL), IL = 32 bits [ 463.718853] SET = 0, FnV = 0 [ 463.718854] EA = 0, S1PTW = 0 [ 463.718855] Data abort info: [ 463.718856] ISV = 0, ISS = 0x0044 [ 463.718857] CM = 0, WnR = 1 [ 463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000 [ 463.718861] [] pgd= [ 463.718866] Internal error: Oops: 9644 [#1] SMP [...] [ 463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G O 5.7.0-rc7+ #139 [ 463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 09/25/2018 [ 463.718982] pstate: 6049 (nZCv daif +PAN -UAO) [ 463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719000] sp : 80002dbe3c40 [...] [ 463.719025] Call trace: [ 463.719030] virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719034] vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock] [ 463.719041] vhost_worker+0x100/0x1a0 [vhost] [ 463.719048] kthread+0x128/0x130 [ 463.719052] ret_from_fork+0x10/0x18 The race condition is as follows: Task1Task2 == __sock_release virtio_transport_recv_pkt __vsock_release vsock_find_bound_socket (found sk) lock_sock_nested vsock_remove_sock sock_orphan sk_set_socket(sk, NULL) sk->sk_shutdown = SHUTDOWN_MASK ... release_sock lock_sock virtio_transport_recv_connecting sk->sk_socket->state (panic!) The root cause is that vsock_find_bound_socket can't hold the lock_sock, so there is a small race window between vsock_find_bound_socket() and lock_sock(). If __vsock_release() is running in another task, sk->sk_socket will be set to NULL inadvertently. This fixes it by checking sk->sk_shutdown(suggested by Stefano) after lock_sock since sk->sk_shutdown is set to SHUTDOWN_MASK under the protection of lock_sock_nested. Signed-off-by: Jia He Cc: sta...@vger.kernel.org Reviewed-by: Stefano Garzarella --- v3: - describe the fix of race condition more clearly - refine the commit log net/vmw_vsock/virtio_transport_common.c | 8 1 file changed, 8 insertions(+) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 69efc891885f..0edda1edf988 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1132,6 +1132,14 @@ void virtio_transport_recv_pkt(struct virtio_transport *t, lock_sock(sk); + /* Check if sk has been released before lock_sock */ + if (sk->sk_shutdown == SHUTDOWN_MASK) { + (void)virtio_transport_reset_no_sock(t, pkt); + release_sock(sk); + sock_put(sk); + goto free_pkt; + } + /* Update CID in case it has changed after a transport reset event */ vsk->local_addr.svm_cid = dst.svm_cid; -- 2.17.1
[PATCH v2] virtio_vsock: Fix race condition in virtio_transport_recv_pkt
When client tries to connect(SOCK_STREAM) the server in the guest with NONBLOCK mode, there will be a panic on a ThunderX2 (armv8a server): [ 463.718844][ T5040] Unable to handle kernel NULL pointer dereference at virtual address [ 463.718848][ T5040] Mem abort info: [ 463.718849][ T5040] ESR = 0x9644 [ 463.718852][ T5040] EC = 0x25: DABT (current EL), IL = 32 bits [ 463.718853][ T5040] SET = 0, FnV = 0 [ 463.718854][ T5040] EA = 0, S1PTW = 0 [ 463.718855][ T5040] Data abort info: [ 463.718856][ T5040] ISV = 0, ISS = 0x0044 [ 463.718857][ T5040] CM = 0, WnR = 1 [ 463.718859][ T5040] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000 [ 463.718861][ T5040] [] pgd= [ 463.718866][ T5040] Internal error: Oops: 9644 [#1] SMP [...] [ 463.718977][ T5040] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G O 5.7.0-rc7+ #139 [ 463.718980][ T5040] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 09/25/2018 [ 463.718982][ T5040] pstate: 6049 (nZCv daif +PAN -UAO) [ 463.718995][ T5040] pc : virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.718999][ T5040] lr : virtio_transport_recv_pkt+0x1fc/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719000][ T5040] sp : 80002dbe3c40 [...] [ 463.719025][ T5040] Call trace: [ 463.719030][ T5040] virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719034][ T5040] vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock] [ 463.719041][ T5040] vhost_worker+0x100/0x1a0 [vhost] [ 463.719048][ T5040] kthread+0x128/0x130 [ 463.719052][ T5040] ret_from_fork+0x10/0x18 The race condition as follows: Task1Task2 == __sock_release virtio_transport_recv_pkt __vsock_release vsock_find_bound_socket (found) lock_sock_nested vsock_remove_sock sock_orphan sk_set_socket(sk, NULL) ... release_sock lock_sock virtio_transport_recv_connecting sk->sk_socket->state (panic) The root cause is that vsock_find_bound_socket can't hold the lock_sock, so there is a small race window between vsock_find_bound_socket() and lock_sock(). If there is __vsock_release() in another task, sk->sk_socket will be set to NULL inadvertently. This fixes it by checking sk->sk_shutdown. Signed-off-by: Jia He Cc: sta...@vger.kernel.org Cc: Stefano Garzarella --- v2: use lightweight checking suggested by Stefano Garzarella net/vmw_vsock/virtio_transport_common.c | 8 1 file changed, 8 insertions(+) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 69efc891885f..0edda1edf988 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1132,6 +1132,14 @@ void virtio_transport_recv_pkt(struct virtio_transport *t, lock_sock(sk); + /* Check if sk has been released before lock_sock */ + if (sk->sk_shutdown == SHUTDOWN_MASK) { + (void)virtio_transport_reset_no_sock(t, pkt); + release_sock(sk); + sock_put(sk); + goto free_pkt; + } + /* Update CID in case it has changed after a transport reset event */ vsk->local_addr.svm_cid = dst.svm_cid; -- 2.17.1
[PATCH] virtio_vsock: Fix race condition in virtio_transport_recv_pkt
When client tries to connect(SOCK_STREAM) the server in the guest with NONBLOCK mode, there will be a panic on a ThunderX2 (armv8a server): [ 463.718844][ T5040] Unable to handle kernel NULL pointer dereference at virtual address [ 463.718848][ T5040] Mem abort info: [ 463.718849][ T5040] ESR = 0x9644 [ 463.718852][ T5040] EC = 0x25: DABT (current EL), IL = 32 bits [ 463.718853][ T5040] SET = 0, FnV = 0 [ 463.718854][ T5040] EA = 0, S1PTW = 0 [ 463.718855][ T5040] Data abort info: [ 463.718856][ T5040] ISV = 0, ISS = 0x0044 [ 463.718857][ T5040] CM = 0, WnR = 1 [ 463.718859][ T5040] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000 [ 463.718861][ T5040] [] pgd= [ 463.718866][ T5040] Internal error: Oops: 9644 [#1] SMP [...] [ 463.718977][ T5040] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G O 5.7.0-rc7+ #139 [ 463.718980][ T5040] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 09/25/2018 [ 463.718982][ T5040] pstate: 6049 (nZCv daif +PAN -UAO) [ 463.718995][ T5040] pc : virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.718999][ T5040] lr : virtio_transport_recv_pkt+0x1fc/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719000][ T5040] sp : 80002dbe3c40 [...] [ 463.719025][ T5040] Call trace: [ 463.719030][ T5040] virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common] [ 463.719034][ T5040] vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock] [ 463.719041][ T5040] vhost_worker+0x100/0x1a0 [vhost] [ 463.719048][ T5040] kthread+0x128/0x130 [ 463.719052][ T5040] ret_from_fork+0x10/0x18 The race condition as follows: Task1Task2 == __sock_release virtio_transport_recv_pkt __vsock_release vsock_find_bound_socket (found) lock_sock_nested vsock_remove_sock sock_orphan sk_set_socket(sk, NULL) ... release_sock lock_sock virtio_transport_recv_connecting sk->sk_socket->state (panic) This fixes it by checking vsk again whether it is in bound/connected table. Signed-off-by: Jia He Cc: sta...@vger.kernel.org --- net/vmw_vsock/virtio_transport_common.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 69efc891885f..0dbd6a45f0ed 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1132,6 +1132,17 @@ void virtio_transport_recv_pkt(struct virtio_transport *t, lock_sock(sk); + /* Check it again if vsk is removed by vsock_remove_sock */ + spin_lock_bh(_table_lock); + if (!__vsock_in_bound_table(vsk) && !__vsock_in_connected_table(vsk)) { + spin_unlock_bh(_table_lock); + (void)virtio_transport_reset_no_sock(t, pkt); + release_sock(sk); + sock_put(sk); + goto free_pkt; + } + spin_unlock_bh(_table_lock); + /* Update CID in case it has changed after a transport reset event */ vsk->local_addr.svm_cid = dst.svm_cid; -- 2.17.1
Re: [PATCH V3 0/3] arm64: Enable vmemmap mapping from device memory
Hi On 2020/3/31 13:09, Anshuman Khandual wrote: This series enables vmemmap backing memory allocation from device memory ranges on arm64. But before that, it enables vmemmap_populate_basepages() and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based alocation requests. I verified no obvious regression after this patch series. Host: ThunderX2(armv8a server), kernel v5.4 qemu:v3.1, -M virt \ -object memory-backend-file,id=mem1,share=on,mem-path=/tmp2/nvdimm.img,size=4G,align=2M \ -device nvdimm,id=nvdimm1,memdev=mem1,label-size=2M Guest: kernel v5.7.0-rc5 with this patch series. Tested case: - 4K PAGESIZE, boot, mount w/ -o dax, mount w/o -o dax, basic io - 64K PAGESIZE,boot, mount w/ -o dax, mount w/o -o dax, basic io Not tested: - 16K pagesize due to my hardware limiation(can't run 16K pgsz kernel) - hot-add/remove nvdimm device from qemu due to no fully support on arm64 qemu yet - Host nvdimm device hotplug Hence from above result, Tested-by: Jia He This series applies after latest (v14) arm64 memory hot remove series (https://lkml.org/lkml/2020/3/3/1746) on Linux 5.6. Pending Question: altmap_alloc_block_buf() does not have any other remaining users in the tree after this change. Should it be converted into a static function and it's declaration be dropped from the header (include/linux/mm.h). Avoided doing so because I was not sure if there are any off-tree users or not. Changes in V3: - Dropped comment from free_hotplug_page_range() per Robin - Modified comment in unmap_hotplug_range() per Robin - Enabled altmap support in vmemmap_alloc_block_buf() per Robin Changes in V2: (https://lkml.org/lkml/2020/3/4/475) - Rebased on latest hot-remove series (v14) adding P4D page table support Changes in V1: (https://lkml.org/lkml/2020/1/23/12) - Added an WARN_ON() in unmap_hotplug_range() when altmap is provided without the page table backing memory being freed Changes in RFC V2: (https://lkml.org/lkml/2019/10/21/11) - Changed the commit message on 1/2 patch per Will - Changed the commit message on 2/2 patch as well - Rebased on arm64 memory hot remove series (v10) RFC V1: (https://lkml.org/lkml/2019/6/28/32) Cc: Catalin Marinas Cc: Will Deacon Cc: Mark Rutland Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Tony Luck Cc: Fenghua Yu Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: David Hildenbrand Cc: Mike Rapoport Cc: Michal Hocko Cc: "Matthew Wilcox (Oracle)" Cc: "Kirill A. Shutemov" Cc: Andrew Morton Cc: Dan Williams Cc: Pavel Tatashin Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc:linux-arm-ker...@lists.infradead.org Cc:linux-i...@vger.kernel.org Cc:linux-ri...@lists.infradead.org Cc:x...@kernel.org Cc:linuxppc-...@lists.ozlabs.org Cc:linux...@kvack.org Cc:linux-kernel@vger.kernel.org Anshuman Khandual (3): mm/sparsemem: Enable vmem_altmap support in vmemmap_populate_basepages() mm/sparsemem: Enable vmem_altmap support in vmemmap_alloc_block_buf() arm64/mm: Enable vmem_altmap support for vmemmap mappings arch/arm64/mm/mmu.c | 59 ++- arch/ia64/mm/discontig.c | 2 +- arch/powerpc/mm/init_64.c | 10 +++ arch/riscv/mm/init.c | 2 +- arch/x86/mm/init_64.c | 12 include/linux/mm.h| 8 -- mm/sparse-vmemmap.c | 38 - 7 files changed, 87 insertions(+), 44 deletions(-) -- --- Cheers, Justin (Jia He)
[PATCH v2] vhost: vsock: kick send_pkt worker once device is started
Ning Bo reported an abnormal 2-second gap when booting Kata container [1]. The unconditional timeout was caused by VSOCK_DEFAULT_CONNECT_TIMEOUT of connecting from the client side. The vhost vsock client tries to connect an initializing virtio vsock server. The abnormal flow looks like: host-userspace vhost vsock guest vsock == === connect() > vhost_transport_send_pkt_work() initializing | vq->private_data==NULL | will not be queued V schedule_timeout(2s) vhost_vsock_start() <- device ready set vq->private_data wait for 2s and failed connect() again vq->private_data!=NULL recv connecting pkt Details: 1. Host userspace sends a connect pkt, at that time, guest vsock is under initializing, hence the vhost_vsock_start has not been called. So vq->private_data==NULL, and the pkt is not been queued to send to guest 2. Then it sleeps for 2s 3. After guest vsock finishes initializing, vq->private_data is set 4. When host userspace wakes up after 2s, send connecting pkt again, everything is fine. As suggested by Stefano Garzarella, this fixes it by additional kicking the send_pkt worker in vhost_vsock_start once the virtio device is started. This makes the pending pkt sent again. After this patch, kata-runtime (with vsock enabled) boot time is reduced from 3s to 1s on a ThunderX2 arm64 server. [1] https://github.com/kata-containers/runtime/issues/1917 Reported-by: Ning Bo Suggested-by: Stefano Garzarella Signed-off-by: Jia He --- v2: new solution suggested by Stefano Garzarella drivers/vhost/vsock.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index e36aaf9ba7bd..0716a9cdffee 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -543,6 +543,11 @@ static int vhost_vsock_start(struct vhost_vsock *vsock) mutex_unlock(>mutex); } + /* Some packets may have been queued before the device was started, +* let's kick the send worker to send them. +*/ + vhost_work_queue(>dev, >send_pkt_work); + mutex_unlock(>dev.mutex); return 0; -- 2.17.1
[PATCH] vhost: add mutex_lock/unlock for vhost_vq_reset
vq->mutex is to protect any vq accessing, hence adding mutex_lock/unlock makes sense to avoid potential race condition. Signed-off-by: Jia He --- drivers/vhost/vhost.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index d450e16c5c25..622bfba2e5ab 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -297,6 +297,7 @@ static void vhost_vq_meta_reset(struct vhost_dev *d) static void vhost_vq_reset(struct vhost_dev *dev, struct vhost_virtqueue *vq) { + mutex_lock(>mutex); vq->num = 1; vq->desc = NULL; vq->avail = NULL; @@ -323,6 +324,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq->umem = NULL; vq->iotlb = NULL; __vhost_vq_meta_reset(vq); + mutex_unlock(>mutex); } static int vhost_worker(void *data) -- 2.17.1
[PATCH] vhost: vsock: don't send pkt when vq is not started
Ning Bo reported an abnormal 2-second gap when booting Kata container [1]. The unconditional timeout is caused by VSOCK_DEFAULT_CONNECT_TIMEOUT of connect at client side. The vhost vsock client tries to connect an initlizing virtio vsock server. The abnormal flow looks like: host-userspace vhost vsock guest vsock == === connect() > vhost_transport_send_pkt_work() initializing | vq->private_data==NULL | will not be queued V schedule_timeout(2s) vhost_vsock_start() <- device ready set vq->private_data wait for 2s and failed connect() again vq->private_data!=NULL recv connecting pkt 1. host userspace sends a connect pkt, at that time, guest vsock is under initializing, hence the vhost_vsock_start has not been called. So vq->private_data==NULL, and the pkt is not been queued to send to guest. 2. then it sleeps for 2s 3. after guest vsock finishes initializing, vq->private_data is set. 4. When host userspace wakes up after 2s, send connecting pkt again, everything is fine. This fixes it by checking vq->private_data in vhost_transport_send_pkt, and return at once if !vq->private_data. This makes user connect() be returned with ECONNREFUSED. After this patch, kata-runtime (with vsock enabled) boottime reduces from 3s to 1s on ThunderX2 arm64 server. [1] https://github.com/kata-containers/runtime/issues/1917 Reported-by: Ning Bo Signed-off-by: Jia He --- drivers/vhost/vsock.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index e36aaf9ba7bd..67474334dd88 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -241,6 +241,7 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt) { struct vhost_vsock *vsock; int len = pkt->len; + struct vhost_virtqueue *vq; rcu_read_lock(); @@ -252,6 +253,13 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt) return -ENODEV; } + vq = >vqs[VSOCK_VQ_RX]; + if (!vq->private_data) { + rcu_read_unlock(); + virtio_transport_free_pkt(pkt); + return -ECONNREFUSED; + } + if (pkt->reply) atomic_inc(>queued_replies); -- 2.17.1
Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
Hi Palmer On 2019/10/19 4:38, Palmer Dabbelt wrote: On Wed, 16 Oct 2019 16:46:08 PDT (-0700), w...@kernel.org wrote: Hey Palmer, On Wed, Oct 16, 2019 at 04:21:59PM -0700, Palmer Dabbelt wrote: On Tue, 08 Oct 2019 05:39:44 PDT (-0700), w...@kernel.org wrote: > On Tue, Oct 08, 2019 at 02:19:05AM +, Justin He (Arm Technology China) wrote: > > > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote: > > > > diff --git a/mm/memory.c b/mm/memory.c > > > > index b1ca51a079f2..1f56b0118ef5 100644 > > > > --- a/mm/memory.c > > > > +++ b/mm/memory.c > > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = > > > > 2; > > > > #endif > > > > > > > > +#ifndef arch_faults_on_old_pte > > > > +static inline bool arch_faults_on_old_pte(void) > > > > +{ > > > > + return false; > > > > +} > > > > +#endif > > > > > > Kirill has acked this, so I'm happy to take the patch as-is, however isn't > > > it the case that /most/ architectures will want to return true for > > > arch_faults_on_old_pte()? In which case, wouldn't it make more sense for > > > that to be the default, and have x86 and arm64 provide an override? For > > > example, aren't most architectures still going to hit the double fault > > > scenario even with your patch applied? > > > > No, after applying my patch series, only those architectures which don't provide > > setting access flag by hardware AND don't implement their arch_faults_on_old_pte > > will hit the double page fault. > > > > The meaning of true for arch_faults_on_old_pte() is "this arch doesn't have the hardware > > setting access flag way, it might cause page fault on an old pte" > > I don't want to change other architectures' default behavior here. So by default, > > arch_faults_on_old_pte() is false. > > ...and my complaint is that this is the majority of supported architectures, > so you're fixing something for arm64 which also affects arm, powerpc, > alpha, mips, riscv, ... > > Chances are, they won't even realise they need to implement > arch_faults_on_old_pte() until somebody runs into the double fault and > wastes lots of time debugging it before they spot your patch. If I understand the semantics correctly, we should have this set to true. I don't have any context here, but we've got /* * The kernel assumes that TLBs don't cache invalid * entries, but in RISC-V, SFENCE.VMA specifies an * ordering constraint, not a cache flush; it is * necessary even after writing invalid entries. */ local_flush_tlb_page(addr); in do_page_fault(). Ok, although I think this is really about whether or not your hardware can make a pte young when accessed, or whether you take a fault and do it by updating the pte explicitly. v12 of the patches did change the default, so you should be "safe" with those either way: http://lists.infradead.org/pipermail/linux-arm-kernel/2019-October/686030.html OK, that fence is because we allow invalid translations to be cached, which is a completely different issue. RISC-V implementations are allowed to have software managed accessed/dirty bits. For some reason I thought we were relying on the firmware to handle this, but I can't actually find the code so I might be crazy. Wherever it's done, there's no spec enforcing it so we should leave this true on RISC-V. Thanks for the confirmation. So we can keep the default arch_faults_on_old_pte (return true) on RISC-V. Thanks. --- Cheers, Justin (Jia He)
[PATCH v12 3/4] x86/mm: implement arch_faults_on_old_pte() stub on x86
arch_faults_on_old_pte is a helper to indicate that it might cause page fault when accessing old pte. But on x86, there is feature to setting pte access flag by hardware. Hence implement an overriding stub which always returns false. Signed-off-by: Jia He Suggested-by: Will Deacon --- arch/x86/include/asm/pgtable.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 0bc530c4eb13..ad97dc155195 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1463,6 +1463,12 @@ static inline bool arch_has_pfn_modify_check(void) return boot_cpu_has_bug(X86_BUG_L1TF); } +#define arch_faults_on_old_pte arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} + #include #endif /* __ASSEMBLY__ */ -- 2.17.1
[PATCH v12 0/4] fix double page fault in cow_user_page for pfn mapping
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." -Changes v12: refine PATCH 01, remove the !! since C languages can convert unsigned to bool (Catalin) v11: refine cpu_has_hw_af in PATCH 01(Will Deacon, Suzuki) change the default return value to true in arch_faults_on_old_pte add PATCH 03 for overriding arch_faults_on_old_pte(false) on x86 v10: add r-b from Catalin and a-b from Kirill in PATCH 03 remoe Reported-by in PATCH 01 v9: refactor cow_user_page for indention optimization (Catalin) hold the ptl longer (Catalin) v8: change cow_user_page's return type (Matthew) v7: s/pte_spinlock/pte_offset_map_lock (Kirill) v6: fix error case of returning with spinlock taken (Catalin) move kmap_atomic to avoid handling kunmap_atomic v5: handle the case correctly when !pte_same fix kbuild test failed v4: introduce cpu_has_hw_af (Suzuki) bail out if !pte_same (Kirill) v3: add vmf->ptl lock/unlock (Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalin) v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin) Jia He (4): arm64: cpufeature: introduce helper cpu_has_hw_af() arm64: mm: implement arch_faults_on_old_pte() on arm64 x86/mm: implement arch_faults_on_old_pte() stub on x86 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/cpufeature.h | 14 arch/arm64/include/asm/pgtable.h| 14 arch/x86/include/asm/pgtable.h | 6 ++ mm/memory.c | 104 4 files changed, 123 insertions(+), 15 deletions(-) -- 2.17.1
[PATCH v12 4/4] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. To reproduce the bug, the cmd is as follows after you deployed everything: make -C src/test/vmmalloc_fork/ TEST_TIME=60m check Below call trace is from arm64 do_page_fault for debugging purpose: [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fixes it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error in case there can be some obscure use-case (by Kirill). [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Signed-off-by: Jia He Reported-by: Yibo Cai Reviewed-by: Catalin Marinas Acked-by: Kirill A. Shutemov --- mm/memory.c | 104 1 file changed, 89 insertions(+), 15 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index b1ca51a079f2..b6a5d6a08438 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,18 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + /* +* Those arches which don't have hw access flag feature need to +* implement their own helper. By default, "true" means pagefault +* will be hit on old pte. +*/ + return true; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2145,32 +2157,82 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline bool cow_user_page(struct page *dst, struct page *src, +struct vm_fault *vmf) { + bool ret; + void *kaddr; + void __user *uaddr; + bool force_mkyoung; + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); + if (likely(src)) { + copy_user_highpage(dst, src, addr, vma); + return true; + } + /* * If the source page was a PFN mapping, we don't have * a "struct page" for it. We do a best-effort copy by * just copying from the original user address. If that * fails, we just zero-fill it. Live with it. */ - if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + kaddr = kmap_atomic(dst); + uaddr = (void __user *)(addr & PAGE_MASK); + + /* +* On architectures with software "accessed" bits, we would +* take a double page fault, so mark it accessed here. +*/ + force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte); + if (force_mkyoung) { + pte_t entry; + + vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, >ptl); + if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { + /* +* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + ret = false; + goto pte_unlock; + } + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } + + /* +* This really shouldn't fail, because the page is there +* in the page tables. But it might just be unr
[PATCH v12 2/4] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. We don't always have a hardware-managed Access Flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He Reviewed-by: Catalin Marinas --- arch/arm64/include/asm/pgtable.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7576df00eb50..e96fb82f62de 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -885,6 +885,20 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying from user will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. We don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + WARN_ON(preemptible()); + + return !cpu_has_hw_af(); +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v12 1/4] arm64: cpufeature: introduce helper cpu_has_hw_af()
We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by a new helper cpu_has_hw_af(). If later we noticed a potential performance issue on this path, we can turn it into a static label as with other CPU features. Signed-off-by: Jia He Suggested-by: Suzuki Poulose Reviewed-by: Catalin Marinas --- arch/arm64/include/asm/cpufeature.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 9cde5d2e768f..4261d55e8506 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -659,6 +659,20 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) default: return CONFIG_ARM64_PA_BITS; } } + +/* Check whether hardware update of the Access flag is supported */ +static inline bool cpu_has_hw_af(void) +{ + u64 mmfr1; + + if (!IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) + return false; + + mmfr1 = read_cpuid(ID_AA64MMFR1_EL1); + return cpuid_feature_extract_unsigned_field(mmfr1, + ID_AA64MMFR1_HADBS_SHIFT); +} + #endif /* __ASSEMBLY__ */ #endif -- 2.17.1
[PATCH v11 3/4] x86/mm: implement arch_faults_on_old_pte() stub on x86
arch_faults_on_old_pte is a helper to indicate that it might cause page fault when accessing old pte. But on x86, there is feature to setting pte access flag by hardware. Hence implement an overriding stub which always returns false. Signed-off-by: Jia He Suggested-by: Will Deacon --- arch/x86/include/asm/pgtable.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 0bc530c4eb13..ad97dc155195 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1463,6 +1463,12 @@ static inline bool arch_has_pfn_modify_check(void) return boot_cpu_has_bug(X86_BUG_L1TF); } +#define arch_faults_on_old_pte arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} + #include #endif /* __ASSEMBLY__ */ -- 2.17.1
[PATCH v11 2/4] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. We don't always have a hardware-managed Access Flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He Reviewed-by: Catalin Marinas --- arch/arm64/include/asm/pgtable.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7576df00eb50..e96fb82f62de 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -885,6 +885,20 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying from user will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. We don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + WARN_ON(preemptible()); + + return !cpu_has_hw_af(); +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v11 4/4] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. To reproduce the bug, the cmd is as follows after you deployed everything: make -C src/test/vmmalloc_fork/ TEST_TIME=60m check Below call trace is from arm64 do_page_fault for debugging purpose: [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fixes it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error in case there can be some obscure use-case (by Kirill). [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Signed-off-by: Jia He Reported-by: Yibo Cai Reviewed-by: Catalin Marinas Acked-by: Kirill A. Shutemov --- mm/memory.c | 104 1 file changed, 89 insertions(+), 15 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index b1ca51a079f2..b6a5d6a08438 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,18 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + /* +* Those arches which don't have hw access flag feature need to +* implement their own helper. By default, "true" means pagefault +* will be hit on old pte. +*/ + return true; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2145,32 +2157,82 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline bool cow_user_page(struct page *dst, struct page *src, +struct vm_fault *vmf) { + bool ret; + void *kaddr; + void __user *uaddr; + bool force_mkyoung; + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); + if (likely(src)) { + copy_user_highpage(dst, src, addr, vma); + return true; + } + /* * If the source page was a PFN mapping, we don't have * a "struct page" for it. We do a best-effort copy by * just copying from the original user address. If that * fails, we just zero-fill it. Live with it. */ - if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + kaddr = kmap_atomic(dst); + uaddr = (void __user *)(addr & PAGE_MASK); + + /* +* On architectures with software "accessed" bits, we would +* take a double page fault, so mark it accessed here. +*/ + force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte); + if (force_mkyoung) { + pte_t entry; + + vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, >ptl); + if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { + /* +* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + ret = false; + goto pte_unlock; + } + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } + + /* +* This really shouldn't fail, because the page is there +* in the page tables. But it might just be unr
[PATCH v11 1/4] arm64: cpufeature: introduce helper cpu_has_hw_af()
We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by a new helper cpu_has_hw_af(). Signed-off-by: Jia He Suggested-by: Suzuki Poulose Reviewed-by: Catalin Marinas --- arch/arm64/include/asm/cpufeature.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 9cde5d2e768f..1a95396ea5c8 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -659,6 +659,20 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) default: return CONFIG_ARM64_PA_BITS; } } + +/* Check whether hardware update of the Access flag is supported */ +static inline bool cpu_has_hw_af(void) +{ + if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) { + u64 mmfr1 = read_cpuid(ID_AA64MMFR1_EL1); + + return !!cpuid_feature_extract_unsigned_field(mmfr1, + ID_AA64MMFR1_HADBS_SHIFT); + } + + return false; +} + #endif /* __ASSEMBLY__ */ #endif -- 2.17.1
[PATCH v11 0/4] fix double page fault in cow_user_page for pfn mapping
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." Changes v11: refine cpu_has_hw_af in PATCH 01(Will Deacon, Suzuki) change the default return value to true in arch_faults_on_old_pte add PATCH 03 for overriding arch_faults_on_old_pte(false) on x86 v10: add r-b from Catalin and a-b from Kirill in PATCH 03 remoe Reported-by in PATCH 01 v9: refactor cow_user_page for indention optimization (Catalin) hold the ptl longer (Catalin) v8: change cow_user_page's return type (Matthew) v7: s/pte_spinlock/pte_offset_map_lock (Kirill) v6: fix error case of returning with spinlock taken (Catalin) move kmap_atomic to avoid handling kunmap_atomic v5: handle the case correctly when !pte_same fix kbuild test failed v4: introduce cpu_has_hw_af (Suzuki) bail out if !pte_same (Kirill) v3: add vmf->ptl lock/unlock (Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalin) v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin) Jia He (4): arm64: cpufeature: introduce helper cpu_has_hw_af() arm64: mm: implement arch_faults_on_old_pte() on arm64 x86/mm: implement arch_faults_on_old_pte() stub on x86 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/cpufeature.h | 14 arch/arm64/include/asm/pgtable.h| 14 arch/x86/include/asm/pgtable.h | 6 ++ mm/memory.c | 104 4 files changed, 123 insertions(+), 15 deletions(-) -- 2.17.1
Re: [PATCH v10 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()
Hi Suzuki On 2019/10/8 23:32, Suzuki K Poulose wrote: On 08/10/2019 02:12, Justin He (Arm Technology China) wrote: Hi Will and Marc Sorry for the late response, just came back from a vacation. -Original Message- From: Marc Zyngier Sent: 2019年10月1日 21:19 To: Will Deacon Cc: Justin He (Arm Technology China) ; Catalin Marinas ; Mark Rutland ; James Morse ; Matthew Wilcox ; Kirill A. Shutemov ; linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux...@kvack.org; Punit Agrawal ; Thomas Gleixner ; Andrew Morton ; hejia...@gmail.com; Kaly Xin (Arm Technology China) Subject: Re: [PATCH v10 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af() On Tue, 1 Oct 2019 13:54:47 +0100 Will Deacon wrote: On Mon, Sep 30, 2019 at 09:57:38AM +0800, Jia He wrote: We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by new helper cpu_has_hw_af(). Signed-off-by: Jia He Suggested-by: Suzuki Poulose Reviewed-by: Catalin Marinas --- arch/arm64/include/asm/cpufeature.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 9cde5d2e768f..949bc7c85030 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -659,6 +659,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) default: return CONFIG_ARM64_PA_BITS; } } + +/* Check whether hardware update of the Access flag is supported */ +static inline bool cpu_has_hw_af(void) +{ + if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) + return read_cpuid(ID_AA64MMFR1_EL1) & 0xf; 0xf? I think we should have a mask in sysreg.h for this constant. We don't have the mask, but we certainly have the shift. GENMASK(ID_AA64MMFR1_HADBS_SHIFT + 3, ID_AA64MMFR1_HADBS_SHIFT) is a bit of a mouthful though. Ideally, we'd have a helper for that. Ok, I will implement the helper if there isn't so far. And then replace the 0xf with it. Or could we simpl reuse existing cpuid_feature_extract_unsigned_field() ? u64 mmfr1 = read_cpuid(ID_AA64MMFR1_EL1); return cpuid_feature_extract_unsigned_field(mmfr1, ID_AA64MMFR1_HADBS_SHIFT) ? Yes, we can, I will send the new version --- Cheers, Justin (Jia He)
[PATCH v10 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()
We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by new helper cpu_has_hw_af(). Signed-off-by: Jia He Suggested-by: Suzuki Poulose Reviewed-by: Catalin Marinas --- arch/arm64/include/asm/cpufeature.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 9cde5d2e768f..949bc7c85030 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -659,6 +659,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) default: return CONFIG_ARM64_PA_BITS; } } + +/* Check whether hardware update of the Access flag is supported */ +static inline bool cpu_has_hw_af(void) +{ + if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) + return read_cpuid(ID_AA64MMFR1_EL1) & 0xf; + + return false; +} + #endif /* __ASSEMBLY__ */ #endif -- 2.17.1
[PATCH v10 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying fromuser will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He Reviewed-by: Catalin Marinas --- arch/arm64/include/asm/pgtable.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7576df00eb50..e96fb82f62de 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -885,6 +885,20 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying from user will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. We don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + WARN_ON(preemptible()); + + return !cpu_has_hw_af(); +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v10 0/3] fix double page fault on arm64
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." Changes v10: add r-b from Catalin and a-b from Kirill in PATCH 03 remoe Reported-by in PATCH 01 v9: refactor cow_user_page for indention optimization (Catalin) hold the ptl longer (Catalin) v8: change cow_user_page's return type (Matthew) v7: s/pte_spinlock/pte_offset_map_lock (Kirill) v6: fix error case of returning with spinlock taken (Catalin) move kmap_atomic to avoid handling kunmap_atomic v5: handle the case correctly when !pte_same fix kbuild test failed v4: introduce cpu_has_hw_af (Suzuki) bail out if !pte_same (Kirill) v3: add vmf->ptl lock/unlock (Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalin) v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin) Jia He (3): arm64: cpufeature: introduce helper cpu_has_hw_af() arm64: mm: implement arch_faults_on_old_pte() on arm64 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/cpufeature.h | 10 +++ arch/arm64/include/asm/pgtable.h| 14 mm/memory.c | 99 - 3 files changed, 108 insertions(+), 15 deletions(-) -- 2.17.1
[PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error in case there can be some obscure use-case.(by Kirill) [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Signed-off-by: Jia He Reported-by: Yibo Cai Reviewed-by: Catalin Marinas Acked-by: Kirill A. Shutemov --- mm/memory.c | 99 + 1 file changed, 84 insertions(+), 15 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index b1ca51a079f2..1f56b0118ef5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2145,32 +2152,82 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline bool cow_user_page(struct page *dst, struct page *src, +struct vm_fault *vmf) { + bool ret; + void *kaddr; + void __user *uaddr; + bool force_mkyoung; + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); + if (likely(src)) { + copy_user_highpage(dst, src, addr, vma); + return true; + } + /* * If the source page was a PFN mapping, we don't have * a "struct page" for it. We do a best-effort copy by * just copying from the original user address. If that * fails, we just zero-fill it. Live with it. */ - if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + kaddr = kmap_atomic(dst); + uaddr = (void __user *)(addr & PAGE_MASK); + + /* +* On architectures with software "accessed" bits, we would +* take a double page fault, so mark it accessed here. +*/ + force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte); + if (force_mkyoung) { + pte_t entry; + + vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, >ptl); + if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { + /* +* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + ret = false; + goto pte_unlock; + } + + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } + /* +* This really shouldn't fail, because the page is there +* in the page tables. But it might just be unreadable, +* in which case we just give up and fill the result with +* zeroes. +*/ + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) { /* -* This really shouldn't fail, because the page is there -* in the page tables. But it might just be unreadable, -* in which case we just give up and fill
[PATCH v9 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying fromuser will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He Reviewed-by: Catalin Marinas --- arch/arm64/include/asm/pgtable.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e09760ece844..2b035befb66d 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -868,6 +868,20 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying from user will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. We don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + WARN_ON(preemptible()); + + return !cpu_has_hw_af(); +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v9 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error in case there can be some obscure use-case.(by Kirill) [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Signed-off-by: Jia He Reported-by: Yibo Cai --- mm/memory.c | 99 + 1 file changed, 84 insertions(+), 15 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..a0a381b36ff2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2140,32 +2147,82 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline bool cow_user_page(struct page *dst, struct page *src, +struct vm_fault *vmf) { + bool ret; + void *kaddr; + void __user *uaddr; + bool force_mkyoung; + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); + if (likely(src)) { + copy_user_highpage(dst, src, addr, vma); + return true; + } + /* * If the source page was a PFN mapping, we don't have * a "struct page" for it. We do a best-effort copy by * just copying from the original user address. If that * fails, we just zero-fill it. Live with it. */ - if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + kaddr = kmap_atomic(dst); + uaddr = (void __user *)(addr & PAGE_MASK); + + /* +* On architectures with software "accessed" bits, we would +* take a double page fault, so mark it accessed here. +*/ + force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte); + if (force_mkyoung) { + pte_t entry; + + vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, >ptl); + if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) { + /* +* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + ret = false; + goto pte_unlock; + } + + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } + /* +* This really shouldn't fail, because the page is there +* in the page tables. But it might just be unreadable, +* in which case we just give up and fill the result with +* zeroes. +*/ + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) { /* -* This really shouldn't fail, because the page is there -* in the page tables. But it might just be unreadable, -* in which case we just give up and fill the result with -* zeroes. +
[PATCH v9 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()
We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by new helper cpu_has_hw_af(). Signed-off-by: Jia He Suggested-by: Suzuki Poulose Reported-by: kbuild test robot Reviewed-by: Catalin Marinas --- arch/arm64/include/asm/cpufeature.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index c96ffa4722d3..c2e3abd39faa 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -667,6 +667,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) default: return CONFIG_ARM64_PA_BITS; } } + +/* Check whether hardware update of the Access flag is supported */ +static inline bool cpu_has_hw_af(void) +{ + if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) + return read_cpuid(ID_AA64MMFR1_EL1) & 0xf; + + return false; +} + #endif /* __ASSEMBLY__ */ #endif -- 2.17.1
[PATCH v9 0/3] fix double page fault on arm64
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." Changes v9: refactor cow_user_page for indention optimization (Catalin) hold the ptl longer (Catalin) v8: change cow_user_page's return type (Matthew) v7: s/pte_spinlock/pte_offset_map_lock (Kirill) v6: fix error case of returning with spinlock taken (Catalin) move kmap_atomic to avoid handling kunmap_atomic v5: handle the case correctly when !pte_same fix kbuild test failed v4: introduce cpu_has_hw_af (Suzuki) bail out if !pte_same (Kirill) v3: add vmf->ptl lock/unlock (Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalin) v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin) Jia He (3): arm64: cpufeature: introduce helper cpu_has_hw_af() arm64: mm: implement arch_faults_on_old_pte() on arm64 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/cpufeature.h | 10 +++ arch/arm64/include/asm/pgtable.h| 14 mm/memory.c | 99 - 3 files changed, 108 insertions(+), 15 deletions(-) -- 2.17.1
Re: [PATCH v8 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
Hi Catalin On 2019/9/24 18:33, Catalin Marinas wrote: On Tue, Sep 24, 2019 at 06:43:06AM +, Justin He (Arm Technology China) wrote: Catalin Marinas wrote: On Sat, Sep 21, 2019 at 09:50:54PM +0800, Jia He wrote: @@ -2151,21 +2163,53 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo * fails, we just zero-fill it. Live with it. */ if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void *kaddr; + pte_t entry; + void __user *uaddr = (void __user *)(addr & PAGE_MASK); + /* On architectures with software "accessed" bits, we would +* take a double page fault, so mark it accessed here. +*/ [...] + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, + >ptl); + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, + vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } else { + /* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + pte_unmap_unlock(vmf->pte, vmf->ptl); + return false; + } + pte_unmap_unlock(vmf->pte, vmf->ptl); + } [...] + + kaddr = kmap_atomic(dst); Since you moved the kmap_atomic() here, could the above arch_faults_on_old_pte() run in a preemptible context? I suggested to add a WARN_ON in patch 2 to be sure. Should I move kmap_atomic back to the original line? Thus, we can make sure that arch_faults_on_old_pte() is in the context of preempt_disabled? Otherwise, arch_faults_on_old_pte() may cause plenty of warning if I add a WARN_ON in arch_faults_on_old_pte. I tested it when I enable the PREEMPT=y on a ThunderX2 qemu guest. So we have two options here: 1. Change arch_faults_on_old_pte() scope to the whole system rather than just the current CPU. You'd have to wire up a new arm64 capability for the access flag but this way we don't care whether it's preemptible or not. 2. Keep the arch_faults_on_old_pte() per-CPU but make sure we are not preempted here. The kmap_atomic() move would do but you'd have to kunmap_atomic() before the return. I think the answer to my question below also has some implication on which option to pick: /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with * zeroes. */ - if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) { + /* Give a warn in case there can be some obscure +* use-case +*/ + WARN_ON_ONCE(1); That's more of a question for the mm guys: at this point we do the copying with the ptl released; is there anything else that could have made the pte old in the meantime? I think unuse_pte() is only called on anonymous vmas, so it shouldn't be the case here. If we need to hold the ptl here, you could as well have an enclosing kmap/kunmap_atomic (option 2) with some goto instead of "return false". I am not 100% sure that I understand your suggestion well, so I drafted the patch here: Changes: optimize the indentions hold the ptl longer -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline bool cow_user_page(struct page *dst, struct page *src, + struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + unsigned long addr = vmf->address; + bool ret; + pte_t entry; + void *kaddr; + void __user *uaddr; + debug_dma_assert_idle(src); + if (likely(src)) { + copy_user_highpage(dst, src, addr, vma); + return true; + } + /* * If the source page was a PFN mapping, we don't have * a &quo
[PATCH v8 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error in case there can be some obscure use-case.(by Kirill) [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Reported-by: Yibo Cai Signed-off-by: Jia He --- mm/memory.c | 67 - 1 file changed, 61 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..ae09b070b04d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2140,8 +2147,13 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline bool cow_user_page(struct page *dst, struct page *src, +struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); /* @@ -2151,21 +2163,53 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo * fails, we just zero-fill it. Live with it. */ if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void *kaddr; + pte_t entry; + void __user *uaddr = (void __user *)(addr & PAGE_MASK); + /* On architectures with software "accessed" bits, we would +* take a double page fault, so mark it accessed here. +*/ + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, + >ptl); + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, + vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } else { + /* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + pte_unmap_unlock(vmf->pte, vmf->ptl); + return false; + } + pte_unmap_unlock(vmf->pte, vmf->ptl); + } + + kaddr = kmap_atomic(dst); /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with * zeroes. */ - if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) { + /* Give a warn in case there can be some obscure +
[PATCH v8 0/3] fix double page fault on arm64
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." Changes v8: change cow_user_page's return type (Matthew) v7: s/pte_spinlock/pte_offset_map_lock (Kirill) v6: fix error case of returning with spinlock taken (Catalin) move kmap_atomic to avoid handling kunmap_atomic v5: handle the case correctly when !pte_same fix kbuild test failed v4: introduce cpu_has_hw_af (Suzuki) bail out if !pte_same (Kirill) v3: add vmf->ptl lock/unlock (Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalin) v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin) Jia He (3): arm64: cpufeature: introduce helper cpu_has_hw_af() arm64: mm: implement arch_faults_on_old_pte() on arm64 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/cpufeature.h | 10 + arch/arm64/include/asm/pgtable.h| 12 ++ mm/memory.c | 67 ++--- 3 files changed, 83 insertions(+), 6 deletions(-) -- 2.17.1
[PATCH v8 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying fromuser will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He --- arch/arm64/include/asm/pgtable.h | 12 1 file changed, 12 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e09760ece844..4a9939615e41 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying fromuser will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. we don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + return !cpu_has_hw_af(); +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v8 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()
We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by new helper cpu_has_hw_af(). Reported-by: kbuild test robot Suggested-by: Suzuki Poulose Signed-off-by: Jia He --- arch/arm64/include/asm/cpufeature.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index c96ffa4722d3..46caf934ba4e 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -667,6 +667,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) default: return CONFIG_ARM64_PA_BITS; } } + +/* Decouple AF from AFDBM. */ +static inline bool cpu_has_hw_af(void) +{ + if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) + return read_cpuid(ID_AA64MMFR1_EL1) & 0xf; + + return false; +} + #endif /* __ASSEMBLY__ */ #endif -- 2.17.1
Re: [PATCH v7 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
[On behalf of justin...@arm.com] Hi Matthew On 2019/9/20 23:53, Matthew Wilcox wrote: On Fri, Sep 20, 2019 at 09:54:37PM +0800, Jia He wrote: -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline int cow_user_page(struct page *dst, struct page *src, + struct vm_fault *vmf) { Can we talk about the return type here? + } else { + /* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + pte_unmap_unlock(vmf->pte, vmf->ptl); + return -1; ... + return 0; } So -1 for "try again" and 0 for "succeeded". + if (cow_user_page(new_page, old_page, vmf)) { Then we use it like a bool. But it's kind of backwards from a bool because false is success. + /* COW failed, if the fault was solved by other, +* it's fine. If not, userspace would re-fault on +* the same address and we will handle the fault +* from the second attempt. +*/ + put_page(new_page); + if (old_page) + put_page(old_page); + return 0; And we don't use the return value; in fact we invert it. Would this make more sense: static inline bool cow_user_page(struct page *dst, struct page *src, struct vm_fault *vmf) ... pte_unmap_unlock(vmf->pte, vmf->ptl); return false; ... return true; ... if (!cow_user_page(new_page, old_page, vmf)) { That reads more sensibly for me. We could also go with returning a vm_fault_t, but that would be more complex than needed today, I think. Ok, will change the return type to bool as you suggested. Thanks --- Cheers, Justin (Jia He)
[PATCH v7 0/3] fix double page fault on arm64
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." Changes v7: s/pte_spinlock/pte_offset_map_lock (Kirill) v6: fix error case of returning with spinlock taken (Catalin) move kmap_atomic to avoid handling kunmap_atomic v5: handle the case correctly when !pte_same fix kbuild test failed v4: introduce cpu_has_hw_af (Suzuki) bail out if !pte_same (Kirill) v3: add vmf->ptl lock/unlock (Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalin) v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin) Jia He (3): arm64: cpufeature: introduce helper cpu_has_hw_af() arm64: mm: implement arch_faults_on_old_pte() on arm64 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/cpufeature.h | 10 + arch/arm64/include/asm/pgtable.h| 12 ++ mm/memory.c | 67 ++--- 3 files changed, 83 insertions(+), 6 deletions(-) -- 2.17.1
[PATCH v7 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error in case there can be some obscure use-case.(by Kirill) [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Reported-by: Yibo Cai Signed-off-by: Jia He --- mm/memory.c | 67 - 1 file changed, 61 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..3e39e40fee87 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2140,8 +2147,13 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline int cow_user_page(struct page *dst, struct page *src, + struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); /* @@ -2151,21 +2163,53 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo * fails, we just zero-fill it. Live with it. */ if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void *kaddr; + pte_t entry; + void __user *uaddr = (void __user *)(addr & PAGE_MASK); + /* On architectures with software "accessed" bits, we would +* take a double page fault, so mark it accessed here. +*/ + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, + >ptl); + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, + vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } else { + /* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + pte_unmap_unlock(vmf->pte, vmf->ptl); + return -1; + } + pte_unmap_unlock(vmf->pte, vmf->ptl); + } + + kaddr = kmap_atomic(dst); /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with * zeroes. */ - if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) { + /* Give a warn in case there can be some obscure +
[PATCH v7 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()
We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by new helper cpu_has_hw_af(). Reported-by: kbuild test robot Suggested-by: Suzuki Poulose Signed-off-by: Jia He --- arch/arm64/include/asm/cpufeature.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index c96ffa4722d3..46caf934ba4e 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -667,6 +667,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) default: return CONFIG_ARM64_PA_BITS; } } + +/* Decouple AF from AFDBM. */ +static inline bool cpu_has_hw_af(void) +{ + if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) + return read_cpuid(ID_AA64MMFR1_EL1) & 0xf; + + return false; +} + #endif /* __ASSEMBLY__ */ #endif -- 2.17.1
[PATCH v7 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying fromuser will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He --- arch/arm64/include/asm/pgtable.h | 12 1 file changed, 12 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e09760ece844..4a9939615e41 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying fromuser will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. we don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + return !cpu_has_hw_af(); +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v6 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error in case there can be some obscure use-case.(by Kirill) [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Reported-by: Yibo Cai Signed-off-by: Jia He --- mm/memory.c | 65 - 1 file changed, 59 insertions(+), 6 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..7c38c1ce5440 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline int cow_user_page(struct page *dst, struct page *src, + struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); /* @@ -2151,21 +2162,52 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo * fails, we just zero-fill it. Live with it. */ if (unlikely(!src)) { - void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void *kaddr; + void __user *uaddr = (void __user *)(addr & PAGE_MASK); + pte_t entry; + + /* On architectures with software "accessed" bits, we would +* take a double page fault, so mark it accessed here. +*/ + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + spin_lock(vmf->ptl); + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, + vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } else { + /* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + spin_unlock(vmf->ptl); + return -1; + } + spin_unlock(vmf->ptl); + } + kaddr = kmap_atomic(dst); /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with * zeroes. */ - if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) { + /* Give a warn in case there can be some obscure +* use-case +*/ + WARN_ON_ONCE(1); clear_page(kaddr); + } kunmap_
[PATCH v6 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()
We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by new helper cpu_has_hw_af(). Reported-by: kbuild test robot Suggested-by: Suzuki Poulose Signed-off-by: Jia He --- arch/arm64/include/asm/cpufeature.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index c96ffa4722d3..46caf934ba4e 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -667,6 +667,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange) default: return CONFIG_ARM64_PA_BITS; } } + +/* Decouple AF from AFDBM. */ +static inline bool cpu_has_hw_af(void) +{ + if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) + return read_cpuid(ID_AA64MMFR1_EL1) & 0xf; + + return false; +} + #endif /* __ASSEMBLY__ */ #endif -- 2.17.1
[PATCH v6 0/3] fix double page fault on arm64
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." Changes v6: fix error case of returning with spinlock taken (Catalin) move kmap_atomic to avoid handling kunmap_atomic v5: handle the case correctly when !pte_same fix kbuild test failed v4: introduce cpu_has_hw_af (Suzuki) bail out if !pte_same (Kirill) v3: add vmf->ptl lock/unlock (Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalin) v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin) Jia He (3): arm64: cpufeature: introduce helper cpu_has_hw_af() arm64: mm: implement arch_faults_on_old_pte() on arm64 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/cpufeature.h | 10 + arch/arm64/include/asm/pgtable.h| 12 ++ mm/memory.c | 65 ++--- 3 files changed, 81 insertions(+), 6 deletions(-) -- 2.17.1
[PATCH v6 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying fromuser will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He --- arch/arm64/include/asm/pgtable.h | 12 1 file changed, 12 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e09760ece844..4a9939615e41 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying fromuser will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. we don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + return !cpu_has_hw_af(); +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v5 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying fromuser will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He --- arch/arm64/include/asm/pgtable.h | 12 1 file changed, 12 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e09760ece844..4a9939615e41 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying fromuser will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. we don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + return !cpu_has_hw_af(); +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v5 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error in case there can be some obscure use-case.(by Kirill) [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Reported-by: Yibo Cai Signed-off-by: Jia He --- mm/memory.c | 59 - 1 file changed, 54 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..cf681963b2f5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline int cow_user_page(struct page *dst, struct page *src, + struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); /* @@ -2152,7 +2163,29 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ if (unlikely(!src)) { void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void __user *uaddr = (void __user *)(addr & PAGE_MASK); + pte_t entry; + + /* On architectures with software "accessed" bits, we would +* take a double page fault, so mark it accessed here. +*/ + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + spin_lock(vmf->ptl); + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, + vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } else { + /* Other thread has already handled the fault +* and we don't need to do anything. If it's +* not the case, the fault will be triggered +* again on the same address. +*/ + return -1; + } + spin_unlock(vmf->ptl); + } /* * This really shouldn't fail, because the page is there @@ -2160,12 +2193,17 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo * in which case we just give up and fill the result with * zeroes. */ - if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) { + /* In case there can be some obscure use-case */ + WARN_ON_ONCE(1); clear_page(kaddr); + } kunmap_atomic(kaddr); flush_dcache_page(dst); } else - copy_user_highpage(dst, src, va, vma); + copy_user_highpage(dst, src, addr, vma); + + return 0; } static gfp_t __
[PATCH v5 0/3] fix double page fault on arm64
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." Changes v5: handle the case correctly when !pte_same fix kbuild test failed v4: introduce cpu_has_hw_af (Suzuki) bail out if !pte_same (Kirill) v3: add vmf->ptl lock/unlock (by Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalin) v2: remove FAULT_FLAG_WRITE when setting pte access flag (by Catalin) Jia He (3): arm64: cpufeature: introduce helper cpu_has_hw_af() arm64: mm: implement arch_faults_on_old_pte() on arm64 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/cpufeature.h | 1 + arch/arm64/include/asm/pgtable.h| 12 ++ arch/arm64/kernel/cpufeature.c | 10 + mm/memory.c | 59 ++--- 4 files changed, 77 insertions(+), 5 deletions(-) -- 2.17.1
[PATCH v5 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()
We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by new helper cpu_has_hw_af(). Reported-by: kbuild test robot Suggested-by: Suzuki Poulose Signed-off-by: Jia He --- arch/arm64/include/asm/cpufeature.h | 1 + arch/arm64/kernel/cpufeature.c | 10 ++ 2 files changed, 11 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index c96ffa4722d3..206b6e3954cf 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -390,6 +390,7 @@ extern DECLARE_BITMAP(boot_capabilities, ARM64_NPATCHABLE); for_each_set_bit(cap, cpu_hwcaps, ARM64_NCAPS) bool this_cpu_has_cap(unsigned int cap); +bool cpu_has_hw_af(void); void cpu_set_feature(unsigned int num); bool cpu_have_feature(unsigned int num); unsigned long cpu_get_elf_hwcap(void); diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index b1fdc486aed8..fb0e9425d286 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1141,6 +1141,16 @@ static bool has_hw_dbm(const struct arm64_cpu_capabilities *cap, return true; } +/* Decouple AF from AFDBM. */ +bool cpu_has_hw_af(void) +{ + return (read_cpuid(ID_AA64MMFR1_EL1) & 0xf); +} +#else /* CONFIG_ARM64_HW_AFDBM */ +bool cpu_has_hw_af(void) +{ + return false; +} #endif #ifdef CONFIG_ARM64_VHE -- 2.17.1
Re: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
Hi Kirill [On behalf of justin...@arm.com because some mails are filted...] On 2019/9/18 22:00, Kirill A. Shutemov wrote: On Wed, Sep 18, 2019 at 09:19:14PM +0800, Jia He wrote: When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Reported-by: Yibo Cai Signed-off-by: Jia He --- mm/memory.c | 35 ++- 1 file changed, 30 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..d2c130a5883b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline void cow_user_page(struct page *dst, struct page *src, +struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); /* @@ -2152,20 +2163,34 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ if (unlikely(!src)) { void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void __user *uaddr = (void __user *)(addr & PAGE_MASK); + pte_t entry; /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with -* zeroes. +* zeroes. On architectures with software "accessed" bits, +* we would take a double page fault here, so mark it +* accessed here. */ + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + spin_lock(vmf->ptl); + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, + vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } I don't follow. So if pte has changed under you, you don't set the accessed bit, but never the less copy from the user. What makes you think it will not trigger the same problem? I think we need to make cow_user_page() fail in this case and caller -- wp_page_copy() -- return zero. If the fault was solved by other thread, we are fine. If not userspace would re-fault on the same address and we will handle the fault from the second attempt. Thanks for the pointing. How about make cow_user_page() be returned VM_FAULT_RETRY? Then in do_page_fault(), it can retry the page fault? --- Cheers, Justin (Jia He) + spin_unlock(vmf->ptl); + } + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) clear_page(kaddr); kunmap_atomic(kaddr); flush_dcache_page(dst); } else - copy_user_highpage(dst, src, va, vma); + copy_user_highpage(dst, src, addr, vma); } static
[PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Reported-by: Yibo Cai Signed-off-by: Jia He --- mm/memory.c | 35 ++- 1 file changed, 30 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..d2c130a5883b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline void cow_user_page(struct page *dst, struct page *src, +struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; + unsigned long addr = vmf->address; + debug_dma_assert_idle(src); /* @@ -2152,20 +2163,34 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ if (unlikely(!src)) { void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void __user *uaddr = (void __user *)(addr & PAGE_MASK); + pte_t entry; /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with -* zeroes. +* zeroes. On architectures with software "accessed" bits, +* we would take a double page fault here, so mark it +* accessed here. */ + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + spin_lock(vmf->ptl); + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vma, addr, + vmf->pte, entry, 0)) + update_mmu_cache(vma, addr, vmf->pte); + } + spin_unlock(vmf->ptl); + } + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) clear_page(kaddr); kunmap_atomic(kaddr); flush_dcache_page(dst); } else - copy_user_highpage(dst, src, va, vma); + copy_user_highpage(dst, src, addr, vma); } static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) @@ -2318,7 +2343,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) vmf->address); if (!new_page) goto oom; - cow_user_page(new_page, old_page, vmf->address, vma); + cow_user_page(new_page, old_page, vmf); } if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , false)) -- 2.17.1
[PATCH v4 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying fromuser will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He --- arch/arm64/include/asm/pgtable.h | 12 1 file changed, 12 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e09760ece844..4a9939615e41 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying fromuser will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. we don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + return !cpu_has_hw_af(); +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v4 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()
We unconditionally set the HW_AFDBM capability and only enable it on CPUs which really have the feature. But sometimes we need to know whether this cpu has the capability of HW AF. So decouple AF from DBM by new helper cpu_has_hw_af(). Signed-off-by: Jia He Suggested-by: Suzuki Poulose --- arch/arm64/include/asm/cpufeature.h | 1 + arch/arm64/kernel/cpufeature.c | 6 ++ 2 files changed, 7 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index c96ffa4722d3..206b6e3954cf 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -390,6 +390,7 @@ extern DECLARE_BITMAP(boot_capabilities, ARM64_NPATCHABLE); for_each_set_bit(cap, cpu_hwcaps, ARM64_NCAPS) bool this_cpu_has_cap(unsigned int cap); +bool cpu_has_hw_af(void); void cpu_set_feature(unsigned int num); bool cpu_have_feature(unsigned int num); unsigned long cpu_get_elf_hwcap(void); diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index b1fdc486aed8..c5097f58649d 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1141,6 +1141,12 @@ static bool has_hw_dbm(const struct arm64_cpu_capabilities *cap, return true; } +/* Decouple AF from AFDBM. */ +bool cpu_has_hw_af(void) +{ + return (read_cpuid(ID_AA64MMFR1_EL1) & 0xf); +} + #endif #ifdef CONFIG_ARM64_VHE -- 2.17.1
[PATCH v4 0/3] fix double page fault on arm64
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." Changes v4: introduce cpu_has_hw_af (Suzuki) bail out if !pte_same (Kirill) v3: add vmf->ptl lock/unlock (by Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalin) v2: remove FAULT_FLAG_WRITE when setting pte access flag (by Catalin) Jia He (3): arm64: cpufeature: introduce helper cpu_has_hw_af() arm64: mm: implement arch_faults_on_old_pte() on arm64 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/cpufeature.h | 1 + arch/arm64/include/asm/pgtable.h| 12 ++ arch/arm64/kernel/cpufeature.c | 6 + mm/memory.c | 35 - 4 files changed, 49 insertions(+), 5 deletions(-) -- 2.17.1
[PATCH v3 1/2] arm64: mm: implement arch_faults_on_old_pte() on arm64
On arm64 without hardware Access Flag, copying fromuser will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64. Hence implement arch_faults_on_old_pte on arm64 to indicate that it might cause page fault when accessing old pte. Signed-off-by: Jia He --- arch/arm64/include/asm/pgtable.h | 12 1 file changed, 12 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index e09760ece844..b41399d758df 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct *vma, #define phys_to_ttbr(addr) (addr) #endif +/* + * On arm64 without hardware Access Flag, copying fromuser will fail because + * the pte is old and cannot be marked young. So we always end up with zeroed + * page after fork() + CoW for pfn mappings. we don't always have a + * hardware-managed access flag on arm64. + */ +static inline bool arch_faults_on_old_pte(void) +{ + return true; +} +#define arch_faults_on_old_pte arch_faults_on_old_pte + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ -- 2.17.1
[PATCH v3 2/2] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Reported-by: Yibo Cai Signed-off-by: Jia He --- mm/memory.c | 30 +- 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..a64af6495f71 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly = 2; #endif +#ifndef arch_faults_on_old_pte +static inline bool arch_faults_on_old_pte(void) +{ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -2140,7 +2147,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline void cow_user_page(struct page *dst, struct page *src, + struct vm_fault *vmf) { debug_dma_assert_idle(src); @@ -2152,20 +2160,32 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ if (unlikely(!src)) { void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void __user *uaddr = (void __user *)(vmf->address & PAGE_MASK); + pte_t entry; /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with -* zeroes. +* zeroes. If PTE_AF is cleared on arm64, it might +* cause double page fault. So makes pte young here */ + if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) { + spin_lock(vmf->ptl); + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vmf->vma, vmf->address, + vmf->pte, entry, 0)) + update_mmu_cache(vmf->vma, vmf->address, +vmf->pte); + spin_unlock(vmf->ptl); + } + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) clear_page(kaddr); kunmap_atomic(kaddr); flush_dcache_page(dst); } else - copy_user_highpage(dst, src, va, vma); + copy_user_highpage(dst, src, vmf->address, vmf->vma); } static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) @@ -2318,7 +2338,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) vmf->address); if (!new_page) goto oom; - cow_user_page(new_page, old_page, vmf->address, vma); + cow_user_page(new_page, old_page, vmf); } if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , false)) -- 2.17.1
[PATCH v3 0/2] fix double page fault on arm64
When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." Changes v3: add vmf->ptl lock/unlock (by Kirill A. Shutemov) add arch_faults_on_old_pte (Matthew, Catalins) v2: remove FAULT_FLAG_WRITE when setting pte access flag (by Catalin) Jia He (2): arm64: mm: implement arch_faults_on_old_pte() on arm64 mm: fix double page fault on arm64 if PTE_AF is cleared arch/arm64/include/asm/pgtable.h | 11 +++ mm/memory.c | 29 - 2 files changed, 35 insertions(+), 5 deletions(-) -- 2.17.1
[PATCH v2] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is (PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 As told by Catalin: "On arm64 without hardware Access Flag, copying from user will fail because the pte is old and cannot be marked young. So we always end up with zeroed page after fork() + CoW for pfn mappings. we don't always have a hardware-managed access flag on arm64." This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Reported-by: Yibo Cai Signed-off-by: Jia He --- Changes v2: remove FAULT_FLAG_WRITE when setting pte access flag (by Catalin) mm/memory.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..63d4fd285e8e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2140,7 +2140,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline void cow_user_page(struct page *dst, struct page *src, + struct vm_fault *vmf) { debug_dma_assert_idle(src); @@ -2152,20 +2153,30 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ if (unlikely(!src)) { void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void __user *uaddr = (void __user *)(vmf->address & PAGE_MASK); + pte_t entry; /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with -* zeroes. +* zeroes. If PTE_AF is cleared on arm64, it might +* cause double page fault. So makes pte young here */ + if (!pte_young(vmf->orig_pte)) { + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vmf->vma, vmf->address, + vmf->pte, entry, 0)) + update_mmu_cache(vmf->vma, vmf->address, + vmf->pte); + } + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) clear_page(kaddr); kunmap_atomic(kaddr); flush_dcache_page(dst); } else - copy_user_highpage(dst, src, va, vma); + copy_user_highpage(dst, src, vmf->address, vmf->vma); } static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) @@ -2318,7 +2329,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) vmf->address); if (!new_page) goto oom; - cow_user_page(new_page, old_page, vmf->address, vma); + cow_user_page(new_page, old_page, vmf); } if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , false)) -- 2.17.1
[PATCH] mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there will be a double page fault in __copy_from_user_inatomic of cow_user_page. Below call trace is from arm64 do_page_fault for debugging purpose [ 110.016195] Call trace: [ 110.016826] do_page_fault+0x5a4/0x690 [ 110.017812] do_mem_abort+0x50/0xb0 [ 110.018726] el1_da+0x20/0xc4 [ 110.019492] __arch_copy_from_user+0x180/0x280 [ 110.020646] do_wp_page+0xb0/0x860 [ 110.021517] __handle_mm_fault+0x994/0x1338 [ 110.022606] handle_mm_fault+0xe8/0x180 [ 110.023584] do_page_fault+0x240/0x690 [ 110.024535] do_mem_abort+0x50/0xb0 [ 110.025423] el0_da+0x20/0x24 The pte info before __copy_from_user_inatomic is(PTE_AF is cleared): [9b007000] pgd=00023d4f8003, pud=00023da9b003, pmd=00023d4b3003, pte=36298607bd3 The keypoint is: we don't always have a hardware-managed access flag on arm64. The root cause is in copy_one_pte, it will clear the PTE_AF for COW pages. Generally, when it is accessed by user, the COW pages will be set as accessed(PTE_AF bit on arm64) by hardware if hardware feature is supported. But on some arm64 platforms, the PTE_AF needs to be set by software. This patch fix it by calling pte_mkyoung. Also, the parameter is changed because vmf should be passed to cow_user_page() [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork Reported-by: Yibo Cai Signed-off-by: Jia He --- mm/memory.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e2bb51b6242e..b1f9ace2e943 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2140,7 +2140,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd, return same; } -static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma) +static inline void cow_user_page(struct page *dst, struct page *src, + struct vm_fault *vmf) { debug_dma_assert_idle(src); @@ -2152,20 +2153,30 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ if (unlikely(!src)) { void *kaddr = kmap_atomic(dst); - void __user *uaddr = (void __user *)(va & PAGE_MASK); + void __user *uaddr = (void __user *)(vmf->address & PAGE_MASK); + pte_t entry; /* * This really shouldn't fail, because the page is there * in the page tables. But it might just be unreadable, * in which case we just give up and fill the result with -* zeroes. +* zeroes. If PTE_AF is cleared on arm64, it might +* cause double page fault here. so makes pte young here */ + if (!pte_young(vmf->orig_pte)) { + entry = pte_mkyoung(vmf->orig_pte); + if (ptep_set_access_flags(vmf->vma, vmf->address, + vmf->pte, entry, vmf->flags & FAULT_FLAG_WRITE)) + update_mmu_cache(vmf->vma, vmf->address, + vmf->pte); + } + if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) clear_page(kaddr); kunmap_atomic(kaddr); flush_dcache_page(dst); } else - copy_user_highpage(dst, src, va, vma); + copy_user_highpage(dst, src, vmf->address, vmf->vma); } static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) @@ -2318,7 +2329,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) vmf->address); if (!new_page) goto oom; - cow_user_page(new_page, old_page, vmf->address, vma); + cow_user_page(new_page, old_page, vmf); } if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , false)) -- 2.17.1
[PATCH 0/2] Fix and support dax kmem on arm64
This patch set is to fix some dax kmem driver issues and then it can support to use pmem as normal RAM in arm64 qemu guest. Jia He (2): drivers/dax/kmem: use default numa_mem_id if target_node is invalid drivers/dax/kmem: give a warning if CONFIG_DEV_DAX_PMEM_COMPAT is enabled drivers/dax/kmem.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) -- 2.17.1
[PATCH 2/2] drivers/dax/kmem: give a warning if CONFIG_DEV_DAX_PMEM_COMPAT is enabled
commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM") helps to add persistent memory as normal RAM blocks. But this driver doesn't work if CONFIG_DEV_DAX_PMEM_COMPAT is enabled. Here is the debugging call trace when CONFIG_DEV_DAX_PMEM_COMPAT is enabled. [4.443730] devm_memremap_pages+0x4b9/0x540 [4.443733] dev_dax_probe+0x112/0x220 [device_dax] [4.443735] dax_pmem_compat_probe+0x58/0x92 [dax_pmem_compat] [4.443737] nvdimm_bus_probe+0x6b/0x150 [4.443739] really_probe+0xf5/0x3d0 [4.443740] driver_probe_device+0x11b/0x130 [4.443741] device_driver_attach+0x58/0x60 [4.443742] __driver_attach+0xa3/0x140 Then the dax0.0 device will be registered as "nd" bus instead of "dax" bus. This causes the error as follows: root@ubuntu:~# echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind -bash: echo: write error: No such device This gives a warning to notify the user. Signed-off-by: Jia He --- drivers/dax/kmem.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index ad62d551d94e..b77f0e880598 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -93,6 +93,11 @@ static struct dax_device_driver device_dax_kmem_driver = { static int __init dax_kmem_init(void) { + if (IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)) { + pr_warn("CONFIG_DEV_DAX_PMEM_COMPAT is not compatible\n"); + pr_warn("kmem dax driver might not be workable\n"); + } + return dax_driver_register(_dax_kmem_driver); } -- 2.17.1
[PATCH 1/2] drivers/dax/kmem: use default numa_mem_id if target_node is invalid
In some platforms(e.g arm64 guest), the NFIT info might not be ready. Then target_node might be -1. But if there is a default numa_mem_id(), we can use it to avoid unnecessary fatal EINVL error. devm_memremap_pages() also uses this logic if nid is invalid, we can keep the same page with it. Signed-off-by: Jia He --- drivers/dax/kmem.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a02318c6d28a..ad62d551d94e 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -33,9 +33,9 @@ int dev_dax_kmem_probe(struct device *dev) */ numa_node = dev_dax->target_node; if (numa_node < 0) { - dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n", -res, numa_node); - return -EINVAL; + dev_warn(dev, "DAX %pR with invalid node, assume it as %d\n", + res, numa_node, numa_mem_id()); + numa_node = numa_mem_id(); } /* Hotplug starting at the beginning of the next block: */ -- 2.17.1
[PATCH 1/2] vsprintf: Prevent crash when dereferencing invalid pointers for %pD
Commit 3e5903eb9cff ("vsprintf: Prevent crash when dereferencing invalid pointers") prevents most crash except for %pD. There is an additional pointer dereferencing before dentry_name. At least, vma->file can be NULL and be passed to printk %pD in print_bad_pte, which can cause crash. This patch fixes it with introducing a new file_dentry_name. Signed-off-by: Jia He --- lib/vsprintf.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 63937044c57d..b4a119176fdb 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -869,6 +869,15 @@ char *dentry_name(char *buf, char *end, const struct dentry *d, struct printf_sp return widen_string(buf, n, end, spec); } +static noinline_for_stack +char *file_dentry_name(char *buf, char *end, const struct file *f, + struct printf_spec spec, const char *fmt) +{ + if (check_pointer(, end, f, spec)) + return buf; + + return dentry_name(buf, end, f->f_path.dentry, spec, fmt); +} #ifdef CONFIG_BLOCK static noinline_for_stack char *bdev_name(char *buf, char *end, struct block_device *bdev, @@ -2166,9 +2175,7 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr, case 'C': return clock(buf, end, ptr, spec, fmt); case 'D': - return dentry_name(buf, end, - ((const struct file *)ptr)->f_path.dentry, - spec, fmt); + return file_dentry_name(buf, end, ptr, spec, fmt); #ifdef CONFIG_BLOCK case 'g': return bdev_name(buf, end, ptr, spec, fmt); -- 2.17.1
[PATCH 2/2] lib/test_printf: add test of null/invalid pointer dereference for dentry
This add some additional test cases of null/invalid pointer dereference for dentry and file (%pd and %pD) Signed-off-by: Jia He --- lib/test_printf.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/lib/test_printf.c b/lib/test_printf.c index 944eb50f3862..befedffeb476 100644 --- a/lib/test_printf.c +++ b/lib/test_printf.c @@ -455,6 +455,13 @@ dentry(void) test("foo", "%pd", _dentry[0]); test("foo", "%pd2", _dentry[0]); + /* test the null/invalid pointer case for dentry */ + test("(null)", "%pd", NULL); + test("(efault)", "%pd", PTR_INVALID); + /* test the null/invalid pointer case for file */ + test("(null)", "%pD", NULL); + test("(efault)", "%pD", PTR_INVALID); + test("romeo", "%pd", _dentry[3]); test("alfa/romeo", "%pd2", _dentry[3]); test("bravo/alfa/romeo", "%pd3", _dentry[3]); -- 2.17.1
[PATCH] arm64: mm: add missing PTE_SPECIAL in pte_mkdevmap on arm64
Without this patch, the MAP_SYNC test case will cause a print_bad_pte warning on arm64 as follows: [ 25.542693] BUG: Bad page map in process mapdax333 pte:2e8000448800f53 pmd:41ff5f003 [ 25.546360] page:7e001022 refcount:1 mapcount:-1 mapping:8003e29c7440 index:0x0 [ 25.550281] ext4_dax_aops [ 25.550282] name:"__aaabbbcccddd__" [ 25.551553] flags: 0x3001002(referenced|reserved) [ 25.555802] raw: 03001002 8003dfffa908 8003e29c7440 [ 25.559446] raw: 0001fffe [ 25.563075] page dumped because: bad pte [ 25.564938] addr:be05b000 vm_flags:208000fb anon_vma: mapping:8003e29c7440 index:0 [ 25.574272] file:__aaabbbcccddd__ fault:ext4_dax_fault ap:ext4_file_mmap readpage:0x0 [ 25.578799] CPU: 1 PID: 1180 Comm: mapdax333 Not tainted 5.2.0+ #21 [ 25.581702] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 25.585624] Call trace: [ 25.587008] dump_backtrace+0x0/0x178 [ 25.588799] show_stack+0x24/0x30 [ 25.590328] dump_stack+0xa8/0xcc [ 25.591901] print_bad_pte+0x18c/0x218 [ 25.593628] unmap_page_range+0x778/0xc00 [ 25.595506] unmap_single_vma+0x94/0xe8 [ 25.597304] unmap_vmas+0x90/0x108 [ 25.598901] unmap_region+0xc0/0x128 [ 25.600566] __do_munmap+0x284/0x3f0 [ 25.602245] __vm_munmap+0x78/0xe0 [ 25.603820] __arm64_sys_munmap+0x34/0x48 [ 25.605709] el0_svc_common.constprop.0+0x78/0x168 [ 25.607956] el0_svc_handler+0x34/0x90 [ 25.609698] el0_svc+0x8/0xc [ 25.611103] Disabling lock debugging due to kernel taint [ 25.613573] BUG: Bad page state in process mapdax333 pfn:448800 [ 25.616359] page:7e001022 refcount:0 mapcount:-1 mapping:8003e29c7440 index:0x1 [ 25.620236] ext4_dax_aops [ 25.620237] name:"__aaabbbcccddd__" [ 25.621495] flags: 0x300() [ 25.624912] raw: 0300 dead0100 dead0200 8003e29c7440 [ 25.628502] raw: 0001 fffe [ 25.632097] page dumped because: non-NULL mapping [...] [ 25.656567] CPU: 1 PID: 1180 Comm: mapdax333 Tainted: GB 5.2.0+ #21 [ 25.660131] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 25.663324] Call trace: [ 25.664466] dump_backtrace+0x0/0x178 [ 25.666163] show_stack+0x24/0x30 [ 25.667721] dump_stack+0xa8/0xcc [ 25.669270] bad_page+0xf0/0x150 [ 25.670772] free_pages_check_bad+0x84/0xa0 [ 25.672724] free_pcppages_bulk+0x45c/0x708 [ 25.674675] free_unref_page_commit+0xcc/0x100 [ 25.676751] free_unref_page_list+0x13c/0x200 [ 25.678801] release_pages+0x350/0x420 [ 25.680539] free_pages_and_swap_cache+0xf8/0x128 [ 25.682738] tlb_flush_mmu+0x164/0x2b0 [ 25.684485] unmap_page_range+0x648/0xc00 [ 25.686349] unmap_single_vma+0x94/0xe8 [ 25.688131] unmap_vmas+0x90/0x108 [ 25.689739] unmap_region+0xc0/0x128 [ 25.691392] __do_munmap+0x284/0x3f0 [ 25.693079] __vm_munmap+0x78/0xe0 [ 25.694658] __arm64_sys_munmap+0x34/0x48 [ 25.696530] el0_svc_common.constprop.0+0x78/0x168 [ 25.698772] el0_svc_handler+0x34/0x90 [ 25.700512] el0_svc+0x8/0xc The root cause is in _vm_normal_page, without the PTE_SPECIAL bit, the return value will be incorrectly set to pfn_to_page(pfn) instead of NULL. Besides, this patch also rewrite the pmd_mkdevmap to avoid setting PTE_SPECIAL for pmd The MAP_SYNC test case is as follows(Provided by Yibo Cai) $#include $#include $#include $#include $#include $#ifndef MAP_SYNC $#define MAP_SYNC 0x8 $#endif /* mount -o dax /dev/pmem0 /mnt */ $#define F "/mnt/__aaabbbcccddd__" int main(void) { int fd; char buf[4096]; void *addr; if ((fd = open(F, O_CREAT|O_TRUNC|O_RDWR, 0644)) < 0) { perror("open1"); return 1; } if (write(fd, buf, 4096) != 4096) { perror("lseek"); return 1; } addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_SYNC, fd, 0); if (addr == MAP_FAILED) { perror("mmap"); printf("did you mount with '-o dax'?\n"); return 1; } memset(addr, 0x55, 4096); if (munmap(addr, 4096) == -1) { perror("munmap"); return 1; } close(fd); return 0; } Fixes: 73b20c84d42d ("arm64: mm: implement pte_devmap support") Reported-by: Yibo Cai Signed-off-by: Jia He Acked-by: Robin Murphy --- arch/arm64/include/asm/pgtable.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 5fdcfe237338..e09760ece844 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -209,7 +209,7 @@ static inline pmd_t pmd_mkcont(pmd_t pmd) static inline pte_t pte_mkdevmap(pte_t pte) { - return set_pte_bit(
Re: [RFC PATCH v2 2/3] arm64: mark all the GICC nodes in MADT as possible cpu
On 2019/6/29 10:42, Xiongfeng Wang wrote: We set 'cpu_possible_mask' based on the enabled GICC node in MADT. If the GICC node is disabled, we will skip initializing the kernel data structure for that CPU. To support CPU hotplug, we need to initialize some CPU related data structure in advance. This patch mark all the GICC nodes as possible CPU and only these enabled GICC nodes as present CPU. Signed-off-by: Xiongfeng Wang --- arch/arm64/kernel/setup.c | 2 +- arch/arm64/kernel/smp.c | 11 +-- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index 7e541f9..7f4d12a 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -359,7 +359,7 @@ static int __init topology_init(void) for_each_online_node(i) register_one_node(i); - for_each_possible_cpu(i) { + for_each_online_cpu(i) { Have you considered the case in non-acpi mode? and setting "maxcpus=n" in host kernel boot parameters? --- Cheers, Justin (Jia He) struct cpu *cpu = _cpu(cpu_data.cpu, i); cpu->hotpluggable = 1; register_cpu(cpu, i); diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 6dcf960..6d9983c 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -525,16 +525,14 @@ struct acpi_madt_generic_interrupt *acpi_cpu_get_madt_gicc(int cpu) { u64 hwid = processor->arm_mpidr; - if (!(processor->flags & ACPI_MADT_ENABLED)) { - pr_debug("skipping disabled CPU entry with 0x%llx MPIDR\n", hwid); - return; - } - if (hwid & ~MPIDR_HWID_BITMASK || hwid == INVALID_HWID) { pr_err("skipping CPU entry with invalid MPIDR 0x%llx\n", hwid); return; } + if (!(processor->flags & ACPI_MADT_ENABLED)) + pr_debug("disabled CPU entry with 0x%llx MPIDR\n", hwid); + if (is_mpidr_duplicate(cpu_count, hwid)) { pr_err("duplicate CPU MPIDR 0x%llx in MADT\n", hwid); return; @@ -755,7 +753,8 @@ void __init smp_prepare_cpus(unsigned int max_cpus) if (err) continue; - set_cpu_present(cpu, true); + if ((cpu_madt_gicc[cpu].flags & ACPI_MADT_ENABLED)) + set_cpu_present(cpu, true); numa_store_cpu_info(cpu); } } --
Re: [PATCH RFC 0/3] Support CPU hotplug for ARM64
Hi Xiongfeng Sorry, I missed your latter mail, you used a emulated SCI interrupt --- Cheers, Justin (Jia He) On 2019/7/4 11:26, Xiongfeng Wang wrote: Hi Justin, On 2019/7/4 11:00, Jia He wrote: Hi Xiongfeng It is a little bit awkful that I am also investigating acpi based cpu hotplug issue silimar with your idea. My question is your purpose to implement the vcpu hotplug in arm64 qemu? Yes, my purpose is to implement the vcpu hotplug in arm64 qemu. So that I can add or remove vcpu without shutting down the Guest OS. Thanks, Xiongfeng Thanks for the ellaboration --- Cheers, Justin (Jia He) On 2019/6/28 19:13, Xiongfeng Wang wrote: This patchset mark all the GICC node in MADT as possible CPUs even though it is disabled. But only those enabled GICC node are marked as present CPUs. So that kernel will initialize some CPU related data structure in advance before the CPU is actually hot added into the system. This patchset also implement 'acpi_(un)map_cpu()' and 'arch_(un)register_cpu()' for ARM64. These functions are needed to enable CPU hotplug. To support CPU hotplug, we need to add all the possible GICC node in MADT including those CPUs that are not present but may be hot added later. Those CPUs are marked as disabled in GICC nodes. Xiongfeng Wang (3): ACPI / scan: evaluate _STA for processors declared via ASL Device statement arm64: mark all the GICC nodes in MADT as possible cpu arm64: Add CPU hotplug support arch/arm64/kernel/acpi.c | 22 ++ arch/arm64/kernel/setup.c | 19 ++- arch/arm64/kernel/smp.c | 11 +-- drivers/acpi/scan.c | 12 4 files changed, 57 insertions(+), 7 deletions(-) . --
Re: [PATCH RFC 0/3] Support CPU hotplug for ARM64
Hi Xiongfeng On 2019/7/4 11:26, Xiongfeng Wang wrote: Hi Justin, On 2019/7/4 11:00, Jia He wrote: Hi Xiongfeng It is a little bit awkful that I am also investigating acpi based cpu hotplug issue silimar with your idea. My question is your purpose to implement the vcpu hotplug in arm64 qemu? Yes, my purpose is to implement the vcpu hotplug in arm64 qemu. So that I can add or remove vcpu without shutting down the Guest OS. Thanks for the infor, I guess you used GED device in qemu ;-)? --- Cheers, Justin (Jia He)
Re: [PATCH RFC 0/3] Support CPU hotplug for ARM64
Hi Xiongfeng It is a little bit awkful that I am also investigating acpi based cpu hotplug issue silimar with your idea. My question is your purpose to implement the vcpu hotplug in arm64 qemu? Thanks for the ellaboration --- Cheers, Justin (Jia He) On 2019/6/28 19:13, Xiongfeng Wang wrote: This patchset mark all the GICC node in MADT as possible CPUs even though it is disabled. But only those enabled GICC node are marked as present CPUs. So that kernel will initialize some CPU related data structure in advance before the CPU is actually hot added into the system. This patchset also implement 'acpi_(un)map_cpu()' and 'arch_(un)register_cpu()' for ARM64. These functions are needed to enable CPU hotplug. To support CPU hotplug, we need to add all the possible GICC node in MADT including those CPUs that are not present but may be hot added later. Those CPUs are marked as disabled in GICC nodes. Xiongfeng Wang (3): ACPI / scan: evaluate _STA for processors declared via ASL Device statement arm64: mark all the GICC nodes in MADT as possible cpu arm64: Add CPU hotplug support arch/arm64/kernel/acpi.c | 22 ++ arch/arm64/kernel/setup.c | 19 ++- arch/arm64/kernel/smp.c | 11 +-- drivers/acpi/scan.c | 12 4 files changed, 57 insertions(+), 7 deletions(-)
Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64
Hi Hanjun On 2019/6/11 23:18, Hanjun Guo wrote: Hello Ard, Thanks for the reply, please see my comments inline. On 2019/6/10 21:16, Ard Biesheuvel wrote: On Sat, 8 Jun 2019 at 06:22, Hanjun Guo wrote: Hi Ard, Will, This week we were trying to debug an issue of time consuming in mem_init(), and leading to this similar solution form Jia He, so I would like to bring this thread back, please see my detail test result below. On 2018/9/7 22:44, Will Deacon wrote: On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote: On 22 August 2018 at 05:07, Jia He wrote: Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") optimized the loop in memmap_init_zone(). But it causes possible panic bug. So Daniel Vacek reverted it later. But as suggested by Daniel Vacek, it is fine to using memblock to skip gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID. More from what Daniel said: "On arm and arm64, memblock is used by default. But generic version of pfn_valid() is based on mem sections and memblock_next_valid_pfn() does not always return the next valid one but skips more resulting in some valid frames to be skipped (as if they were invalid). And that's why kernel was eventually crashing on some !arm machines." About the performance consideration: As said by James in b92df1de5, "I have tested this patch on a virtual model of a Samurai CPU with a sparse memory map. The kernel boot time drops from 109 to 62 seconds." Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64. Besides we can remain memblock_next_valid_pfn, there is still some room for improvement. After this set, I can see the time overhead of memmap_init is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G memory, pagesize 64k). I believe arm server will benefit more if memory is larger than TBs OK so we can summarize the benefits of this series as follows: - boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds - boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15 *milliseconds* Google was not very helpful in figuring out what a Samurai CPU is and why we should care about the boot time of Linux running on a virtual model of it, and the 15 ms speedup is not that compelling either. Testing this patch set on top of Kunpeng 920 based ARM64 server, with 384G memory in total, we got the time consuming below without this patch set with this patch set mem_init()13310ms 1415ms So we got about 8x speedup on this machine, which is very impressive. Yes, this is impressive. But does it matter in the grand scheme of things? It matters for this machine, because it's for storage and there is a watchdog and the time consuming triggers the watchdog. How much time does this system take to arrive at this point from power on? Sorry, I don't have such data, as the arch timer is not initialized and I didn't see the time stamp at this point, but I read the cycles from arch timer before and after the time consuming function to get how much time consumed. The time consuming is related the memory DIMM size and where to locate those memory DIMMs in the slots. In above case, we are using 16G memory DIMM. We also tested 1T memory with 64G size for each memory DIMM on another ARM64 machine, the time consuming reduced from 20s to 2s (I think it's related to firmware implementations). I agree that this optimization looks good in isolation, but the fact that you spotted a bug justifies my skepticism at the time. On the other hand, now that we have several independent reports (from you, but also from the Renesas folks) that the speedup is worthwhile for real world use cases, I think it does make sense to revisit it. Thank you very much for taking care of this :) So what I would like to see is the patch set being proposed again, with the new data points added for documentation. Also, the commit logs need to crystal clear about how the meaning of PFN validity differs between ARM and other architectures, and why the assumptions that the optimization is based on are guaranteed to hold. I think Jia He no longer works for HXT, if don't mind, I can repost this patch set with Jia He's authority unchanged. Ok, I don't mind that, thanks for your followup :) --- Cheers, Justin (Jia He)