from:"Jia He"

[PATCH v2] device-dax: use fallback nid when numa node is invalid

2021-09-10 Thread Jia He

Previously, numa_off was set unconditionally in dummy_numa_init()
even with a fake numa node. Then ACPI sets node id as NUMA_NO_NODE(-1)
after acpi_map_pxm_to_node() because it regards numa_off as turning
off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
arm64 with fake numa case.

Without this patch, pmem can't be probed as RAM devices on arm64 if
SRAT table isn't present:
  $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
  kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with invalid 
node: -1
  kmem: probe of dax0.0 failed with error -22

This fixes it by using fallback memory_add_physaddr_to_nid() as nid.

Suggested-by: David Hildenbrand 
Signed-off-by: Jia He 
---
v2: - rebase it based on David's "memory group" patch.
- drop the changes in dev_dax_kmem_remove() since nid had been 
  removed in remove_memory().
 drivers/dax/kmem.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..e4836eb7539e 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -47,20 +47,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
unsigned long total_len = 0;
struct dax_kmem_data *data;
int i, rc, mapped = 0;
-   int numa_node;
-
-   /*
-* Ensure good NUMA information for the persistent memory.
-* Without this check, there is a risk that slow memory
-* could be mixed in a node with faster memory, causing
-* unavoidable performance issues.
-*/
-   numa_node = dev_dax->target_node;
-   if (numa_node < 0) {
-   dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
-   numa_node);
-   return -EINVAL;
-   }
+   int numa_node = dev_dax->target_node;
 
for (i = 0; i < dev_dax->nr_range; i++) {
struct range range;
@@ -71,6 +58,22 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
i, range.start, range.end);
continue;
}
+
+   /*
+* Ensure good NUMA information for the persistent memory.
+* Without this check, there is a risk but not fatal that slow
+* memory could be mixed in a node with faster memory, causing
+* unavoidable performance issues. Warn this and use fallback
+* node id.
+*/
+   if (numa_node < 0) {
+   int new_node = memory_add_physaddr_to_nid(range.start);
+
+   dev_info(dev, "changing nid from %d to %d for DAX 
region [%#llx-%#llx]\n",
+numa_node, new_node, range.start, range.end);
+   numa_node = new_node;
+   }
+
total_len += range_len();
}
 
-- 
2.17.1

[PATCH] KVM: arm64: Fix unaligned addr case in mmu walking

2021-03-03 Thread Jia He

If the start addr is not aligned with the granule size of that level.
loop step size should be adjusted to boundary instead of simple
kvm_granual_size(level) increment. Otherwise, some mmu entries might miss
the chance to be walked through.
E.g. Assume the unmap range [data->addr, data->end] is
[0xff00ab2000,0xff00cb2000] in level 2 walking and NOT block mapping.
And the 1st part of that pmd entry is [0xff00ab2000,0xff00c0]. The
pmd value is 0x83fbd2c1002 (not valid entry). In this case, data->addr
should be adjusted to 0xff00c0 instead of 0xff00cb2000.

Without this fix, userspace "segment fault" error can be easily
triggered by running simple gVisor runsc cases on an Ampere Altra
server:
docker run --runtime=runsc -it --rm  ubuntu /bin/bash

In container:
for i in `seq 1 100`;do ls;done

Reported-by: Howard Zhang 
Signed-off-by: Jia He 
---
 arch/arm64/kvm/hyp/pgtable.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bdf8e55ed308..4d99d07c610c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -225,6 +225,7 @@ static inline int __kvm_pgtable_visit(struct 
kvm_pgtable_walk_data *data,
goto out;
 
if (!table) {
+   data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
data->addr += kvm_granule_size(level);
goto out;
}
-- 
2.17.1

[RFC PATCH 1/2] arm64/cpuinfo: Move init_cpu_features() ahead of setup.c::early_fixmap_init()

2021-01-12 Thread Jia He

Move init_cpu_features() ahead of setup_arch()->early_fixmap_init(), which
is the preparation work for checking the condition to assign
arm64_use_ng_mappings as cpus_have_const_cap(ARM64_UNMAP_KERNEL_AT_EL0).

Besides, jump_label_init() is also moved ahead because
cpus_have_const_cap() depends on static key enable api.

Percpu helpers should be avoided in cpuinfo_store_boot_cpu() before percpu
init at main.c::setup_per_cpu_areas()

Signed-off-by: Jia He 
---
 arch/arm64/include/asm/cpu.h |  1 +
 arch/arm64/kernel/cpuinfo.c  | 13 ++---
 arch/arm64/kernel/setup.c| 14 +-
 arch/arm64/kernel/smp.c  |  3 +--
 4 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h
index 7faae6ff3ab4..59f36f5e3c04 100644
--- a/arch/arm64/include/asm/cpu.h
+++ b/arch/arm64/include/asm/cpu.h
@@ -63,6 +63,7 @@ DECLARE_PER_CPU(struct cpuinfo_arm64, cpu_data);
 
 void cpuinfo_store_cpu(void);
 void __init cpuinfo_store_boot_cpu(void);
+void __init save_boot_cpuinfo_data(void);
 
 void __init init_cpu_features(struct cpuinfo_arm64 *info);
 void update_cpu_features(int cpu, struct cpuinfo_arm64 *info,
diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 77605aec25fe..f8de5b8bae20 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -413,9 +413,16 @@ void cpuinfo_store_cpu(void)
 
 void __init cpuinfo_store_boot_cpu(void)
 {
-   struct cpuinfo_arm64 *info = _cpu(cpu_data, 0);
-   __cpuinfo_store_cpu(info);
+   __cpuinfo_store_cpu(_cpu_data);
 
-   boot_cpu_data = *info;
init_cpu_features(_cpu_data);
 }
+
+void __init save_boot_cpuinfo_data(void)
+{
+   struct cpuinfo_arm64 *info;
+
+   set_my_cpu_offset(per_cpu_offset(smp_processor_id()));
+   info = _cpu(cpu_data, 0);
+   *info = boot_cpu_data;
+}
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 1a57a76e1cc2..e078ab068f3b 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -297,16 +297,20 @@ void __init __no_sanitize_address setup_arch(char 
**cmdline_p)
 */
arm64_use_ng_mappings = kaslr_requires_kpti();
 
-   early_fixmap_init();
-   early_ioremap_init();
-
-   setup_machine_fdt(__fdt_pointer);
-
/*
 * Initialise the static keys early as they may be enabled by the
 * cpufeature code and early parameters.
 */
jump_label_init();
+
+   /* Init the cpu feature codes for boot cpu */
+   cpuinfo_store_boot_cpu();
+
+   early_fixmap_init();
+   early_ioremap_init();
+
+   setup_machine_fdt(__fdt_pointer);
+
parse_early_param();
 
/*
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 2499b895efea..3df1f5b1da0b 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -449,8 +449,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 void __init smp_prepare_boot_cpu(void)
 {
-   set_my_cpu_offset(per_cpu_offset(smp_processor_id()));
-   cpuinfo_store_boot_cpu();
+   save_boot_cpuinfo_data();
 
/*
 * We now know enough about the boot CPU to apply the
-- 
2.17.1

[RFC PATCH 0/2] Avoid booting stall caused by

2021-01-12 Thread Jia He

There is a 10s stall in idmap_kpti_install_ng_mappings when kernel boots
on a Ampere EMAG server.

Commit f992b4dfd58b ("arm64: kpti: Add ->enable callback to remap
swapper using nG mappings") updates the nG bit runtime if kpti is
required.

But things get worse if rodata=full in map_mem(). NO_BLOCK_MAPPINGS |
NO_CONT_MAPPINGS is required when creating pagetable mapping. Hence all
ptes are fully mapped in this case. On a Ampere EMAG server with 256G
memory(pagesize=4k), it causes the 10s stall.

After moving init_cpu_features() ahead of early_fixmap_init(), we can use
cpu_have_const_cap earlier than before. Hence we can avoid this stall
by updating arm64_use_ng_mappings.

After this patch series, it reduces the kernel boot time from 14.7s to
4.1s:
Before:
[   14.757569] Freeing initrd memory: 60752K
After:
[4.138819] Freeing initrd memory: 60752K

Set it as RFC because I want to resolve any other points which I have
misconerned.

Jia He (2):
  arm64/cpuinfo: Move init_cpu_features() ahead of early_fixmap_init()
  arm64: kpti: Update arm64_use_ng_mappings before pagetable mapping

 arch/arm64/include/asm/cpu.h |  1 +
 arch/arm64/kernel/cpuinfo.c  | 13 ++---
 arch/arm64/kernel/setup.c| 18 +-
 arch/arm64/kernel/smp.c  |  3 +--
 4 files changed, 25 insertions(+), 10 deletions(-)

-- 
2.17.1

[RFC PATCH 2/2] arm64: kpti: Update arm64_use_ng_mappings before pagetable mapping

2021-01-12 Thread Jia He

There is a 10s stall in idmap_kpti_install_ng_mappings when kernel boots
on a Ampere EMAG server.

Commit f992b4dfd58b ("arm64: kpti: Add ->enable callback to remap
swapper using nG mappings") updates the nG bit runtime if kpti is required.
But things get worse if rodata=full in map_mem(). NO_BLOCK_MAPPINGS |
NO_CONT_MAPPINGS is required when creating pagetable mapping. Hence all
ptes are fully mapped in this case. On a Ampere EMAG server with 256G
memory(pagesize=4k), it causes the 10s stall.

After previous commit moving init_cpu_features(), we can use
cpu_have_const_cap earlier than before. Hence we can avoid this stall
by updating arm64_use_ng_mappings.

Signed-off-by: Jia He 
---
 arch/arm64/kernel/setup.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index e078ab068f3b..51098ceb7159 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -306,6 +306,10 @@ void __init __no_sanitize_address setup_arch(char 
**cmdline_p)
/* Init the cpu feature codes for boot cpu */
cpuinfo_store_boot_cpu();
 
+   /* ARM64_UNMAP_KERNEL_AT_EL0 cap can be updated in 
cpuinfo_store_boot_cpu() */
+   if (!arm64_use_ng_mappings)
+   arm64_use_ng_mappings = 
cpus_have_const_cap(ARM64_UNMAP_KERNEL_AT_EL0);
+
early_fixmap_init();
early_ioremap_init();
 
-- 
2.17.1

[PATCH] vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote()

2020-11-19 Thread Jia He

The permission of vfio iommu is different and incompatible with vma
permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will
simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't
map a dma region with NONE permission.

This corner case will be exposed in coming virtio_fs cache_size
commit [1]
 - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
   memory_region_init_ram_ptr()
 - re-mmap the above area with read/write authority.
 - vfio_dma_map() will be invoked when vfio device is hotplug added.

qemu:
vfio_listener_region_add()
vfio_dma_map(..., readonly=false)
map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE
ioctl(VFIO_IOMMU_MAP_DMA)

kernel:
vfio_dma_do_map()
vfio_pin_map_dma()
vfio_pin_pages_remote()
vaddr_get_pfn()
...
check_vma_flags() failed! because
vm_flags hasn't VM_WRITE && gup_flags
has FOLL_WRITE

It will report error in qemu log when hotplug adding(vfio) a nvme disk
to qemu guest on an Ampere EMAG server:
"VFIO_MAP_DMA failed: Bad address"

[1] 
https://gitlab.com/virtio-fs/qemu/-/blob/virtio-fs-dev/hw/virtio/vhost-user-fs.c#L502

Signed-off-by: Jia He 
---
 drivers/vfio/vfio_iommu_type1.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 67e827638995..33faa6b7dbd4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -453,7 +453,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
mmap_read_lock(mm);
-   ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM,
+   ret = pin_user_pages_remote(mm, vaddr, 1,
+   flags | FOLL_LONGTERM | FOLL_FORCE,
page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-- 
2.17.1

[PATCH] gpio: dwapb: Fix missing conversion to GPIO-lib-based IRQ-chip

2020-10-16 Thread Jia He

Commit 0ea683931adb ("gpio: dwapb: Convert driver to using the
GPIO-lib-based IRQ-chip") missed the case in dwapb_irq_set_wake().

Without this fix, probing the dwapb gpio driver will hit a error:
"address between user and kernel address ranges" on a Ampere armv8a
server and cause a panic.

Fixes: 0ea683931adb ("gpio: dwapb: Convert driver to using the
GPIO-lib-based IRQ-chip")
Signed-off-by: Jia He 
---
 drivers/gpio/gpio-dwapb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpio/gpio-dwapb.c b/drivers/gpio/gpio-dwapb.c
index a5b326754124..2a9046c0fb16 100644
--- a/drivers/gpio/gpio-dwapb.c
+++ b/drivers/gpio/gpio-dwapb.c
@@ -343,8 +343,8 @@ static int dwapb_irq_set_type(struct irq_data *d, u32 type)
 #ifdef CONFIG_PM_SLEEP
 static int dwapb_irq_set_wake(struct irq_data *d, unsigned int enable)
 {
-   struct irq_chip_generic *igc = irq_data_get_irq_chip_data(d);
-   struct dwapb_gpio *gpio = igc->private;
+   struct gpio_chip *gc = irq_data_get_irq_chip_data(d);
+   struct dwapb_gpio *gpio = to_dwapb_gpio(gc);
struct dwapb_context *ctx = gpio->ports[0].ctx;
irq_hw_number_t bit = irqd_to_hwirq(d);
 
-- 
2.17.1

[RFC PATCH 1/6] mm/memory_hotplug: remove redundant memory block size alignment check

2020-07-28 Thread Jia He

The alignment check has been done by check_hotplug_memory_range(). Hence
the redundant one in create_memory_block_devices() can be removed.

The similar redundant check is removed in remove_memory_block_devices().

Signed-off-by: Jia He 
---
 drivers/base/memory.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2b09b68b9f78..4a1691664c6c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -642,10 +642,6 @@ int create_memory_block_devices(unsigned long start, 
unsigned long size)
unsigned long block_id;
int ret = 0;
 
-   if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) ||
-!IS_ALIGNED(size, memory_block_size_bytes(
-   return -EINVAL;
-
for (block_id = start_block_id; block_id != end_block_id; block_id++) {
ret = init_memory_block(, block_id, MEM_OFFLINE);
if (ret)
@@ -678,10 +674,6 @@ void remove_memory_block_devices(unsigned long start, 
unsigned long size)
struct memory_block *mem;
unsigned long block_id;
 
-   if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) ||
-!IS_ALIGNED(size, memory_block_size_bytes(
-   return;
-
for (block_id = start_block_id; block_id != end_block_id; block_id++) {
mem = find_memory_block_by_id(block_id);
if (WARN_ON_ONCE(!mem))
-- 
2.17.1

[RFC PATCH 2/6] resource: export find_next_iomem_res() helper

2020-07-28 Thread Jia He

The helper is to find the lowest iomem resource that covers part of
[@start..@end]

It is useful when relaxing the alignment check for dax pmem kmem.

Signed-off-by: Jia He 
---
 include/linux/ioport.h | 3 +++
 kernel/resource.c  | 3 ++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6c2b06fe8beb..203fd16c9f45 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -247,6 +247,9 @@ extern struct resource * __request_region(struct resource *,
 
 extern void __release_region(struct resource *, resource_size_t,
resource_size_t);
+extern int find_next_iomem_res(resource_size_t start, resource_size_t end,
+  unsigned long flags, unsigned long desc,
+  bool first_lvl, struct resource *res);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int release_mem_region_adjustable(struct resource *, resource_size_t,
resource_size_t);
diff --git a/kernel/resource.c b/kernel/resource.c
index 841737bbda9e..57e6a6802a3d 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -338,7 +338,7 @@ EXPORT_SYMBOL(release_resource);
  * @first_lvl: walk only the first level children, if set
  * @res:   return ptr, if resource found
  */
-static int find_next_iomem_res(resource_size_t start, resource_size_t end,
+int find_next_iomem_res(resource_size_t start, resource_size_t end,
   unsigned long flags, unsigned long desc,
   bool first_lvl, struct resource *res)
 {
@@ -391,6 +391,7 @@ static int find_next_iomem_res(resource_size_t start, 
resource_size_t end,
read_unlock(_lock);
return p ? 0 : -ENODEV;
 }
+EXPORT_SYMBOL(find_next_iomem_res);
 
 static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
 unsigned long flags, unsigned long desc,
-- 
2.17.1

[RFC PATCH 3/6] mm/memory_hotplug: allow pmem kmem not to align with memory_block_size

2020-07-28 Thread Jia He

When dax pmem is probed as RAM device on arm64, previously, kmem_start in
dev_dax_kmem_probe() should be aligned with 1G memblock size on arm64 due
to SECTION_SIZE_BITS(30).

There will be some meta data at the beginning/end of the iomem space, e.g.
namespace info and nvdimm label:
24000-33fdf : Persistent Memory
  24000-2403f : namespace0.0
  28000-2bfff : dax0.0
28000-2bfff : System RAM

Hence it makes the whole kmem space not aligned with memory_block_size for
both start addr and end addr. Hence there is a big gap when kmem is added
into memory block which causes big memory space wasting.

This changes it by relaxing the alignment check for dax pmem kmem in the
path of online/offline memory blocks.

Signed-off-by: Jia He 
---
 drivers/base/memory.c | 16 
 mm/memory_hotplug.c   | 39 ++-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4a1691664c6c..3d2a94f3b1d9 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -334,6 +334,22 @@ static ssize_t valid_zones_show(struct device *dev,
 * online nodes otherwise the page_zone is not reliable
 */
if (mem->state == MEM_ONLINE) {
+#ifdef CONFIG_ZONE_DEVICE
+   struct resource res;
+   int ret;
+
+   /* adjust start_pfn for dax pmem kmem */
+   ret = find_next_iomem_res(start_pfn << PAGE_SHIFT,
+   ((start_pfn + nr_pages) << PAGE_SHIFT) 
- 1,
+   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+   IORES_DESC_PERSISTENT_MEMORY,
+   false, );
+   if (!ret && PFN_UP(res.start) > start_pfn) {
+   nr_pages -= PFN_UP(res.start) - start_pfn;
+   start_pfn = PFN_UP(res.start);
+   }
+#endif
+
/*
 * The block contains more than one zone can not be offlined.
 * This can happen e.g. for ZONE_DMA and ZONE_DMA32
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a53103dc292b..25745f67b680 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -999,6 +999,20 @@ int try_online_node(int nid)
 
 static int check_hotplug_memory_range(u64 start, u64 size)
 {
+#ifdef CONFIG_ZONE_DEVICE
+   struct resource res;
+   int ret;
+
+   /* Allow pmem kmem not to align with block size */
+   ret = find_next_iomem_res(start, start + size - 1,
+   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+   IORES_DESC_PERSISTENT_MEMORY,
+   false, );
+   if (!ret) {
+   return 0;
+   }
+#endif
+
/* memory range must be block size aligned */
if (!size || !IS_ALIGNED(start, memory_block_size_bytes()) ||
!IS_ALIGNED(size, memory_block_size_bytes())) {
@@ -1481,19 +1495,42 @@ static int __ref __offline_pages(unsigned long 
start_pfn,
mem_hotplug_begin();
 
/*
-* Don't allow to offline memory blocks that contain holes.
+* Don't allow to offline memory blocks that contain holes except
+* for pmem.
 * Consequently, memory blocks with holes can never get onlined
 * via the hotplug path - online_pages() - as hotplugged memory has
 * no holes. This way, we e.g., don't have to worry about marking
 * memory holes PG_reserved, don't need pfn_valid() checks, and can
 * avoid using walk_system_ram_range() later.
+* When dax pmem is used as RAM (kmem), holes at the beginning is
+* allowed.
 */
walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages,
  count_system_ram_pages_cb);
if (nr_pages != end_pfn - start_pfn) {
+#ifdef CONFIG_ZONE_DEVICE
+   struct resource res;
+
+   /* Allow pmem kmem not to align with block size */
+   ret = find_next_iomem_res(start_pfn << PAGE_SHIFT,
+   (end_pfn << PAGE_SHIFT) - 1,
+   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+   IORES_DESC_PERSISTENT_MEMORY,
+   false, );
+   if (ret) {
+   ret = -EINVAL;
+   reason = "memory holes";
+   goto failed_removal;
+   }
+
+   /* adjust start_pfn for dax pmem kmem */
+   start_pfn = PFN_UP(res.start);
+   end_pfn = PFN_DOWN(res.end + 1);
+#else
ret = -EINVAL;
reason = "memory holes";
goto failed_removal;
+#endif
}
 
/* This makes hotplug much easier...and readable.
-- 
2.17.1

[RFC PATCH 6/6] arm64: fall back to vmemmap_populate_basepages if not aligned with PMD_SIZE

2020-07-28 Thread Jia He

In dax pmem kmem (dax pmem used as RAM device) case, the start address
might not be aligned with PMD_SIZE
e.g.
24000-33fdf : Persistent Memory
  24000-2421f : namespace0.0
  24240-2bfff : dax0.0
24240-2bfff : System RAM (kmem)
pfn_to_page(0x24240) is fe0007e9.

Without this patch, vmemmap_populate(fe0007e9, ...) will incorrectly
create a pmd mapping [fe0007e0, fe000800] which contains
fe0007e9.

This adds the check and then falls back to vmemmap_populate_basepages()

Signed-off-by: Jia He 
---
 arch/arm64/mm/mmu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d69feb2cfb84..3b21bd47e801 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1102,6 +1102,10 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
 
do {
next = pmd_addr_end(addr, end);
+   if (next - addr < PMD_SIZE) {
+   vmemmap_populate_basepages(start, next, node, altmap);
+   continue;
+   }
 
pgdp = vmemmap_pgd_populate(addr, node);
if (!pgdp)
-- 
2.17.1

[RFC PATCH 4/6] mm/page_alloc: adjust the start,end in dax pmem kmem case

2020-07-28 Thread Jia He

There are 3 cases when doing online pages:
 - normal RAM, should be aligned with memory block size
 - persistent memory with ZONE_DEVICE
 - persistent memory used as normal RAM (kmem) with ZONE_NORMAL, this patch
   tries to adjust the start_pfn/end_pfn after finding the corresponding
   resource range.

Without this patch, the check of __init_single_page when doing online memory
will be failed because those pages haven't been mapped in mmu(not present
from mmu's point of view).

Signed-off-by: Jia He 
---
 mm/page_alloc.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e028b87ce294..13216ab3623f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5971,6 +5971,20 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
if (start_pfn == altmap->base_pfn)
start_pfn += altmap->reserve;
end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
+   } else {
+   struct resource res;
+   int ret;
+
+   /* adjust the start,end in dax pmem kmem case */
+   ret = find_next_iomem_res(start_pfn << PAGE_SHIFT,
+   (end_pfn << PAGE_SHIFT) - 1,
+   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+   IORES_DESC_PERSISTENT_MEMORY,
+   false, );
+   if (!ret) {
+   start_pfn = PFN_UP(res.start);
+   end_pfn = PFN_DOWN(res.end + 1);
+   }
}
 #endif
 
-- 
2.17.1

[RFC PATCH 5/6] device-dax: relax the memblock size alignment for kmem_start

2020-07-28 Thread Jia He

Previously, kmem_start in dev_dax_kmem_probe should be aligned with
SECTION_SIZE_BITS(30), i.e. 1G memblock size on arm64. Even with Dan
Williams' sub-section patch series, it was not helpful when adding the
dax pmem kmem to memblock:
$ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M
$echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
$echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
$cat /proc/iomem
...
23c00-23fff : System RAM
  23dd4-23fec : reserved
  23fed-23fff : reserved
24000-33fdf : Persistent Memory
  24000-2403f : namespace0.0
  28000-2bfff : dax0.0  <- boundary are aligned with 1G
28000-2bfff : System RAM (kmem)
$ lsmem
RANGE SIZE  STATE REMOVABLE BLOCK
0x4000-0x00023fff   8G online   yes   1-8
0x00028000-0x0002bfff   1G online   yes10

Memory block size: 1G
Total online memory:   9G
Total offline memory:  0B
...
Hence there is a big gap between 0x2403f and 0x28000 due to the 1G
alignment on arm64. More than that, only 1G memory is returned while 2G is
requested.

On x86, the gap is relatively small due to SECTION_SIZE_BITS(27).

Besides descreasing SECTION_SIZE_BITS on arm64, we can relax the alignment
when adding the kmem.
After this patch:
24000-33fdf : Persistent Memory
  24000-2421f : namespace0.0
  24240-2bfff : dax0.0
24240-2bfff : System RAM (kmem)
$ lsmem
RANGE SIZE  STATE REMOVABLE BLOCK
0x4000-0x0002bfff  10G online   yes  1-10

Memory block size: 1G
Total online memory:  10G
Total offline memory:  0B

Notes, block 9-10 are the newly hotplug added.

This patches remove the tight alignment constraint of
memory_block_size_bytes(), but still keep the constraint from
online_pages_range().

Signed-off-by: Jia He 
---
 drivers/dax/kmem.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index d77786dc0d92..849d0706dfe0 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -30,9 +30,20 @@ int dev_dax_kmem_probe(struct device *dev)
const char *new_res_name;
int numa_node;
int rc;
+   int order;
 
-   /* Hotplug starting at the beginning of the next block: */
-   kmem_start = ALIGN(res->start, memory_block_size_bytes());
+   /* kmem_start needn't be aligned with memory_block_size_bytes().
+* But given the constraint in online_pages_range(), adjust the
+* alignment of kmem_start and kmem_size
+*/
+   kmem_size = resource_size(res);
+   order = min_t(int, MAX_ORDER - 1, get_order(kmem_size));
+   kmem_start = ALIGN(res->start, 1ul << (order + PAGE_SHIFT));
+   /* Adjust the size down to compensate for moving up kmem_start: */
+   kmem_size -= kmem_start - res->start;
+   /* Align the size down to cover only complete blocks: */
+   kmem_size &= ~((1ul << (order + PAGE_SHIFT)) - 1);
+   kmem_end = kmem_start + kmem_size;
 
/*
 * Ensure good NUMA information for the persistent memory.
@@ -48,13 +59,6 @@ int dev_dax_kmem_probe(struct device *dev)
numa_node, res);
}
 
-   kmem_size = resource_size(res);
-   /* Adjust the size down to compensate for moving up kmem_start: */
-   kmem_size -= kmem_start - res->start;
-   /* Align the size down to cover only complete blocks: */
-   kmem_size &= ~(memory_block_size_bytes() - 1);
-   kmem_end = kmem_start + kmem_size;
-
new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
if (!new_res_name)
return -ENOMEM;
-- 
2.17.1

[RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-28 Thread Jia He

When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e.
1G memblock size. Even Dan Williams' sub-section patch series [1] had been
upstream merged, it was not helpful due to hard limitation of kmem_start:
$ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M
$echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
$echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
$cat /proc/iomem
...
23c00-23fff : System RAM
  23dd4-23fec : reserved
  23fed-23fff : reserved
24000-33fdf : Persistent Memory
  24000-2403f : namespace0.0
  28000-2bfff : dax0.0  <- aligned with 1G boundary
28000-2bfff : System RAM
Hence there is a big gap between 0x2403f and 0x28000 due to the 1G
alignment.
 
Without this series, if qemu creates a 4G bytes nvdimm device, we can only
use 2G bytes for dax pmem(kmem) in the worst case.
e.g.
24000-33fdf : Persistent Memory 
We can only use the memblock between [24000, 2] due to the hard
limitation. It wastes too much memory space.

Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there
are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
SPARSEMEM_VMEMMAP, page bits in struct page ...

Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment
with memory_block_size_bytes().

Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem
can be used as ram with smaller gap. Also the kmem hotplug add/remove are both
tested on arm64/x86 guest.

This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 [2].

[1] https://lkml.org/lkml/2019/6/19/67
[2] https://lkml.org/lkml/2020/7/8/1546
Jia He (6):
  mm/memory_hotplug: remove redundant memory block size alignment check
  resource: export find_next_iomem_res() helper
  mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
  mm/page_alloc: adjust the start,end in dax pmem kmem case
  device-dax: relax the memblock size alignment for kmem_start
  arm64: fall back to vmemmap_populate_basepages if not aligned  with
PMD_SIZE

 arch/arm64/mm/mmu.c|  4 
 drivers/base/memory.c  | 24 
 drivers/dax/kmem.c | 22 +-
 include/linux/ioport.h |  3 +++
 kernel/resource.c  |  3 ++-
 mm/memory_hotplug.c| 39 ++-
 mm/page_alloc.c| 14 ++
 7 files changed, 90 insertions(+), 19 deletions(-)

-- 
2.17.1

[PATCH 1/2] mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid()

2020-07-09 Thread Jia He

This is to introduce a general dummy helper. memory_add_physaddr_to_nid()
is a fallback option to get the nid in case NUMA_NO_NID is detected.

After this patch, arm64/sh/s390 can simply use the general dummy version.
PowerPC/x86/ia64 will still use their specific version.

This is the preparation to set a fallback value for dev_dax->target_node.

Reviewed-by: David Hildenbrand 
Signed-off-by: Jia He 
---
 arch/arm64/mm/numa.c | 10 --
 arch/ia64/mm/numa.c  |  2 --
 arch/sh/mm/init.c|  9 -
 arch/x86/mm/numa.c   |  1 -
 mm/memory_hotplug.c  | 10 ++
 5 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index aafcee3e3f7e..73f8b49d485c 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -461,13 +461,3 @@ void __init arm64_numa_init(void)
 
numa_init(dummy_numa_init);
 }
-
-/*
- * We hope that we will be hotplugging memory on nodes we already know about,
- * such that acpi_get_node() succeeds and we never fall back to this...
- */
-int memory_add_physaddr_to_nid(u64 addr)
-{
-   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
-   return 0;
-}
diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index 5e1015eb6d0d..f34964271101 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -106,7 +106,5 @@ int memory_add_physaddr_to_nid(u64 addr)
return 0;
return nid;
 }
-
-EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 #endif
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index a70ba0fdd0b3..f75932ba87a6 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -430,15 +430,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
return ret;
 }
 
-#ifdef CONFIG_NUMA
-int memory_add_physaddr_to_nid(u64 addr)
-{
-   /* Node 0 for now.. */
-   return 0;
-}
-EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
-#endif
-
 void arch_remove_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap)
 {
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8ee952038c80..2a6e62af4636 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -929,5 +929,4 @@ int memory_add_physaddr_to_nid(u64 start)
nid = numa_meminfo.blk[0].nid;
return nid;
 }
-EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da374cd3d45b..b49ab743d914 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -350,6 +350,16 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned 
long nr_pages,
return err;
 }
 
+#ifdef CONFIG_NUMA
+int __weak memory_add_physaddr_to_nid(u64 start)
+{
+   pr_info_once("Unknown target node for memory at 0x%llx, assuming node 
0\n",
+   start);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+#endif
+
 /* find the smallest valid pfn in the range [start_pfn, end_pfn) */
 static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 unsigned long start_pfn,
-- 
2.17.1

[PATCH 2/2] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

2020-07-09 Thread Jia He

When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
online at that time), mem_hotplug_begin/done is unpaired in such case.

Therefore a warning:
 Call Trace:
  percpu_up_write+0x33/0x40
  try_remove_memory+0x66/0x120
  ? _cond_resched+0x19/0x30
  remove_memory+0x2b/0x40
  dev_dax_kmem_remove+0x36/0x72 [kmem]
  device_release_driver_internal+0xf0/0x1c0
  device_release_driver+0x12/0x20
  bus_remove_device+0xe1/0x150
  device_del+0x17b/0x3e0
  unregister_dev_dax+0x29/0x60
  devm_action_release+0x15/0x20
  release_nodes+0x19a/0x1e0
  devres_release_all+0x3f/0x50
  device_release_driver_internal+0x100/0x1c0
  driver_detach+0x4c/0x8f
  bus_remove_driver+0x5c/0xd0
  driver_unregister+0x31/0x50
  dax_pmem_exit+0x10/0xfe0 [dax_pmem]

Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
Cc: sta...@vger.kernel.org # v5.6+
Signed-off-by: Jia He 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
Acked-by: Dan Williams 
---
 mm/memory_hotplug.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b49ab743d914..3e0645387daf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1752,7 +1752,7 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 */
rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb);
if (rc)
-   goto done;
+   return rc;
 
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
@@ -1776,9 +1776,8 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 
try_offline_node(nid);
 
-done:
mem_hotplug_done();
-   return rc;
+   return 0;
 }
 
 /**
-- 
2.17.1

[PATCH v4 0/2] Fix and enable pmem as RAM device on arm64

2020-07-09 Thread Jia He

This fixies a few issues when I tried to enable pmem as RAM device on arm64.

To use memory_add_physaddr_to_nid as a fallback nid, it would be better
implement a general version (__weak) in mm/memory_hotplug. After that, arm64/
sh/s390 can simply use the general version, and PowerPC/ia64/x86 will use
arch specific version.

Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The
memblocks from the dax pmem device can be either hot-added or hot-removed
on arm64 guest. Also passed the compilation test on x86.

Changes:
v4: - remove "device-dax: use fallback nid when numa_node is invalid", wait
  for Dan Williams' phys_addr_to_target_node() patch
- folder v3 patch1-4 into single one, no functional changes
v3: https://lkml.org/lkml/2020/7/8/1541
- introduce general version memory_add_physaddr_to_nid, refine the arch
  specific one
- fix an uninitialization bug in v2 device-dax patch
v2: https://lkml.org/lkml/2020/7/7/71
- Drop unnecessary patch to harden try_offline_node
- Use new solution(by David) to fix dev->target_node=-1 during probing
- Refine the mem_hotplug_begin/done patch

v1: https://lkml.org/lkml/2020/7/5/381

Jia He (2):
  mm/memory_hotplug: introduce default dummy
memory_add_physaddr_to_nid()
  mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

 arch/arm64/mm/numa.c | 10 --
 arch/ia64/mm/numa.c  |  2 --
 arch/sh/mm/init.c|  9 -
 arch/x86/mm/numa.c   |  1 -
 mm/memory_hotplug.c  | 15 ---
 5 files changed, 12 insertions(+), 25 deletions(-)

-- 
2.17.1

[PATCH v3 5/6] device-dax: use fallback nid when numa_node is invalid

2020-07-08 Thread Jia He

numa_off is set unconditionally at the end of dummy_numa_init(),
even with a fake numa node. ACPI detects node id as NUMA_NO_NODE(-1) in
acpi_map_pxm_to_node() because it regards numa_off as turning off the numa
node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa.

Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table
isn't present:
$ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with invalid 
node: -1
kmem: probe of dax0.0 failed with error -22

This fixes it by using fallback memory_add_physaddr_to_nid() as nid.

Suggested-by: David Hildenbrand 
Signed-off-by: Jia He 
---
 drivers/dax/kmem.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 275aa5f87399..218f66057994 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -31,22 +31,23 @@ int dev_dax_kmem_probe(struct device *dev)
int numa_node;
int rc;
 
+   /* Hotplug starting at the beginning of the next block: */
+   kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
/*
 * Ensure good NUMA information for the persistent memory.
 * Without this check, there is a risk that slow memory
 * could be mixed in a node with faster memory, causing
-* unavoidable performance issues.
+* unavoidable performance issues. Furthermore, fallback node
+* id can be used when numa_node is invalid.
 */
numa_node = dev_dax->target_node;
if (numa_node < 0) {
-   dev_warn(dev, "rejecting DAX region %pR with invalid node: 
%d\n",
-res, numa_node);
-   return -EINVAL;
+   numa_node = memory_add_physaddr_to_nid(kmem_start);
+   dev_info(dev, "using nid %d for DAX region with undefined nid 
%pR\n",
+   numa_node, res);
}
 
-   /* Hotplug starting at the beginning of the next block: */
-   kmem_start = ALIGN(res->start, memory_block_size_bytes());
-
kmem_size = resource_size(res);
/* Adjust the size down to compensate for moving up kmem_start: */
kmem_size -= kmem_start - res->start;
@@ -100,15 +101,19 @@ static int dev_dax_kmem_remove(struct device *dev)
resource_size_t kmem_start = res->start;
resource_size_t kmem_size = resource_size(res);
const char *res_name = res->name;
+   int numa_node = dev_dax->target_node;
int rc;
 
+   if (numa_node < 0)
+   numa_node = memory_add_physaddr_to_nid(kmem_start);
+
/*
 * We have one shot for removing memory, if some memory blocks were not
 * offline prior to calling this function remove_memory() will fail, and
 * there is no way to hotremove this memory until reboot because device
 * unbind will succeed even if we return failure.
 */
-   rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
+   rc = remove_memory(numa_node, kmem_start, kmem_size);
if (rc) {
any_hotremove_failed = true;
dev_err(dev,
-- 
2.17.1

[PATCH v3 6/6] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

2020-07-08 Thread Jia He

When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
online at that time), mem_hotplug_begin/done is unpaired in such case.

Therefore a warning:
 Call Trace:
  percpu_up_write+0x33/0x40
  try_remove_memory+0x66/0x120
  ? _cond_resched+0x19/0x30
  remove_memory+0x2b/0x40
  dev_dax_kmem_remove+0x36/0x72 [kmem]
  device_release_driver_internal+0xf0/0x1c0
  device_release_driver+0x12/0x20
  bus_remove_device+0xe1/0x150
  device_del+0x17b/0x3e0
  unregister_dev_dax+0x29/0x60
  devm_action_release+0x15/0x20
  release_nodes+0x19a/0x1e0
  devres_release_all+0x3f/0x50
  device_release_driver_internal+0x100/0x1c0
  driver_detach+0x4c/0x8f
  bus_remove_driver+0x5c/0xd0
  driver_unregister+0x31/0x50
  dax_pmem_exit+0x10/0xfe0 [dax_pmem]

Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
Cc: sta...@vger.kernel.org # v5.6+
Signed-off-by: Jia He 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
Acked-by: Dan Williams 
---
 mm/memory_hotplug.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b49ab743d914..3e0645387daf 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1752,7 +1752,7 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 */
rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb);
if (rc)
-   goto done;
+   return rc;
 
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
@@ -1776,9 +1776,8 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 
try_offline_node(nid);
 
-done:
mem_hotplug_done();
-   return rc;
+   return 0;
 }
 
 /**
-- 
2.17.1

[PATCH v3 4/6] mm: don't export memory_add_physaddr_to_nid in arch specific directory

2020-07-08 Thread Jia He

After a general version of __weak memory_add_physaddr_to_nid implemented
and exported , it is no use exporting twice in arch directory even if
e,g, ia64/x86 have their specific version.

This is to suppress the modpost warning:
WARNING: modpost: vmlinux: 'memory_add_physaddr_to_nid' exported twice.
Previous export was in vmlinux

Suggested-by: David Hildenbrand 
Signed-off-by: Jia He 
---
 arch/ia64/mm/numa.c | 2 --
 arch/x86/mm/numa.c  | 1 -
 2 files changed, 3 deletions(-)

diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index 5e1015eb6d0d..f34964271101 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -106,7 +106,5 @@ int memory_add_physaddr_to_nid(u64 addr)
return 0;
return nid;
 }
-
-EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 #endif
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8ee952038c80..2a6e62af4636 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -929,5 +929,4 @@ int memory_add_physaddr_to_nid(u64 start)
nid = numa_meminfo.blk[0].nid;
return nid;
 }
-EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
-- 
2.17.1

[PATCH v3 1/6] mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid()

2020-07-08 Thread Jia He

This is to introduce a general dummy helper. memory_add_physaddr_to_nid()
is a fallback option to get the nid in case NUMA_NO_NID is detected.

After this patch, arm64/sh/s390 can simply use the general dummy version.
PowerPC/x86/ia64 will still use their specific version.

Signed-off-by: Jia He 
---
 mm/memory_hotplug.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da374cd3d45b..b49ab743d914 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -350,6 +350,16 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned 
long nr_pages,
return err;
 }
 
+#ifdef CONFIG_NUMA
+int __weak memory_add_physaddr_to_nid(u64 start)
+{
+   pr_info_once("Unknown target node for memory at 0x%llx, assuming node 
0\n",
+   start);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+#endif
+
 /* find the smallest valid pfn in the range [start_pfn, end_pfn) */
 static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 unsigned long start_pfn,
-- 
2.17.1

[PATCH v3 2/6] arm64/mm: use default dummy memory_add_physaddr_to_nid()

2020-07-08 Thread Jia He

After making default memory_add_physaddr_to_nid() in mm/memory_hotplug,
it is no use defining a similar one in arch specific directory.

Signed-off-by: Jia He 
---
 arch/arm64/mm/numa.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index aafcee3e3f7e..73f8b49d485c 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -461,13 +461,3 @@ void __init arm64_numa_init(void)
 
numa_init(dummy_numa_init);
 }
-
-/*
- * We hope that we will be hotplugging memory on nodes we already know about,
- * such that acpi_get_node() succeeds and we never fall back to this...
- */
-int memory_add_physaddr_to_nid(u64 addr)
-{
-   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
-   return 0;
-}
-- 
2.17.1

[PATCH v3 3/6] sh/mm: use default dummy memory_add_physaddr_to_nid()

2020-07-08 Thread Jia He

After making default memory_add_physaddr_to_nid in mm/memory_hotplug,
there is no use to define a similar one in arch specific directory.

Signed-off-by: Jia He 
---
 arch/sh/mm/init.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index a70ba0fdd0b3..f75932ba87a6 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -430,15 +430,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
return ret;
 }
 
-#ifdef CONFIG_NUMA
-int memory_add_physaddr_to_nid(u64 addr)
-{
-   /* Node 0 for now.. */
-   return 0;
-}
-EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
-#endif
-
 void arch_remove_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap)
 {
-- 
2.17.1

[PATCH v3 0/6] Fix and enable pmem as RAM device on arm64

2020-07-08 Thread Jia He

This fixies a few issues when I tried to enable pmem as RAM device on arm64.

To use memory_add_physaddr_to_nid as a fallback nid, it would be better
implement a general version (__weak) in mm/memory_hotplug. After that, arm64/
sh/s390 can simply use the general version, and PowerPC/ia64/x86 will use
arch specific version.

Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The 
memblocks from the dax pmem device can be either hot-added or hot-removed
on arm64 guest. Also passed the compilation test on x86.

Changes:
v3: - introduce general version memory_add_physaddr_to_nid, refine the arch
  specific one
- fix an uninitialization bug in v2 device-dax patch
v2: https://lkml.org/lkml/2020/7/7/71
- Drop unnecessary patch to harden try_offline_node
- Use new solution(by David) to fix dev->target_node=-1 during probing
- Refine the mem_hotplug_begin/done patch

v1: https://lkml.org/lkml/2020/7/5/381


Jia He (6):
  mm/memory_hotplug: introduce default dummy
memory_add_physaddr_to_nid()
  arm64/mm: use default dummy memory_add_physaddr_to_nid()
  sh/mm: use default dummy memory_add_physaddr_to_nid()
  mm: don't export memory_add_physaddr_to_nid in arch specific directory
  device-dax: use fallback nid when numa_node is invalid
  mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

 arch/arm64/mm/numa.c | 10 --
 arch/ia64/mm/numa.c  |  2 --
 arch/sh/mm/init.c|  9 -
 arch/x86/mm/numa.c   |  1 -
 drivers/dax/kmem.c   | 21 +
 mm/memory_hotplug.c  | 15 ---
 6 files changed, 25 insertions(+), 33 deletions(-)

-- 
2.17.1

[PATCH v2 0/3] Fix and enable pmem as RAM device on arm64

2020-07-07 Thread Jia He

This fix a few issues when I tried to enable pmem as RAM device on arm64.

Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The 
memblocks from the dax pmem device can be either hot-added or hot-removed
on arm64 guest.

Changes:
v2: - Drop unneccessary patch to harden try_offline_node
- Use new solution(by David) to fix dev->target_node=-1 during probing
- Refine the mem_hotplug_begin/done patch

v1: https://lkml.org/lkml/2020/7/5/381

Jia He (3):
  arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL
  device-dax: use fallback nid when numa_node is invalid
  mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

 arch/arm64/mm/numa.c |  5 +++--
 drivers/dax/kmem.c   | 22 ++
 mm/memory_hotplug.c  |  5 ++---
 3 files changed, 19 insertions(+), 13 deletions(-)

-- 
2.17.1

[RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is invalid

2020-07-07 Thread Jia He

Previously, numa_off is set unconditionally at the end of dummy_numa_init(),
even with a fake numa node. Then ACPI detects node id as NUMA_NO_NODE(-1) in
acpi_map_pxm_to_node() because it regards numa_off as turning off the numa
node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa.

Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table
isn't present:
$ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with invalid 
node: -1
kmem: probe of dax0.0 failed with error -22

This fixes it by using fallback memory_add_physaddr_to_nid() as nid.

Suggested-by: David Hildenbrand 
Signed-off-by: Jia He 
---
I noticed that on powerpc memory_add_physaddr_to_nid is not exported for module
driver. Set it to RFC due to this concern.

 drivers/dax/kmem.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 275aa5f87399..68e693ca6d59 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -28,20 +28,22 @@ int dev_dax_kmem_probe(struct device *dev)
resource_size_t kmem_end;
struct resource *new_res;
const char *new_res_name;
-   int numa_node;
+   int numa_node, new_node;
int rc;
 
/*
 * Ensure good NUMA information for the persistent memory.
-* Without this check, there is a risk that slow memory
-* could be mixed in a node with faster memory, causing
-* unavoidable performance issues.
+* Without this check, there is a risk but not fatal that slow
+* memory could be mixed in a node with faster memory, causing
+* unavoidable performance issues. Furthermore, fallback node
+* id can be used when numa_node is invalid.
 */
numa_node = dev_dax->target_node;
if (numa_node < 0) {
-   dev_warn(dev, "rejecting DAX region %pR with invalid node: 
%d\n",
-res, numa_node);
-   return -EINVAL;
+   new_node = memory_add_physaddr_to_nid(kmem_start);
+   dev_info(dev, "changing nid from %d to %d for DAX region %pR\n",
+   numa_node, new_node, res);
+   numa_node = new_node;
}
 
/* Hotplug starting at the beginning of the next block: */
@@ -100,6 +102,7 @@ static int dev_dax_kmem_remove(struct device *dev)
resource_size_t kmem_start = res->start;
resource_size_t kmem_size = resource_size(res);
const char *res_name = res->name;
+   int numa_node = dev_dax->target_node;
int rc;
 
/*
@@ -108,7 +111,10 @@ static int dev_dax_kmem_remove(struct device *dev)
 * there is no way to hotremove this memory until reboot because device
 * unbind will succeed even if we return failure.
 */
-   rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
+   if (numa_node < 0)
+   numa_node = memory_add_physaddr_to_nid(kmem_start);
+
+   rc = remove_memory(numa_node, kmem_start, kmem_size);
if (rc) {
any_hotremove_failed = true;
dev_err(dev,
-- 
2.17.1

[PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

2020-07-07 Thread Jia He

This exports memory_add_physaddr_to_nid() for module driver to use.

memory_add_physaddr_to_nid() is a fallback option to get the nid in case
NUMA_NO_NID is detected.

Suggested-by: David Hildenbrand 
Signed-off-by: Jia He 
---
 arch/arm64/mm/numa.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index aafcee3e3f7e..7eeb31740248 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
 
 /*
  * We hope that we will be hotplugging memory on nodes we already know about,
- * such that acpi_get_node() succeeds and we never fall back to this...
+ * such that acpi_get_node() succeeds. But when SRAT is not present, the node
+ * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
  */
 int memory_add_physaddr_to_nid(u64 addr)
 {
-   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
return 0;
 }
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
-- 
2.17.1

[PATCH v2 3/3] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

2020-07-06 Thread Jia He

When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
online at that time), mem_hotplug_begin/done is unpaired in such case.

Therefore a warning:
 Call Trace:
  percpu_up_write+0x33/0x40
  try_remove_memory+0x66/0x120
  ? _cond_resched+0x19/0x30
  remove_memory+0x2b/0x40
  dev_dax_kmem_remove+0x36/0x72 [kmem]
  device_release_driver_internal+0xf0/0x1c0
  device_release_driver+0x12/0x20
  bus_remove_device+0xe1/0x150
  device_del+0x17b/0x3e0
  unregister_dev_dax+0x29/0x60
  devm_action_release+0x15/0x20
  release_nodes+0x19a/0x1e0
  devres_release_all+0x3f/0x50
  device_release_driver_internal+0x100/0x1c0
  driver_detach+0x4c/0x8f
  bus_remove_driver+0x5c/0xd0
  driver_unregister+0x31/0x50
  dax_pmem_exit+0x10/0xfe0 [dax_pmem]

Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
Cc: sta...@vger.kernel.org # v5.6+
Signed-off-by: Jia He 
---
 mm/memory_hotplug.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da374cd3d45b..76c75a599da3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1742,7 +1742,7 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 */
rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb);
if (rc)
-   goto done;
+   return rc;
 
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
@@ -1766,9 +1766,8 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 
try_offline_node(nid);
 
-done:
mem_hotplug_done();
-   return rc;
+   return 0;
 }
 
 /**
-- 
2.17.1

[PATCH 3/3] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

2020-07-05 Thread Jia He

When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
online at that time), mem_hotplug_begin/done is unpaired in such case.

Therefore a warning:
 Call Trace:
  percpu_up_write+0x33/0x40
  try_remove_memory+0x66/0x120
  ? _cond_resched+0x19/0x30
  remove_memory+0x2b/0x40
  dev_dax_kmem_remove+0x36/0x72 [kmem]
  device_release_driver_internal+0xf0/0x1c0
  device_release_driver+0x12/0x20
  bus_remove_device+0xe1/0x150
  device_del+0x17b/0x3e0
  unregister_dev_dax+0x29/0x60
  devm_action_release+0x15/0x20
  release_nodes+0x19a/0x1e0
  devres_release_all+0x3f/0x50
  device_release_driver_internal+0x100/0x1c0
  driver_detach+0x4c/0x8f
  bus_remove_driver+0x5c/0xd0
  driver_unregister+0x31/0x50
  dax_pmem_exit+0x10/0xfe0 [dax_pmem]

This fixes it by moving mem_hotplug_done ahead of "done"

Signed-off-by: Jia He 
---
 mm/memory_hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e1e290577b45..86b36714342b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1769,8 +1769,8 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 
try_offline_node(nid);
 
-done:
mem_hotplug_done();
+done:
return rc;
 }
 
-- 
2.17.1

[PATCH 0/3] Fix and enable pmem as RAM on arm64

2020-07-05 Thread Jia He

This fix a few issues when I tried to enable pmem as RAM device on arm64.

Tested on ThunderX2 host/qemu "-M virt" guest.

Jia He (3):
  arm64/numa: set numa_off to false when numa node is fake
  mm/memory_hotplug: harden try_offline_node against bogus nid
  mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

 arch/arm64/mm/numa.c | 3 ++-
 mm/memory_hotplug.c  | 5 -
 2 files changed, 6 insertions(+), 2 deletions(-)

-- 
2.17.1

[PATCH 1/3] arm64/numa: set numa_off to false when numa node is fake

2020-07-05 Thread Jia He

Previously, numa_off is set to true unconditionally in dummy_numa_init(),
even if there is a fake numa node.

But acpi will translate node id to NUMA_NO_NODE(-1) in acpi_map_pxm_to_node()
because it regards numa_off as turning off the numa node.

Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table
isn't present.

$ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with invalid 
node: -1
kmem: probe of dax0.0 failed with error -22

This fixes it by setting numa_off to false.

Signed-off-by: Jia He 
---
 arch/arm64/mm/numa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index aafcee3e3f7e..7689986020d9 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -440,7 +440,8 @@ static int __init dummy_numa_init(void)
return ret;
}
 
-   numa_off = true;
+   /* force numa_off to be false since we have a fake numa node here */
+   numa_off = false;
return 0;
 }
 
-- 
2.17.1

[PATCH 2/3] mm/memory_hotplug: harden try_offline_node against bogus nid

2020-07-05 Thread Jia He

When testing the remove_memory path of dax pmem, there will be a panic with
call trace:
  try_remove_memory+0x84/0x170
  remove_memory+0x38/0x58
  dev_dax_kmem_remove+0x3c/0x84 [kmem]
  device_release_driver_internal+0xfc/0x1c8
  device_release_driver+0x28/0x38
  bus_remove_device+0xd4/0x158
  device_del+0x160/0x3a0
  unregister_dev_dax+0x30/0x68
  devm_action_release+0x20/0x30
  release_nodes+0x150/0x240
  devres_release_all+0x6c/0x1d0
  device_release_driver_internal+0x10c/0x1c8
  driver_detach+0xac/0x170
  bus_remove_driver+0x64/0x130
  driver_unregister+0x34/0x60
  dax_pmem_exit+0x14/0xffc4 [dax_pmem]
  __arm64_sys_delete_module+0x18c/0x2d0
  el0_svc_common.constprop.2+0x78/0x168
  do_el0_svc+0x34/0xa0
  el0_sync_handler+0xe0/0x188
  el0_sync+0x164/0x180

It is caused by the bogus nid (-1). Although the root cause is pmem dax
translates from pxm to node_id incorrectly due to numa_off, it is worth
hardening the codes in try_offline_node(), quiting if !pgdat.

Signed-off-by: Jia He 
---
 mm/memory_hotplug.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da374cd3d45b..e1e290577b45 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1680,6 +1680,9 @@ void try_offline_node(int nid)
pg_data_t *pgdat = NODE_DATA(nid);
int rc;
 
+   if (WARN_ON(!pgdat))
+   return;
+
/*
 * If the node still spans pages (especially ZONE_DEVICE), don't
 * offline it. A node spans memory after move_pfn_range_to_zone(),
-- 
2.17.1

[PATCH v5] virtio_vsock: Fix race condition in virtio_transport_recv_pkt

2020-05-30 Thread Jia He

When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the
server on the guest, there will be a panic on a ThunderX2 (armv8a server):

[  463.718844] Unable to handle kernel NULL pointer dereference at virtual 
address 
[  463.718848] Mem abort info:
[  463.718849]   ESR = 0x9644
[  463.718852]   EC = 0x25: DABT (current EL), IL = 32 bits
[  463.718853]   SET = 0, FnV = 0
[  463.718854]   EA = 0, S1PTW = 0
[  463.718855] Data abort info:
[  463.718856]   ISV = 0, ISS = 0x0044
[  463.718857]   CM = 0, WnR = 1
[  463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000
[  463.718861] [] pgd=
[  463.718866] Internal error: Oops: 9644 [#1] SMP
[...]
[  463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G   O  
5.7.0-rc7+ #139
[  463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 
09/25/2018
[  463.718982] pstate: 6049 (nZCv daif +PAN -UAO)
[  463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719000] sp : 80002dbe3c40
[...]
[  463.719025] Call trace:
[  463.719030]  virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719034]  vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock]
[  463.719041]  vhost_worker+0x100/0x1a0 [vhost]
[  463.719048]  kthread+0x128/0x130
[  463.719052]  ret_from_fork+0x10/0x18

The race condition is as follows:
Task1Task2
==
__sock_release   virtio_transport_recv_pkt
  __vsock_release  vsock_find_bound_socket (found sk)
lock_sock_nested
vsock_remove_sock
sock_orphan
  sk_set_socket(sk, NULL)
sk->sk_shutdown = SHUTDOWN_MASK
...
release_sock
lock_sock
   virtio_transport_recv_connecting
 sk->sk_socket->state (panic!)

The root cause is that vsock_find_bound_socket can't hold the lock_sock,
so there is a small race window between vsock_find_bound_socket() and
lock_sock(). If __vsock_release() is running in another task,
sk->sk_socket will be set to NULL inadvertently.

Thus check the data structure member "sk_shutdown" (suggested by Stefano)
after a call of the function "lock_sock" since this field is set to
"SHUTDOWN_MASK" under the protection of "lock_sock_nested".

Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko")
Signed-off-by: Jia He 
Cc: sta...@vger.kernel.org
Cc: Asias He 
Reviewed-by: Stefano Garzarella 
---
v5: sorry, MIME type in the previous commit message

 net/vmw_vsock/virtio_transport_common.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 69efc891885f..0edda1edf988 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1132,6 +1132,14 @@ void virtio_transport_recv_pkt(struct virtio_transport 
*t,
 
lock_sock(sk);
 
+   /* Check if sk has been released before lock_sock */
+   if (sk->sk_shutdown == SHUTDOWN_MASK) {
+   (void)virtio_transport_reset_no_sock(t, pkt);
+   release_sock(sk);
+   sock_put(sk);
+   goto free_pkt;
+   }
+
/* Update CID in case it has changed after a transport reset event */
vsk->local_addr.svm_cid = dst.svm_cid;
 
-- 
2.17.1

[PATCH] virtio_vsock: Fix race condition in virtio_transport_recv_pkt

2020-05-30 Thread Jia He

When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the
server on the guest, there will be a panic on a ThunderX2 (armv8a server):

[  463.718844] Unable to handle kernel NULL pointer dereference at virtual 
address 
[  463.718848] Mem abort info:
[  463.718849]   ESR = 0x9644
[  463.718852]   EC = 0x25: DABT (current EL), IL = 32 bits
[  463.718853]   SET = 0, FnV = 0
[  463.718854]   EA = 0, S1PTW = 0
[  463.718855] Data abort info:
[  463.718856]   ISV = 0, ISS = 0x0044
[  463.718857]   CM = 0, WnR = 1
[  463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000
[  463.718861] [] pgd=
[  463.718866] Internal error: Oops: 9644 [#1] SMP
[...]
[  463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G   O  
5.7.0-rc7+ #139
[  463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 
09/25/2018
[  463.718982] pstate: 6049 (nZCv daif +PAN -UAO)
[  463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719000] sp : 80002dbe3c40
[...]
[  463.719025] Call trace:
[  463.719030]  virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719034]  vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock]
[  463.719041]  vhost_worker+0x100/0x1a0 [vhost]
[  463.719048]  kthread+0x128/0x130
[  463.719052]  ret_from_fork+0x10/0x18

The race condition is as follows:
Task1Task2
==
__sock_release   virtio_transport_recv_pkt
  __vsock_release  vsock_find_bound_socket (found sk)
lock_sock_nested
vsock_remove_sock
sock_orphan
  sk_set_socket(sk, NULL)
sk->sk_shutdown = SHUTDOWN_MASK
...
release_sock
lock_sock
   virtio_transport_recv_connecting
 sk->sk_socket->state (panic!)

The root cause is that vsock_find_bound_socket can't hold the lock_sock,
so there is a small race window between vsock_find_bound_socket() and
lock_sock(). If __vsock_release() is running in another task,
sk->sk_socket will be set to NULL inadvertently.

Thus check the data structure member “sk_shutdown” (suggested by Stefano)
after a call of the function “lock_sock” since this field is set to
“SHUTDOWN_MASK” under the protection of “lock_sock_nested”.

Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko")
Signed-off-by: Jia He 
Cc: sta...@vger.kernel.org
Cc: Asias He 
Reviewed-by: Stefano Garzarella 
---
v4: refine the commit msg (from Markus)

 net/vmw_vsock/virtio_transport_common.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 69efc891885f..0edda1edf988 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1132,6 +1132,14 @@ void virtio_transport_recv_pkt(struct virtio_transport 
*t,
 
lock_sock(sk);
 
+   /* Check if sk has been released before lock_sock */
+   if (sk->sk_shutdown == SHUTDOWN_MASK) {
+   (void)virtio_transport_reset_no_sock(t, pkt);
+   release_sock(sk);
+   sock_put(sk);
+   goto free_pkt;
+   }
+
/* Update CID in case it has changed after a transport reset event */
vsk->local_addr.svm_cid = dst.svm_cid;
 
-- 
2.17.1

[PATCH v3] virtio_vsock: Fix race condition in virtio_transport_recv_pkt

2020-05-29 Thread Jia He

When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the
server on the guest, there will be a panic on a ThunderX2 (armv8a server):

[  463.718844] Unable to handle kernel NULL pointer dereference at virtual 
address 
[  463.718848] Mem abort info:
[  463.718849]   ESR = 0x9644
[  463.718852]   EC = 0x25: DABT (current EL), IL = 32 bits
[  463.718853]   SET = 0, FnV = 0
[  463.718854]   EA = 0, S1PTW = 0
[  463.718855] Data abort info:
[  463.718856]   ISV = 0, ISS = 0x0044
[  463.718857]   CM = 0, WnR = 1
[  463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000
[  463.718861] [] pgd=
[  463.718866] Internal error: Oops: 9644 [#1] SMP
[...]
[  463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G   O  
5.7.0-rc7+ #139
[  463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 
09/25/2018
[  463.718982] pstate: 6049 (nZCv daif +PAN -UAO)
[  463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719000] sp : 80002dbe3c40
[...]
[  463.719025] Call trace:
[  463.719030]  virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719034]  vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock]
[  463.719041]  vhost_worker+0x100/0x1a0 [vhost]
[  463.719048]  kthread+0x128/0x130
[  463.719052]  ret_from_fork+0x10/0x18

The race condition is as follows:
Task1Task2
==
__sock_release   virtio_transport_recv_pkt
  __vsock_release  vsock_find_bound_socket (found sk)
lock_sock_nested
vsock_remove_sock
sock_orphan
  sk_set_socket(sk, NULL)
sk->sk_shutdown = SHUTDOWN_MASK
...
release_sock
lock_sock
   virtio_transport_recv_connecting
 sk->sk_socket->state (panic!)

The root cause is that vsock_find_bound_socket can't hold the lock_sock,
so there is a small race window between vsock_find_bound_socket() and
lock_sock(). If __vsock_release() is running in another task,
sk->sk_socket will be set to NULL inadvertently.

This fixes it by checking sk->sk_shutdown(suggested by Stefano) after
lock_sock since sk->sk_shutdown is set to SHUTDOWN_MASK under the
protection of lock_sock_nested.

Signed-off-by: Jia He 
Cc: sta...@vger.kernel.org
Reviewed-by: Stefano Garzarella 
---
v3: - describe the fix of race condition more clearly
- refine the commit log

 net/vmw_vsock/virtio_transport_common.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 69efc891885f..0edda1edf988 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1132,6 +1132,14 @@ void virtio_transport_recv_pkt(struct virtio_transport 
*t,
 
lock_sock(sk);
 
+   /* Check if sk has been released before lock_sock */
+   if (sk->sk_shutdown == SHUTDOWN_MASK) {
+   (void)virtio_transport_reset_no_sock(t, pkt);
+   release_sock(sk);
+   sock_put(sk);
+   goto free_pkt;
+   }
+
/* Update CID in case it has changed after a transport reset event */
vsk->local_addr.svm_cid = dst.svm_cid;
 
-- 
2.17.1

[PATCH v2] virtio_vsock: Fix race condition in virtio_transport_recv_pkt

2020-05-29 Thread Jia He

When client tries to connect(SOCK_STREAM) the server in the guest with
NONBLOCK mode, there will be a panic on a ThunderX2 (armv8a server):
[  463.718844][ T5040] Unable to handle kernel NULL pointer dereference at 
virtual address 
[  463.718848][ T5040] Mem abort info:
[  463.718849][ T5040]   ESR = 0x9644
[  463.718852][ T5040]   EC = 0x25: DABT (current EL), IL = 32 bits
[  463.718853][ T5040]   SET = 0, FnV = 0
[  463.718854][ T5040]   EA = 0, S1PTW = 0
[  463.718855][ T5040] Data abort info:
[  463.718856][ T5040]   ISV = 0, ISS = 0x0044
[  463.718857][ T5040]   CM = 0, WnR = 1
[  463.718859][ T5040] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000
[  463.718861][ T5040] [] pgd=
[  463.718866][ T5040] Internal error: Oops: 9644 [#1] SMP
[...]
[  463.718977][ T5040] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G   
O  5.7.0-rc7+ #139
[  463.718980][ T5040] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS 
F06 09/25/2018
[  463.718982][ T5040] pstate: 6049 (nZCv daif +PAN -UAO)
[  463.718995][ T5040] pc : virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.718999][ T5040] lr : virtio_transport_recv_pkt+0x1fc/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719000][ T5040] sp : 80002dbe3c40
[...]
[  463.719025][ T5040] Call trace:
[  463.719030][ T5040]  virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719034][ T5040]  vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock]
[  463.719041][ T5040]  vhost_worker+0x100/0x1a0 [vhost]
[  463.719048][ T5040]  kthread+0x128/0x130
[  463.719052][ T5040]  ret_from_fork+0x10/0x18

The race condition as follows:
Task1Task2
==
__sock_release   virtio_transport_recv_pkt
  __vsock_release  vsock_find_bound_socket (found)
lock_sock_nested
vsock_remove_sock
sock_orphan
  sk_set_socket(sk, NULL)
...
release_sock
lock_sock
   virtio_transport_recv_connecting
 sk->sk_socket->state (panic)

The root cause is that vsock_find_bound_socket can't hold the lock_sock,
so there is a small race window between vsock_find_bound_socket() and
lock_sock(). If there is __vsock_release() in another task, sk->sk_socket
will be set to NULL inadvertently.

This fixes it by checking sk->sk_shutdown.

Signed-off-by: Jia He 
Cc: sta...@vger.kernel.org
Cc: Stefano Garzarella 
---
v2: use lightweight checking suggested by Stefano Garzarella

 net/vmw_vsock/virtio_transport_common.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 69efc891885f..0edda1edf988 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1132,6 +1132,14 @@ void virtio_transport_recv_pkt(struct virtio_transport 
*t,
 
lock_sock(sk);
 
+   /* Check if sk has been released before lock_sock */
+   if (sk->sk_shutdown == SHUTDOWN_MASK) {
+   (void)virtio_transport_reset_no_sock(t, pkt);
+   release_sock(sk);
+   sock_put(sk);
+   goto free_pkt;
+   }
+
/* Update CID in case it has changed after a transport reset event */
vsk->local_addr.svm_cid = dst.svm_cid;
 
-- 
2.17.1

[PATCH] virtio_vsock: Fix race condition in virtio_transport_recv_pkt

2020-05-29 Thread Jia He

When client tries to connect(SOCK_STREAM) the server in the guest with NONBLOCK
mode, there will be a panic on a ThunderX2 (armv8a server):
[  463.718844][ T5040] Unable to handle kernel NULL pointer dereference at 
virtual address 
[  463.718848][ T5040] Mem abort info:
[  463.718849][ T5040]   ESR = 0x9644
[  463.718852][ T5040]   EC = 0x25: DABT (current EL), IL = 32 bits
[  463.718853][ T5040]   SET = 0, FnV = 0
[  463.718854][ T5040]   EA = 0, S1PTW = 0
[  463.718855][ T5040] Data abort info:
[  463.718856][ T5040]   ISV = 0, ISS = 0x0044
[  463.718857][ T5040]   CM = 0, WnR = 1
[  463.718859][ T5040] user pgtable: 4k pages, 48-bit VAs, pgdp=008f6f6e9000
[  463.718861][ T5040] [] pgd=
[  463.718866][ T5040] Internal error: Oops: 9644 [#1] SMP
[...]
[  463.718977][ T5040] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G   
O  5.7.0-rc7+ #139
[  463.718980][ T5040] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS 
F06 09/25/2018
[  463.718982][ T5040] pstate: 6049 (nZCv daif +PAN -UAO)
[  463.718995][ T5040] pc : virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.718999][ T5040] lr : virtio_transport_recv_pkt+0x1fc/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719000][ T5040] sp : 80002dbe3c40
[...]
[  463.719025][ T5040] Call trace:
[  463.719030][ T5040]  virtio_transport_recv_pkt+0x4c8/0xd40 
[vmw_vsock_virtio_transport_common]
[  463.719034][ T5040]  vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock]
[  463.719041][ T5040]  vhost_worker+0x100/0x1a0 [vhost]
[  463.719048][ T5040]  kthread+0x128/0x130
[  463.719052][ T5040]  ret_from_fork+0x10/0x18

The race condition as follows:
Task1Task2
==
__sock_release   virtio_transport_recv_pkt
  __vsock_release  vsock_find_bound_socket (found)
lock_sock_nested
vsock_remove_sock
sock_orphan
  sk_set_socket(sk, NULL)
...
release_sock
lock_sock
   virtio_transport_recv_connecting
 sk->sk_socket->state (panic)

This fixes it by checking vsk again whether it is in bound/connected table.

Signed-off-by: Jia He 
Cc: sta...@vger.kernel.org
---
 net/vmw_vsock/virtio_transport_common.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 69efc891885f..0dbd6a45f0ed 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1132,6 +1132,17 @@ void virtio_transport_recv_pkt(struct virtio_transport 
*t,
 
lock_sock(sk);
 
+   /* Check it again if vsk is removed by vsock_remove_sock */
+   spin_lock_bh(_table_lock);
+   if (!__vsock_in_bound_table(vsk) && !__vsock_in_connected_table(vsk)) {
+   spin_unlock_bh(_table_lock);
+   (void)virtio_transport_reset_no_sock(t, pkt);
+   release_sock(sk);
+   sock_put(sk);
+   goto free_pkt;
+   }
+   spin_unlock_bh(_table_lock);
+
/* Update CID in case it has changed after a transport reset event */
vsk->local_addr.svm_cid = dst.svm_cid;
 
-- 
2.17.1

Re: [PATCH V3 0/3] arm64: Enable vmemmap mapping from device memory

2020-05-21 Thread Jia He


Hi

On 2020/3/31 13:09, Anshuman Khandual wrote:

This series enables vmemmap backing memory allocation from device memory
ranges on arm64. But before that, it enables vmemmap_populate_basepages()
and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based
alocation requests.


I verified no obvious regression after this patch series.

Host: ThunderX2(armv8a server), kernel v5.4

qemu:v3.1, -M virt \

-object 
memory-backend-file,id=mem1,share=on,mem-path=/tmp2/nvdimm.img,size=4G,align=2M \


-device nvdimm,id=nvdimm1,memdev=mem1,label-size=2M

Guest: kernel v5.7.0-rc5 with this patch series.

Tested case:

- 4K PAGESIZE, boot, mount w/ -o dax, mount w/o -o dax, basic io

- 64K PAGESIZE,boot, mount w/ -o dax, mount w/o -o dax, basic io

Not tested:

- 16K pagesize due to my hardware limiation(can't run 16K pgsz kernel)

- hot-add/remove nvdimm device from qemu due to no fully support on arm64 qemu 
yet

- Host nvdimm device hotplug

Hence from above result,

Tested-by: Jia He 


This series applies after latest (v14) arm64 memory hot remove series
(https://lkml.org/lkml/2020/3/3/1746) on Linux 5.6.

Pending Question:

altmap_alloc_block_buf() does not have any other remaining users in the
tree after this change. Should it be converted into a static function and
it's declaration be dropped from the header (include/linux/mm.h). Avoided
doing so because I was not sure if there are any off-tree users or not.

Changes in V3:

- Dropped comment from free_hotplug_page_range() per Robin
- Modified comment in unmap_hotplug_range() per Robin
- Enabled altmap support in vmemmap_alloc_block_buf() per Robin

Changes in V2: (https://lkml.org/lkml/2020/3/4/475)

- Rebased on latest hot-remove series (v14) adding P4D page table support

Changes in V1: (https://lkml.org/lkml/2020/1/23/12)

- Added an WARN_ON() in unmap_hotplug_range() when altmap is
   provided without the page table backing memory being freed

Changes in RFC V2: (https://lkml.org/lkml/2019/10/21/11)

- Changed the commit message on 1/2 patch per Will
- Changed the commit message on 2/2 patch as well
- Rebased on arm64 memory hot remove series (v10)

RFC V1: (https://lkml.org/lkml/2019/6/28/32)

Cc: Catalin Marinas
Cc: Will Deacon
Cc: Mark Rutland
Cc: Paul Walmsley
Cc: Palmer Dabbelt
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: David Hildenbrand
Cc: Mike Rapoport
Cc: Michal Hocko
Cc: "Matthew Wilcox (Oracle)"
Cc: "Kirill A. Shutemov"
Cc: Andrew Morton
Cc: Dan Williams
Cc: Pavel Tatashin
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc:linux-arm-ker...@lists.infradead.org
Cc:linux-i...@vger.kernel.org
Cc:linux-ri...@lists.infradead.org
Cc:x...@kernel.org
Cc:linuxppc-...@lists.ozlabs.org
Cc:linux...@kvack.org
Cc:linux-kernel@vger.kernel.org

Anshuman Khandual (3):
   mm/sparsemem: Enable vmem_altmap support in vmemmap_populate_basepages()
   mm/sparsemem: Enable vmem_altmap support in vmemmap_alloc_block_buf()
   arm64/mm: Enable vmem_altmap support for vmemmap mappings

  arch/arm64/mm/mmu.c   | 59 ++-
  arch/ia64/mm/discontig.c  |  2 +-
  arch/powerpc/mm/init_64.c | 10 +++
  arch/riscv/mm/init.c  |  2 +-
  arch/x86/mm/init_64.c | 12 
  include/linux/mm.h|  8 --
  mm/sparse-vmemmap.c   | 38 -
  7 files changed, 87 insertions(+), 44 deletions(-)


--

---
Cheers,
Justin (Jia He)

[PATCH v2] vhost: vsock: kick send_pkt worker once device is started

2020-04-30 Thread Jia He

Ning Bo reported an abnormal 2-second gap when booting Kata container [1].
The unconditional timeout was caused by VSOCK_DEFAULT_CONNECT_TIMEOUT of
connecting from the client side. The vhost vsock client tries to connect
an initializing virtio vsock server.

The abnormal flow looks like:
host-userspace   vhost vsock   guest vsock
==   ===   
connect() >  vhost_transport_send_pkt_work()   initializing
   | vq->private_data==NULL
   | will not be queued
   V
schedule_timeout(2s)
 vhost_vsock_start()  <-   device ready
 set vq->private_data

wait for 2s and failed
connect() again  vq->private_data!=NULL recv connecting pkt

Details:
1. Host userspace sends a connect pkt, at that time, guest vsock is under
   initializing, hence the vhost_vsock_start has not been called. So
   vq->private_data==NULL, and the pkt is not been queued to send to guest
2. Then it sleeps for 2s
3. After guest vsock finishes initializing, vq->private_data is set
4. When host userspace wakes up after 2s, send connecting pkt again,
   everything is fine.

As suggested by Stefano Garzarella, this fixes it by additional kicking the
send_pkt worker in vhost_vsock_start once the virtio device is started. This
makes the pending pkt sent again.

After this patch, kata-runtime (with vsock enabled) boot time is reduced
from 3s to 1s on a ThunderX2 arm64 server.

[1] https://github.com/kata-containers/runtime/issues/1917

Reported-by: Ning Bo 
Suggested-by: Stefano Garzarella 
Signed-off-by: Jia He 
---
v2: new solution suggested by Stefano Garzarella

 drivers/vhost/vsock.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e36aaf9ba7bd..0716a9cdffee 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -543,6 +543,11 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
mutex_unlock(>mutex);
}
 
+   /* Some packets may have been queued before the device was started,
+* let's kick the send worker to send them.
+*/
+   vhost_work_queue(>dev, >send_pkt_work);
+
mutex_unlock(>dev.mutex);
return 0;
 
-- 
2.17.1

[PATCH] vhost: add mutex_lock/unlock for vhost_vq_reset

2020-04-29 Thread Jia He

vq->mutex is to protect any vq accessing, hence adding mutex_lock/unlock
makes sense to avoid potential race condition.

Signed-off-by: Jia He 
---
 drivers/vhost/vhost.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d450e16c5c25..622bfba2e5ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -297,6 +297,7 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 static void vhost_vq_reset(struct vhost_dev *dev,
   struct vhost_virtqueue *vq)
 {
+   mutex_lock(>mutex);
vq->num = 1;
vq->desc = NULL;
vq->avail = NULL;
@@ -323,6 +324,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->umem = NULL;
vq->iotlb = NULL;
__vhost_vq_meta_reset(vq);
+   mutex_unlock(>mutex);
 }
 
 static int vhost_worker(void *data)
-- 
2.17.1

[PATCH] vhost: vsock: don't send pkt when vq is not started

2020-04-29 Thread Jia He

Ning Bo reported an abnormal 2-second gap when booting Kata container [1].
The unconditional timeout is caused by VSOCK_DEFAULT_CONNECT_TIMEOUT of
connect at client side. The vhost vsock client tries to connect an
initlizing virtio vsock server.

The abnormal flow looks like:
host-userspace   vhost vsock   guest vsock
==   ===   
connect() >  vhost_transport_send_pkt_work()   initializing
   | vq->private_data==NULL
   | will not be queued
   V
schedule_timeout(2s)
 vhost_vsock_start()  <-   device ready
 set vq->private_data

wait for 2s and failed

connect() again  vq->private_data!=NULL  recv connecting pkt

1. host userspace sends a connect pkt, at that time, guest vsock is under
initializing, hence the vhost_vsock_start has not been called. So
vq->private_data==NULL, and the pkt is not been queued to send to guest.
2. then it sleeps for 2s
3. after guest vsock finishes initializing, vq->private_data is set.
4. When host userspace wakes up after 2s, send connecting pkt again,
everything is fine.

This fixes it by checking vq->private_data in vhost_transport_send_pkt,
and return at once if !vq->private_data. This makes user connect()
be returned with ECONNREFUSED.

After this patch, kata-runtime (with vsock enabled) boottime reduces from
3s to 1s on ThunderX2 arm64 server.

[1] https://github.com/kata-containers/runtime/issues/1917

Reported-by: Ning Bo 
Signed-off-by: Jia He 
---
 drivers/vhost/vsock.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e36aaf9ba7bd..67474334dd88 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -241,6 +241,7 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
 {
struct vhost_vsock *vsock;
int len = pkt->len;
+   struct vhost_virtqueue *vq;
 
rcu_read_lock();
 
@@ -252,6 +253,13 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
return -ENODEV;
}
 
+   vq = >vqs[VSOCK_VQ_RX];
+   if (!vq->private_data) {
+   rcu_read_unlock();
+   virtio_transport_free_pkt(pkt);
+   return -ECONNREFUSED;
+   }
+
if (pkt->reply)
atomic_inc(>queued_replies);
 
-- 
2.17.1

Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-10-18 Thread Jia He


Hi Palmer

On 2019/10/19 4:38, Palmer Dabbelt wrote:

On Wed, 16 Oct 2019 16:46:08 PDT (-0700), w...@kernel.org wrote:

Hey Palmer,

On Wed, Oct 16, 2019 at 04:21:59PM -0700, Palmer Dabbelt wrote:

On Tue, 08 Oct 2019 05:39:44 PDT (-0700), w...@kernel.org wrote:
> On Tue, Oct 08, 2019 at 02:19:05AM +, Justin He (Arm Technology 
China) wrote:

> > > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index b1ca51a079f2..1f56b0118ef5 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > > >  2;
> > > >  #endif
> > > >
> > > > +#ifndef arch_faults_on_old_pte
> > > > +static inline bool arch_faults_on_old_pte(void)
> > > > +{
> > > > +    return false;
> > > > +}
> > > > +#endif
> > >
> > > Kirill has acked this, so I'm happy to take the patch as-is, however 
isn't

> > > it the case that /most/ architectures will want to return true for
> > > arch_faults_on_old_pte()? In which case, wouldn't it make more sense for
> > > that to be the default, and have x86 and arm64 provide an override? For
> > > example, aren't most architectures still going to hit the double fault
> > > scenario even with your patch applied?
> >
> > No, after applying my patch series, only those architectures which 
don't provide
> > setting access flag by hardware AND don't implement their 
arch_faults_on_old_pte

> > will hit the double page fault.
> >
> > The meaning of true for arch_faults_on_old_pte() is "this arch doesn't 
have the hardware

> > setting access flag way, it might cause page fault on an old pte"
> > I don't want to change other architectures' default behavior here. So 
by default,

> > arch_faults_on_old_pte() is false.
>
> ...and my complaint is that this is the majority of supported architectures,
> so you're fixing something for arm64 which also affects arm, powerpc,
> alpha, mips, riscv, ...
>
> Chances are, they won't even realise they need to implement
> arch_faults_on_old_pte() until somebody runs into the double fault and
> wastes lots of time debugging it before they spot your patch.

If I understand the semantics correctly, we should have this set to true.  I
don't have any context here, but we've got

   /*
    * The kernel assumes that TLBs don't cache invalid
    * entries, but in RISC-V, SFENCE.VMA specifies an
    * ordering constraint, not a cache flush; it is
    * necessary even after writing invalid entries.
    */
   local_flush_tlb_page(addr);

in do_page_fault().


Ok, although I think this is really about whether or not your hardware can
make a pte young when accessed, or whether you take a fault and do it
by updating the pte explicitly.

v12 of the patches did change the default, so you should be "safe" with
those either way:

http://lists.infradead.org/pipermail/linux-arm-kernel/2019-October/686030.html


OK, that fence is because we allow invalid translations to be cached, which 
is a completely different issue.


RISC-V implementations are allowed to have software managed accessed/dirty 
bits.  For some reason I thought we were relying on the firmware to handle 
this, but I can't actually find the code so I might be crazy.  Wherever it's 
done, there's no spec enforcing it so we should leave this true on RISC-V.


Thanks for the confirmation. So we can keep the default arch_faults_on_old_pte 
(return true) on RISC-V.



Thanks.


---
Cheers,
Justin (Jia He)

[PATCH v12 3/4] x86/mm: implement arch_faults_on_old_pte() stub on x86

2019-10-11 Thread Jia He

arch_faults_on_old_pte is a helper to indicate that it might cause page
fault when accessing old pte. But on x86, there is feature to setting
pte access flag by hardware. Hence implement an overriding stub which
always returns false.

Signed-off-by: Jia He 
Suggested-by: Will Deacon 
---
 arch/x86/include/asm/pgtable.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0bc530c4eb13..ad97dc155195 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1463,6 +1463,12 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+
 #include 
 #endif /* __ASSEMBLY__ */
 
-- 
2.17.1

[PATCH v12 0/4] fix double page fault in cow_user_page for pfn mapping

2019-10-11 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

-Changes
v12:
refine PATCH 01, remove the !! since C languages can convert unsigned
to bool (Catalin)
v11:
refine cpu_has_hw_af in PATCH 01(Will Deacon, Suzuki)
change the default return value to true in arch_faults_on_old_pte
add PATCH 03 for overriding arch_faults_on_old_pte(false) on x86
v10:
add r-b from Catalin and a-b from Kirill in PATCH 03
remoe Reported-by in PATCH 01
v9: refactor cow_user_page for indention optimization (Catalin)
hold the ptl longer (Catalin)
v8: change cow_user_page's return type (Matthew)
v7: s/pte_spinlock/pte_offset_map_lock (Kirill)
v6: fix error case of returning with spinlock taken (Catalin)
move kmap_atomic to avoid handling kunmap_atomic
v5: handle the case correctly when !pte_same
fix kbuild test failed
v4: introduce cpu_has_hw_af (Suzuki)
bail out if !pte_same (Kirill)
v3: add vmf->ptl lock/unlock (Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalin)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin)

Jia He (4):
  arm64: cpufeature: introduce helper cpu_has_hw_af()
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  x86/mm: implement arch_faults_on_old_pte() stub on x86
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/cpufeature.h |  14 
 arch/arm64/include/asm/pgtable.h|  14 
 arch/x86/include/asm/pgtable.h  |   6 ++
 mm/memory.c | 104 
 4 files changed, 123 insertions(+), 15 deletions(-)

-- 
2.17.1

[PATCH v12 4/4] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-10-11 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

To reproduce the bug, the cmd is as follows after you deployed everything:
make -C src/test/vmmalloc_fork/ TEST_TIME=60m check

Below call trace is from arm64 do_page_fault for debugging purpose:
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003,
   pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fixes it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case (by Kirill).

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Signed-off-by: Jia He 
Reported-by: Yibo Cai 
Reviewed-by: Catalin Marinas 
Acked-by: Kirill A. Shutemov 
---
 mm/memory.c | 104 
 1 file changed, 89 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b1ca51a079f2..b6a5d6a08438 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,18 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   /*
+* Those arches which don't have hw access flag feature need to
+* implement their own helper. By default, "true" means pagefault
+* will be hit on old pte.
+*/
+   return true;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2145,32 +2157,82 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline bool cow_user_page(struct page *dst, struct page *src,
+struct vm_fault *vmf)
 {
+   bool ret;
+   void *kaddr;
+   void __user *uaddr;
+   bool force_mkyoung;
+   struct vm_area_struct *vma = vmf->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
 
+   if (likely(src)) {
+   copy_user_highpage(dst, src, addr, vma);
+   return true;
+   }
+
/*
 * If the source page was a PFN mapping, we don't have
 * a "struct page" for it. We do a best-effort copy by
 * just copying from the original user address. If that
 * fails, we just zero-fill it. Live with it.
 */
-   if (unlikely(!src)) {
-   void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   kaddr = kmap_atomic(dst);
+   uaddr = (void __user *)(addr & PAGE_MASK);
+
+   /*
+* On architectures with software "accessed" bits, we would
+* take a double page fault, so mark it accessed here.
+*/
+   force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte);
+   if (force_mkyoung) {
+   pte_t entry;
+
+   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, >ptl);
+   if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   /*
+* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   ret = false;
+   goto pte_unlock;
+   }
 
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   }
+
+   /*
+* This really shouldn't fail, because the page is there
+* in the page tables. But it might just be unr

[PATCH v12 2/4] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-10-11 Thread Jia He

On arm64 without hardware Access Flag, copying from user will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. We don't always have a
hardware-managed Access Flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/include/asm/pgtable.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7576df00eb50..e96fb82f62de 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -885,6 +885,20 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying from user will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. We don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   WARN_ON(preemptible());
+
+   return !cpu_has_hw_af();
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v12 1/4] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-10-11 Thread Jia He

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by a new helper cpu_has_hw_af().

If later we noticed a potential performance issue on this path, we can
turn it into a static label as with other CPU features.

Signed-off-by: Jia He 
Suggested-by: Suzuki Poulose 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/include/asm/cpufeature.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index 9cde5d2e768f..4261d55e8506 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -659,6 +659,20 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int 
parange)
default: return CONFIG_ARM64_PA_BITS;
}
 }
+
+/* Check whether hardware update of the Access flag is supported */
+static inline bool cpu_has_hw_af(void)
+{
+   u64 mmfr1;
+
+   if (!IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+   return false;
+
+   mmfr1 = read_cpuid(ID_AA64MMFR1_EL1);
+   return cpuid_feature_extract_unsigned_field(mmfr1,
+   ID_AA64MMFR1_HADBS_SHIFT);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif
-- 
2.17.1

[PATCH v11 3/4] x86/mm: implement arch_faults_on_old_pte() stub on x86

2019-10-09 Thread Jia He

arch_faults_on_old_pte is a helper to indicate that it might cause page
fault when accessing old pte. But on x86, there is feature to setting
pte access flag by hardware. Hence implement an overriding stub which
always returns false.

Signed-off-by: Jia He 
Suggested-by: Will Deacon 
---
 arch/x86/include/asm/pgtable.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0bc530c4eb13..ad97dc155195 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1463,6 +1463,12 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+
 #include 
 #endif /* __ASSEMBLY__ */
 
-- 
2.17.1

[PATCH v11 2/4] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-10-09 Thread Jia He

On arm64 without hardware Access Flag, copying from user will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. We don't always have a
hardware-managed Access Flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/include/asm/pgtable.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7576df00eb50..e96fb82f62de 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -885,6 +885,20 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying from user will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. We don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   WARN_ON(preemptible());
+
+   return !cpu_has_hw_af();
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v11 4/4] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-10-09 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

To reproduce the bug, the cmd is as follows after you deployed everything:
make -C src/test/vmmalloc_fork/ TEST_TIME=60m check

Below call trace is from arm64 do_page_fault for debugging purpose:
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003,
   pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fixes it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case (by Kirill).

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Signed-off-by: Jia He 
Reported-by: Yibo Cai 
Reviewed-by: Catalin Marinas 
Acked-by: Kirill A. Shutemov 
---
 mm/memory.c | 104 
 1 file changed, 89 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b1ca51a079f2..b6a5d6a08438 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,18 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   /*
+* Those arches which don't have hw access flag feature need to
+* implement their own helper. By default, "true" means pagefault
+* will be hit on old pte.
+*/
+   return true;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2145,32 +2157,82 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline bool cow_user_page(struct page *dst, struct page *src,
+struct vm_fault *vmf)
 {
+   bool ret;
+   void *kaddr;
+   void __user *uaddr;
+   bool force_mkyoung;
+   struct vm_area_struct *vma = vmf->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
 
+   if (likely(src)) {
+   copy_user_highpage(dst, src, addr, vma);
+   return true;
+   }
+
/*
 * If the source page was a PFN mapping, we don't have
 * a "struct page" for it. We do a best-effort copy by
 * just copying from the original user address. If that
 * fails, we just zero-fill it. Live with it.
 */
-   if (unlikely(!src)) {
-   void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   kaddr = kmap_atomic(dst);
+   uaddr = (void __user *)(addr & PAGE_MASK);
+
+   /*
+* On architectures with software "accessed" bits, we would
+* take a double page fault, so mark it accessed here.
+*/
+   force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte);
+   if (force_mkyoung) {
+   pte_t entry;
+
+   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, >ptl);
+   if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   /*
+* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   ret = false;
+   goto pte_unlock;
+   }
 
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   }
+
+   /*
+* This really shouldn't fail, because the page is there
+* in the page tables. But it might just be unr

[PATCH v11 1/4] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-10-09 Thread Jia He

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by a new helper cpu_has_hw_af().

Signed-off-by: Jia He 
Suggested-by: Suzuki Poulose 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/include/asm/cpufeature.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index 9cde5d2e768f..1a95396ea5c8 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -659,6 +659,20 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int 
parange)
default: return CONFIG_ARM64_PA_BITS;
}
 }
+
+/* Check whether hardware update of the Access flag is supported */
+static inline bool cpu_has_hw_af(void)
+{
+   if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) {
+   u64 mmfr1 = read_cpuid(ID_AA64MMFR1_EL1);
+
+   return !!cpuid_feature_extract_unsigned_field(mmfr1,
+   ID_AA64MMFR1_HADBS_SHIFT);
+   }
+
+   return false;
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif
-- 
2.17.1

[PATCH v11 0/4] fix double page fault in cow_user_page for pfn mapping

2019-10-09 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

Changes
v11:
refine cpu_has_hw_af in PATCH 01(Will Deacon, Suzuki)
change the default return value to true in arch_faults_on_old_pte
add PATCH 03 for overriding arch_faults_on_old_pte(false) on x86
v10:
add r-b from Catalin and a-b from Kirill in PATCH 03
remoe Reported-by in PATCH 01
v9: refactor cow_user_page for indention optimization (Catalin)
hold the ptl longer (Catalin)
v8: change cow_user_page's return type (Matthew)
v7: s/pte_spinlock/pte_offset_map_lock (Kirill)
v6: fix error case of returning with spinlock taken (Catalin)
move kmap_atomic to avoid handling kunmap_atomic
v5: handle the case correctly when !pte_same
fix kbuild test failed
v4: introduce cpu_has_hw_af (Suzuki)
bail out if !pte_same (Kirill)
v3: add vmf->ptl lock/unlock (Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalin)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin)

Jia He (4):
  arm64: cpufeature: introduce helper cpu_has_hw_af()
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  x86/mm: implement arch_faults_on_old_pte() stub on x86
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/cpufeature.h |  14 
 arch/arm64/include/asm/pgtable.h|  14 
 arch/x86/include/asm/pgtable.h  |   6 ++
 mm/memory.c | 104 
 4 files changed, 123 insertions(+), 15 deletions(-)

-- 
2.17.1

Re: [PATCH v10 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-10-09 Thread Jia He


Hi Suzuki

On 2019/10/8 23:32, Suzuki K Poulose wrote:



On 08/10/2019 02:12, Justin He (Arm Technology China) wrote:

Hi Will and Marc
Sorry for the late response, just came back from a vacation.


-Original Message-
From: Marc Zyngier 
Sent: 2019年10月1日 21:19
To: Will Deacon 
Cc: Justin He (Arm Technology China) ; Catalin
Marinas ; Mark Rutland
; James Morse ;
Matthew Wilcox ; Kirill A. Shutemov
; linux-arm-ker...@lists.infradead.org;
linux-kernel@vger.kernel.org; linux...@kvack.org; Punit Agrawal
; Thomas Gleixner ;
Andrew Morton ; hejia...@gmail.com; Kaly
Xin (Arm Technology China) 
Subject: Re: [PATCH v10 1/3] arm64: cpufeature: introduce helper
cpu_has_hw_af()

On Tue, 1 Oct 2019 13:54:47 +0100
Will Deacon  wrote:


On Mon, Sep 30, 2019 at 09:57:38AM +0800, Jia He wrote:

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by new helper cpu_has_hw_af().

Signed-off-by: Jia He 
Suggested-by: Suzuki Poulose 
Reviewed-by: Catalin Marinas 
---
  arch/arm64/include/asm/cpufeature.h | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h

b/arch/arm64/include/asm/cpufeature.h

index 9cde5d2e768f..949bc7c85030 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -659,6 +659,16 @@ static inline u32

id_aa64mmfr0_parange_to_phys_shift(int parange)

   default: return CONFIG_ARM64_PA_BITS;
   }
  }
+
+/* Check whether hardware update of the Access flag is supported */
+static inline bool cpu_has_hw_af(void)
+{
+ if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+ return read_cpuid(ID_AA64MMFR1_EL1) & 0xf;


0xf? I think we should have a mask in sysreg.h for this constant.


We don't have the mask, but we certainly have the shift.

GENMASK(ID_AA64MMFR1_HADBS_SHIFT + 3,
ID_AA64MMFR1_HADBS_SHIFT) is a bit
of a mouthful though. Ideally, we'd have a helper for that.


Ok, I will implement the helper if there isn't so far.
And then replace the 0xf with it.


Or could we simpl reuse existing cpuid_feature_extract_unsigned_field() ?

u64 mmfr1 = read_cpuid(ID_AA64MMFR1_EL1);

return cpuid_feature_extract_unsigned_field(mmfr1, ID_AA64MMFR1_HADBS_SHIFT) ?


Yes, we can, I will send the new version

---
Cheers,
Justin (Jia He)

[PATCH v10 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-29 Thread Jia He

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by new helper cpu_has_hw_af().

Signed-off-by: Jia He 
Suggested-by: Suzuki Poulose 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/include/asm/cpufeature.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index 9cde5d2e768f..949bc7c85030 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -659,6 +659,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int 
parange)
default: return CONFIG_ARM64_PA_BITS;
}
 }
+
+/* Check whether hardware update of the Access flag is supported */
+static inline bool cpu_has_hw_af(void)
+{
+   if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+   return read_cpuid(ID_AA64MMFR1_EL1) & 0xf;
+
+   return false;
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif
-- 
2.17.1

[PATCH v10 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-29 Thread Jia He

On arm64 without hardware Access Flag, copying fromuser will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. we don't always have a
hardware-managed access flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/include/asm/pgtable.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7576df00eb50..e96fb82f62de 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -885,6 +885,20 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying from user will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. We don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   WARN_ON(preemptible());
+
+   return !cpu_has_hw_af();
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v10 0/3] fix double page fault on arm64

2019-09-29 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

Changes
v10:
add r-b from Catalin and a-b from Kirill in PATCH 03
remoe Reported-by in PATCH 01
v9: refactor cow_user_page for indention optimization (Catalin)
hold the ptl longer (Catalin)
v8: change cow_user_page's return type (Matthew)
v7: s/pte_spinlock/pte_offset_map_lock (Kirill)
v6: fix error case of returning with spinlock taken (Catalin)
move kmap_atomic to avoid handling kunmap_atomic
v5: handle the case correctly when !pte_same
fix kbuild test failed
v4: introduce cpu_has_hw_af (Suzuki)
bail out if !pte_same (Kirill)
v3: add vmf->ptl lock/unlock (Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalin)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin)

Jia He (3):
  arm64: cpufeature: introduce helper cpu_has_hw_af()
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/cpufeature.h | 10 +++
 arch/arm64/include/asm/pgtable.h| 14 
 mm/memory.c | 99 -
 3 files changed, 108 insertions(+), 15 deletions(-)

-- 
2.17.1

[PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-29 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case.(by Kirill)

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Signed-off-by: Jia He 
Reported-by: Yibo Cai 
Reviewed-by: Catalin Marinas 
Acked-by: Kirill A. Shutemov 
---
 mm/memory.c | 99 +
 1 file changed, 84 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b1ca51a079f2..1f56b0118ef5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2145,32 +2152,82 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline bool cow_user_page(struct page *dst, struct page *src,
+struct vm_fault *vmf)
 {
+   bool ret;
+   void *kaddr;
+   void __user *uaddr;
+   bool force_mkyoung;
+   struct vm_area_struct *vma = vmf->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
 
+   if (likely(src)) {
+   copy_user_highpage(dst, src, addr, vma);
+   return true;
+   }
+
/*
 * If the source page was a PFN mapping, we don't have
 * a "struct page" for it. We do a best-effort copy by
 * just copying from the original user address. If that
 * fails, we just zero-fill it. Live with it.
 */
-   if (unlikely(!src)) {
-   void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   kaddr = kmap_atomic(dst);
+   uaddr = (void __user *)(addr & PAGE_MASK);
+
+   /*
+* On architectures with software "accessed" bits, we would
+* take a double page fault, so mark it accessed here.
+*/
+   force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte);
+   if (force_mkyoung) {
+   pte_t entry;
+
+   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, >ptl);
+   if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   /*
+* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   ret = false;
+   goto pte_unlock;
+   }
+
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   }
 
+   /*
+* This really shouldn't fail, because the page is there
+* in the page tables. But it might just be unreadable,
+* in which case we just give up and fill the result with
+* zeroes.
+*/
+   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
/*
-* This really shouldn't fail, because the page is there
-* in the page tables. But it might just be unreadable,
-* in which case we just give up and fill

[PATCH v9 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-24 Thread Jia He

On arm64 without hardware Access Flag, copying fromuser will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. we don't always have a
hardware-managed access flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/include/asm/pgtable.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e09760ece844..2b035befb66d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -868,6 +868,20 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying from user will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. We don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   WARN_ON(preemptible());
+
+   return !cpu_has_hw_af();
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v9 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-24 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case.(by Kirill)

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Signed-off-by: Jia He 
Reported-by: Yibo Cai 
---
 mm/memory.c | 99 +
 1 file changed, 84 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..a0a381b36ff2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2140,32 +2147,82 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline bool cow_user_page(struct page *dst, struct page *src,
+struct vm_fault *vmf)
 {
+   bool ret;
+   void *kaddr;
+   void __user *uaddr;
+   bool force_mkyoung;
+   struct vm_area_struct *vma = vmf->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
 
+   if (likely(src)) {
+   copy_user_highpage(dst, src, addr, vma);
+   return true;
+   }
+
/*
 * If the source page was a PFN mapping, we don't have
 * a "struct page" for it. We do a best-effort copy by
 * just copying from the original user address. If that
 * fails, we just zero-fill it. Live with it.
 */
-   if (unlikely(!src)) {
-   void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   kaddr = kmap_atomic(dst);
+   uaddr = (void __user *)(addr & PAGE_MASK);
+
+   /*
+* On architectures with software "accessed" bits, we would
+* take a double page fault, so mark it accessed here.
+*/
+   force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf->orig_pte);
+   if (force_mkyoung) {
+   pte_t entry;
+
+   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, >ptl);
+   if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   /*
+* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   ret = false;
+   goto pte_unlock;
+   }
+
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr, vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   }
 
+   /*
+* This really shouldn't fail, because the page is there
+* in the page tables. But it might just be unreadable,
+* in which case we just give up and fill the result with
+* zeroes.
+*/
+   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
/*
-* This really shouldn't fail, because the page is there
-* in the page tables. But it might just be unreadable,
-* in which case we just give up and fill the result with
-* zeroes.
+

[PATCH v9 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-24 Thread Jia He

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by new helper cpu_has_hw_af().

Signed-off-by: Jia He 
Suggested-by: Suzuki Poulose 
Reported-by: kbuild test robot 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/include/asm/cpufeature.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index c96ffa4722d3..c2e3abd39faa 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -667,6 +667,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int 
parange)
default: return CONFIG_ARM64_PA_BITS;
}
 }
+
+/* Check whether hardware update of the Access flag is supported */
+static inline bool cpu_has_hw_af(void)
+{
+   if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+   return read_cpuid(ID_AA64MMFR1_EL1) & 0xf;
+
+   return false;
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif
-- 
2.17.1

[PATCH v9 0/3] fix double page fault on arm64

2019-09-24 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

Changes
v9: refactor cow_user_page for indention optimization (Catalin)
hold the ptl longer (Catalin)
v8: change cow_user_page's return type (Matthew)
v7: s/pte_spinlock/pte_offset_map_lock (Kirill)
v6: fix error case of returning with spinlock taken (Catalin)
move kmap_atomic to avoid handling kunmap_atomic
v5: handle the case correctly when !pte_same
fix kbuild test failed
v4: introduce cpu_has_hw_af (Suzuki)
bail out if !pte_same (Kirill)
v3: add vmf->ptl lock/unlock (Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalin)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin)

Jia He (3):
  arm64: cpufeature: introduce helper cpu_has_hw_af()
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/cpufeature.h | 10 +++
 arch/arm64/include/asm/pgtable.h| 14 
 mm/memory.c | 99 -
 3 files changed, 108 insertions(+), 15 deletions(-)

-- 
2.17.1

Re: [PATCH v8 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-24 Thread Jia He


Hi Catalin

On 2019/9/24 18:33, Catalin Marinas wrote:

On Tue, Sep 24, 2019 at 06:43:06AM +, Justin He (Arm Technology China) 
wrote:

Catalin Marinas wrote:

On Sat, Sep 21, 2019 at 09:50:54PM +0800, Jia He wrote:

@@ -2151,21 +2163,53 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 * fails, we just zero-fill it. Live with it.
 */
if (unlikely(!src)) {
-   void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void *kaddr;
+   pte_t entry;
+   void __user *uaddr = (void __user *)(addr & PAGE_MASK);

+   /* On architectures with software "accessed" bits, we would
+* take a double page fault, so mark it accessed here.
+*/

[...]

+   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr,
+  >ptl);
+   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr,
+ vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   } else {
+   /* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   return false;
+   }
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   }

[...]

+
+   kaddr = kmap_atomic(dst);

Since you moved the kmap_atomic() here, could the above
arch_faults_on_old_pte() run in a preemptible context? I suggested to
add a WARN_ON in patch 2 to be sure.

Should I move kmap_atomic back to the original line? Thus, we can make sure
that arch_faults_on_old_pte() is in the context of preempt_disabled?
Otherwise, arch_faults_on_old_pte() may cause plenty of warning if I add
a WARN_ON in arch_faults_on_old_pte.  I tested it when I enable the PREEMPT=y
on a ThunderX2 qemu guest.

So we have two options here:

1. Change arch_faults_on_old_pte() scope to the whole system rather than
just the current CPU. You'd have to wire up a new arm64 capability
for the access flag but this way we don't care whether it's
preemptible or not.

2. Keep the arch_faults_on_old_pte() per-CPU but make sure we are not
preempted here. The kmap_atomic() move would do but you'd have to
kunmap_atomic() before the return.

I think the answer to my question below also has some implication on
which option to pick:


/*
 * This really shouldn't fail, because the page is there
 * in the page tables. But it might just be unreadable,
 * in which case we just give up and fill the result with
 * zeroes.
 */
-   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
+   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
+   /* Give a warn in case there can be some obscure
+* use-case
+*/
+   WARN_ON_ONCE(1);

That's more of a question for the mm guys: at this point we do the
copying with the ptl released; is there anything else that could have
made the pte old in the meantime? I think unuse_pte() is only called on
anonymous vmas, so it shouldn't be the case here.

If we need to hold the ptl here, you could as well have an enclosing
kmap/kunmap_atomic (option 2) with some goto instead of "return false".


I am not 100% sure that I understand your suggestion well, so I drafted the 
patch

here:

Changes: optimize the indentions

 hold the ptl longer


-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)

+static inline bool cow_user_page(struct page *dst, struct page *src,
+                 struct vm_fault *vmf)
 {
+    struct vm_area_struct *vma = vmf->vma;
+    struct mm_struct *mm = vma->vm_mm;
+    unsigned long addr = vmf->address;
+    bool ret;
+    pte_t entry;
+    void *kaddr;
+    void __user *uaddr;
+
 debug_dma_assert_idle(src);

+    if (likely(src)) {
+        copy_user_highpage(dst, src, addr, vma);
+        return true;
+    }
+
 /*
  * If the source page was a PFN mapping, we don't have
  * a &quo

[PATCH v8 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-21 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case.(by Kirill)

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Reported-by: Yibo Cai 
Signed-off-by: Jia He 
---
 mm/memory.c | 67 -
 1 file changed, 61 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..ae09b070b04d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2140,8 +2147,13 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline bool cow_user_page(struct page *dst, struct page *src,
+struct vm_fault *vmf)
 {
+   struct vm_area_struct *vma = vmf->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
 
/*
@@ -2151,21 +2163,53 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 * fails, we just zero-fill it. Live with it.
 */
if (unlikely(!src)) {
-   void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void *kaddr;
+   pte_t entry;
+   void __user *uaddr = (void __user *)(addr & PAGE_MASK);
 
+   /* On architectures with software "accessed" bits, we would
+* take a double page fault, so mark it accessed here.
+*/
+   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr,
+  >ptl);
+   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr,
+ vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   } else {
+   /* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   return false;
+   }
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   }
+
+   kaddr = kmap_atomic(dst);
/*
 * This really shouldn't fail, because the page is there
 * in the page tables. But it might just be unreadable,
 * in which case we just give up and fill the result with
 * zeroes.
 */
-   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
+   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
+   /* Give a warn in case there can be some obscure
+

[PATCH v8 0/3] fix double page fault on arm64

2019-09-21 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

Changes
v8: change cow_user_page's return type (Matthew)
v7: s/pte_spinlock/pte_offset_map_lock (Kirill)
v6: fix error case of returning with spinlock taken (Catalin)
move kmap_atomic to avoid handling kunmap_atomic
v5: handle the case correctly when !pte_same
fix kbuild test failed
v4: introduce cpu_has_hw_af (Suzuki)
bail out if !pte_same (Kirill)
v3: add vmf->ptl lock/unlock (Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalin)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin)

Jia He (3):
  arm64: cpufeature: introduce helper cpu_has_hw_af()
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/cpufeature.h | 10 +
 arch/arm64/include/asm/pgtable.h| 12 ++
 mm/memory.c | 67 ++---
 3 files changed, 83 insertions(+), 6 deletions(-)

-- 
2.17.1

[PATCH v8 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-21 Thread Jia He

On arm64 without hardware Access Flag, copying fromuser will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. we don't always have a
hardware-managed access flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
---
 arch/arm64/include/asm/pgtable.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e09760ece844..4a9939615e41 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying fromuser will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. we don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   return !cpu_has_hw_af();
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v8 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-21 Thread Jia He

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by new helper cpu_has_hw_af().

Reported-by: kbuild test robot 
Suggested-by: Suzuki Poulose 
Signed-off-by: Jia He 
---
 arch/arm64/include/asm/cpufeature.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index c96ffa4722d3..46caf934ba4e 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -667,6 +667,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int 
parange)
default: return CONFIG_ARM64_PA_BITS;
}
 }
+
+/* Decouple AF from AFDBM. */
+static inline bool cpu_has_hw_af(void)
+{
+   if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+   return read_cpuid(ID_AA64MMFR1_EL1) & 0xf;
+
+   return false;
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif
-- 
2.17.1

Re: [PATCH v7 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-21 Thread Jia He


[On behalf of justin...@arm.com]

Hi Matthew

On 2019/9/20 23:53, Matthew Wilcox wrote:

On Fri, Sep 20, 2019 at 09:54:37PM +0800, Jia He wrote:

-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline int cow_user_page(struct page *dst, struct page *src,
+   struct vm_fault *vmf)
  {

Can we talk about the return type here?


+   } else {
+   /* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   return -1;

...

+   return 0;
  }

So -1 for "try again" and 0 for "succeeded".


+   if (cow_user_page(new_page, old_page, vmf)) {

Then we use it like a bool.  But it's kind of backwards from a bool because
false is success.


+   /* COW failed, if the fault was solved by other,
+* it's fine. If not, userspace would re-fault on
+* the same address and we will handle the fault
+* from the second attempt.
+*/
+   put_page(new_page);
+   if (old_page)
+   put_page(old_page);
+   return 0;

And we don't use the return value; in fact we invert it.

Would this make more sense:

static inline bool cow_user_page(struct page *dst, struct page *src,
struct vm_fault *vmf)
...
pte_unmap_unlock(vmf->pte, vmf->ptl);
return false;
...
return true;
...
if (!cow_user_page(new_page, old_page, vmf)) {

That reads more sensibly for me.  We could also go with returning a
vm_fault_t, but that would be more complex than needed today, I think.


Ok, will change the return type to bool as you suggested.
Thanks

---
Cheers,
Justin (Jia He)

[PATCH v7 0/3] fix double page fault on arm64

2019-09-20 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

Changes
v7: s/pte_spinlock/pte_offset_map_lock (Kirill)
v6: fix error case of returning with spinlock taken (Catalin)
move kmap_atomic to avoid handling kunmap_atomic
v5: handle the case correctly when !pte_same
fix kbuild test failed
v4: introduce cpu_has_hw_af (Suzuki)
bail out if !pte_same (Kirill)
v3: add vmf->ptl lock/unlock (Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalin)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin)

Jia He (3):
  arm64: cpufeature: introduce helper cpu_has_hw_af()
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/cpufeature.h | 10 +
 arch/arm64/include/asm/pgtable.h| 12 ++
 mm/memory.c | 67 ++---
 3 files changed, 83 insertions(+), 6 deletions(-)

-- 
2.17.1

[PATCH v7 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-20 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case.(by Kirill)

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Reported-by: Yibo Cai 
Signed-off-by: Jia He 
---
 mm/memory.c | 67 -
 1 file changed, 61 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..3e39e40fee87 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2140,8 +2147,13 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline int cow_user_page(struct page *dst, struct page *src,
+   struct vm_fault *vmf)
 {
+   struct vm_area_struct *vma = vmf->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
 
/*
@@ -2151,21 +2163,53 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 * fails, we just zero-fill it. Live with it.
 */
if (unlikely(!src)) {
-   void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void *kaddr;
+   pte_t entry;
+   void __user *uaddr = (void __user *)(addr & PAGE_MASK);
 
+   /* On architectures with software "accessed" bits, we would
+* take a double page fault, so mark it accessed here.
+*/
+   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr,
+  >ptl);
+   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr,
+ vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   } else {
+   /* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   return -1;
+   }
+   pte_unmap_unlock(vmf->pte, vmf->ptl);
+   }
+
+   kaddr = kmap_atomic(dst);
/*
 * This really shouldn't fail, because the page is there
 * in the page tables. But it might just be unreadable,
 * in which case we just give up and fill the result with
 * zeroes.
 */
-   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
+   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
+   /* Give a warn in case there can be some obscure
+

[PATCH v7 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-20 Thread Jia He

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by new helper cpu_has_hw_af().

Reported-by: kbuild test robot 
Suggested-by: Suzuki Poulose 
Signed-off-by: Jia He 
---
 arch/arm64/include/asm/cpufeature.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index c96ffa4722d3..46caf934ba4e 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -667,6 +667,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int 
parange)
default: return CONFIG_ARM64_PA_BITS;
}
 }
+
+/* Decouple AF from AFDBM. */
+static inline bool cpu_has_hw_af(void)
+{
+   if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+   return read_cpuid(ID_AA64MMFR1_EL1) & 0xf;
+
+   return false;
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif
-- 
2.17.1

[PATCH v7 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-20 Thread Jia He

On arm64 without hardware Access Flag, copying fromuser will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. we don't always have a
hardware-managed access flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
---
 arch/arm64/include/asm/pgtable.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e09760ece844..4a9939615e41 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying fromuser will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. we don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   return !cpu_has_hw_af();
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v6 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-19 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case.(by Kirill)

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Reported-by: Yibo Cai 
Signed-off-by: Jia He 
---
 mm/memory.c | 65 -
 1 file changed, 59 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..7c38c1ce5440 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline int cow_user_page(struct page *dst, struct page *src,
+   struct vm_fault *vmf)
 {
+   struct vm_area_struct *vma = vmf->vma;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
 
/*
@@ -2151,21 +2162,52 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 * fails, we just zero-fill it. Live with it.
 */
if (unlikely(!src)) {
-   void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void *kaddr;
+   void __user *uaddr = (void __user *)(addr & PAGE_MASK);
+   pte_t entry;
+
+   /* On architectures with software "accessed" bits, we would
+* take a double page fault, so mark it accessed here.
+*/
+   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+   spin_lock(vmf->ptl);
+   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr,
+ vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   } else {
+   /* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   spin_unlock(vmf->ptl);
+   return -1;
+   }
+   spin_unlock(vmf->ptl);
+   }
 
+   kaddr = kmap_atomic(dst);
/*
 * This really shouldn't fail, because the page is there
 * in the page tables. But it might just be unreadable,
 * in which case we just give up and fill the result with
 * zeroes.
 */
-   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
+   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
+   /* Give a warn in case there can be some obscure
+* use-case
+*/
+   WARN_ON_ONCE(1);
clear_page(kaddr);
+   }
kunmap_

[PATCH v6 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-19 Thread Jia He

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by new helper cpu_has_hw_af().

Reported-by: kbuild test robot 
Suggested-by: Suzuki Poulose 
Signed-off-by: Jia He 
---
 arch/arm64/include/asm/cpufeature.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index c96ffa4722d3..46caf934ba4e 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -667,6 +667,16 @@ static inline u32 id_aa64mmfr0_parange_to_phys_shift(int 
parange)
default: return CONFIG_ARM64_PA_BITS;
}
 }
+
+/* Decouple AF from AFDBM. */
+static inline bool cpu_has_hw_af(void)
+{
+   if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+   return read_cpuid(ID_AA64MMFR1_EL1) & 0xf;
+
+   return false;
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif
-- 
2.17.1

[PATCH v6 0/3] fix double page fault on arm64

2019-09-19 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

Changes
v6: fix error case of returning with spinlock taken (Catalin)
move kmap_atomic to avoid handling kunmap_atomic
v5: handle the case correctly when !pte_same
fix kbuild test failed
v4: introduce cpu_has_hw_af (Suzuki)
bail out if !pte_same (Kirill)
v3: add vmf->ptl lock/unlock (Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalin)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (Catalin)

Jia He (3):
  arm64: cpufeature: introduce helper cpu_has_hw_af()
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/cpufeature.h | 10 +
 arch/arm64/include/asm/pgtable.h| 12 ++
 mm/memory.c | 65 ++---
 3 files changed, 81 insertions(+), 6 deletions(-)

-- 
2.17.1

[PATCH v6 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-19 Thread Jia He

On arm64 without hardware Access Flag, copying fromuser will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. we don't always have a
hardware-managed access flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
---
 arch/arm64/include/asm/pgtable.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e09760ece844..4a9939615e41 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying fromuser will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. we don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   return !cpu_has_hw_af();
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v5 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-19 Thread Jia He

On arm64 without hardware Access Flag, copying fromuser will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. we don't always have a
hardware-managed access flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
---
 arch/arm64/include/asm/pgtable.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e09760ece844..4a9939615e41 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying fromuser will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. we don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   return !cpu_has_hw_af();
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v5 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-19 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case.(by Kirill)

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Reported-by: Yibo Cai 
Signed-off-by: Jia He 
---
 mm/memory.c | 59 -
 1 file changed, 54 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..cf681963b2f5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline int cow_user_page(struct page *dst, struct page *src,
+   struct vm_fault *vmf)
 {
+   struct vm_area_struct *vma = vmf->vma;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
 
/*
@@ -2152,7 +2163,29 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 */
if (unlikely(!src)) {
void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void __user *uaddr = (void __user *)(addr & PAGE_MASK);
+   pte_t entry;
+
+   /* On architectures with software "accessed" bits, we would
+* take a double page fault, so mark it accessed here.
+*/
+   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+   spin_lock(vmf->ptl);
+   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr,
+ vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   } else {
+   /* Other thread has already handled the fault
+* and we don't need to do anything. If it's
+* not the case, the fault will be triggered
+* again on the same address.
+*/
+   return -1;
+   }
+   spin_unlock(vmf->ptl);
+   }
 
/*
 * This really shouldn't fail, because the page is there
@@ -2160,12 +2193,17 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 * in which case we just give up and fill the result with
 * zeroes.
 */
-   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
+   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
+   /* In case there can be some obscure use-case */
+   WARN_ON_ONCE(1);
clear_page(kaddr);
+   }
kunmap_atomic(kaddr);
flush_dcache_page(dst);
} else
-   copy_user_highpage(dst, src, va, vma);
+   copy_user_highpage(dst, src, addr, vma);
+
+   return 0;
 }
 
 static gfp_t __

[PATCH v5 0/3] fix double page fault on arm64

2019-09-19 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

Changes
v5: handle the case correctly when !pte_same
fix kbuild test failed
v4: introduce cpu_has_hw_af (Suzuki)
bail out if !pte_same (Kirill)
v3: add vmf->ptl lock/unlock (by Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalin)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (by Catalin)

Jia He (3):
  arm64: cpufeature: introduce helper cpu_has_hw_af()
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/cpufeature.h |  1 +
 arch/arm64/include/asm/pgtable.h| 12 ++
 arch/arm64/kernel/cpufeature.c  | 10 +
 mm/memory.c | 59 ++---
 4 files changed, 77 insertions(+), 5 deletions(-)

-- 
2.17.1

[PATCH v5 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-19 Thread Jia He

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by new helper cpu_has_hw_af().

Reported-by: kbuild test robot 
Suggested-by: Suzuki Poulose 
Signed-off-by: Jia He 
---
 arch/arm64/include/asm/cpufeature.h |  1 +
 arch/arm64/kernel/cpufeature.c  | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index c96ffa4722d3..206b6e3954cf 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -390,6 +390,7 @@ extern DECLARE_BITMAP(boot_capabilities, ARM64_NPATCHABLE);
for_each_set_bit(cap, cpu_hwcaps, ARM64_NCAPS)
 
 bool this_cpu_has_cap(unsigned int cap);
+bool cpu_has_hw_af(void);
 void cpu_set_feature(unsigned int num);
 bool cpu_have_feature(unsigned int num);
 unsigned long cpu_get_elf_hwcap(void);
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index b1fdc486aed8..fb0e9425d286 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1141,6 +1141,16 @@ static bool has_hw_dbm(const struct 
arm64_cpu_capabilities *cap,
return true;
 }
 
+/* Decouple AF from AFDBM. */
+bool cpu_has_hw_af(void)
+{
+   return (read_cpuid(ID_AA64MMFR1_EL1) & 0xf);
+}
+#else /* CONFIG_ARM64_HW_AFDBM */
+bool cpu_has_hw_af(void)
+{
+   return false;
+}
 #endif
 
 #ifdef CONFIG_ARM64_VHE
-- 
2.17.1

Re: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-18 Thread Jia He


Hi Kirill

[On behalf of justin...@arm.com because some mails are filted...]

On 2019/9/18 22:00, Kirill A. Shutemov wrote:

On Wed, Sep 18, 2019 at 09:19:14PM +0800, Jia He wrote:

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Reported-by: Yibo Cai 
Signed-off-by: Jia He 
---
  mm/memory.c | 35 ++-
  1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..d2c130a5883b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
2;
  #endif
  
+#ifndef arch_faults_on_old_pte

+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+#endif
+
  static int __init disable_randmaps(char *s)
  {
randomize_va_space = 0;
@@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
  }
  
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)

+static inline void cow_user_page(struct page *dst, struct page *src,
+struct vm_fault *vmf)
  {
+   struct vm_area_struct *vma = vmf->vma;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
  
  	/*

@@ -2152,20 +2163,34 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 */
if (unlikely(!src)) {
void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void __user *uaddr = (void __user *)(addr & PAGE_MASK);
+   pte_t entry;
  
  		/*

 * This really shouldn't fail, because the page is there
 * in the page tables. But it might just be unreadable,
 * in which case we just give up and fill the result with
-* zeroes.
+* zeroes. On architectures with software "accessed" bits,
+* we would take a double page fault here, so mark it
+* accessed here.
 */
+   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+   spin_lock(vmf->ptl);
+   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr,
+ vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   }

I don't follow.

So if pte has changed under you, you don't set the accessed bit, but never
the less copy from the user.

What makes you think it will not trigger the same problem?

I think we need to make cow_user_page() fail in this case and caller --
wp_page_copy() -- return zero. If the fault was solved by other thread, we
are fine. If not userspace would re-fault on the same address and we will
handle the fault from the second attempt.


Thanks for the pointing. How about make cow_user_page() be returned

VM_FAULT_RETRY? Then in do_page_fault(), it can retry the page fault?

---
Cheers,
Justin (Jia He)




+   spin_unlock(vmf->ptl);
+   }
+
if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
clear_page(kaddr);
kunmap_atomic(kaddr);
flush_dcache_page(dst);
} else
-   copy_user_highpage(dst, src, va, vma);
+   copy_user_highpage(dst, src, addr, vma);
  }
  
  static

[PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-18 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Reported-by: Yibo Cai 
Signed-off-by: Jia He 
---
 mm/memory.c | 35 ++-
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..d2c130a5883b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline void cow_user_page(struct page *dst, struct page *src,
+struct vm_fault *vmf)
 {
+   struct vm_area_struct *vma = vmf->vma;
+   unsigned long addr = vmf->address;
+
debug_dma_assert_idle(src);
 
/*
@@ -2152,20 +2163,34 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 */
if (unlikely(!src)) {
void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void __user *uaddr = (void __user *)(addr & PAGE_MASK);
+   pte_t entry;
 
/*
 * This really shouldn't fail, because the page is there
 * in the page tables. But it might just be unreadable,
 * in which case we just give up and fill the result with
-* zeroes.
+* zeroes. On architectures with software "accessed" bits,
+* we would take a double page fault here, so mark it
+* accessed here.
 */
+   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+   spin_lock(vmf->ptl);
+   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vma, addr,
+ vmf->pte, entry, 0))
+   update_mmu_cache(vma, addr, vmf->pte);
+   }
+   spin_unlock(vmf->ptl);
+   }
+
if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
clear_page(kaddr);
kunmap_atomic(kaddr);
flush_dcache_page(dst);
} else
-   copy_user_highpage(dst, src, va, vma);
+   copy_user_highpage(dst, src, addr, vma);
 }
 
 static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
@@ -2318,7 +2343,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
vmf->address);
if (!new_page)
goto oom;
-   cow_user_page(new_page, old_page, vmf->address, vma);
+   cow_user_page(new_page, old_page, vmf);
}
 
if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , 
false))
-- 
2.17.1

[PATCH v4 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-18 Thread Jia He

On arm64 without hardware Access Flag, copying fromuser will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. we don't always have a
hardware-managed access flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
---
 arch/arm64/include/asm/pgtable.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e09760ece844..4a9939615e41 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying fromuser will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. we don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   return !cpu_has_hw_af();
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v4 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-18 Thread Jia He

We unconditionally set the HW_AFDBM capability and only enable it on
CPUs which really have the feature. But sometimes we need to know
whether this cpu has the capability of HW AF. So decouple AF from
DBM by new helper cpu_has_hw_af().

Signed-off-by: Jia He 
Suggested-by: Suzuki Poulose 
---
 arch/arm64/include/asm/cpufeature.h | 1 +
 arch/arm64/kernel/cpufeature.c  | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index c96ffa4722d3..206b6e3954cf 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -390,6 +390,7 @@ extern DECLARE_BITMAP(boot_capabilities, ARM64_NPATCHABLE);
for_each_set_bit(cap, cpu_hwcaps, ARM64_NCAPS)
 
 bool this_cpu_has_cap(unsigned int cap);
+bool cpu_has_hw_af(void);
 void cpu_set_feature(unsigned int num);
 bool cpu_have_feature(unsigned int num);
 unsigned long cpu_get_elf_hwcap(void);
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index b1fdc486aed8..c5097f58649d 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1141,6 +1141,12 @@ static bool has_hw_dbm(const struct 
arm64_cpu_capabilities *cap,
return true;
 }
 
+/* Decouple AF from AFDBM. */
+bool cpu_has_hw_af(void)
+{
+   return (read_cpuid(ID_AA64MMFR1_EL1) & 0xf);
+}
+
 #endif
 
 #ifdef CONFIG_ARM64_VHE
-- 
2.17.1

[PATCH v4 0/3] fix double page fault on arm64

2019-09-18 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

Changes
v4: introduce cpu_has_hw_af (Suzuki)
bail out if !pte_same (Kirill)
v3: add vmf->ptl lock/unlock (by Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalin)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (by Catalin)

Jia He (3):
  arm64: cpufeature: introduce helper cpu_has_hw_af()
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/cpufeature.h |  1 +
 arch/arm64/include/asm/pgtable.h| 12 ++
 arch/arm64/kernel/cpufeature.c  |  6 +
 mm/memory.c | 35 -
 4 files changed, 49 insertions(+), 5 deletions(-)

-- 
2.17.1

[PATCH v3 1/2] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-13 Thread Jia He

On arm64 without hardware Access Flag, copying fromuser will fail because
the pte is old and cannot be marked young. So we always end up with zeroed
page after fork() + CoW for pfn mappings. we don't always have a
hardware-managed access flag on arm64.

Hence implement arch_faults_on_old_pte on arm64 to indicate that it might
cause page fault when accessing old pte.

Signed-off-by: Jia He 
---
 arch/arm64/include/asm/pgtable.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e09760ece844..b41399d758df 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 #define phys_to_ttbr(addr) (addr)
 #endif
 
+/*
+ * On arm64 without hardware Access Flag, copying fromuser will fail because
+ * the pte is old and cannot be marked young. So we always end up with zeroed
+ * page after fork() + CoW for pfn mappings. we don't always have a
+ * hardware-managed access flag on arm64.
+ */
+static inline bool arch_faults_on_old_pte(void)
+{
+   return true;
+}
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
-- 
2.17.1

[PATCH v3 2/2] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-13 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Reported-by: Yibo Cai 
Signed-off-by: Jia He 
---
 mm/memory.c | 30 +-
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..a64af6495f71 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
2;
 #endif
 
+#ifndef arch_faults_on_old_pte
+static inline bool arch_faults_on_old_pte(void)
+{
+   return false;
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
randomize_va_space = 0;
@@ -2140,7 +2147,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline void cow_user_page(struct page *dst, struct page *src,
+   struct vm_fault *vmf)
 {
debug_dma_assert_idle(src);
 
@@ -2152,20 +2160,32 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 */
if (unlikely(!src)) {
void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void __user *uaddr = (void __user *)(vmf->address & PAGE_MASK);
+   pte_t entry;
 
/*
 * This really shouldn't fail, because the page is there
 * in the page tables. But it might just be unreadable,
 * in which case we just give up and fill the result with
-* zeroes.
+* zeroes. If PTE_AF is cleared on arm64, it might
+* cause double page fault. So makes pte young here
 */
+   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
+   spin_lock(vmf->ptl);
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vmf->vma, vmf->address,
+ vmf->pte, entry, 0))
+   update_mmu_cache(vmf->vma, vmf->address,
+vmf->pte);
+   spin_unlock(vmf->ptl);
+   }
+
if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
clear_page(kaddr);
kunmap_atomic(kaddr);
flush_dcache_page(dst);
} else
-   copy_user_highpage(dst, src, va, vma);
+   copy_user_highpage(dst, src, vmf->address, vmf->vma);
 }
 
 static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
@@ -2318,7 +2338,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
vmf->address);
if (!new_page)
goto oom;
-   cow_user_page(new_page, old_page, vmf->address, vma);
+   cow_user_page(new_page, old_page, vmf);
}
 
if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , 
false))
-- 
2.17.1

[PATCH v3 0/2] fix double page fault on arm64

2019-09-13 Thread Jia He

When we tested pmdk unit test vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

Changes
v3: add vmf->ptl lock/unlock (by Kirill A. Shutemov)
add arch_faults_on_old_pte (Matthew, Catalins)
v2: remove FAULT_FLAG_WRITE when setting pte access flag (by Catalin)
Jia He (2):
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  mm: fix double page fault on arm64 if PTE_AF is cleared

 arch/arm64/include/asm/pgtable.h | 11 +++
 mm/memory.c  | 29 -
 2 files changed, 35 insertions(+), 5 deletions(-)

-- 
2.17.1

[PATCH v2] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-06 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Reported-by: Yibo Cai 
Signed-off-by: Jia He 
---
Changes
v2: remove FAULT_FLAG_WRITE when setting pte access flag (by Catalin)

 mm/memory.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..63d4fd285e8e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2140,7 +2140,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline void cow_user_page(struct page *dst, struct page *src,
+   struct vm_fault *vmf)
 {
debug_dma_assert_idle(src);
 
@@ -2152,20 +2153,30 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 */
if (unlikely(!src)) {
void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void __user *uaddr = (void __user *)(vmf->address & PAGE_MASK);
+   pte_t entry;
 
/*
 * This really shouldn't fail, because the page is there
 * in the page tables. But it might just be unreadable,
 * in which case we just give up and fill the result with
-* zeroes.
+* zeroes. If PTE_AF is cleared on arm64, it might
+* cause double page fault. So makes pte young here
 */
+   if (!pte_young(vmf->orig_pte)) {
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vmf->vma, vmf->address,
+   vmf->pte, entry, 0))
+   update_mmu_cache(vmf->vma, vmf->address,
+   vmf->pte);
+   }
+
if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
clear_page(kaddr);
kunmap_atomic(kaddr);
flush_dcache_page(dst);
} else
-   copy_user_highpage(dst, src, va, vma);
+   copy_user_highpage(dst, src, vmf->address, vmf->vma);
 }
 
 static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
@@ -2318,7 +2329,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
vmf->address);
if (!new_page)
goto oom;
-   cow_user_page(new_page, old_page, vmf->address, vma);
+   cow_user_page(new_page, old_page, vmf);
}
 
if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , 
false))
-- 
2.17.1

[PATCH] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-03 Thread Jia He

When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

Below call trace is from arm64 do_page_fault for debugging purpose
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is(PTE_AF is cleared):
[9b007000] pgd=00023d4f8003, pud=00023da9b003, 
pmd=00023d4b3003, pte=36298607bd3

The keypoint is: we don't always have a hardware-managed access flag on
arm64.

The root cause is in copy_one_pte, it will clear the PTE_AF for COW
pages. Generally, when it is accessed by user, the COW pages will be set
as accessed(PTE_AF bit on arm64) by hardware if hardware feature is
supported. But on some arm64 platforms, the PTE_AF needs to be set by
software.

This patch fix it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Reported-by: Yibo Cai 
Signed-off-by: Jia He 
---
 mm/memory.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2bb51b6242e..b1f9ace2e943 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2140,7 +2140,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline void cow_user_page(struct page *dst, struct page *src,
+   struct vm_fault *vmf)
 {
debug_dma_assert_idle(src);
 
@@ -2152,20 +2153,30 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
 */
if (unlikely(!src)) {
void *kaddr = kmap_atomic(dst);
-   void __user *uaddr = (void __user *)(va & PAGE_MASK);
+   void __user *uaddr = (void __user *)(vmf->address & PAGE_MASK);
+   pte_t entry;
 
/*
 * This really shouldn't fail, because the page is there
 * in the page tables. But it might just be unreadable,
 * in which case we just give up and fill the result with
-* zeroes.
+* zeroes. If PTE_AF is cleared on arm64, it might
+* cause double page fault here. so makes pte young here
 */
+   if (!pte_young(vmf->orig_pte)) {
+   entry = pte_mkyoung(vmf->orig_pte);
+   if (ptep_set_access_flags(vmf->vma, vmf->address,
+   vmf->pte, entry, vmf->flags & FAULT_FLAG_WRITE))
+   update_mmu_cache(vmf->vma, vmf->address,
+   vmf->pte);
+   }
+
if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
clear_page(kaddr);
kunmap_atomic(kaddr);
flush_dcache_page(dst);
} else
-   copy_user_highpage(dst, src, va, vma);
+   copy_user_highpage(dst, src, vmf->address, vmf->vma);
 }
 
 static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
@@ -2318,7 +2329,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
vmf->address);
if (!new_page)
goto oom;
-   cow_user_page(new_page, old_page, vmf->address, vma);
+   cow_user_page(new_page, old_page, vmf);
}
 
if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, , 
false))
-- 
2.17.1

[PATCH 0/2] Fix and support dax kmem on arm64

2019-08-16 Thread Jia He

This patch set is to fix some dax kmem driver issues and then it can 
support to use pmem as normal RAM in arm64 qemu guest.

Jia He (2):
  drivers/dax/kmem: use default numa_mem_id if target_node is invalid
  drivers/dax/kmem: give a warning if CONFIG_DEV_DAX_PMEM_COMPAT is
enabled

 drivers/dax/kmem.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

-- 
2.17.1

[PATCH 2/2] drivers/dax/kmem: give a warning if CONFIG_DEV_DAX_PMEM_COMPAT is enabled

2019-08-16 Thread Jia He

commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for use
like normal RAM") helps to add persistent memory as normal RAM blocks.
But this driver doesn't work if CONFIG_DEV_DAX_PMEM_COMPAT is enabled.

Here is the debugging call trace when CONFIG_DEV_DAX_PMEM_COMPAT is
enabled.
[4.443730]  devm_memremap_pages+0x4b9/0x540
[4.443733]  dev_dax_probe+0x112/0x220 [device_dax]
[4.443735]  dax_pmem_compat_probe+0x58/0x92 [dax_pmem_compat]
[4.443737]  nvdimm_bus_probe+0x6b/0x150
[4.443739]  really_probe+0xf5/0x3d0
[4.443740]  driver_probe_device+0x11b/0x130
[4.443741]  device_driver_attach+0x58/0x60
[4.443742]  __driver_attach+0xa3/0x140

Then the dax0.0 device will be registered as "nd" bus instead of
"dax" bus. This causes the error as follows:
root@ubuntu:~# echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
-bash: echo: write error: No such device

This gives a warning to notify the user.

Signed-off-by: Jia He 
---
 drivers/dax/kmem.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index ad62d551d94e..b77f0e880598 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -93,6 +93,11 @@ static struct dax_device_driver device_dax_kmem_driver = {
 
 static int __init dax_kmem_init(void)
 {
+   if (IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)) {
+   pr_warn("CONFIG_DEV_DAX_PMEM_COMPAT is not compatible\n");
+   pr_warn("kmem dax driver might not be workable\n");
+   }
+
return dax_driver_register(_dax_kmem_driver);
 }
 
-- 
2.17.1

[PATCH 1/2] drivers/dax/kmem: use default numa_mem_id if target_node is invalid

2019-08-16 Thread Jia He

In some platforms(e.g arm64 guest), the NFIT info might not be ready.
Then target_node might be -1. But if there is a default numa_mem_id(),
we can use it to avoid unnecessary fatal EINVL error.

devm_memremap_pages() also uses this logic if nid is invalid, we can
keep the same page with it.

Signed-off-by: Jia He 
---
 drivers/dax/kmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a02318c6d28a..ad62d551d94e 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -33,9 +33,9 @@ int dev_dax_kmem_probe(struct device *dev)
 */
numa_node = dev_dax->target_node;
if (numa_node < 0) {
-   dev_warn(dev, "rejecting DAX region %pR with invalid node: 
%d\n",
-res, numa_node);
-   return -EINVAL;
+   dev_warn(dev, "DAX %pR with invalid node, assume it as %d\n",
+   res, numa_node, numa_mem_id());
+   numa_node = numa_mem_id();
}
 
/* Hotplug starting at the beginning of the next block: */
-- 
2.17.1

[PATCH 1/2] vsprintf: Prevent crash when dereferencing invalid pointers for %pD

2019-08-08 Thread Jia He

Commit 3e5903eb9cff ("vsprintf: Prevent crash when dereferencing invalid
pointers") prevents most crash except for %pD.
There is an additional pointer dereferencing before dentry_name.

At least, vma->file can be NULL and be passed to printk %pD in 
print_bad_pte, which can cause crash.

This patch fixes it with introducing a new file_dentry_name.

Signed-off-by: Jia He 
---
 lib/vsprintf.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 63937044c57d..b4a119176fdb 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -869,6 +869,15 @@ char *dentry_name(char *buf, char *end, const struct 
dentry *d, struct printf_sp
return widen_string(buf, n, end, spec);
 }
 
+static noinline_for_stack
+char *file_dentry_name(char *buf, char *end, const struct file *f,
+   struct printf_spec spec, const char *fmt)
+{
+   if (check_pointer(, end, f, spec))
+   return buf;
+
+   return dentry_name(buf, end, f->f_path.dentry, spec, fmt);
+}
 #ifdef CONFIG_BLOCK
 static noinline_for_stack
 char *bdev_name(char *buf, char *end, struct block_device *bdev,
@@ -2166,9 +2175,7 @@ char *pointer(const char *fmt, char *buf, char *end, void 
*ptr,
case 'C':
return clock(buf, end, ptr, spec, fmt);
case 'D':
-   return dentry_name(buf, end,
-  ((const struct file *)ptr)->f_path.dentry,
-  spec, fmt);
+   return file_dentry_name(buf, end, ptr, spec, fmt);
 #ifdef CONFIG_BLOCK
case 'g':
return bdev_name(buf, end, ptr, spec, fmt);
-- 
2.17.1

[PATCH 2/2] lib/test_printf: add test of null/invalid pointer dereference for dentry

2019-08-08 Thread Jia He

This add some additional test cases of null/invalid pointer dereference
for dentry and file (%pd and %pD)

Signed-off-by: Jia He 
---
 lib/test_printf.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/lib/test_printf.c b/lib/test_printf.c
index 944eb50f3862..befedffeb476 100644
--- a/lib/test_printf.c
+++ b/lib/test_printf.c
@@ -455,6 +455,13 @@ dentry(void)
test("foo", "%pd", _dentry[0]);
test("foo", "%pd2", _dentry[0]);
 
+   /* test the null/invalid pointer case for dentry */
+   test("(null)", "%pd", NULL);
+   test("(efault)", "%pd", PTR_INVALID);
+   /* test the null/invalid pointer case for file */
+   test("(null)", "%pD", NULL);
+   test("(efault)", "%pD", PTR_INVALID);
+
test("romeo", "%pd", _dentry[3]);
test("alfa/romeo", "%pd2", _dentry[3]);
test("bravo/alfa/romeo", "%pd3", _dentry[3]);
-- 
2.17.1

[PATCH] arm64: mm: add missing PTE_SPECIAL in pte_mkdevmap on arm64

2019-08-06 Thread Jia He

Without this patch, the MAP_SYNC test case will cause a print_bad_pte
warning on arm64 as follows:
[   25.542693] BUG: Bad page map in process mapdax333
pte:2e8000448800f53 pmd:41ff5f003
[   25.546360] page:7e001022 refcount:1 mapcount:-1
mapping:8003e29c7440 index:0x0
[   25.550281] ext4_dax_aops
[   25.550282] name:"__aaabbbcccddd__"
[   25.551553] flags: 0x3001002(referenced|reserved)
[   25.555802] raw: 03001002 8003dfffa908 
8003e29c7440
[   25.559446] raw:   0001fffe

[   25.563075] page dumped because: bad pte
[   25.564938] addr:be05b000 vm_flags:208000fb
anon_vma: mapping:8003e29c7440 index:0
[   25.574272] file:__aaabbbcccddd__ fault:ext4_dax_fault
ap:ext4_file_mmap readpage:0x0
[   25.578799] CPU: 1 PID: 1180 Comm: mapdax333 Not tainted 5.2.0+ #21
[   25.581702] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0
02/06/2015
[   25.585624] Call trace:
[   25.587008]  dump_backtrace+0x0/0x178
[   25.588799]  show_stack+0x24/0x30
[   25.590328]  dump_stack+0xa8/0xcc
[   25.591901]  print_bad_pte+0x18c/0x218
[   25.593628]  unmap_page_range+0x778/0xc00
[   25.595506]  unmap_single_vma+0x94/0xe8
[   25.597304]  unmap_vmas+0x90/0x108
[   25.598901]  unmap_region+0xc0/0x128
[   25.600566]  __do_munmap+0x284/0x3f0
[   25.602245]  __vm_munmap+0x78/0xe0
[   25.603820]  __arm64_sys_munmap+0x34/0x48
[   25.605709]  el0_svc_common.constprop.0+0x78/0x168
[   25.607956]  el0_svc_handler+0x34/0x90
[   25.609698]  el0_svc+0x8/0xc
[   25.611103] Disabling lock debugging due to kernel taint
[   25.613573] BUG: Bad page state in process mapdax333  pfn:448800
[   25.616359] page:7e001022 refcount:0 mapcount:-1
mapping:8003e29c7440 index:0x1
[   25.620236] ext4_dax_aops
[   25.620237] name:"__aaabbbcccddd__"
[   25.621495] flags: 0x300()
[   25.624912] raw: 0300 dead0100 dead0200
8003e29c7440
[   25.628502] raw: 0001  fffe

[   25.632097] page dumped because: non-NULL mapping
[...]
[   25.656567] CPU: 1 PID: 1180 Comm: mapdax333 Tainted: GB
5.2.0+ #21
[   25.660131] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0
02/06/2015
[   25.663324] Call trace:
[   25.664466]  dump_backtrace+0x0/0x178
[   25.666163]  show_stack+0x24/0x30
[   25.667721]  dump_stack+0xa8/0xcc
[   25.669270]  bad_page+0xf0/0x150
[   25.670772]  free_pages_check_bad+0x84/0xa0
[   25.672724]  free_pcppages_bulk+0x45c/0x708
[   25.674675]  free_unref_page_commit+0xcc/0x100
[   25.676751]  free_unref_page_list+0x13c/0x200
[   25.678801]  release_pages+0x350/0x420
[   25.680539]  free_pages_and_swap_cache+0xf8/0x128
[   25.682738]  tlb_flush_mmu+0x164/0x2b0
[   25.684485]  unmap_page_range+0x648/0xc00
[   25.686349]  unmap_single_vma+0x94/0xe8
[   25.688131]  unmap_vmas+0x90/0x108
[   25.689739]  unmap_region+0xc0/0x128
[   25.691392]  __do_munmap+0x284/0x3f0
[   25.693079]  __vm_munmap+0x78/0xe0
[   25.694658]  __arm64_sys_munmap+0x34/0x48
[   25.696530]  el0_svc_common.constprop.0+0x78/0x168
[   25.698772]  el0_svc_handler+0x34/0x90
[   25.700512]  el0_svc+0x8/0xc

The root cause is in _vm_normal_page, without the PTE_SPECIAL bit,
the return value will be incorrectly set to pfn_to_page(pfn) instead
of NULL. Besides, this patch also rewrite the pmd_mkdevmap to avoid
setting PTE_SPECIAL for pmd

The MAP_SYNC test case is as follows(Provided by Yibo Cai)
$#include 
$#include 
$#include 
$#include 
$#include 

$#ifndef MAP_SYNC
$#define MAP_SYNC 0x8
$#endif

/* mount -o dax /dev/pmem0 /mnt */
$#define F "/mnt/__aaabbbcccddd__"

int main(void)
{
int fd;
char buf[4096];
void *addr;

if ((fd = open(F, O_CREAT|O_TRUNC|O_RDWR, 0644)) < 0) {
perror("open1");
return 1;
}

if (write(fd, buf, 4096) != 4096) {
perror("lseek");
return 1;
}

addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_SYNC,
fd, 0);
if (addr == MAP_FAILED) {
perror("mmap");
printf("did you mount with '-o dax'?\n");
return 1;
}

memset(addr, 0x55, 4096);

if (munmap(addr, 4096) == -1) {
perror("munmap");
return 1;
}

close(fd);

return 0;
}

Fixes: 73b20c84d42d ("arm64: mm: implement pte_devmap support")
Reported-by: Yibo Cai 
Signed-off-by: Jia He 
Acked-by: Robin Murphy 
---
 arch/arm64/include/asm/pgtable.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5fdcfe237338..e09760ece844 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -209,7 +209,7 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
 
 static inline pte_t pte_mkdevmap(pte_t pte)
 {
-   return set_pte_bit(

Re: [RFC PATCH v2 2/3] arm64: mark all the GICC nodes in MADT as possible cpu

2019-07-04 Thread Jia He




On 2019/6/29 10:42, Xiongfeng Wang wrote:

We set 'cpu_possible_mask' based on the enabled GICC node in MADT. If
the GICC node is disabled, we will skip initializing the kernel data
structure for that CPU.

To support CPU hotplug, we need to initialize some CPU related data
structure in advance. This patch mark all the GICC nodes as possible CPU
and only these enabled GICC nodes as present CPU.

Signed-off-by: Xiongfeng Wang 
---
  arch/arm64/kernel/setup.c |  2 +-
  arch/arm64/kernel/smp.c   | 11 +--
  2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 7e541f9..7f4d12a 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -359,7 +359,7 @@ static int __init topology_init(void)
for_each_online_node(i)
register_one_node(i);
  
-	for_each_possible_cpu(i) {

+   for_each_online_cpu(i) {


Have you considered the case in non-acpi mode? and setting "maxcpus=n" in host 
kernel boot


parameters?

---
Cheers,
Justin (Jia He)



struct cpu *cpu = _cpu(cpu_data.cpu, i);
cpu->hotpluggable = 1;
register_cpu(cpu, i);
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 6dcf960..6d9983c 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -525,16 +525,14 @@ struct acpi_madt_generic_interrupt 
*acpi_cpu_get_madt_gicc(int cpu)
  {
u64 hwid = processor->arm_mpidr;
  
-	if (!(processor->flags & ACPI_MADT_ENABLED)) {

-   pr_debug("skipping disabled CPU entry with 0x%llx MPIDR\n", 
hwid);
-   return;
-   }
-
if (hwid & ~MPIDR_HWID_BITMASK || hwid == INVALID_HWID) {
pr_err("skipping CPU entry with invalid MPIDR 0x%llx\n", hwid);
return;
}
  
+	if (!(processor->flags & ACPI_MADT_ENABLED))

+   pr_debug("disabled CPU entry with 0x%llx MPIDR\n", hwid);
+
if (is_mpidr_duplicate(cpu_count, hwid)) {
pr_err("duplicate CPU MPIDR 0x%llx in MADT\n", hwid);
return;
@@ -755,7 +753,8 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
if (err)
continue;
  
-		set_cpu_present(cpu, true);

+   if ((cpu_madt_gicc[cpu].flags & ACPI_MADT_ENABLED))
+   set_cpu_present(cpu, true);
numa_store_cpu_info(cpu);
}
  }


--

Re: [PATCH RFC 0/3] Support CPU hotplug for ARM64

2019-07-04 Thread Jia He


Hi Xiongfeng

Sorry, I missed your latter mail, you used a emulated SCI interrupt

---
Cheers,
Justin (Jia He)

On 2019/7/4 11:26, Xiongfeng Wang wrote:

Hi Justin,

On 2019/7/4 11:00, Jia He wrote:

Hi Xiongfeng

It is a little bit awkful that I am also  investigating acpi based cpu hotplug 
issue silimar with

your idea. My question is your purpose to implement the vcpu hotplug in arm64 
qemu?

Yes, my purpose is to implement the vcpu hotplug in arm64 qemu. So that I can 
add or remove vcpu
without shutting down the Guest OS.

Thanks,
Xiongfeng


Thanks for the ellaboration

---
Cheers,
Justin (Jia He)

On 2019/6/28 19:13, Xiongfeng Wang wrote:

This patchset mark all the GICC node in MADT as possible CPUs even though it
is disabled. But only those enabled GICC node are marked as present CPUs.
So that kernel will initialize some CPU related data structure in advance before
the CPU is actually hot added into the system. This patchset also implement
'acpi_(un)map_cpu()' and 'arch_(un)register_cpu()' for ARM64. These functions 
are
needed to enable CPU hotplug.

To support CPU hotplug, we need to add all the possible GICC node in MADT
including those CPUs that are not present but may be hot added later. Those
CPUs are marked as disabled in GICC nodes.

Xiongfeng Wang (3):
    ACPI / scan: evaluate _STA for processors declared via ASL Device
  statement
    arm64: mark all the GICC nodes in MADT as possible cpu
    arm64: Add CPU hotplug support

   arch/arm64/kernel/acpi.c  | 22 ++
   arch/arm64/kernel/setup.c | 19 ++-
   arch/arm64/kernel/smp.c   | 11 +--
   drivers/acpi/scan.c   | 12 
   4 files changed, 57 insertions(+), 7 deletions(-)


.


--

Re: [PATCH RFC 0/3] Support CPU hotplug for ARM64

2019-07-03 Thread Jia He


Hi Xiongfeng

On 2019/7/4 11:26, Xiongfeng Wang wrote:

Hi Justin,

On 2019/7/4 11:00, Jia He wrote:

Hi Xiongfeng

It is a little bit awkful that I am also  investigating acpi based cpu hotplug 
issue silimar with

your idea. My question is your purpose to implement the vcpu hotplug in arm64 
qemu?

Yes, my purpose is to implement the vcpu hotplug in arm64 qemu. So that I can 
add or remove vcpu
without shutting down the Guest OS.


Thanks for the infor, I guess you used GED device in qemu ;-)?

---
Cheers,
Justin (Jia He)

Re: [PATCH RFC 0/3] Support CPU hotplug for ARM64

2019-07-03 Thread Jia He


Hi Xiongfeng

It is a little bit awkful that I am also  investigating acpi based cpu hotplug 
issue silimar with


your idea. My question is your purpose to implement the vcpu hotplug in arm64 
qemu?

Thanks for the ellaboration

---
Cheers,
Justin (Jia He)

On 2019/6/28 19:13, Xiongfeng Wang wrote:

This patchset mark all the GICC node in MADT as possible CPUs even though it
is disabled. But only those enabled GICC node are marked as present CPUs.
So that kernel will initialize some CPU related data structure in advance before
the CPU is actually hot added into the system. This patchset also implement
'acpi_(un)map_cpu()' and 'arch_(un)register_cpu()' for ARM64. These functions 
are
needed to enable CPU hotplug.

To support CPU hotplug, we need to add all the possible GICC node in MADT
including those CPUs that are not present but may be hot added later. Those
CPUs are marked as disabled in GICC nodes.

Xiongfeng Wang (3):
   ACPI / scan: evaluate _STA for processors declared via ASL Device
 statement
   arm64: mark all the GICC nodes in MADT as possible cpu
   arm64: Add CPU hotplug support

  arch/arm64/kernel/acpi.c  | 22 ++
  arch/arm64/kernel/setup.c | 19 ++-
  arch/arm64/kernel/smp.c   | 11 +--
  drivers/acpi/scan.c   | 12 
  4 files changed, 57 insertions(+), 7 deletions(-)

Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64

2019-06-11 Thread Jia He


Hi Hanjun

On 2019/6/11 23:18, Hanjun Guo wrote:

Hello Ard,

Thanks for the reply, please see my comments inline.

On 2019/6/10 21:16, Ard Biesheuvel wrote:

On Sat, 8 Jun 2019 at 06:22, Hanjun Guo  wrote:

Hi Ard, Will,

This week we were trying to debug an issue of time consuming in mem_init(),
and leading to this similar solution form Jia He, so I would like to bring this
thread back, please see my detail test result below.

On 2018/9/7 22:44, Will Deacon wrote:

On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:

On 22 August 2018 at 05:07, Jia He  wrote:

Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") optimized the loop in memmap_init_zone(). But it causes
possible panic bug. So Daniel Vacek reverted it later.

But as suggested by Daniel Vacek, it is fine to using memblock to skip
gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.

More from what Daniel said:
"On arm and arm64, memblock is used by default. But generic version of
pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
not always return the next valid one but skips more resulting in some
valid frames to be skipped (as if they were invalid). And that's why
kernel was eventually crashing on some !arm machines."

About the performance consideration:
As said by James in b92df1de5,
"I have tested this patch on a virtual model of a Samurai CPU with a
sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.

Besides we can remain memblock_next_valid_pfn, there is still some room
for improvement. After this set, I can see the time overhead of memmap_init
is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
memory, pagesize 64k). I believe arm server will benefit more if memory is
larger than TBs


OK so we can summarize the benefits of this series as follows:
- boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
- boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
*milliseconds*

Google was not very helpful in figuring out what a Samurai CPU is and
why we should care about the boot time of Linux running on a virtual
model of it, and the 15 ms speedup is not that compelling either.

Testing this patch set on top of Kunpeng 920 based ARM64 server, with
384G memory in total, we got the time consuming below

  without this patch set  with this patch set
mem_init()13310ms  1415ms

So we got about 8x speedup on this machine, which is very impressive.


Yes, this is impressive. But does it matter in the grand scheme of
things?

It matters for this machine, because it's for storage and there is
a watchdog and the time consuming triggers the watchdog.


How much time does this system take to arrive at this point
from power on?

Sorry, I don't have such data, as the arch timer is not initialized
and I didn't see the time stamp at this point, but I read the cycles
from arch timer before and after the time consuming function to get
how much time consumed.


The time consuming is related the memory DIMM size and where to locate those
memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
We also tested 1T memory with 64G size for each memory DIMM on another ARM64
machine, the time consuming reduced from 20s to 2s (I think it's related to
firmware implementations).


I agree that this optimization looks good in isolation, but the fact
that you spotted a bug justifies my skepticism at the time. On the
other hand, now that we have several independent reports (from you,
but also from the Renesas folks) that the speedup is worthwhile for
real world use cases, I think it does make sense to revisit it.

Thank you very much for taking care of this :)


So what I would like to see is the patch set being proposed again,
with the new data points added for documentation. Also, the commit
logs need to crystal clear about how the meaning of PFN validity
differs between ARM and other architectures, and why the assumptions
that the optimization is based on are guaranteed to hold.

I think Jia He no longer works for HXT, if don't mind, I can repost
this patch set with Jia He's authority unchanged.

Ok, I don't mind that, thanks for your followup :)

---
Cheers,
Justin (Jia He)

1 2 3 4 5 6 7 >

1 - 100 of 626 matches

Mail list logo