[PATCH v2 1/1] irqchip/gicv3: iterate over possible CPUs by for_each_possible_cpu()
From: zijun_hu <zijun...@htc.com> get_cpu_number() doesn't use existing helper to iterate over possible CPUs, it will cause error in case of discontinuous @cpu_possible_mask such as 0b0001, such discontinuous @cpu_possible_mask is resulted in likely because one core have failed to come up on a SMP machine fixed by using existing helper for_each_possible_cpu(). Signed-off-by: zijun_hu <zijun...@htc.com> --- Changes in v2: - more detailed commit messages are offered drivers/irqchip/irq-gic-v3.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 519149ec9053..b5df99c6f680 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1042,7 +1042,7 @@ static int get_cpu_number(struct device_node *dn) { const __be32 *cell; u64 hwid; - int i; + int cpu; cell = of_get_property(dn, "reg", NULL); if (!cell) @@ -1056,9 +1056,9 @@ static int get_cpu_number(struct device_node *dn) if (hwid & ~MPIDR_HWID_BITMASK) return -1; - for (i = 0; i < num_possible_cpus(); i++) - if (cpu_logical_map(i) == hwid) - return i; + for_each_possible_cpu(cpu) + if (cpu_logical_map(cpu) == hwid) + return cpu; return -1; } -- 1.9.1
[PATCH v2 1/1] irqchip/gicv3: iterate over possible CPUs by for_each_possible_cpu()
From: zijun_hu get_cpu_number() doesn't use existing helper to iterate over possible CPUs, it will cause error in case of discontinuous @cpu_possible_mask such as 0b0001, such discontinuous @cpu_possible_mask is resulted in likely because one core have failed to come up on a SMP machine fixed by using existing helper for_each_possible_cpu(). Signed-off-by: zijun_hu --- Changes in v2: - more detailed commit messages are offered drivers/irqchip/irq-gic-v3.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 519149ec9053..b5df99c6f680 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1042,7 +1042,7 @@ static int get_cpu_number(struct device_node *dn) { const __be32 *cell; u64 hwid; - int i; + int cpu; cell = of_get_property(dn, "reg", NULL); if (!cell) @@ -1056,9 +1056,9 @@ static int get_cpu_number(struct device_node *dn) if (hwid & ~MPIDR_HWID_BITMASK) return -1; - for (i = 0; i < num_possible_cpus(); i++) - if (cpu_logical_map(i) == hwid) - return i; + for_each_possible_cpu(cpu) + if (cpu_logical_map(cpu) == hwid) + return cpu; return -1; } -- 1.9.1
Re: [PATCH 1/1] irqchip/gicv3: iterate over possible CPUs by for_each_possible_cpu()
On 09/15/2017 03:20 AM, Marc Zyngier wrote: > On Thu, Sep 14 2017 at 1:15:14 pm BST, zijun_hu <zijun...@zoho.com> wrote: >> From: zijun_hu <zijun...@htc.com> >> >> get_cpu_number() doesn't use existing helper to iterate over possible >> CPUs, so error happens in case of discontinuous @cpu_possible_mask >> such as 0b0001. > > Do you have an example of such a situation? Your patch is definitely an > improvement, but I'd like to understand how you get there... > > Thanks, > > M. > a few conditions which maybe result in discontiguous @cpu_possible_mask are noticed and considered of by ARM64 init code as indicated by bellow code segments: in arch/arm64/kernel/smp.c : void __init smp_init_cpus(void) { .. /* * We need to set the cpu_logical_map entries before enabling * the cpus so that cpu processor description entries (DT cpu nodes * and ACPI MADT entries) can be retrieved by matching the cpu hwid * with entries in cpu_logical_map while initializing the cpus. * If the cpu set-up fails, invalidate the cpu_logical_map entry. */ for (i = 1; i < nr_cpu_ids; i++) { if (cpu_logical_map(i) != INVALID_HWID) { if (smp_cpu_setup(i)) cpu_logical_map(i) = INVALID_HWID; } } .. } /* * Initialize cpu operations for a logical cpu and * set it in the possible mask on success */ static int __init smp_cpu_setup(int cpu) { if (cpu_read_ops(cpu)) return -ENODEV; if (cpu_ops[cpu]->cpu_init(cpu)) return -ENODEV; set_cpu_possible(cpu, true); return 0; } i browses GICv3 drivers code and notice that a little weird code segments
Re: [PATCH 1/1] irqchip/gicv3: iterate over possible CPUs by for_each_possible_cpu()
On 09/15/2017 03:20 AM, Marc Zyngier wrote: > On Thu, Sep 14 2017 at 1:15:14 pm BST, zijun_hu wrote: >> From: zijun_hu >> >> get_cpu_number() doesn't use existing helper to iterate over possible >> CPUs, so error happens in case of discontinuous @cpu_possible_mask >> such as 0b0001. > > Do you have an example of such a situation? Your patch is definitely an > improvement, but I'd like to understand how you get there... > > Thanks, > > M. > a few conditions which maybe result in discontiguous @cpu_possible_mask are noticed and considered of by ARM64 init code as indicated by bellow code segments: in arch/arm64/kernel/smp.c : void __init smp_init_cpus(void) { .. /* * We need to set the cpu_logical_map entries before enabling * the cpus so that cpu processor description entries (DT cpu nodes * and ACPI MADT entries) can be retrieved by matching the cpu hwid * with entries in cpu_logical_map while initializing the cpus. * If the cpu set-up fails, invalidate the cpu_logical_map entry. */ for (i = 1; i < nr_cpu_ids; i++) { if (cpu_logical_map(i) != INVALID_HWID) { if (smp_cpu_setup(i)) cpu_logical_map(i) = INVALID_HWID; } } .. } /* * Initialize cpu operations for a logical cpu and * set it in the possible mask on success */ static int __init smp_cpu_setup(int cpu) { if (cpu_read_ops(cpu)) return -ENODEV; if (cpu_ops[cpu]->cpu_init(cpu)) return -ENODEV; set_cpu_possible(cpu, true); return 0; } i browses GICv3 drivers code and notice that a little weird code segments
[PATCH 1/1] irqchip/gicv3: iterate over possible CPUs by for_each_possible_cpu()
From: zijun_hu <zijun...@htc.com> get_cpu_number() doesn't use existing helper to iterate over possible CPUs, so error happens in case of discontinuous @cpu_possible_mask such as 0b0001. fixed by using existing helper for_each_possible_cpu(). Signed-off-by: zijun_hu <zijun...@htc.com> --- drivers/irqchip/irq-gic-v3.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 19d642eae096..2a0427c2971e 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -990,7 +990,7 @@ static int get_cpu_number(struct device_node *dn) { const __be32 *cell; u64 hwid; - int i; + int cpu; cell = of_get_property(dn, "reg", NULL); if (!cell) @@ -1004,9 +1004,9 @@ static int get_cpu_number(struct device_node *dn) if (hwid & ~MPIDR_HWID_BITMASK) return -1; - for (i = 0; i < num_possible_cpus(); i++) - if (cpu_logical_map(i) == hwid) - return i; + for_each_possible_cpu(cpu) + if (cpu_logical_map(cpu) == hwid) + return cpu; return -1; } -- 1.9.1
[PATCH 1/1] irqchip/gicv3: iterate over possible CPUs by for_each_possible_cpu()
From: zijun_hu get_cpu_number() doesn't use existing helper to iterate over possible CPUs, so error happens in case of discontinuous @cpu_possible_mask such as 0b0001. fixed by using existing helper for_each_possible_cpu(). Signed-off-by: zijun_hu --- drivers/irqchip/irq-gic-v3.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 19d642eae096..2a0427c2971e 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -990,7 +990,7 @@ static int get_cpu_number(struct device_node *dn) { const __be32 *cell; u64 hwid; - int i; + int cpu; cell = of_get_property(dn, "reg", NULL); if (!cell) @@ -1004,9 +1004,9 @@ static int get_cpu_number(struct device_node *dn) if (hwid & ~MPIDR_HWID_BITMASK) return -1; - for (i = 0; i < num_possible_cpus(); i++) - if (cpu_logical_map(i) == hwid) - return i; + for_each_possible_cpu(cpu) + if (cpu_logical_map(cpu) == hwid) + return cpu; return -1; } -- 1.9.1
Re: [PATCH 1/1] workqueue: use type int instead of bool to index array
On 2017/9/7 0:40, Tejun Heo wrote: > On Thu, Sep 07, 2017 at 12:04:59AM +0800, zijun_hu wrote: >> On 2017/9/6 22:33, Tejun Heo wrote: >>> Hello, >>> >>> On Wed, Sep 06, 2017 at 11:34:14AM +0800, zijun_hu wrote: >>>> From: zijun_hu <zijun...@htc.com> >>>> >>>> type bool is used to index three arrays in alloc_and_link_pwqs() >>>> it doesn't look like conventional. >>>> >>>> it is fixed by using type int to index the relevant arrays. >>> >>> bool is a uint type which can be either 0 or 1. I don't see what the >>> benefit of this patch is.q >>> >> bool is NOT a uint type now, it is a new type introduced by gcc, it is >> rather different with "typedef int bool" historically > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n815.htm > > Because C has existed for so long without a Boolean type, however, the > new standard must coexist with the old remedies. Therefore, the type > name is taken from the reserved identifier space. To maintain > orthogonal promotion rules, the Boolean type is defined as an unsigned > integer type capable of representing the values 0 and 1. The more > conventional names for the type and its values are then made available > only with the inclusion of the header. In addition, the > header defines a feature test macro to aid in integrating new code > with old code that defines its own Boolean type. > in this case, i think type int is more suitable than bool in aspects of extendibility, program custom and consistency.
Re: [PATCH 1/1] workqueue: use type int instead of bool to index array
On 2017/9/7 0:40, Tejun Heo wrote: > On Thu, Sep 07, 2017 at 12:04:59AM +0800, zijun_hu wrote: >> On 2017/9/6 22:33, Tejun Heo wrote: >>> Hello, >>> >>> On Wed, Sep 06, 2017 at 11:34:14AM +0800, zijun_hu wrote: >>>> From: zijun_hu >>>> >>>> type bool is used to index three arrays in alloc_and_link_pwqs() >>>> it doesn't look like conventional. >>>> >>>> it is fixed by using type int to index the relevant arrays. >>> >>> bool is a uint type which can be either 0 or 1. I don't see what the >>> benefit of this patch is.q >>> >> bool is NOT a uint type now, it is a new type introduced by gcc, it is >> rather different with "typedef int bool" historically > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n815.htm > > Because C has existed for so long without a Boolean type, however, the > new standard must coexist with the old remedies. Therefore, the type > name is taken from the reserved identifier space. To maintain > orthogonal promotion rules, the Boolean type is defined as an unsigned > integer type capable of representing the values 0 and 1. The more > conventional names for the type and its values are then made available > only with the inclusion of the header. In addition, the > header defines a feature test macro to aid in integrating new code > with old code that defines its own Boolean type. > in this case, i think type int is more suitable than bool in aspects of extendibility, program custom and consistency.
Re: [PATCH 1/1] workqueue: use type int instead of bool to index array
On 2017/9/6 22:33, Tejun Heo wrote: > Hello, > > On Wed, Sep 06, 2017 at 11:34:14AM +0800, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> type bool is used to index three arrays in alloc_and_link_pwqs() >> it doesn't look like conventional. >> >> it is fixed by using type int to index the relevant arrays. > > bool is a uint type which can be either 0 or 1. I don't see what the > benefit of this patch is.q > bool is NOT a uint type now, it is a new type introduced by gcc, it is rather different with "typedef int bool" historically see following code segments for more info about type bool bool v = 0x10; printf("v = %d\n", v); the output is v = 1. it maybe cause a invalid array index if bool is represented as uint bool highpri = wq->flags & WQ_HIGHPRI; WQ_HIGHPRI = 1 << 4, @highpri maybe 16, but the number of array elements is 2. bool is a logic value, the valid value is true or false. indexing array by type bool is not a good program custom it is more extendable to use type int, type bool maybe is improper if the number of array elements is extended to more than 2 in future besides, the relevant array is indexed by type int in many other places of the same source file. this patch can keep consistency >
Re: [PATCH 1/1] workqueue: use type int instead of bool to index array
On 2017/9/6 22:33, Tejun Heo wrote: > Hello, > > On Wed, Sep 06, 2017 at 11:34:14AM +0800, zijun_hu wrote: >> From: zijun_hu >> >> type bool is used to index three arrays in alloc_and_link_pwqs() >> it doesn't look like conventional. >> >> it is fixed by using type int to index the relevant arrays. > > bool is a uint type which can be either 0 or 1. I don't see what the > benefit of this patch is.q > bool is NOT a uint type now, it is a new type introduced by gcc, it is rather different with "typedef int bool" historically see following code segments for more info about type bool bool v = 0x10; printf("v = %d\n", v); the output is v = 1. it maybe cause a invalid array index if bool is represented as uint bool highpri = wq->flags & WQ_HIGHPRI; WQ_HIGHPRI = 1 << 4, @highpri maybe 16, but the number of array elements is 2. bool is a logic value, the valid value is true or false. indexing array by type bool is not a good program custom it is more extendable to use type int, type bool maybe is improper if the number of array elements is extended to more than 2 in future besides, the relevant array is indexed by type int in many other places of the same source file. this patch can keep consistency >
[PATCH 1/1] workqueue: use type int instead of bool to index array
From: zijun_hu <zijun...@htc.com> type bool is used to index three arrays in alloc_and_link_pwqs() it doesn't look like conventional. it is fixed by using type int to index the relevant arrays. Signed-off-by: zijun_hu <zijun...@htc.com> --- kernel/workqueue.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ab3c0dc8c7ed..0c7a16792949 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3917,7 +3917,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu, static int alloc_and_link_pwqs(struct workqueue_struct *wq) { - bool highpri = wq->flags & WQ_HIGHPRI; + int pool_idx = (wq->flags & WQ_HIGHPRI) ? 1 : 0; int cpu, ret; if (!(wq->flags & WQ_UNBOUND)) { @@ -3931,7 +3931,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) struct worker_pool *cpu_pools = per_cpu(cpu_worker_pools, cpu); - init_pwq(pwq, wq, _pools[highpri]); + init_pwq(pwq, wq, _pools[pool_idx]); mutex_lock(>mutex); link_pwq(pwq); @@ -3939,14 +3939,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) } return 0; } else if (wq->flags & __WQ_ORDERED) { - ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]); + ret = apply_workqueue_attrs(wq, ordered_wq_attrs[pool_idx]); /* there should only be single pwq for ordering guarantee */ WARN(!ret && (wq->pwqs.next != >dfl_pwq->pwqs_node || wq->pwqs.prev != >dfl_pwq->pwqs_node), "ordering guarantee broken for workqueue %s\n", wq->name); return ret; } else { - return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]); + return apply_workqueue_attrs(wq, + unbound_std_wq_attrs[pool_idx]); } } -- 1.9.1
[PATCH 1/1] workqueue: use type int instead of bool to index array
From: zijun_hu type bool is used to index three arrays in alloc_and_link_pwqs() it doesn't look like conventional. it is fixed by using type int to index the relevant arrays. Signed-off-by: zijun_hu --- kernel/workqueue.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ab3c0dc8c7ed..0c7a16792949 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3917,7 +3917,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu, static int alloc_and_link_pwqs(struct workqueue_struct *wq) { - bool highpri = wq->flags & WQ_HIGHPRI; + int pool_idx = (wq->flags & WQ_HIGHPRI) ? 1 : 0; int cpu, ret; if (!(wq->flags & WQ_UNBOUND)) { @@ -3931,7 +3931,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) struct worker_pool *cpu_pools = per_cpu(cpu_worker_pools, cpu); - init_pwq(pwq, wq, _pools[highpri]); + init_pwq(pwq, wq, _pools[pool_idx]); mutex_lock(>mutex); link_pwq(pwq); @@ -3939,14 +3939,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) } return 0; } else if (wq->flags & __WQ_ORDERED) { - ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]); + ret = apply_workqueue_attrs(wq, ordered_wq_attrs[pool_idx]); /* there should only be single pwq for ordering guarantee */ WARN(!ret && (wq->pwqs.next != >dfl_pwq->pwqs_node || wq->pwqs.prev != >dfl_pwq->pwqs_node), "ordering guarantee broken for workqueue %s\n", wq->name); return ret; } else { - return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]); + return apply_workqueue_attrs(wq, + unbound_std_wq_attrs[pool_idx]); } } -- 1.9.1
Re: [PATCH] mm/vmalloc: add vm_struct for vm_map_ram area
On 07/19/2017 06:44 PM, Zhaoyang Huang wrote: > /proc/vmallocinfo will not show the area allocated by vm_map_ram, which > will make confusion when debug. Add vm_struct for them and show them in > proc. > > Signed-off-by: Zhaoyang Huang> --- another patch titled "vmalloc: show lazy-purged vma info in vmallocinfo" was phased-in to linux-next branch to resolve this problem. > mm/vmalloc.c | 27 ++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 34a1c3e..4a2e93c 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -46,6 +46,9 @@ struct vfree_deferred { > > static void __vunmap(const void *, int); > > +static void setup_vmap_ram_vm(struct vm_struct *vm, struct vmap_area *va, > + unsigned long flags, const void *caller); > + > static void free_work(struct work_struct *w) > { > struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq); > @@ -315,6 +318,7 @@ unsigned long vmalloc_to_pfn(const void *vmalloc_addr) > /*** Global kva allocator ***/ > > #define VM_VM_AREA 0x04 > +#define VM_VM_RAM0x08 > > static DEFINE_SPINLOCK(vmap_area_lock); > /* Export for kexec only */ > @@ -1141,6 +1145,7 @@ void vm_unmap_ram(const void *mem, unsigned int count) > > va = find_vmap_area(addr); > BUG_ON(!va); > + kfree(va->vm); > free_unmap_vmap_area(va); > } > EXPORT_SYMBOL(vm_unmap_ram); > @@ -1173,6 +1178,12 @@ void *vm_map_ram(struct page **pages, unsigned int > count, int node, pgprot_t pro > addr = (unsigned long)mem; > } else { > struct vmap_area *va; > + struct vm_struct *area; > + > + area = kzalloc_node(sizeof(*area), GFP_KERNEL, node); > + if (unlikely(!area)) > + return NULL; > + > va = alloc_vmap_area(size, PAGE_SIZE, > VMALLOC_START, VMALLOC_END, node, GFP_KERNEL); > if (IS_ERR(va)) > @@ -1180,6 +1191,7 @@ void *vm_map_ram(struct page **pages, unsigned int > count, int node, pgprot_t pro > > addr = va->va_start; > mem = (void *)addr; > + setup_vmap_ram_vm(area, va, 0, __builtin_return_address(0)); > } > if (vmap_page_range(addr, addr + size, prot, pages) < 0) { > vm_unmap_ram(mem, count); > @@ -1362,6 +1374,19 @@ static void setup_vmalloc_vm(struct vm_struct *vm, > struct vmap_area *va, > spin_unlock(_area_lock); > } > > +static void setup_vmap_ram_vm(struct vm_struct *vm, struct vmap_area *va, > + unsigned long flags, const void *caller) > +{ > + spin_lock(_area_lock); > + vm->flags = flags; > + vm->addr = (void *)va->va_start; > + vm->size = va->va_end - va->va_start; > + vm->caller = caller; > + va->vm = vm; > + va->flags |= VM_VM_RAM; > + spin_unlock(_area_lock); > +} > + > static void clear_vm_uninitialized_flag(struct vm_struct *vm) > { > /* > @@ -2698,7 +2723,7 @@ static int s_show(struct seq_file *m, void *p) >* s_show can encounter race with remove_vm_area, !VM_VM_AREA on >* behalf of vmap area is being tear down or vm_map_ram allocation. >*/ > - if (!(va->flags & VM_VM_AREA)) > + if (!(va->flags & (VM_VM_AREA | VM_VM_RAM))) > return 0; > > v = va->vm; >
Re: [PATCH] mm/vmalloc: add vm_struct for vm_map_ram area
On 07/19/2017 06:44 PM, Zhaoyang Huang wrote: > /proc/vmallocinfo will not show the area allocated by vm_map_ram, which > will make confusion when debug. Add vm_struct for them and show them in > proc. > > Signed-off-by: Zhaoyang Huang > --- another patch titled "vmalloc: show lazy-purged vma info in vmallocinfo" was phased-in to linux-next branch to resolve this problem. > mm/vmalloc.c | 27 ++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 34a1c3e..4a2e93c 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -46,6 +46,9 @@ struct vfree_deferred { > > static void __vunmap(const void *, int); > > +static void setup_vmap_ram_vm(struct vm_struct *vm, struct vmap_area *va, > + unsigned long flags, const void *caller); > + > static void free_work(struct work_struct *w) > { > struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq); > @@ -315,6 +318,7 @@ unsigned long vmalloc_to_pfn(const void *vmalloc_addr) > /*** Global kva allocator ***/ > > #define VM_VM_AREA 0x04 > +#define VM_VM_RAM0x08 > > static DEFINE_SPINLOCK(vmap_area_lock); > /* Export for kexec only */ > @@ -1141,6 +1145,7 @@ void vm_unmap_ram(const void *mem, unsigned int count) > > va = find_vmap_area(addr); > BUG_ON(!va); > + kfree(va->vm); > free_unmap_vmap_area(va); > } > EXPORT_SYMBOL(vm_unmap_ram); > @@ -1173,6 +1178,12 @@ void *vm_map_ram(struct page **pages, unsigned int > count, int node, pgprot_t pro > addr = (unsigned long)mem; > } else { > struct vmap_area *va; > + struct vm_struct *area; > + > + area = kzalloc_node(sizeof(*area), GFP_KERNEL, node); > + if (unlikely(!area)) > + return NULL; > + > va = alloc_vmap_area(size, PAGE_SIZE, > VMALLOC_START, VMALLOC_END, node, GFP_KERNEL); > if (IS_ERR(va)) > @@ -1180,6 +1191,7 @@ void *vm_map_ram(struct page **pages, unsigned int > count, int node, pgprot_t pro > > addr = va->va_start; > mem = (void *)addr; > + setup_vmap_ram_vm(area, va, 0, __builtin_return_address(0)); > } > if (vmap_page_range(addr, addr + size, prot, pages) < 0) { > vm_unmap_ram(mem, count); > @@ -1362,6 +1374,19 @@ static void setup_vmalloc_vm(struct vm_struct *vm, > struct vmap_area *va, > spin_unlock(_area_lock); > } > > +static void setup_vmap_ram_vm(struct vm_struct *vm, struct vmap_area *va, > + unsigned long flags, const void *caller) > +{ > + spin_lock(_area_lock); > + vm->flags = flags; > + vm->addr = (void *)va->va_start; > + vm->size = va->va_end - va->va_start; > + vm->caller = caller; > + va->vm = vm; > + va->flags |= VM_VM_RAM; > + spin_unlock(_area_lock); > +} > + > static void clear_vm_uninitialized_flag(struct vm_struct *vm) > { > /* > @@ -2698,7 +2723,7 @@ static int s_show(struct seq_file *m, void *p) >* s_show can encounter race with remove_vm_area, !VM_VM_AREA on >* behalf of vmap area is being tear down or vm_map_ram allocation. >*/ > - if (!(va->flags & VM_VM_AREA)) > + if (!(va->flags & (VM_VM_AREA | VM_VM_RAM))) > return 0; > > v = va->vm; >
Re: [PATCH v3] mm/vmalloc: terminate searching since one node found
On 07/18/2017 04:31 PM, Zhaoyang Huang (黄朝阳) wrote: > > It is no need to find the very beginning of the area within > alloc_vmap_area, which can be done by judging each node during the process > it seems the original code is wrote to achieve the following two purposes : A, the result vamp_area has the lowest available address in the required range [vstart, vend) B, it maybe update the cached vamp_area node info which can speedup other relative allocations it look redundant but conventional and necessary this approach maybe destroy the original purposes > For current approach, the worst case is that the starting node which be found > for searching the 'vmap_area_list' is close to the 'vstart', while the final > available one is round to the tail(especially for the left branch). > This commit have the list searching start at the first available node, which > will save the time of walking the rb tree'(1)' and walking the list'(2)'. > > vmap_area_root > / \ > tmp_next U > / > tmp >/ > ... (1) > / > first(current approach) > @tmp_next is the next node of @tmp in the ordered list_head, not in the rbtree > vmap_area_list->...->first->...->tmp->tmp_next > (2) > > Signed-off-by: Zhaoyang Huang> --- > mm/vmalloc.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 34a1c3e..9a5c177 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -459,9 +459,18 @@ static struct vmap_area *alloc_vmap_area(unsigned long > size, > > while (n) { > struct vmap_area *tmp; > + struct vmap_area *tmp_next; > tmp = rb_entry(n, struct vmap_area, rb_node); > + tmp_next = list_next_entry(tmp, list); > if (tmp->va_end >= addr) { > first = tmp; > + if (ALIGN(tmp->va_end, align) + size > + < tmp_next->va_start) { if @tmp node don't locate in the required rang [vstart, vend), but the right of the range it maybe satisfy this condition, even if it locate it locate within the range, it maybe don't have the lowest free address. if @tmp don't have the next node, tmp_next->va_start will cause NULL dereference > + addr = ALIGN(tmp->va_end, align); > + if (cached_hole_size >= size) > + cached_hole_size = 0; it seems a little rough to reset the @cached_hole_size by this way, it will caused the cached info is updated in the next allocation regardless the allocation arguments. > + goto found; > + } > if (tmp->va_start <= addr) > break; > n = n->rb_left; > -- > 1.9.1 >
Re: [PATCH v3] mm/vmalloc: terminate searching since one node found
On 07/18/2017 04:31 PM, Zhaoyang Huang (黄朝阳) wrote: > > It is no need to find the very beginning of the area within > alloc_vmap_area, which can be done by judging each node during the process > it seems the original code is wrote to achieve the following two purposes : A, the result vamp_area has the lowest available address in the required range [vstart, vend) B, it maybe update the cached vamp_area node info which can speedup other relative allocations it look redundant but conventional and necessary this approach maybe destroy the original purposes > For current approach, the worst case is that the starting node which be found > for searching the 'vmap_area_list' is close to the 'vstart', while the final > available one is round to the tail(especially for the left branch). > This commit have the list searching start at the first available node, which > will save the time of walking the rb tree'(1)' and walking the list'(2)'. > > vmap_area_root > / \ > tmp_next U > / > tmp >/ > ... (1) > / > first(current approach) > @tmp_next is the next node of @tmp in the ordered list_head, not in the rbtree > vmap_area_list->...->first->...->tmp->tmp_next > (2) > > Signed-off-by: Zhaoyang Huang > --- > mm/vmalloc.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 34a1c3e..9a5c177 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -459,9 +459,18 @@ static struct vmap_area *alloc_vmap_area(unsigned long > size, > > while (n) { > struct vmap_area *tmp; > + struct vmap_area *tmp_next; > tmp = rb_entry(n, struct vmap_area, rb_node); > + tmp_next = list_next_entry(tmp, list); > if (tmp->va_end >= addr) { > first = tmp; > + if (ALIGN(tmp->va_end, align) + size > + < tmp_next->va_start) { if @tmp node don't locate in the required rang [vstart, vend), but the right of the range it maybe satisfy this condition, even if it locate it locate within the range, it maybe don't have the lowest free address. if @tmp don't have the next node, tmp_next->va_start will cause NULL dereference > + addr = ALIGN(tmp->va_end, align); > + if (cached_hole_size >= size) > + cached_hole_size = 0; it seems a little rough to reset the @cached_hole_size by this way, it will caused the cached info is updated in the next allocation regardless the allocation arguments. > + goto found; > + } > if (tmp->va_start <= addr) > break; > n = n->rb_left; > -- > 1.9.1 >
Re: [PATCH v2] mm/vmalloc: terminate searching since one node found
On 07/17/2017 04:45 PM, zijun_hu wrote: > On 07/17/2017 04:07 PM, Zhaoyang Huang wrote: >> It is no need to find the very beginning of the area within >> alloc_vmap_area, which can be done by judging each node during the process >> >> For current approach, the worst case is that the starting node which be found >> for searching the 'vmap_area_list' is close to the 'vstart', while the final >> available one is round to the tail(especially for the left branch). >> This commit have the list searching start at the first available node, which >> will save the time of walking the rb tree'(1)' and walking the list(2). >> >> vmap_area_root >> / \ >> tmp_next U >> / (1) >> tmp >>/ >> ... >> / >> first(current approach) >> >> vmap_area_list->...->first->...->tmp->tmp_next > > the original code can ensure the following two points : > A, the result vamp_area has the lowest available address in the range > [vstart, vend) > B, it can maintain the cached vamp_area node rightly which can speedup > relative allocation > i suspect this patch maybe destroy the above two points >> (2) >> >> Signed-off-by: Zhaoyang Huang <zhaoyang.hu...@spreadtrum.com> >> --- >> mm/vmalloc.c | 7 +++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index 34a1c3e..f833e07 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -459,9 +459,16 @@ static struct vmap_area *alloc_vmap_area(unsigned >> long size, >> >> while (n) { >> struct vmap_area *tmp; >> + struct vmap_area *tmp_next; >> tmp = rb_entry(n, struct vmap_area, rb_node); >> + tmp_next = list_next_entry(tmp, list); >> if (tmp->va_end >= addr) { >> first = tmp; >> + if (ALIGN(tmp->va_end, align) + size >> + < tmp_next->va_start) { >> + addr = ALIGN(tmp->va_end, align); >> + goto found; >> + } > is the aim vamp_area the lowest available one if the goto occurs ? > it will bypass the latter cached vamp_area info cached_hole_size update > possibly if the goto occurs it think the aim area maybe don't locates the required range [vstart, vend) possibly. >> if (tmp->va_start <= addr) >> break; >> n = n->rb_left; >> -- >> 1.9.1 >> >
Re: [PATCH v2] mm/vmalloc: terminate searching since one node found
On 07/17/2017 04:45 PM, zijun_hu wrote: > On 07/17/2017 04:07 PM, Zhaoyang Huang wrote: >> It is no need to find the very beginning of the area within >> alloc_vmap_area, which can be done by judging each node during the process >> >> For current approach, the worst case is that the starting node which be found >> for searching the 'vmap_area_list' is close to the 'vstart', while the final >> available one is round to the tail(especially for the left branch). >> This commit have the list searching start at the first available node, which >> will save the time of walking the rb tree'(1)' and walking the list(2). >> >> vmap_area_root >> / \ >> tmp_next U >> / (1) >> tmp >>/ >> ... >> / >> first(current approach) >> >> vmap_area_list->...->first->...->tmp->tmp_next > > the original code can ensure the following two points : > A, the result vamp_area has the lowest available address in the range > [vstart, vend) > B, it can maintain the cached vamp_area node rightly which can speedup > relative allocation > i suspect this patch maybe destroy the above two points >> (2) >> >> Signed-off-by: Zhaoyang Huang >> --- >> mm/vmalloc.c | 7 +++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index 34a1c3e..f833e07 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -459,9 +459,16 @@ static struct vmap_area *alloc_vmap_area(unsigned >> long size, >> >> while (n) { >> struct vmap_area *tmp; >> + struct vmap_area *tmp_next; >> tmp = rb_entry(n, struct vmap_area, rb_node); >> + tmp_next = list_next_entry(tmp, list); >> if (tmp->va_end >= addr) { >> first = tmp; >> + if (ALIGN(tmp->va_end, align) + size >> + < tmp_next->va_start) { >> + addr = ALIGN(tmp->va_end, align); >> + goto found; >> + } > is the aim vamp_area the lowest available one if the goto occurs ? > it will bypass the latter cached vamp_area info cached_hole_size update > possibly if the goto occurs it think the aim area maybe don't locates the required range [vstart, vend) possibly. >> if (tmp->va_start <= addr) >> break; >> n = n->rb_left; >> -- >> 1.9.1 >> >
Re: [PATCH v2] mm/vmalloc: terminate searching since one node found
On 07/17/2017 04:07 PM, Zhaoyang Huang wrote: > It is no need to find the very beginning of the area within > alloc_vmap_area, which can be done by judging each node during the process > > For current approach, the worst case is that the starting node which be found > for searching the 'vmap_area_list' is close to the 'vstart', while the final > available one is round to the tail(especially for the left branch). > This commit have the list searching start at the first available node, which > will save the time of walking the rb tree'(1)' and walking the list(2). > > vmap_area_root > / \ > tmp_next U > / (1) > tmp >/ > ... > / > first(current approach) > > vmap_area_list->...->first->...->tmp->tmp_next the original code can ensure the following two points : A, the result vamp_area has the lowest available address in the range [vstart, vend) B, it can maintain the cached vamp_area node rightly which can speedup relative allocation i suspect this patch maybe destroy the above two points > (2) > > Signed-off-by: Zhaoyang Huang> --- > mm/vmalloc.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 34a1c3e..f833e07 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -459,9 +459,16 @@ static struct vmap_area *alloc_vmap_area(unsigned > long size, > > while (n) { > struct vmap_area *tmp; > + struct vmap_area *tmp_next; > tmp = rb_entry(n, struct vmap_area, rb_node); > + tmp_next = list_next_entry(tmp, list); > if (tmp->va_end >= addr) { > first = tmp; > + if (ALIGN(tmp->va_end, align) + size > + < tmp_next->va_start) { > + addr = ALIGN(tmp->va_end, align); > + goto found; > + } is the aim vamp_area the lowest available one if the goto occurs ? it will bypass the latter cached vamp_area info cached_hole_size update possibly if the goto occurs > if (tmp->va_start <= addr) > break; > n = n->rb_left; > -- > 1.9.1 >
Re: [PATCH v2] mm/vmalloc: terminate searching since one node found
On 07/17/2017 04:07 PM, Zhaoyang Huang wrote: > It is no need to find the very beginning of the area within > alloc_vmap_area, which can be done by judging each node during the process > > For current approach, the worst case is that the starting node which be found > for searching the 'vmap_area_list' is close to the 'vstart', while the final > available one is round to the tail(especially for the left branch). > This commit have the list searching start at the first available node, which > will save the time of walking the rb tree'(1)' and walking the list(2). > > vmap_area_root > / \ > tmp_next U > / (1) > tmp >/ > ... > / > first(current approach) > > vmap_area_list->...->first->...->tmp->tmp_next the original code can ensure the following two points : A, the result vamp_area has the lowest available address in the range [vstart, vend) B, it can maintain the cached vamp_area node rightly which can speedup relative allocation i suspect this patch maybe destroy the above two points > (2) > > Signed-off-by: Zhaoyang Huang > --- > mm/vmalloc.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 34a1c3e..f833e07 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -459,9 +459,16 @@ static struct vmap_area *alloc_vmap_area(unsigned > long size, > > while (n) { > struct vmap_area *tmp; > + struct vmap_area *tmp_next; > tmp = rb_entry(n, struct vmap_area, rb_node); > + tmp_next = list_next_entry(tmp, list); > if (tmp->va_end >= addr) { > first = tmp; > + if (ALIGN(tmp->va_end, align) + size > + < tmp_next->va_start) { > + addr = ALIGN(tmp->va_end, align); > + goto found; > + } is the aim vamp_area the lowest available one if the goto occurs ? it will bypass the latter cached vamp_area info cached_hole_size update possibly if the goto occurs > if (tmp->va_start <= addr) > break; > n = n->rb_left; > -- > 1.9.1 >
Re: [RFC PATCH 1/1] mm/vmalloc.c: correct logic errors when insert vmap_area
On 10/13/2016 02:39 PM, zijun_hu wrote: Hi Nicholas, could you give some comments for this patch? thanks a lot > Hi Nicholas, > > i find __insert_vmap_area() is introduced by you > could you offer comments for this patch related to that funciton > > thanks > > On 10/12/2016 10:46 PM, Michal Hocko wrote: >> [Let's CC Nick who has written this code] >> >> On Wed 12-10-16 22:30:13, zijun_hu wrote: >>> From: zijun_hu <zijun...@htc.com> >>> >>> the KVA allocator organizes vmap_areas allocated by rbtree. in order to >>> insert a new vmap_area @i_va into the rbtree, walk around the rbtree from >>> root and compare the vmap_area @t_va met on the rbtree against @i_va; walk >>> toward the left branch of @t_va if @i_va is lower than @t_va, and right >>> branch if higher, otherwise handle this error case since @i_va has overlay >>> with @t_va; however, __insert_vmap_area() don't follow the desired >>> procedure rightly, moreover, it includes a meaningless else if condition >>> and a redundant else branch as shown by comments in below code segments: >>> static void __insert_vmap_area(struct vmap_area *va) >>> { >>> as a internal interface parameter, we assume vmap_area @va has nonzero size >>> ... >>> if (va->va_start < tmp->va_end) >>> p = &(*p)->rb_left; >>> else if (va->va_end > tmp->va_start) >>> p = &(*p)->rb_right; >>> this else if condition is always true and meaningless due to >>> va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally >>> else >>> BUG(); >>> this BUG() is meaningless too due to never be reached normally >>> ... >>> } >>> >>> it looks like the else if condition and else branch are canceled. no errors >>> are caused since the vmap_area @va to insert as a internal interface >>> parameter doesn't have overlay with any one on the rbtree normally. however >>> __insert_vmap_area() looks weird and really has several logic errors as >>> pointed out above when it is viewed as a separate function. >> >> I have tried to read this several times but I am completely lost to >> understand what the actual bug is and how it causes vmap_area sorting to >> misbehave. So is this a correctness issue, performance improvement or >> theoretical fix for an incorrect input? >> >>> fix by walking around vmap_area rbtree as described above to insert >>> a vmap_area. >>> >>> BTW, (va->va_end == tmp_va->va_start) is consider as legal case since it >>> indicates vmap_area @va left neighbors with @tmp_va tightly. >>> >>> Fixes: db64fe02258f ("mm: rewrite vmap layer") >>> Signed-off-by: zijun_hu <zijun...@htc.com> >>> --- >>> mm/vmalloc.c | 8 >>> 1 file changed, 4 insertions(+), 4 deletions(-) >>> >>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >>> index 5daf3211b84f..8b80931654b7 100644 >>> --- a/mm/vmalloc.c >>> +++ b/mm/vmalloc.c >>> @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) >>> >>> parent = *p; >>> tmp_va = rb_entry(parent, struct vmap_area, rb_node); >>> - if (va->va_start < tmp_va->va_end) >>> - p = &(*p)->rb_left; >>> - else if (va->va_end > tmp_va->va_start) >>> - p = &(*p)->rb_right; >>> + if (va->va_end <= tmp_va->va_start) >>> + p = >rb_left; >>> + else if (va->va_start >= tmp_va->va_end) >>> + p = >rb_right; >>> else >>> BUG(); >>> } >>> -- >>> 1.9.1 >> >
Re: [RFC PATCH 1/1] mm/vmalloc.c: correct logic errors when insert vmap_area
On 10/13/2016 02:39 PM, zijun_hu wrote: Hi Nicholas, could you give some comments for this patch? thanks a lot > Hi Nicholas, > > i find __insert_vmap_area() is introduced by you > could you offer comments for this patch related to that funciton > > thanks > > On 10/12/2016 10:46 PM, Michal Hocko wrote: >> [Let's CC Nick who has written this code] >> >> On Wed 12-10-16 22:30:13, zijun_hu wrote: >>> From: zijun_hu >>> >>> the KVA allocator organizes vmap_areas allocated by rbtree. in order to >>> insert a new vmap_area @i_va into the rbtree, walk around the rbtree from >>> root and compare the vmap_area @t_va met on the rbtree against @i_va; walk >>> toward the left branch of @t_va if @i_va is lower than @t_va, and right >>> branch if higher, otherwise handle this error case since @i_va has overlay >>> with @t_va; however, __insert_vmap_area() don't follow the desired >>> procedure rightly, moreover, it includes a meaningless else if condition >>> and a redundant else branch as shown by comments in below code segments: >>> static void __insert_vmap_area(struct vmap_area *va) >>> { >>> as a internal interface parameter, we assume vmap_area @va has nonzero size >>> ... >>> if (va->va_start < tmp->va_end) >>> p = &(*p)->rb_left; >>> else if (va->va_end > tmp->va_start) >>> p = &(*p)->rb_right; >>> this else if condition is always true and meaningless due to >>> va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally >>> else >>> BUG(); >>> this BUG() is meaningless too due to never be reached normally >>> ... >>> } >>> >>> it looks like the else if condition and else branch are canceled. no errors >>> are caused since the vmap_area @va to insert as a internal interface >>> parameter doesn't have overlay with any one on the rbtree normally. however >>> __insert_vmap_area() looks weird and really has several logic errors as >>> pointed out above when it is viewed as a separate function. >> >> I have tried to read this several times but I am completely lost to >> understand what the actual bug is and how it causes vmap_area sorting to >> misbehave. So is this a correctness issue, performance improvement or >> theoretical fix for an incorrect input? >> >>> fix by walking around vmap_area rbtree as described above to insert >>> a vmap_area. >>> >>> BTW, (va->va_end == tmp_va->va_start) is consider as legal case since it >>> indicates vmap_area @va left neighbors with @tmp_va tightly. >>> >>> Fixes: db64fe02258f ("mm: rewrite vmap layer") >>> Signed-off-by: zijun_hu >>> --- >>> mm/vmalloc.c | 8 >>> 1 file changed, 4 insertions(+), 4 deletions(-) >>> >>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >>> index 5daf3211b84f..8b80931654b7 100644 >>> --- a/mm/vmalloc.c >>> +++ b/mm/vmalloc.c >>> @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) >>> >>> parent = *p; >>> tmp_va = rb_entry(parent, struct vmap_area, rb_node); >>> - if (va->va_start < tmp_va->va_end) >>> - p = &(*p)->rb_left; >>> - else if (va->va_end > tmp_va->va_start) >>> - p = &(*p)->rb_right; >>> + if (va->va_end <= tmp_va->va_start) >>> + p = >rb_left; >>> + else if (va->va_start >= tmp_va->va_end) >>> + p = >rb_right; >>> else >>> BUG(); >>> } >>> -- >>> 1.9.1 >> >
[PATCH 1/1] mm/percpu.c: append alignment sanity checkup to avoid memory leakage
From: zijun_hu <zijun...@htc.com> the percpu allocator only works well currently when allocates a power of 2 aligned area, but there aren't any hints for alignment requirement, so memory leakage maybe be caused if allocate other alignment areas the alignment must be a even at least since the LSB of a chunk->map element is used as free/in-use flag of a area; besides, the alignment must be a power of 2 too since ALIGN() doesn't work well for other alignment always but is adopted by pcpu_fit_in_area(). IOW, the current allocator only works well for a power of 2 aligned area allocation. see below opposite example for why a odd alignment doesn't work lets assume area [16, 36) is free but its previous one is in-use, we want to allocate a @size == 8 and @align == 7 area. the larger area [16, 36) is split to three areas [16, 21), [21, 29), [29, 36) eventually. however, due to the usage for a chunk->map element, the actual offset of the aim area [21, 29) is 21 but is recorded in relevant element as 20; moreover the residual tail free area [29, 36) is mistook as in-use and is lost silently unlike macro roundup(), ALIGN(x, a) doesn't work if @a isn't a power of 2 for example, roundup(10, 6) == 12 but ALIGN(10, 6) == 10, and the latter result isn't desired obviously. fix it by appending sanity checkup for alignment requirement Signed-off-by: zijun_hu <zijun...@htc.com> Suggested-by: Tejun Heo <t...@kernel.org> --- include/linux/kernel.h | 1 + mm/percpu.c| 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index bc6ed52a39b9..0dc0b21bd164 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -45,6 +45,7 @@ #define REPEAT_BYTE(x) ((~0ul / 0xff) * (x)) +/* @a is a power of 2 value */ #define ALIGN(x, a)__ALIGN_KERNEL((x), (a)) #define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask)) #define PTR_ALIGN(p, a)((typeof(p))ALIGN((unsigned long)(p), (a))) diff --git a/mm/percpu.c b/mm/percpu.c index 255714302394..10ba3f9a3826 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -886,7 +886,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, size = ALIGN(size, 2); - if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) { + if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE + || !is_power_of_2(align))) { WARN(true, "illegal size (%zu) or align (%zu) for percpu allocation\n", size, align); return NULL; -- 1.9.1
[PATCH 1/1] mm/percpu.c: append alignment sanity checkup to avoid memory leakage
From: zijun_hu the percpu allocator only works well currently when allocates a power of 2 aligned area, but there aren't any hints for alignment requirement, so memory leakage maybe be caused if allocate other alignment areas the alignment must be a even at least since the LSB of a chunk->map element is used as free/in-use flag of a area; besides, the alignment must be a power of 2 too since ALIGN() doesn't work well for other alignment always but is adopted by pcpu_fit_in_area(). IOW, the current allocator only works well for a power of 2 aligned area allocation. see below opposite example for why a odd alignment doesn't work lets assume area [16, 36) is free but its previous one is in-use, we want to allocate a @size == 8 and @align == 7 area. the larger area [16, 36) is split to three areas [16, 21), [21, 29), [29, 36) eventually. however, due to the usage for a chunk->map element, the actual offset of the aim area [21, 29) is 21 but is recorded in relevant element as 20; moreover the residual tail free area [29, 36) is mistook as in-use and is lost silently unlike macro roundup(), ALIGN(x, a) doesn't work if @a isn't a power of 2 for example, roundup(10, 6) == 12 but ALIGN(10, 6) == 10, and the latter result isn't desired obviously. fix it by appending sanity checkup for alignment requirement Signed-off-by: zijun_hu Suggested-by: Tejun Heo --- include/linux/kernel.h | 1 + mm/percpu.c| 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index bc6ed52a39b9..0dc0b21bd164 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -45,6 +45,7 @@ #define REPEAT_BYTE(x) ((~0ul / 0xff) * (x)) +/* @a is a power of 2 value */ #define ALIGN(x, a)__ALIGN_KERNEL((x), (a)) #define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask)) #define PTR_ALIGN(p, a)((typeof(p))ALIGN((unsigned long)(p), (a))) diff --git a/mm/percpu.c b/mm/percpu.c index 255714302394..10ba3f9a3826 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -886,7 +886,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, size = ALIGN(size, 2); - if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) { + if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE + || !is_power_of_2(align))) { WARN(true, "illegal size (%zu) or align (%zu) for percpu allocation\n", size, align); return NULL; -- 1.9.1
Re: [RFC v2 PATCH] mm/percpu.c: simplify grouping CPU algorithm
On 2016/10/14 8:33, Tejun Heo wrote: > Hello, > > On Fri, Oct 14, 2016 at 07:49:44AM +0800, zijun_hu wrote: >> the main intent of this change is making the CPU grouping algorithm more >> easily to understand, especially, for newcomer for memory managements >> take me as a example, i really take me a longer timer to understand it > > If the new code is easier to understand, it's only so marginally. It > just isn't worth the effort or risk. > > Thanks. > okay i agree with your opinion. but i am sure this changes don't have any risk after tests and theoretic analyse
Re: [RFC v2 PATCH] mm/percpu.c: simplify grouping CPU algorithm
On 2016/10/14 8:33, Tejun Heo wrote: > Hello, > > On Fri, Oct 14, 2016 at 07:49:44AM +0800, zijun_hu wrote: >> the main intent of this change is making the CPU grouping algorithm more >> easily to understand, especially, for newcomer for memory managements >> take me as a example, i really take me a longer timer to understand it > > If the new code is easier to understand, it's only so marginally. It > just isn't worth the effort or risk. > > Thanks. > okay i agree with your opinion. but i am sure this changes don't have any risk after tests and theoretic analyse
Re: [RFC PATCH 1/1] mm/percpu.c: fix several trivial issues
On 2016/10/14 8:34, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 09:29:27PM +0800, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> as shown by pcpu_setup_first_chunk(), the first chunk is same as the >> reserved chunk if the reserved size is nonzero but the dynamic is zero >> this special scenario is referred as the special case by below content >> >> fix several trivial issues: >> >> 1) correct or fix several comments >> the LSB of a chunk->map element is used as free/in-use flag and is cleared >> for free area and set for in-use, rather than use positive/negative number >> to mark area state. >> >> 2) change atomic size to PAGE_SIZE for consistency when CONFIG_SMP == n >> both default setup_per_cpu_areas() and pcpu_page_first_chunk() >> use PAGE_SIZE as atomic size when CONFIG_SMP == y; however >> setup_per_cpu_areas() allocates memory for the only unit with alignment >> PAGE_SIZE but assigns unit size to atomic size when CONFIG_SMP == n, so the >> atomic size isn't consistent with either the alignment or the SMP ones. >> fix it by changing atomic size to PAGE_SIZE when CONFIG_SMP == n >> >> 3) correct empty and populated pages statistic error >> in order to service dynamic atomic memory allocation, the number of empty >> and populated pages of chunks is counted to maintain above a low threshold. >> however, for the special case, the first chunk is took into account by >> pcpu_setup_first_chunk(), it is meaningless since the chunk don't include >> any dynamic areas. >> fix it by excluding the reserved chunk before statistic as the other >> contexts do. >> >> 4) fix potential memory leakage for percpu_init_late() >> in order to manage chunk->map memory uniformly, for the first and reserved >> chunks, percpu_init_late() will allocate memory to replace the static >> chunk->map array within section .init.data after slab is brought up >> however, for the special case, memory are allocated for the same chunk->map >> twice since the first chunk reference is same as the reserved, so the >> memory allocated at the first time are leaked obviously. >> fix it by eliminating the second memory allocation under the special case >> >> Signed-off-by: zijun_hu <zijun...@htc.com> > > Can you please break the changes into separate patches? > > Thanks. >
Re: [RFC PATCH 1/1] mm/percpu.c: fix several trivial issues
On 2016/10/14 8:34, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 09:29:27PM +0800, zijun_hu wrote: >> From: zijun_hu >> >> as shown by pcpu_setup_first_chunk(), the first chunk is same as the >> reserved chunk if the reserved size is nonzero but the dynamic is zero >> this special scenario is referred as the special case by below content >> >> fix several trivial issues: >> >> 1) correct or fix several comments >> the LSB of a chunk->map element is used as free/in-use flag and is cleared >> for free area and set for in-use, rather than use positive/negative number >> to mark area state. >> >> 2) change atomic size to PAGE_SIZE for consistency when CONFIG_SMP == n >> both default setup_per_cpu_areas() and pcpu_page_first_chunk() >> use PAGE_SIZE as atomic size when CONFIG_SMP == y; however >> setup_per_cpu_areas() allocates memory for the only unit with alignment >> PAGE_SIZE but assigns unit size to atomic size when CONFIG_SMP == n, so the >> atomic size isn't consistent with either the alignment or the SMP ones. >> fix it by changing atomic size to PAGE_SIZE when CONFIG_SMP == n >> >> 3) correct empty and populated pages statistic error >> in order to service dynamic atomic memory allocation, the number of empty >> and populated pages of chunks is counted to maintain above a low threshold. >> however, for the special case, the first chunk is took into account by >> pcpu_setup_first_chunk(), it is meaningless since the chunk don't include >> any dynamic areas. >> fix it by excluding the reserved chunk before statistic as the other >> contexts do. >> >> 4) fix potential memory leakage for percpu_init_late() >> in order to manage chunk->map memory uniformly, for the first and reserved >> chunks, percpu_init_late() will allocate memory to replace the static >> chunk->map array within section .init.data after slab is brought up >> however, for the special case, memory are allocated for the same chunk->map >> twice since the first chunk reference is same as the reserved, so the >> memory allocated at the first time are leaked obviously. >> fix it by eliminating the second memory allocation under the special case >> >> Signed-off-by: zijun_hu > > Can you please break the changes into separate patches? > > Thanks. >
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 2016/10/14 8:28, Tejun Heo wrote: > Hello, > > On Fri, Oct 14, 2016 at 08:23:06AM +0800, zijun_hu wrote: >> for the current code, only power of 2 alignment value can works well >> >> is it acceptable to performing a power of 2 checking and returning error code >> if fail? > > Yeah, just add is_power_of_2() test to the existing sanity check. > > Thanks. > okay. i will do that
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 2016/10/14 8:28, Tejun Heo wrote: > Hello, > > On Fri, Oct 14, 2016 at 08:23:06AM +0800, zijun_hu wrote: >> for the current code, only power of 2 alignment value can works well >> >> is it acceptable to performing a power of 2 checking and returning error code >> if fail? > > Yeah, just add is_power_of_2() test to the existing sanity check. > > Thanks. > okay. i will do that
Re: [RFC PATCH 1/1] mm/percpu.c: fix several trivial issues
On 2016/10/14 8:34, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 09:29:27PM +0800, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> as shown by pcpu_setup_first_chunk(), the first chunk is same as the >> reserved chunk if the reserved size is nonzero but the dynamic is zero >> this special scenario is referred as the special case by below content >> >> fix several trivial issues: >> >> 1) correct or fix several comments >> the LSB of a chunk->map element is used as free/in-use flag and is cleared >> for free area and set for in-use, rather than use positive/negative number >> to mark area state. >> >> 2) change atomic size to PAGE_SIZE for consistency when CONFIG_SMP == n >> both default setup_per_cpu_areas() and pcpu_page_first_chunk() >> use PAGE_SIZE as atomic size when CONFIG_SMP == y; however >> setup_per_cpu_areas() allocates memory for the only unit with alignment >> PAGE_SIZE but assigns unit size to atomic size when CONFIG_SMP == n, so the >> atomic size isn't consistent with either the alignment or the SMP ones. >> fix it by changing atomic size to PAGE_SIZE when CONFIG_SMP == n >> >> 3) correct empty and populated pages statistic error >> in order to service dynamic atomic memory allocation, the number of empty >> and populated pages of chunks is counted to maintain above a low threshold. >> however, for the special case, the first chunk is took into account by >> pcpu_setup_first_chunk(), it is meaningless since the chunk don't include >> any dynamic areas. >> fix it by excluding the reserved chunk before statistic as the other >> contexts do. >> >> 4) fix potential memory leakage for percpu_init_late() >> in order to manage chunk->map memory uniformly, for the first and reserved >> chunks, percpu_init_late() will allocate memory to replace the static >> chunk->map array within section .init.data after slab is brought up >> however, for the special case, memory are allocated for the same chunk->map >> twice since the first chunk reference is same as the reserved, so the >> memory allocated at the first time are leaked obviously. >> fix it by eliminating the second memory allocation under the special case >> >> Signed-off-by: zijun_hu <zijun...@htc.com> > > Can you please break the changes into separate patches? yes, i can could you give many comments for that trivial issues firstly? i will separate and product which you thinks reasonable formally. thanks > > Thanks. >
Re: [RFC PATCH 1/1] mm/percpu.c: fix several trivial issues
On 2016/10/14 8:34, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 09:29:27PM +0800, zijun_hu wrote: >> From: zijun_hu >> >> as shown by pcpu_setup_first_chunk(), the first chunk is same as the >> reserved chunk if the reserved size is nonzero but the dynamic is zero >> this special scenario is referred as the special case by below content >> >> fix several trivial issues: >> >> 1) correct or fix several comments >> the LSB of a chunk->map element is used as free/in-use flag and is cleared >> for free area and set for in-use, rather than use positive/negative number >> to mark area state. >> >> 2) change atomic size to PAGE_SIZE for consistency when CONFIG_SMP == n >> both default setup_per_cpu_areas() and pcpu_page_first_chunk() >> use PAGE_SIZE as atomic size when CONFIG_SMP == y; however >> setup_per_cpu_areas() allocates memory for the only unit with alignment >> PAGE_SIZE but assigns unit size to atomic size when CONFIG_SMP == n, so the >> atomic size isn't consistent with either the alignment or the SMP ones. >> fix it by changing atomic size to PAGE_SIZE when CONFIG_SMP == n >> >> 3) correct empty and populated pages statistic error >> in order to service dynamic atomic memory allocation, the number of empty >> and populated pages of chunks is counted to maintain above a low threshold. >> however, for the special case, the first chunk is took into account by >> pcpu_setup_first_chunk(), it is meaningless since the chunk don't include >> any dynamic areas. >> fix it by excluding the reserved chunk before statistic as the other >> contexts do. >> >> 4) fix potential memory leakage for percpu_init_late() >> in order to manage chunk->map memory uniformly, for the first and reserved >> chunks, percpu_init_late() will allocate memory to replace the static >> chunk->map array within section .init.data after slab is brought up >> however, for the special case, memory are allocated for the same chunk->map >> twice since the first chunk reference is same as the reserved, so the >> memory allocated at the first time are leaked obviously. >> fix it by eliminating the second memory allocation under the special case >> >> Signed-off-by: zijun_hu > > Can you please break the changes into separate patches? yes, i can could you give many comments for that trivial issues firstly? i will separate and product which you thinks reasonable formally. thanks > > Thanks. >
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 2016/10/14 7:31, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 09:24:50PM +0800, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> the LSB of a chunk->map element is used for free/in-use flag of a area >> and the other bits for offset, the sufficient and necessary condition of >> this usage is that both size and alignment of a area must be even numbers >> however, pcpu_alloc() doesn't force its @align parameter a even number >> explicitly, so a odd @align maybe causes a series of errors, see below >> example for concrete descriptions. >> >> lets assume area [16, 36) is free but its previous one is in-use, we want >> to allocate a @size == 8 and @align == 7 area. the larger area [16, 36) is >> split to three areas [16, 21), [21, 29), [29, 36) eventually. however, due >> to the usage for a chunk->map element, the actual offset of the aim area >> [21, 29) is 21 but is recorded in relevant element as 20; moreover the >> residual tail free area [29, 36) is mistook as in-use and is lost silently >> >> as explained above, inaccurate either offset or free/in-use state of >> a area is recorded into relevant chunk->map element if request a odd >> alignment area, and so causes memory leakage issue >> >> fix it by forcing the @align of a area to allocate a even number >> as do for @size. >> >> BTW, macro ALIGN() within pcpu_fit_in_area() is replaced by roundup() too >> due to back reason. in order to align a value @v up to @a boundary, macro >> roundup(v, a) is more generic than ALIGN(x, a); the latter doesn't work >> well when @a isn't a power of 2 value. for example, roundup(10, 6) == 12 >> but ALIGN(10, 6) == 10, the former result is desired obviously >> >> Signed-off-by: zijun_hu <zijun...@htc.com> > > Nacked-by: Tejun Heo <t...@kernel.org> > > This is a fix for an imaginary problem. The most we should do about > odd alignment is triggering a WARN_ON. > for the current code, only power of 2 alignment value can works well is it acceptable to performing a power of 2 checking and returning error code if fail?
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 2016/10/14 7:31, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 09:24:50PM +0800, zijun_hu wrote: >> From: zijun_hu >> >> the LSB of a chunk->map element is used for free/in-use flag of a area >> and the other bits for offset, the sufficient and necessary condition of >> this usage is that both size and alignment of a area must be even numbers >> however, pcpu_alloc() doesn't force its @align parameter a even number >> explicitly, so a odd @align maybe causes a series of errors, see below >> example for concrete descriptions. >> >> lets assume area [16, 36) is free but its previous one is in-use, we want >> to allocate a @size == 8 and @align == 7 area. the larger area [16, 36) is >> split to three areas [16, 21), [21, 29), [29, 36) eventually. however, due >> to the usage for a chunk->map element, the actual offset of the aim area >> [21, 29) is 21 but is recorded in relevant element as 20; moreover the >> residual tail free area [29, 36) is mistook as in-use and is lost silently >> >> as explained above, inaccurate either offset or free/in-use state of >> a area is recorded into relevant chunk->map element if request a odd >> alignment area, and so causes memory leakage issue >> >> fix it by forcing the @align of a area to allocate a even number >> as do for @size. >> >> BTW, macro ALIGN() within pcpu_fit_in_area() is replaced by roundup() too >> due to back reason. in order to align a value @v up to @a boundary, macro >> roundup(v, a) is more generic than ALIGN(x, a); the latter doesn't work >> well when @a isn't a power of 2 value. for example, roundup(10, 6) == 12 >> but ALIGN(10, 6) == 10, the former result is desired obviously >> >> Signed-off-by: zijun_hu > > Nacked-by: Tejun Heo > > This is a fix for an imaginary problem. The most we should do about > odd alignment is triggering a WARN_ON. > for the current code, only power of 2 alignment value can works well is it acceptable to performing a power of 2 checking and returning error code if fail?
Re: [RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
On 2016/10/14 7:29, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 10:00:28PM +0800, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> as shown by pcpu_build_alloc_info(), the number of units within a percpu >> group is educed by rounding up the number of CPUs within the group to >> @upa boundary, therefore, the number of CPUs isn't equal to the units's >> if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() >> uses BUG_ON() to assert one number is equal the other roughly, so a panic >> is maybe triggered by the BUG_ON() falsely. >> >> in order to fix this issue, the number of CPUs is rounded up then compared >> with units's, the BUG_ON() is replaced by warning and returning error code >> as well to keep system alive as much as possible. > > I really can't decode what the actual issue is here. Can you please > give an example of a concrete case? > the right relationship between the number of CPUs @nr_cpus within a percpu group and the number of unites @nr_units within the same group is that @nr_units == roundup(@nr_cpus, @upa); the process of consideration is shown as follows: 1) current code segments: BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); 2) changes for considering the right relationship between the number of CPUs and units BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa)); 3) replace BUG_ON() by warning and returning error code since it seems BUG_ON() isn't nice as shown by linus recent LKML mail BUG_ON(ai->nr_groups != 1); if (ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa)) return -EINVAL; so 3) is my finial changes; for the relationship of both numbers : see the reply for andrew >> @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t >> reserved_size, >> >> /* allocate pages */ >> j = 0; >> -for (unit = 0; unit < num_possible_cpus(); unit++) >> +for (unit = 0; unit < num_possible_cpus(); unit++) { >> +unsigned int cpu = ai->groups[0].cpu_map[unit]; >> for (i = 0; i < unit_pages; i++) { >> -unsigned int cpu = ai->groups[0].cpu_map[unit]; >> void *ptr; >> >> ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); >> if (!ptr) { >> pr_warn("failed to allocate %s page for >> cpu%u\n", >> -psize_str, cpu); >> +psize_str, cpu); > > And stop making gratuitous changes? > this changes is just for looking nicer instinctively @cpu can be determined in the first outer loop. > Thanks. >
Re: [RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
On 2016/10/14 7:29, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 10:00:28PM +0800, zijun_hu wrote: >> From: zijun_hu >> >> as shown by pcpu_build_alloc_info(), the number of units within a percpu >> group is educed by rounding up the number of CPUs within the group to >> @upa boundary, therefore, the number of CPUs isn't equal to the units's >> if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() >> uses BUG_ON() to assert one number is equal the other roughly, so a panic >> is maybe triggered by the BUG_ON() falsely. >> >> in order to fix this issue, the number of CPUs is rounded up then compared >> with units's, the BUG_ON() is replaced by warning and returning error code >> as well to keep system alive as much as possible. > > I really can't decode what the actual issue is here. Can you please > give an example of a concrete case? > the right relationship between the number of CPUs @nr_cpus within a percpu group and the number of unites @nr_units within the same group is that @nr_units == roundup(@nr_cpus, @upa); the process of consideration is shown as follows: 1) current code segments: BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); 2) changes for considering the right relationship between the number of CPUs and units BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa)); 3) replace BUG_ON() by warning and returning error code since it seems BUG_ON() isn't nice as shown by linus recent LKML mail BUG_ON(ai->nr_groups != 1); if (ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa)) return -EINVAL; so 3) is my finial changes; for the relationship of both numbers : see the reply for andrew >> @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t >> reserved_size, >> >> /* allocate pages */ >> j = 0; >> -for (unit = 0; unit < num_possible_cpus(); unit++) >> +for (unit = 0; unit < num_possible_cpus(); unit++) { >> +unsigned int cpu = ai->groups[0].cpu_map[unit]; >> for (i = 0; i < unit_pages; i++) { >> -unsigned int cpu = ai->groups[0].cpu_map[unit]; >> void *ptr; >> >> ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); >> if (!ptr) { >> pr_warn("failed to allocate %s page for >> cpu%u\n", >> -psize_str, cpu); >> +psize_str, cpu); > > And stop making gratuitous changes? > this changes is just for looking nicer instinctively @cpu can be determined in the first outer loop. > Thanks. >
Re: [RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
On 2016/10/14 7:29, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 10:00:28PM +0800, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> as shown by pcpu_build_alloc_info(), the number of units within a percpu >> group is educed by rounding up the number of CPUs within the group to >> @upa boundary, therefore, the number of CPUs isn't equal to the units's >> if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() >> uses BUG_ON() to assert one number is equal the other roughly, so a panic >> is maybe triggered by the BUG_ON() falsely. >> >> in order to fix this issue, the number of CPUs is rounded up then compared >> with units's, the BUG_ON() is replaced by warning and returning error code >> as well to keep system alive as much as possible. > > I really can't decode what the actual issue is here. Can you please > give an example of a concrete case? > the right relationship between the number of CPUs @nr_cpus within a percpu group and the number of unites @nr_units within the same group is that @nr_units == roundup(@nr_cpus, @upa); the process of consideration is shown as follows: 1) current code segments: BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); 2) changes for considering the right relationship between the number of CPUs and units BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa)); 3) replace BUG_ON() by warning and returning error code since it seems BUG_ON() isn't nice as shown by linus recent LKML mail BUG_ON(ai->nr_groups != 1); if (ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa)) return -EINVAL; so 3) is my finial changes; for the relationship of both numbers : see the reply for andrew thanks ; >> @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t >> reserved_size, >> >> /* allocate pages */ >> j = 0; >> -for (unit = 0; unit < num_possible_cpus(); unit++) >> +for (unit = 0; unit < num_possible_cpus(); unit++) { >> +unsigned int cpu = ai->groups[0].cpu_map[unit]; >> for (i = 0; i < unit_pages; i++) { >> -unsigned int cpu = ai->groups[0].cpu_map[unit]; >> void *ptr; >> >> ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); >> if (!ptr) { >> pr_warn("failed to allocate %s page for >> cpu%u\n", >> -psize_str, cpu); >> +psize_str, cpu); > > And stop making gratuitous changes? > > Thanks. >
Re: [RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
On 2016/10/14 7:29, Tejun Heo wrote: > On Tue, Oct 11, 2016 at 10:00:28PM +0800, zijun_hu wrote: >> From: zijun_hu >> >> as shown by pcpu_build_alloc_info(), the number of units within a percpu >> group is educed by rounding up the number of CPUs within the group to >> @upa boundary, therefore, the number of CPUs isn't equal to the units's >> if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() >> uses BUG_ON() to assert one number is equal the other roughly, so a panic >> is maybe triggered by the BUG_ON() falsely. >> >> in order to fix this issue, the number of CPUs is rounded up then compared >> with units's, the BUG_ON() is replaced by warning and returning error code >> as well to keep system alive as much as possible. > > I really can't decode what the actual issue is here. Can you please > give an example of a concrete case? > the right relationship between the number of CPUs @nr_cpus within a percpu group and the number of unites @nr_units within the same group is that @nr_units == roundup(@nr_cpus, @upa); the process of consideration is shown as follows: 1) current code segments: BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); 2) changes for considering the right relationship between the number of CPUs and units BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa)); 3) replace BUG_ON() by warning and returning error code since it seems BUG_ON() isn't nice as shown by linus recent LKML mail BUG_ON(ai->nr_groups != 1); if (ai->groups[0].nr_units != roundup(num_possible_cpus(), @upa)) return -EINVAL; so 3) is my finial changes; for the relationship of both numbers : see the reply for andrew thanks ; >> @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t >> reserved_size, >> >> /* allocate pages */ >> j = 0; >> -for (unit = 0; unit < num_possible_cpus(); unit++) >> +for (unit = 0; unit < num_possible_cpus(); unit++) { >> +unsigned int cpu = ai->groups[0].cpu_map[unit]; >> for (i = 0; i < unit_pages; i++) { >> -unsigned int cpu = ai->groups[0].cpu_map[unit]; >> void *ptr; >> >> ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); >> if (!ptr) { >> pr_warn("failed to allocate %s page for >> cpu%u\n", >> -psize_str, cpu); >> +psize_str, cpu); > > And stop making gratuitous changes? > > Thanks. >
Re: [RFC v2 PATCH] mm/percpu.c: simplify grouping CPU algorithm
On 2016/10/14 7:37, Tejun Heo wrote: > Hello, Zijun. > > On Tue, Oct 11, 2016 at 08:48:45PM +0800, zijun_hu wrote: >> compared with the original algorithm theoretically and practically, the >> new one educes the same grouping results, besides, it is more effective, >> simpler and easier to understand. > > If the original code wasn't broken and the new code produces the same > output, I'd really not mess with this code. There simply is no upside > to messing with this code. It's run once during boot and never a > noticeable contributor of boot overhead. Maybe the new code is a bit > simpler and more efficient but the actual benefit is so small that any > risk would outweigh it. > > Thanks. > the main intent of this change is making the CPU grouping algorithm more easily to understand, especially, for newcomer for memory managements take me as a example, i really take me a longer timer to understand it
Re: [RFC v2 PATCH] mm/percpu.c: simplify grouping CPU algorithm
On 2016/10/14 7:37, Tejun Heo wrote: > Hello, Zijun. > > On Tue, Oct 11, 2016 at 08:48:45PM +0800, zijun_hu wrote: >> compared with the original algorithm theoretically and practically, the >> new one educes the same grouping results, besides, it is more effective, >> simpler and easier to understand. > > If the original code wasn't broken and the new code produces the same > output, I'd really not mess with this code. There simply is no upside > to messing with this code. It's run once during boot and never a > noticeable contributor of boot overhead. Maybe the new code is a bit > simpler and more efficient but the actual benefit is so small that any > risk would outweigh it. > > Thanks. > the main intent of this change is making the CPU grouping algorithm more easily to understand, especially, for newcomer for memory managements take me as a example, i really take me a longer timer to understand it
Re: [RFC PATCH 1/1] mm/vmalloc.c: correct logic errors when insert vmap_area
Hi Nicholas, i find __insert_vmap_area() is introduced by you could you offer comments for this patch related to that funciton thanks On 10/12/2016 10:46 PM, Michal Hocko wrote: > [Let's CC Nick who has written this code] > > On Wed 12-10-16 22:30:13, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> the KVA allocator organizes vmap_areas allocated by rbtree. in order to >> insert a new vmap_area @i_va into the rbtree, walk around the rbtree from >> root and compare the vmap_area @t_va met on the rbtree against @i_va; walk >> toward the left branch of @t_va if @i_va is lower than @t_va, and right >> branch if higher, otherwise handle this error case since @i_va has overlay >> with @t_va; however, __insert_vmap_area() don't follow the desired >> procedure rightly, moreover, it includes a meaningless else if condition >> and a redundant else branch as shown by comments in below code segments: >> static void __insert_vmap_area(struct vmap_area *va) >> { >> as a internal interface parameter, we assume vmap_area @va has nonzero size >> ... >> if (va->va_start < tmp->va_end) >> p = &(*p)->rb_left; >> else if (va->va_end > tmp->va_start) >> p = &(*p)->rb_right; >> this else if condition is always true and meaningless due to >> va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally >> else >> BUG(); >> this BUG() is meaningless too due to never be reached normally >> ... >> } >> >> it looks like the else if condition and else branch are canceled. no errors >> are caused since the vmap_area @va to insert as a internal interface >> parameter doesn't have overlay with any one on the rbtree normally. however >> __insert_vmap_area() looks weird and really has several logic errors as >> pointed out above when it is viewed as a separate function. > > I have tried to read this several times but I am completely lost to > understand what the actual bug is and how it causes vmap_area sorting to > misbehave. So is this a correctness issue, performance improvement or > theoretical fix for an incorrect input? > >> fix by walking around vmap_area rbtree as described above to insert >> a vmap_area. >> >> BTW, (va->va_end == tmp_va->va_start) is consider as legal case since it >> indicates vmap_area @va left neighbors with @tmp_va tightly. >> >> Fixes: db64fe02258f ("mm: rewrite vmap layer") >> Signed-off-by: zijun_hu <zijun...@htc.com> >> --- >> mm/vmalloc.c | 8 >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index 5daf3211b84f..8b80931654b7 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) >> >> parent = *p; >> tmp_va = rb_entry(parent, struct vmap_area, rb_node); >> -if (va->va_start < tmp_va->va_end) >> -p = &(*p)->rb_left; >> -else if (va->va_end > tmp_va->va_start) >> -p = &(*p)->rb_right; >> +if (va->va_end <= tmp_va->va_start) >> +p = >rb_left; >> +else if (va->va_start >= tmp_va->va_end) >> +p = >rb_right; >> else >> BUG(); >> } >> -- >> 1.9.1 >
Re: [RFC PATCH 1/1] mm/vmalloc.c: correct logic errors when insert vmap_area
Hi Nicholas, i find __insert_vmap_area() is introduced by you could you offer comments for this patch related to that funciton thanks On 10/12/2016 10:46 PM, Michal Hocko wrote: > [Let's CC Nick who has written this code] > > On Wed 12-10-16 22:30:13, zijun_hu wrote: >> From: zijun_hu >> >> the KVA allocator organizes vmap_areas allocated by rbtree. in order to >> insert a new vmap_area @i_va into the rbtree, walk around the rbtree from >> root and compare the vmap_area @t_va met on the rbtree against @i_va; walk >> toward the left branch of @t_va if @i_va is lower than @t_va, and right >> branch if higher, otherwise handle this error case since @i_va has overlay >> with @t_va; however, __insert_vmap_area() don't follow the desired >> procedure rightly, moreover, it includes a meaningless else if condition >> and a redundant else branch as shown by comments in below code segments: >> static void __insert_vmap_area(struct vmap_area *va) >> { >> as a internal interface parameter, we assume vmap_area @va has nonzero size >> ... >> if (va->va_start < tmp->va_end) >> p = &(*p)->rb_left; >> else if (va->va_end > tmp->va_start) >> p = &(*p)->rb_right; >> this else if condition is always true and meaningless due to >> va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally >> else >> BUG(); >> this BUG() is meaningless too due to never be reached normally >> ... >> } >> >> it looks like the else if condition and else branch are canceled. no errors >> are caused since the vmap_area @va to insert as a internal interface >> parameter doesn't have overlay with any one on the rbtree normally. however >> __insert_vmap_area() looks weird and really has several logic errors as >> pointed out above when it is viewed as a separate function. > > I have tried to read this several times but I am completely lost to > understand what the actual bug is and how it causes vmap_area sorting to > misbehave. So is this a correctness issue, performance improvement or > theoretical fix for an incorrect input? > >> fix by walking around vmap_area rbtree as described above to insert >> a vmap_area. >> >> BTW, (va->va_end == tmp_va->va_start) is consider as legal case since it >> indicates vmap_area @va left neighbors with @tmp_va tightly. >> >> Fixes: db64fe02258f ("mm: rewrite vmap layer") >> Signed-off-by: zijun_hu >> --- >> mm/vmalloc.c | 8 >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index 5daf3211b84f..8b80931654b7 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) >> >> parent = *p; >> tmp_va = rb_entry(parent, struct vmap_area, rb_node); >> -if (va->va_start < tmp_va->va_end) >> -p = &(*p)->rb_left; >> -else if (va->va_end > tmp_va->va_start) >> -p = &(*p)->rb_right; >> +if (va->va_end <= tmp_va->va_start) >> +p = >rb_left; >> +else if (va->va_start >= tmp_va->va_end) >> +p = >rb_right; >> else >> BUG(); >> } >> -- >> 1.9.1 >
Re: [RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
On 10/13/2016 05:41 AM, Andrew Morton wrote: > On Tue, 11 Oct 2016 22:00:28 +0800 zijun_hu <zijun...@zoho.com> wrote: > >> as shown by pcpu_build_alloc_info(), the number of units within a percpu >> group is educed by rounding up the number of CPUs within the group to >> @upa boundary, therefore, the number of CPUs isn't equal to the units's >> if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() >> uses BUG_ON() to assert one number is equal the other roughly, so a panic >> is maybe triggered by the BUG_ON() falsely. >> >> in order to fix this issue, the number of CPUs is rounded up then compared >> with units's, the BUG_ON() is replaced by warning and returning error code >> as well to keep system alive as much as possible. > > Under what circumstances is the triggered? In other words, what are > the end-user visible effects of the fix? > the BUG_ON() takes effect when the number isn't aligned @upa, the BUG_ON() should not be triggered under this normal circumstances. the aim of this fixing is prevent the BUG_ON() which is triggered under the case. see below original code segments for reason. pcpu_build_alloc_info(){ ... for_each_possible_cpu(cpu) if (group_map[cpu] == group) gi->cpu_map[gi->nr_units++] = cpu; gi->nr_units = roundup(gi->nr_units, upa); calculate the number of CPUs belonging to a group into relevant @gi->nr_units then roundup @gi->nr_units up to @upa for itself unit += gi->nr_units; ... } pcpu_page_first_chunk() { ... ai = pcpu_build_alloc_info(reserved_size, 0, PAGE_SIZE, NULL); if (IS_ERR(ai)) return PTR_ERR(ai); BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); it seems there is only one group and all CPUs belong to the group but compare the number of CPUs with the number of units directly. as shown by comments in above function. ai->groups[0].nr_units should equal to roundup(num_possible_cpus(), @upa) other than num_possible_cpus() directly. ... } > I mean, this is pretty old code (isn't it?) so what are you doing that > triggers this? > > i am learning memory source and find the inconsistency and think the BUG_ON() maybe be triggered under this special normal but possible case
Re: [RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
On 10/13/2016 05:41 AM, Andrew Morton wrote: > On Tue, 11 Oct 2016 22:00:28 +0800 zijun_hu wrote: > >> as shown by pcpu_build_alloc_info(), the number of units within a percpu >> group is educed by rounding up the number of CPUs within the group to >> @upa boundary, therefore, the number of CPUs isn't equal to the units's >> if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() >> uses BUG_ON() to assert one number is equal the other roughly, so a panic >> is maybe triggered by the BUG_ON() falsely. >> >> in order to fix this issue, the number of CPUs is rounded up then compared >> with units's, the BUG_ON() is replaced by warning and returning error code >> as well to keep system alive as much as possible. > > Under what circumstances is the triggered? In other words, what are > the end-user visible effects of the fix? > the BUG_ON() takes effect when the number isn't aligned @upa, the BUG_ON() should not be triggered under this normal circumstances. the aim of this fixing is prevent the BUG_ON() which is triggered under the case. see below original code segments for reason. pcpu_build_alloc_info(){ ... for_each_possible_cpu(cpu) if (group_map[cpu] == group) gi->cpu_map[gi->nr_units++] = cpu; gi->nr_units = roundup(gi->nr_units, upa); calculate the number of CPUs belonging to a group into relevant @gi->nr_units then roundup @gi->nr_units up to @upa for itself unit += gi->nr_units; ... } pcpu_page_first_chunk() { ... ai = pcpu_build_alloc_info(reserved_size, 0, PAGE_SIZE, NULL); if (IS_ERR(ai)) return PTR_ERR(ai); BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); it seems there is only one group and all CPUs belong to the group but compare the number of CPUs with the number of units directly. as shown by comments in above function. ai->groups[0].nr_units should equal to roundup(num_possible_cpus(), @upa) other than num_possible_cpus() directly. ... } > I mean, this is pretty old code (isn't it?) so what are you doing that > triggers this? > > i am learning memory source and find the inconsistency and think the BUG_ON() maybe be triggered under this special normal but possible case
Re: [RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
On 10/13/2016 05:41 AM, Andrew Morton wrote: > On Tue, 11 Oct 2016 22:00:28 +0800 zijun_hu <zijun...@zoho.com> wrote: > >> as shown by pcpu_build_alloc_info(), the number of units within a percpu >> group is educed by rounding up the number of CPUs within the group to >> @upa boundary, therefore, the number of CPUs isn't equal to the units's >> if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() >> uses BUG_ON() to assert one number is equal the other roughly, so a panic >> is maybe triggered by the BUG_ON() falsely. >> >> in order to fix this issue, the number of CPUs is rounded up then compared >> with units's, the BUG_ON() is replaced by warning and returning error code >> as well to keep system alive as much as possible. > > Under what circumstances is the triggered? In other words, what are > the end-user visible effects of the fix? > the BUG_ON() takes effect when the number of CPUs isn't aligned @upa, the BUG_ON() should not be triggered under this normal circumstances. the aim of this fixing is prevent the BUG_ON() which is triggered under the case. see below original code segments for reason. pcpu_build_alloc_info(){ ... for_each_possible_cpu(cpu) if (group_map[cpu] == group) gi->cpu_map[gi->nr_units++] = cpu; gi->nr_units = roundup(gi->nr_units, upa); calculate the number of CPUs belonging to a group into relevant @gi->nr_units then roundup @gi->nr_units up to @upa for itself unit += gi->nr_units; ... } pcpu_page_first_chunk() { ... ai = pcpu_build_alloc_info(reserved_size, 0, PAGE_SIZE, NULL); if (IS_ERR(ai)) return PTR_ERR(ai); BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); it seems there is only one group and all CPUs belong to the group but compare the number of CPUs with the number of units directly. ... } as shown by comments in above function. ai->groups[0].nr_units should equal to roundup(num_possible_cpus(), @upa) other than num_possible_cpus() directly. > I mean, this is pretty old code (isn't it?) so what are you doing that > triggers this? > > i am learning memory management source and find the inconsistency and think the BUG_ON() maybe be triggered under this special normal but possible case it maybe a logic error
Re: [RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
On 10/13/2016 05:41 AM, Andrew Morton wrote: > On Tue, 11 Oct 2016 22:00:28 +0800 zijun_hu wrote: > >> as shown by pcpu_build_alloc_info(), the number of units within a percpu >> group is educed by rounding up the number of CPUs within the group to >> @upa boundary, therefore, the number of CPUs isn't equal to the units's >> if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() >> uses BUG_ON() to assert one number is equal the other roughly, so a panic >> is maybe triggered by the BUG_ON() falsely. >> >> in order to fix this issue, the number of CPUs is rounded up then compared >> with units's, the BUG_ON() is replaced by warning and returning error code >> as well to keep system alive as much as possible. > > Under what circumstances is the triggered? In other words, what are > the end-user visible effects of the fix? > the BUG_ON() takes effect when the number of CPUs isn't aligned @upa, the BUG_ON() should not be triggered under this normal circumstances. the aim of this fixing is prevent the BUG_ON() which is triggered under the case. see below original code segments for reason. pcpu_build_alloc_info(){ ... for_each_possible_cpu(cpu) if (group_map[cpu] == group) gi->cpu_map[gi->nr_units++] = cpu; gi->nr_units = roundup(gi->nr_units, upa); calculate the number of CPUs belonging to a group into relevant @gi->nr_units then roundup @gi->nr_units up to @upa for itself unit += gi->nr_units; ... } pcpu_page_first_chunk() { ... ai = pcpu_build_alloc_info(reserved_size, 0, PAGE_SIZE, NULL); if (IS_ERR(ai)) return PTR_ERR(ai); BUG_ON(ai->nr_groups != 1); BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); it seems there is only one group and all CPUs belong to the group but compare the number of CPUs with the number of units directly. ... } as shown by comments in above function. ai->groups[0].nr_units should equal to roundup(num_possible_cpus(), @upa) other than num_possible_cpus() directly. > I mean, this is pretty old code (isn't it?) so what are you doing that > triggers this? > > i am learning memory management source and find the inconsistency and think the BUG_ON() maybe be triggered under this special normal but possible case it maybe a logic error
Re: [RFC PATCH 1/1] mm/vmalloc.c: correct logic errors when insert vmap_area
On 2016/10/12 22:46, Michal Hocko wrote: > [Let's CC Nick who has written this code] > > On Wed 12-10-16 22:30:13, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> the KVA allocator organizes vmap_areas allocated by rbtree. in order to >> insert a new vmap_area @i_va into the rbtree, walk around the rbtree from >> root and compare the vmap_area @t_va met on the rbtree against @i_va; walk >> toward the left branch of @t_va if @i_va is lower than @t_va, and right >> branch if higher, otherwise handle this error case since @i_va has overlay >> with @t_va; however, __insert_vmap_area() don't follow the desired >> procedure rightly, moreover, it includes a meaningless else if condition >> and a redundant else branch as shown by comments in below code segments: >> static void __insert_vmap_area(struct vmap_area *va) >> { >> as a internal interface parameter, we assume vmap_area @va has nonzero size >> ... >> if (va->va_start < tmp->va_end) >> p = &(*p)->rb_left; >> else if (va->va_end > tmp->va_start) >> p = &(*p)->rb_right; >> this else if condition is always true and meaningless due to >> va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally >> else >> BUG(); >> this BUG() is meaningless too due to never be reached normally >> ... >> } >> >> it looks like the else if condition and else branch are canceled. no errors >> are caused since the vmap_area @va to insert as a internal interface >> parameter doesn't have overlay with any one on the rbtree normally. however >> __insert_vmap_area() looks weird and really has several logic errors as >> pointed out above when it is viewed as a separate function. > > I have tried to read this several times but I am completely lost to > understand what the actual bug is and how it causes vmap_area sorting to > misbehave. So is this a correctness issue, performance improvement or > theoretical fix for an incorrect input? > there are several logic errors for this function in current code: current code is : static void __insert_vmap_area(struct vmap_area *va) { ... if (va->va_start < tmp->va_end) p = &(*p)->rb_left; else if (va->va_end > tmp->va_start) p = &(*p)->rb_right; else BUG(); ... } the current code is equivalent with the following code static void __insert_vmap_area(struct vmap_area *va) { ... if (va->va_start < tmp->va_end) p = &(*p)->rb_left; else p = &(*p)->rb_right; ... } as shown above, for current code : this else if (va->va_end > tmp->va_start) is meaningless since it is always true the else branch BUG(); is meaningless too since it never be reached it seems there are logic error in the function the code we expect should be as follows: static void __insert_vmap_area(struct vmap_area *va) { ... if (va->va_end <= tmp_va->va_start) p = &(*p)->rb_left; else if (va->va_start >= tmp_va->va_end) p = &(*p)->rb_right; else BUG(); ... } >> fix by walking around vmap_area rbtree as described above to insert >> a vmap_area. >> >> BTW, (va->va_end == tmp_va->va_start) is consider as legal case since it >> indicates vmap_area @va left neighbors with @tmp_va tightly. >> >> Fixes: db64fe02258f ("mm: rewrite vmap layer") >> Signed-off-by: zijun_hu <zijun...@htc.com> >> --- >> mm/vmalloc.c | 8 >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index 5daf3211b84f..8b80931654b7 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) >> >> parent = *p; >> tmp_va = rb_entry(parent, struct vmap_area, rb_node); >> -if (va->va_start < tmp_va->va_end) >> -p = &(*p)->rb_left; >> -else if (va->va_end > tmp_va->va_start) >> -p = &(*p)->rb_right; >> +if (va->va_end <= tmp_va->va_start) >> +p = >rb_left; >> +else if (va->va_start >= tmp_va->va_end) >> +p = >rb_right; >> else >> BUG(); >> } >> -- >> 1.9.1 >
Re: [RFC PATCH 1/1] mm/vmalloc.c: correct logic errors when insert vmap_area
On 2016/10/12 22:46, Michal Hocko wrote: > [Let's CC Nick who has written this code] > > On Wed 12-10-16 22:30:13, zijun_hu wrote: >> From: zijun_hu >> >> the KVA allocator organizes vmap_areas allocated by rbtree. in order to >> insert a new vmap_area @i_va into the rbtree, walk around the rbtree from >> root and compare the vmap_area @t_va met on the rbtree against @i_va; walk >> toward the left branch of @t_va if @i_va is lower than @t_va, and right >> branch if higher, otherwise handle this error case since @i_va has overlay >> with @t_va; however, __insert_vmap_area() don't follow the desired >> procedure rightly, moreover, it includes a meaningless else if condition >> and a redundant else branch as shown by comments in below code segments: >> static void __insert_vmap_area(struct vmap_area *va) >> { >> as a internal interface parameter, we assume vmap_area @va has nonzero size >> ... >> if (va->va_start < tmp->va_end) >> p = &(*p)->rb_left; >> else if (va->va_end > tmp->va_start) >> p = &(*p)->rb_right; >> this else if condition is always true and meaningless due to >> va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally >> else >> BUG(); >> this BUG() is meaningless too due to never be reached normally >> ... >> } >> >> it looks like the else if condition and else branch are canceled. no errors >> are caused since the vmap_area @va to insert as a internal interface >> parameter doesn't have overlay with any one on the rbtree normally. however >> __insert_vmap_area() looks weird and really has several logic errors as >> pointed out above when it is viewed as a separate function. > > I have tried to read this several times but I am completely lost to > understand what the actual bug is and how it causes vmap_area sorting to > misbehave. So is this a correctness issue, performance improvement or > theoretical fix for an incorrect input? > there are several logic errors for this function in current code: current code is : static void __insert_vmap_area(struct vmap_area *va) { ... if (va->va_start < tmp->va_end) p = &(*p)->rb_left; else if (va->va_end > tmp->va_start) p = &(*p)->rb_right; else BUG(); ... } the current code is equivalent with the following code static void __insert_vmap_area(struct vmap_area *va) { ... if (va->va_start < tmp->va_end) p = &(*p)->rb_left; else p = &(*p)->rb_right; ... } as shown above, for current code : this else if (va->va_end > tmp->va_start) is meaningless since it is always true the else branch BUG(); is meaningless too since it never be reached it seems there are logic error in the function the code we expect should be as follows: static void __insert_vmap_area(struct vmap_area *va) { ... if (va->va_end <= tmp_va->va_start) p = &(*p)->rb_left; else if (va->va_start >= tmp_va->va_end) p = &(*p)->rb_right; else BUG(); ... } >> fix by walking around vmap_area rbtree as described above to insert >> a vmap_area. >> >> BTW, (va->va_end == tmp_va->va_start) is consider as legal case since it >> indicates vmap_area @va left neighbors with @tmp_va tightly. >> >> Fixes: db64fe02258f ("mm: rewrite vmap layer") >> Signed-off-by: zijun_hu >> --- >> mm/vmalloc.c | 8 >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index 5daf3211b84f..8b80931654b7 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) >> >> parent = *p; >> tmp_va = rb_entry(parent, struct vmap_area, rb_node); >> -if (va->va_start < tmp_va->va_end) >> -p = &(*p)->rb_left; >> -else if (va->va_end > tmp_va->va_start) >> -p = &(*p)->rb_right; >> +if (va->va_end <= tmp_va->va_start) >> +p = >rb_left; >> +else if (va->va_start >= tmp_va->va_end) >> +p = >rb_right; >> else >> BUG(); >> } >> -- >> 1.9.1 >
[RFC PATCH 1/1] mm/vmalloc.c: correct logic errors when insert vmap_area
From: zijun_hu <zijun...@htc.com> the KVA allocator organizes vmap_areas allocated by rbtree. in order to insert a new vmap_area @i_va into the rbtree, walk around the rbtree from root and compare the vmap_area @t_va met on the rbtree against @i_va; walk toward the left branch of @t_va if @i_va is lower than @t_va, and right branch if higher, otherwise handle this error case since @i_va has overlay with @t_va; however, __insert_vmap_area() don't follow the desired procedure rightly, moreover, it includes a meaningless else if condition and a redundant else branch as shown by comments in below code segments: static void __insert_vmap_area(struct vmap_area *va) { as a internal interface parameter, we assume vmap_area @va has nonzero size ... if (va->va_start < tmp->va_end) p = &(*p)->rb_left; else if (va->va_end > tmp->va_start) p = &(*p)->rb_right; this else if condition is always true and meaningless due to va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally else BUG(); this BUG() is meaningless too due to never be reached normally ... } it looks like the else if condition and else branch are canceled. no errors are caused since the vmap_area @va to insert as a internal interface parameter doesn't have overlay with any one on the rbtree normally. however __insert_vmap_area() looks weird and really has several logic errors as pointed out above when it is viewed as a separate function. fix by walking around vmap_area rbtree as described above to insert a vmap_area. BTW, (va->va_end == tmp_va->va_start) is consider as legal case since it indicates vmap_area @va left neighbors with @tmp_va tightly. Fixes: db64fe02258f ("mm: rewrite vmap layer") Signed-off-by: zijun_hu <zijun...@htc.com> --- mm/vmalloc.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5daf3211b84f..8b80931654b7 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) parent = *p; tmp_va = rb_entry(parent, struct vmap_area, rb_node); - if (va->va_start < tmp_va->va_end) - p = &(*p)->rb_left; - else if (va->va_end > tmp_va->va_start) - p = &(*p)->rb_right; + if (va->va_end <= tmp_va->va_start) + p = >rb_left; + else if (va->va_start >= tmp_va->va_end) + p = >rb_right; else BUG(); } -- 1.9.1
[RFC PATCH 1/1] mm/vmalloc.c: correct logic errors when insert vmap_area
From: zijun_hu the KVA allocator organizes vmap_areas allocated by rbtree. in order to insert a new vmap_area @i_va into the rbtree, walk around the rbtree from root and compare the vmap_area @t_va met on the rbtree against @i_va; walk toward the left branch of @t_va if @i_va is lower than @t_va, and right branch if higher, otherwise handle this error case since @i_va has overlay with @t_va; however, __insert_vmap_area() don't follow the desired procedure rightly, moreover, it includes a meaningless else if condition and a redundant else branch as shown by comments in below code segments: static void __insert_vmap_area(struct vmap_area *va) { as a internal interface parameter, we assume vmap_area @va has nonzero size ... if (va->va_start < tmp->va_end) p = &(*p)->rb_left; else if (va->va_end > tmp->va_start) p = &(*p)->rb_right; this else if condition is always true and meaningless due to va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally else BUG(); this BUG() is meaningless too due to never be reached normally ... } it looks like the else if condition and else branch are canceled. no errors are caused since the vmap_area @va to insert as a internal interface parameter doesn't have overlay with any one on the rbtree normally. however __insert_vmap_area() looks weird and really has several logic errors as pointed out above when it is viewed as a separate function. fix by walking around vmap_area rbtree as described above to insert a vmap_area. BTW, (va->va_end == tmp_va->va_start) is consider as legal case since it indicates vmap_area @va left neighbors with @tmp_va tightly. Fixes: db64fe02258f ("mm: rewrite vmap layer") Signed-off-by: zijun_hu --- mm/vmalloc.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5daf3211b84f..8b80931654b7 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) parent = *p; tmp_va = rb_entry(parent, struct vmap_area, rb_node); - if (va->va_start < tmp_va->va_end) - p = &(*p)->rb_left; - else if (va->va_end > tmp_va->va_start) - p = &(*p)->rb_right; + if (va->va_end <= tmp_va->va_start) + p = >rb_left; + else if (va->va_start >= tmp_va->va_end) + p = >rb_right; else BUG(); } -- 1.9.1
[RESEND RFC PATCH v2 1/1] mm/vmalloc.c: simplify /proc/vmallocinfo implementation
From: zijun_hu <zijun...@htc.com> many seq_file helpers exist for simplifying implementation of virtual files especially, for /proc nodes. however, the helpers for iteration over list_head are available but aren't adopted to implement /proc/vmallocinfo currently. simplify /proc/vmallocinfo implementation by existing seq_file helpers Signed-off-by: zijun_hu <zijun...@htc.com> --- Changes in v2: - the redundant type cast is removed as advised by rient...@google.com - commit messages are updated mm/vmalloc.c | 27 +-- 1 file changed, 5 insertions(+), 22 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2481cb4e6b2..e73948afac70 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2574,32 +2574,13 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) static void *s_start(struct seq_file *m, loff_t *pos) __acquires(_area_lock) { - loff_t n = *pos; - struct vmap_area *va; - spin_lock(_area_lock); - va = list_first_entry(_area_list, typeof(*va), list); - while (n > 0 && >list != _area_list) { - n--; - va = list_next_entry(va, list); - } - if (!n && >list != _area_list) - return va; - - return NULL; - + return seq_list_start(_area_list, *pos); } static void *s_next(struct seq_file *m, void *p, loff_t *pos) { - struct vmap_area *va = p, *next; - - ++*pos; - next = list_next_entry(va, list); - if (>list != _area_list) - return next; - - return NULL; + return seq_list_next(p, _area_list, pos); } static void s_stop(struct seq_file *m, void *p) @@ -2634,9 +2615,11 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v) static int s_show(struct seq_file *m, void *p) { - struct vmap_area *va = p; + struct vmap_area *va; struct vm_struct *v; + va = list_entry(p, struct vmap_area, list); + /* * s_show can encounter race with remove_vm_area, !VM_VM_AREA on * behalf of vmap area is being tear down or vm_map_ram allocation. -- 1.9.1
[RESEND RFC PATCH v2 1/1] mm/vmalloc.c: simplify /proc/vmallocinfo implementation
From: zijun_hu many seq_file helpers exist for simplifying implementation of virtual files especially, for /proc nodes. however, the helpers for iteration over list_head are available but aren't adopted to implement /proc/vmallocinfo currently. simplify /proc/vmallocinfo implementation by existing seq_file helpers Signed-off-by: zijun_hu --- Changes in v2: - the redundant type cast is removed as advised by rient...@google.com - commit messages are updated mm/vmalloc.c | 27 +-- 1 file changed, 5 insertions(+), 22 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f2481cb4e6b2..e73948afac70 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2574,32 +2574,13 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) static void *s_start(struct seq_file *m, loff_t *pos) __acquires(_area_lock) { - loff_t n = *pos; - struct vmap_area *va; - spin_lock(_area_lock); - va = list_first_entry(_area_list, typeof(*va), list); - while (n > 0 && >list != _area_list) { - n--; - va = list_next_entry(va, list); - } - if (!n && >list != _area_list) - return va; - - return NULL; - + return seq_list_start(_area_list, *pos); } static void *s_next(struct seq_file *m, void *p, loff_t *pos) { - struct vmap_area *va = p, *next; - - ++*pos; - next = list_next_entry(va, list); - if (>list != _area_list) - return next; - - return NULL; + return seq_list_next(p, _area_list, pos); } static void s_stop(struct seq_file *m, void *p) @@ -2634,9 +2615,11 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v) static int s_show(struct seq_file *m, void *p) { - struct vmap_area *va = p; + struct vmap_area *va; struct vm_struct *v; + va = list_entry(p, struct vmap_area, list); + /* * s_show can encounter race with remove_vm_area, !VM_VM_AREA on * behalf of vmap area is being tear down or vm_map_ram allocation. -- 1.9.1
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 05:54 PM, Michal Hocko wrote: > On Wed 12-10-16 16:44:31, zijun_hu wrote: >> On 10/12/2016 04:25 PM, Michal Hocko wrote: >>> On Wed 12-10-16 15:24:33, zijun_hu wrote: > [...] >>>> i found the following code segments in mm/vmalloc.c >>>> static struct vmap_area *alloc_vmap_area(unsigned long size, >>>> unsigned long align, >>>> unsigned long vstart, unsigned long vend, >>>> int node, gfp_t gfp_mask) >>>> { >>>> ... >>>> >>>> BUG_ON(!size); >>>> BUG_ON(offset_in_page(size)); >>>> BUG_ON(!is_power_of_2(align)); >>> >>> See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and >>> from a quick look they are even unnecessary. So rather than adding more >>> of those, I think removing those that are not needed is much more >>> preferred. >>> >> i notice that, and the above code segments is used to illustrate that >> input parameter checking is necessary sometimes > > Why do you think it is necessary here? > i am sorry for reply late i don't know whether it is necessary i just find there are so many sanity checkup in current internal interfaces
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 05:54 PM, Michal Hocko wrote: > On Wed 12-10-16 16:44:31, zijun_hu wrote: >> On 10/12/2016 04:25 PM, Michal Hocko wrote: >>> On Wed 12-10-16 15:24:33, zijun_hu wrote: > [...] >>>> i found the following code segments in mm/vmalloc.c >>>> static struct vmap_area *alloc_vmap_area(unsigned long size, >>>> unsigned long align, >>>> unsigned long vstart, unsigned long vend, >>>> int node, gfp_t gfp_mask) >>>> { >>>> ... >>>> >>>> BUG_ON(!size); >>>> BUG_ON(offset_in_page(size)); >>>> BUG_ON(!is_power_of_2(align)); >>> >>> See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and >>> from a quick look they are even unnecessary. So rather than adding more >>> of those, I think removing those that are not needed is much more >>> preferred. >>> >> i notice that, and the above code segments is used to illustrate that >> input parameter checking is necessary sometimes > > Why do you think it is necessary here? > i am sorry for reply late i don't know whether it is necessary i just find there are so many sanity checkup in current internal interfaces
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 04:25 PM, Michal Hocko wrote: > On Wed 12-10-16 15:24:33, zijun_hu wrote: >> On 10/12/2016 02:53 PM, Michal Hocko wrote: >>> On Wed 12-10-16 08:28:17, zijun_hu wrote: >>>> On 2016/10/12 1:22, Michal Hocko wrote: >>>>> On Tue 11-10-16 21:24:50, zijun_hu wrote: >>>>>> From: zijun_hu <zijun...@htc.com> >>>>>> >> should we have a generic discussion whether such patches which considers >> many boundary or rare conditions are necessary. > > In general, I believe that kernel internal interfaces which have no > userspace exposure shouldn't be cluttered with sanity checks. > you are right and i agree with you. but there are many internal interfaces perform sanity checks in current linux sources >> i found the following code segments in mm/vmalloc.c >> static struct vmap_area *alloc_vmap_area(unsigned long size, >> unsigned long align, >> unsigned long vstart, unsigned long vend, >> int node, gfp_t gfp_mask) >> { >> ... >> >> BUG_ON(!size); >> BUG_ON(offset_in_page(size)); >> BUG_ON(!is_power_of_2(align)); > > See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and > from a quick look they are even unnecessary. So rather than adding more > of those, I think removing those that are not needed is much more > preferred. > i notice that, and the above code segments is used to illustrate that input parameter checking is necessary sometimes >> should we make below declarations as conventions >> 1) when we say 'alignment', it means align to a power of 2 value >>for example, aligning value @v to @b implicit @v is power of 2 >>, align 10 to 4 is 12 > > alignment other than power-of-two makes only very limited sense to me. > you are right and i agree with you. >> 2) when we say 'round value @v up/down to boundary @b', it means the >>result is a times of @b, it don't requires @b is a power of 2 > i will write to linus to ask for opinions whether we should declare the meaning of 'align' and 'round up/down' formally and whether such patches are necessary
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 04:25 PM, Michal Hocko wrote: > On Wed 12-10-16 15:24:33, zijun_hu wrote: >> On 10/12/2016 02:53 PM, Michal Hocko wrote: >>> On Wed 12-10-16 08:28:17, zijun_hu wrote: >>>> On 2016/10/12 1:22, Michal Hocko wrote: >>>>> On Tue 11-10-16 21:24:50, zijun_hu wrote: >>>>>> From: zijun_hu >>>>>> >> should we have a generic discussion whether such patches which considers >> many boundary or rare conditions are necessary. > > In general, I believe that kernel internal interfaces which have no > userspace exposure shouldn't be cluttered with sanity checks. > you are right and i agree with you. but there are many internal interfaces perform sanity checks in current linux sources >> i found the following code segments in mm/vmalloc.c >> static struct vmap_area *alloc_vmap_area(unsigned long size, >> unsigned long align, >> unsigned long vstart, unsigned long vend, >> int node, gfp_t gfp_mask) >> { >> ... >> >> BUG_ON(!size); >> BUG_ON(offset_in_page(size)); >> BUG_ON(!is_power_of_2(align)); > > See a recent Linus rant about BUG_ONs. These BUG_ONs are quite old and > from a quick look they are even unnecessary. So rather than adding more > of those, I think removing those that are not needed is much more > preferred. > i notice that, and the above code segments is used to illustrate that input parameter checking is necessary sometimes >> should we make below declarations as conventions >> 1) when we say 'alignment', it means align to a power of 2 value >>for example, aligning value @v to @b implicit @v is power of 2 >>, align 10 to 4 is 12 > > alignment other than power-of-two makes only very limited sense to me. > you are right and i agree with you. >> 2) when we say 'round value @v up/down to boundary @b', it means the >>result is a times of @b, it don't requires @b is a power of 2 > i will write to linus to ask for opinions whether we should declare the meaning of 'align' and 'round up/down' formally and whether such patches are necessary
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 02:53 PM, Michal Hocko wrote: > On Wed 12-10-16 08:28:17, zijun_hu wrote: >> On 2016/10/12 1:22, Michal Hocko wrote: >>> On Tue 11-10-16 21:24:50, zijun_hu wrote: >>>> From: zijun_hu <zijun...@htc.com> >>>> >>>> the LSB of a chunk->map element is used for free/in-use flag of a area >>>> and the other bits for offset, the sufficient and necessary condition of >>>> this usage is that both size and alignment of a area must be even numbers >>>> however, pcpu_alloc() doesn't force its @align parameter a even number >>>> explicitly, so a odd @align maybe causes a series of errors, see below >>>> example for concrete descriptions. >>> >>> Is or was there any user who would use a different than even (or power of 2) >>> alighment? If not is this really worth handling? >>> >> >> it seems only a power of 2 alignment except 1 can make sure it work very >> well, >> that is a strict limit, maybe this more strict limit should be checked > > I fail to see how any other alignment would actually make any sense > what so ever. Look, I am not a maintainer of this code but adding a new > code to catch something that doesn't make any sense sounds dubious at > best to me. > > I could understand this patch if you see a problem and want to prevent > it from repeating bug doing these kind of changes just in case sounds > like a bad idea. > thanks for your reply should we have a generic discussion whether such patches which considers many boundary or rare conditions are necessary. i found the following code segments in mm/vmalloc.c static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask) { ... BUG_ON(!size); BUG_ON(offset_in_page(size)); BUG_ON(!is_power_of_2(align)); should we make below declarations as conventions 1) when we say 'alignment', it means align to a power of 2 value for example, aligning value @v to @b implicit @v is power of 2 , align 10 to 4 is 12 2) when we say 'round value @v up/down to boundary @b', it means the result is a times of @b, it don't requires @b is a power of 2
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 02:53 PM, Michal Hocko wrote: > On Wed 12-10-16 08:28:17, zijun_hu wrote: >> On 2016/10/12 1:22, Michal Hocko wrote: >>> On Tue 11-10-16 21:24:50, zijun_hu wrote: >>>> From: zijun_hu >>>> >>>> the LSB of a chunk->map element is used for free/in-use flag of a area >>>> and the other bits for offset, the sufficient and necessary condition of >>>> this usage is that both size and alignment of a area must be even numbers >>>> however, pcpu_alloc() doesn't force its @align parameter a even number >>>> explicitly, so a odd @align maybe causes a series of errors, see below >>>> example for concrete descriptions. >>> >>> Is or was there any user who would use a different than even (or power of 2) >>> alighment? If not is this really worth handling? >>> >> >> it seems only a power of 2 alignment except 1 can make sure it work very >> well, >> that is a strict limit, maybe this more strict limit should be checked > > I fail to see how any other alignment would actually make any sense > what so ever. Look, I am not a maintainer of this code but adding a new > code to catch something that doesn't make any sense sounds dubious at > best to me. > > I could understand this patch if you see a problem and want to prevent > it from repeating bug doing these kind of changes just in case sounds > like a bad idea. > thanks for your reply should we have a generic discussion whether such patches which considers many boundary or rare conditions are necessary. i found the following code segments in mm/vmalloc.c static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask) { ... BUG_ON(!size); BUG_ON(offset_in_page(size)); BUG_ON(!is_power_of_2(align)); should we make below declarations as conventions 1) when we say 'alignment', it means align to a power of 2 value for example, aligning value @v to @b implicit @v is power of 2 , align 10 to 4 is 12 2) when we say 'round value @v up/down to boundary @b', it means the result is a times of @b, it don't requires @b is a power of 2
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 02:53 PM, Michal Hocko wrote: > On Wed 12-10-16 08:28:17, zijun_hu wrote: >> On 2016/10/12 1:22, Michal Hocko wrote: >>> On Tue 11-10-16 21:24:50, zijun_hu wrote: >>>> From: zijun_hu <zijun...@htc.com> >>>> >>>> the LSB of a chunk->map element is used for free/in-use flag of a area >>>> and the other bits for offset, the sufficient and necessary condition of >>>> this usage is that both size and alignment of a area must be even numbers >>>> however, pcpu_alloc() doesn't force its @align parameter a even number >>>> explicitly, so a odd @align maybe causes a series of errors, see below >>>> example for concrete descriptions. >>> >>> Is or was there any user who would use a different than even (or power of 2) >>> alighment? If not is this really worth handling? >>> >> >> it seems only a power of 2 alignment except 1 can make sure it work very >> well, >> that is a strict limit, maybe this more strict limit should be checked > > I fail to see how any other alignment would actually make any sense > what so ever. Look, I am not a maintainer of this code but adding a new > code to catch something that doesn't make any sense sounds dubious at > best to me. > > I could understand this patch if you see a problem and want to prevent > it from repeating bug doing these kind of changes just in case sounds > like a bad idea. > thanks for your reply should we have a generic discussion whether such patches which considers many boundary or rare conditions are necessary. should we make below declarations as conventions 1) when we say 'alignment', it means align to a power of 2 value for example, aligning value @v to @b implicit @v is power of 2 , align 10 to 4 is 12 2) when we say 'round value @v up/down to boundary @b', it means the result is a times of @b, it don't requires @b is a power of 2
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 10/12/2016 02:53 PM, Michal Hocko wrote: > On Wed 12-10-16 08:28:17, zijun_hu wrote: >> On 2016/10/12 1:22, Michal Hocko wrote: >>> On Tue 11-10-16 21:24:50, zijun_hu wrote: >>>> From: zijun_hu >>>> >>>> the LSB of a chunk->map element is used for free/in-use flag of a area >>>> and the other bits for offset, the sufficient and necessary condition of >>>> this usage is that both size and alignment of a area must be even numbers >>>> however, pcpu_alloc() doesn't force its @align parameter a even number >>>> explicitly, so a odd @align maybe causes a series of errors, see below >>>> example for concrete descriptions. >>> >>> Is or was there any user who would use a different than even (or power of 2) >>> alighment? If not is this really worth handling? >>> >> >> it seems only a power of 2 alignment except 1 can make sure it work very >> well, >> that is a strict limit, maybe this more strict limit should be checked > > I fail to see how any other alignment would actually make any sense > what so ever. Look, I am not a maintainer of this code but adding a new > code to catch something that doesn't make any sense sounds dubious at > best to me. > > I could understand this patch if you see a problem and want to prevent > it from repeating bug doing these kind of changes just in case sounds > like a bad idea. > thanks for your reply should we have a generic discussion whether such patches which considers many boundary or rare conditions are necessary. should we make below declarations as conventions 1) when we say 'alignment', it means align to a power of 2 value for example, aligning value @v to @b implicit @v is power of 2 , align 10 to 4 is 12 2) when we say 'round value @v up/down to boundary @b', it means the result is a times of @b, it don't requires @b is a power of 2
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 2016/10/12 1:22, Michal Hocko wrote: > On Tue 11-10-16 21:24:50, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> the LSB of a chunk->map element is used for free/in-use flag of a area >> and the other bits for offset, the sufficient and necessary condition of >> this usage is that both size and alignment of a area must be even numbers >> however, pcpu_alloc() doesn't force its @align parameter a even number >> explicitly, so a odd @align maybe causes a series of errors, see below >> example for concrete descriptions. > > Is or was there any user who would use a different than even (or power of 2) > alighment? If not is this really worth handling? > it seems only a power of 2 alignment except 1 can make sure it work very well, that is a strict limit, maybe this more strict limit should be checked i don't know since there are too many sources and too many users and too many use cases. even if nobody, i can't be sure that it doesn't happens in the future it is worth since below reasons 1) if it is used in right ways, this patch have no impact; otherwise, it can alert user by warning message and correct the behavior. is it better that a warning message and correcting than resulting in many terrible error silently under a special case by change? it can make program more stronger. 2) does any alignment but 1 means a power of 2 alignment conventionally and implicitly? if not, is it better that adjusting both @align and @size uniformly based on the sufficient necessary condition than mixing supposing one part is right and correcting the other? i find that there is BUG_ON(!is_power_of_2(align)) statement in mm/vmalloc.c 3) this simple fix can make the function applicable in wider range, it hints the reader that the lowest requirement for alignment is a even number 4) for char a[10][10]; char (*p)[10]; if a user want to allocate a @size = 10 and @align = 10 memory block, should we reject the user's request?
Re: [RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
On 2016/10/12 1:22, Michal Hocko wrote: > On Tue 11-10-16 21:24:50, zijun_hu wrote: >> From: zijun_hu >> >> the LSB of a chunk->map element is used for free/in-use flag of a area >> and the other bits for offset, the sufficient and necessary condition of >> this usage is that both size and alignment of a area must be even numbers >> however, pcpu_alloc() doesn't force its @align parameter a even number >> explicitly, so a odd @align maybe causes a series of errors, see below >> example for concrete descriptions. > > Is or was there any user who would use a different than even (or power of 2) > alighment? If not is this really worth handling? > it seems only a power of 2 alignment except 1 can make sure it work very well, that is a strict limit, maybe this more strict limit should be checked i don't know since there are too many sources and too many users and too many use cases. even if nobody, i can't be sure that it doesn't happens in the future it is worth since below reasons 1) if it is used in right ways, this patch have no impact; otherwise, it can alert user by warning message and correct the behavior. is it better that a warning message and correcting than resulting in many terrible error silently under a special case by change? it can make program more stronger. 2) does any alignment but 1 means a power of 2 alignment conventionally and implicitly? if not, is it better that adjusting both @align and @size uniformly based on the sufficient necessary condition than mixing supposing one part is right and correcting the other? i find that there is BUG_ON(!is_power_of_2(align)) statement in mm/vmalloc.c 3) this simple fix can make the function applicable in wider range, it hints the reader that the lowest requirement for alignment is a even number 4) for char a[10][10]; char (*p)[10]; if a user want to allocate a @size = 10 and @align = 10 memory block, should we reject the user's request?
Re: [RFC PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
Hi all, please ignore this patch since it includes a build error i resend the fixed patch in v2 version i am sorry for my incaution On 2016/10/11 21:03, zijun_hu wrote: > From: zijun_hu <zijun...@htc.com> > > as shown by pcpu_build_alloc_info(), the number of units within a percpu > group is educed by rounding up the number of CPUs within the group to > @upa boundary, therefore, the number of CPUs isn't equal to the units's > if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() > uses BUG_ON() to assert one number is equal the other roughly, so a panic > is maybe triggered by the BUG_ON() falsely. > > in order to fix this issue, the number of CPUs is rounded up then compared > with units's, the BUG_ON() is replaced by warning and returning error code > as well to keep system alive as much as possible. > > Signed-off-by: zijun_hu <zijun...@htc.com> > --- > mm/percpu.c | 16 > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/mm/percpu.c b/mm/percpu.c > index 32e2d8d128c1..c2f0d9734d8c 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -2095,6 +2095,8 @@ int __init pcpu_page_first_chunk(size_t reserved_size, > size_t pages_size; > struct page **pages; > int unit, i, j, rc; > + int upa; > + int nr_g0_units; > > snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10); > > @@ -2102,7 +2104,12 @@ int __init pcpu_page_first_chunk(size_t reserved_size, > if (IS_ERR(ai)) > return PTR_ERR(ai); > BUG_ON(ai->nr_groups != 1); > - BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); > + upa = ai->alloc_size/ai->unit_size; > + g0_nr_units = roundup(num_possible_cpus(), upa); > + if (unlikely(WARN_ON(ai->groups[0].nr_units != nr_g0_units))) { > + pcpu_free_alloc_info(ai); > + return -EINVAL; > + } > > unit_pages = ai->unit_size >> PAGE_SHIFT; > > @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t reserved_size, > > /* allocate pages */ > j = 0; > - for (unit = 0; unit < num_possible_cpus(); unit++) > + for (unit = 0; unit < num_possible_cpus(); unit++) { > + unsigned int cpu = ai->groups[0].cpu_map[unit]; > for (i = 0; i < unit_pages; i++) { > - unsigned int cpu = ai->groups[0].cpu_map[unit]; > void *ptr; > > ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); > if (!ptr) { > pr_warn("failed to allocate %s page for > cpu%u\n", > - psize_str, cpu); > + psize_str, cpu); > goto enomem; > } > /* kmemleak tracks the percpu allocations separately */ > kmemleak_free(ptr); > pages[j++] = virt_to_page(ptr); > } > + } > > /* allocate vm area, map the pages and copy static data */ > vm.flags = VM_ALLOC; >
Re: [RFC PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
Hi all, please ignore this patch since it includes a build error i resend the fixed patch in v2 version i am sorry for my incaution On 2016/10/11 21:03, zijun_hu wrote: > From: zijun_hu > > as shown by pcpu_build_alloc_info(), the number of units within a percpu > group is educed by rounding up the number of CPUs within the group to > @upa boundary, therefore, the number of CPUs isn't equal to the units's > if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() > uses BUG_ON() to assert one number is equal the other roughly, so a panic > is maybe triggered by the BUG_ON() falsely. > > in order to fix this issue, the number of CPUs is rounded up then compared > with units's, the BUG_ON() is replaced by warning and returning error code > as well to keep system alive as much as possible. > > Signed-off-by: zijun_hu > --- > mm/percpu.c | 16 > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/mm/percpu.c b/mm/percpu.c > index 32e2d8d128c1..c2f0d9734d8c 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -2095,6 +2095,8 @@ int __init pcpu_page_first_chunk(size_t reserved_size, > size_t pages_size; > struct page **pages; > int unit, i, j, rc; > + int upa; > + int nr_g0_units; > > snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10); > > @@ -2102,7 +2104,12 @@ int __init pcpu_page_first_chunk(size_t reserved_size, > if (IS_ERR(ai)) > return PTR_ERR(ai); > BUG_ON(ai->nr_groups != 1); > - BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); > + upa = ai->alloc_size/ai->unit_size; > + g0_nr_units = roundup(num_possible_cpus(), upa); > + if (unlikely(WARN_ON(ai->groups[0].nr_units != nr_g0_units))) { > + pcpu_free_alloc_info(ai); > + return -EINVAL; > + } > > unit_pages = ai->unit_size >> PAGE_SHIFT; > > @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t reserved_size, > > /* allocate pages */ > j = 0; > - for (unit = 0; unit < num_possible_cpus(); unit++) > + for (unit = 0; unit < num_possible_cpus(); unit++) { > + unsigned int cpu = ai->groups[0].cpu_map[unit]; > for (i = 0; i < unit_pages; i++) { > - unsigned int cpu = ai->groups[0].cpu_map[unit]; > void *ptr; > > ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); > if (!ptr) { > pr_warn("failed to allocate %s page for > cpu%u\n", > - psize_str, cpu); > + psize_str, cpu); > goto enomem; > } > /* kmemleak tracks the percpu allocations separately */ > kmemleak_free(ptr); > pages[j++] = virt_to_page(ptr); > } > + } > > /* allocate vm area, map the pages and copy static data */ > vm.flags = VM_ALLOC; >
[RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
From: zijun_hu <zijun...@htc.com> as shown by pcpu_build_alloc_info(), the number of units within a percpu group is educed by rounding up the number of CPUs within the group to @upa boundary, therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() uses BUG_ON() to assert one number is equal the other roughly, so a panic is maybe triggered by the BUG_ON() falsely. in order to fix this issue, the number of CPUs is rounded up then compared with units's, the BUG_ON() is replaced by warning and returning error code as well to keep system alive as much as possible. Signed-off-by: zijun_hu <zijun...@htc.com> --- Changes in v2: - fix build error mm/percpu.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 32e2d8d128c1..ab1186c68ab6 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -2095,6 +2095,8 @@ int __init pcpu_page_first_chunk(size_t reserved_size, size_t pages_size; struct page **pages; int unit, i, j, rc; + int upa; + int nr_g0_units; snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10); @@ -2102,7 +2104,12 @@ int __init pcpu_page_first_chunk(size_t reserved_size, if (IS_ERR(ai)) return PTR_ERR(ai); BUG_ON(ai->nr_groups != 1); - BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); + upa = ai->alloc_size/ai->unit_size; + nr_g0_units = roundup(num_possible_cpus(), upa); + if (unlikely(WARN_ON(ai->groups[0].nr_units != nr_g0_units))) { + pcpu_free_alloc_info(ai); + return -EINVAL; + } unit_pages = ai->unit_size >> PAGE_SHIFT; @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t reserved_size, /* allocate pages */ j = 0; - for (unit = 0; unit < num_possible_cpus(); unit++) + for (unit = 0; unit < num_possible_cpus(); unit++) { + unsigned int cpu = ai->groups[0].cpu_map[unit]; for (i = 0; i < unit_pages; i++) { - unsigned int cpu = ai->groups[0].cpu_map[unit]; void *ptr; ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); if (!ptr) { pr_warn("failed to allocate %s page for cpu%u\n", - psize_str, cpu); + psize_str, cpu); goto enomem; } /* kmemleak tracks the percpu allocations separately */ kmemleak_free(ptr); pages[j++] = virt_to_page(ptr); } + } /* allocate vm area, map the pages and copy static data */ vm.flags = VM_ALLOC; -- 1.9.1
[RFC v2 PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
From: zijun_hu as shown by pcpu_build_alloc_info(), the number of units within a percpu group is educed by rounding up the number of CPUs within the group to @upa boundary, therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() uses BUG_ON() to assert one number is equal the other roughly, so a panic is maybe triggered by the BUG_ON() falsely. in order to fix this issue, the number of CPUs is rounded up then compared with units's, the BUG_ON() is replaced by warning and returning error code as well to keep system alive as much as possible. Signed-off-by: zijun_hu --- Changes in v2: - fix build error mm/percpu.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 32e2d8d128c1..ab1186c68ab6 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -2095,6 +2095,8 @@ int __init pcpu_page_first_chunk(size_t reserved_size, size_t pages_size; struct page **pages; int unit, i, j, rc; + int upa; + int nr_g0_units; snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10); @@ -2102,7 +2104,12 @@ int __init pcpu_page_first_chunk(size_t reserved_size, if (IS_ERR(ai)) return PTR_ERR(ai); BUG_ON(ai->nr_groups != 1); - BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); + upa = ai->alloc_size/ai->unit_size; + nr_g0_units = roundup(num_possible_cpus(), upa); + if (unlikely(WARN_ON(ai->groups[0].nr_units != nr_g0_units))) { + pcpu_free_alloc_info(ai); + return -EINVAL; + } unit_pages = ai->unit_size >> PAGE_SHIFT; @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t reserved_size, /* allocate pages */ j = 0; - for (unit = 0; unit < num_possible_cpus(); unit++) + for (unit = 0; unit < num_possible_cpus(); unit++) { + unsigned int cpu = ai->groups[0].cpu_map[unit]; for (i = 0; i < unit_pages; i++) { - unsigned int cpu = ai->groups[0].cpu_map[unit]; void *ptr; ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); if (!ptr) { pr_warn("failed to allocate %s page for cpu%u\n", - psize_str, cpu); + psize_str, cpu); goto enomem; } /* kmemleak tracks the percpu allocations separately */ kmemleak_free(ptr); pages[j++] = virt_to_page(ptr); } + } /* allocate vm area, map the pages and copy static data */ vm.flags = VM_ALLOC; -- 1.9.1
[RFC PATCH 1/1] mm/percpu.c: fix several trivial issues
From: zijun_hu <zijun...@htc.com> as shown by pcpu_setup_first_chunk(), the first chunk is same as the reserved chunk if the reserved size is nonzero but the dynamic is zero this special scenario is referred as the special case by below content fix several trivial issues: 1) correct or fix several comments the LSB of a chunk->map element is used as free/in-use flag and is cleared for free area and set for in-use, rather than use positive/negative number to mark area state. 2) change atomic size to PAGE_SIZE for consistency when CONFIG_SMP == n both default setup_per_cpu_areas() and pcpu_page_first_chunk() use PAGE_SIZE as atomic size when CONFIG_SMP == y; however setup_per_cpu_areas() allocates memory for the only unit with alignment PAGE_SIZE but assigns unit size to atomic size when CONFIG_SMP == n, so the atomic size isn't consistent with either the alignment or the SMP ones. fix it by changing atomic size to PAGE_SIZE when CONFIG_SMP == n 3) correct empty and populated pages statistic error in order to service dynamic atomic memory allocation, the number of empty and populated pages of chunks is counted to maintain above a low threshold. however, for the special case, the first chunk is took into account by pcpu_setup_first_chunk(), it is meaningless since the chunk don't include any dynamic areas. fix it by excluding the reserved chunk before statistic as the other contexts do. 4) fix potential memory leakage for percpu_init_late() in order to manage chunk->map memory uniformly, for the first and reserved chunks, percpu_init_late() will allocate memory to replace the static chunk->map array within section .init.data after slab is brought up however, for the special case, memory are allocated for the same chunk->map twice since the first chunk reference is same as the reserved, so the memory allocated at the first time are leaked obviously. fix it by eliminating the second memory allocation under the special case Signed-off-by: zijun_hu <zijun...@htc.com> --- mm/percpu.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 26d1c73bd9e2..760a4730f6fd 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -36,8 +36,8 @@ * chunk maps unnecessarily. * * Allocation state in each chunk is kept using an array of integers - * on chunk->map. A positive value in the map represents a free - * region and negative allocated. Allocation inside a chunk is done + * on chunk->map. A LSB cleared value in the map represents a free + * region and set allocated. Allocation inside a chunk is done * by scanning this map sequentially and serving the first matching * entry. This is mostly copied from the percpu_modalloc() allocator. * Chunks can be determined from the address using the index field @@ -93,9 +93,9 @@ #endif #ifndef __pcpu_ptr_to_addr #define __pcpu_ptr_to_addr(ptr) \ - (void __force *)((unsigned long)(ptr) + \ -(unsigned long)pcpu_base_addr -\ -(unsigned long)__per_cpu_start) + (void __force *)((unsigned long)(ptr) - \ +(unsigned long)__per_cpu_start + \ +(unsigned long)pcpu_base_addr) #endif #else /* CONFIG_SMP */ /* on UP, it's always identity mapped */ @@ -583,6 +583,11 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align, * merge'em. Note that 'small' is defined as smaller * than sizeof(int), which is very small but isn't too * uncommon for percpu allocations. +* +* it is unnecessary to append i > 0 to make sure p[-1] +* available: the offset of the first area is 0 and is aligned +* very well, no head free area is left if segments one piece +* from it */ if (head && (head < sizeof(int) || !(p[-1] & 1))) { *p = off += head; @@ -1707,8 +1712,9 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, /* link the first chunk in */ pcpu_first_chunk = dchunk ?: schunk; - pcpu_nr_empty_pop_pages += - pcpu_count_occupied_pages(pcpu_first_chunk, 1); + if (pcpu_first_chunk != pcpu_reserved_chunk) + pcpu_nr_empty_pop_pages += + pcpu_count_occupied_pages(pcpu_first_chunk, 1); pcpu_chunk_relocate(pcpu_first_chunk, -1); /* we're done */ @@ -2266,7 +2272,7 @@ void __init setup_per_cpu_areas(void) ai->dyn_size = unit_size; ai->unit_size = unit_size; - ai->atom_size = unit_size; + ai->atom_size = PAGE_SIZE; ai->alloc_size = unit_size; ai->groups[0].nr_units = 1
[RFC PATCH 1/1] mm/percpu.c: fix several trivial issues
From: zijun_hu as shown by pcpu_setup_first_chunk(), the first chunk is same as the reserved chunk if the reserved size is nonzero but the dynamic is zero this special scenario is referred as the special case by below content fix several trivial issues: 1) correct or fix several comments the LSB of a chunk->map element is used as free/in-use flag and is cleared for free area and set for in-use, rather than use positive/negative number to mark area state. 2) change atomic size to PAGE_SIZE for consistency when CONFIG_SMP == n both default setup_per_cpu_areas() and pcpu_page_first_chunk() use PAGE_SIZE as atomic size when CONFIG_SMP == y; however setup_per_cpu_areas() allocates memory for the only unit with alignment PAGE_SIZE but assigns unit size to atomic size when CONFIG_SMP == n, so the atomic size isn't consistent with either the alignment or the SMP ones. fix it by changing atomic size to PAGE_SIZE when CONFIG_SMP == n 3) correct empty and populated pages statistic error in order to service dynamic atomic memory allocation, the number of empty and populated pages of chunks is counted to maintain above a low threshold. however, for the special case, the first chunk is took into account by pcpu_setup_first_chunk(), it is meaningless since the chunk don't include any dynamic areas. fix it by excluding the reserved chunk before statistic as the other contexts do. 4) fix potential memory leakage for percpu_init_late() in order to manage chunk->map memory uniformly, for the first and reserved chunks, percpu_init_late() will allocate memory to replace the static chunk->map array within section .init.data after slab is brought up however, for the special case, memory are allocated for the same chunk->map twice since the first chunk reference is same as the reserved, so the memory allocated at the first time are leaked obviously. fix it by eliminating the second memory allocation under the special case Signed-off-by: zijun_hu --- mm/percpu.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 26d1c73bd9e2..760a4730f6fd 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -36,8 +36,8 @@ * chunk maps unnecessarily. * * Allocation state in each chunk is kept using an array of integers - * on chunk->map. A positive value in the map represents a free - * region and negative allocated. Allocation inside a chunk is done + * on chunk->map. A LSB cleared value in the map represents a free + * region and set allocated. Allocation inside a chunk is done * by scanning this map sequentially and serving the first matching * entry. This is mostly copied from the percpu_modalloc() allocator. * Chunks can be determined from the address using the index field @@ -93,9 +93,9 @@ #endif #ifndef __pcpu_ptr_to_addr #define __pcpu_ptr_to_addr(ptr) \ - (void __force *)((unsigned long)(ptr) + \ -(unsigned long)pcpu_base_addr -\ -(unsigned long)__per_cpu_start) + (void __force *)((unsigned long)(ptr) - \ +(unsigned long)__per_cpu_start + \ +(unsigned long)pcpu_base_addr) #endif #else /* CONFIG_SMP */ /* on UP, it's always identity mapped */ @@ -583,6 +583,11 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align, * merge'em. Note that 'small' is defined as smaller * than sizeof(int), which is very small but isn't too * uncommon for percpu allocations. +* +* it is unnecessary to append i > 0 to make sure p[-1] +* available: the offset of the first area is 0 and is aligned +* very well, no head free area is left if segments one piece +* from it */ if (head && (head < sizeof(int) || !(p[-1] & 1))) { *p = off += head; @@ -1707,8 +1712,9 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, /* link the first chunk in */ pcpu_first_chunk = dchunk ?: schunk; - pcpu_nr_empty_pop_pages += - pcpu_count_occupied_pages(pcpu_first_chunk, 1); + if (pcpu_first_chunk != pcpu_reserved_chunk) + pcpu_nr_empty_pop_pages += + pcpu_count_occupied_pages(pcpu_first_chunk, 1); pcpu_chunk_relocate(pcpu_first_chunk, -1); /* we're done */ @@ -2266,7 +2272,7 @@ void __init setup_per_cpu_areas(void) ai->dyn_size = unit_size; ai->unit_size = unit_size; - ai->atom_size = unit_size; + ai->atom_size = PAGE_SIZE; ai->alloc_size = unit_size; ai->groups[0].nr_units = 1; ai->groups[0].cpu_map[0] = 0
[RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
From: zijun_hu <zijun...@htc.com> the LSB of a chunk->map element is used for free/in-use flag of a area and the other bits for offset, the sufficient and necessary condition of this usage is that both size and alignment of a area must be even numbers however, pcpu_alloc() doesn't force its @align parameter a even number explicitly, so a odd @align maybe causes a series of errors, see below example for concrete descriptions. lets assume area [16, 36) is free but its previous one is in-use, we want to allocate a @size == 8 and @align == 7 area. the larger area [16, 36) is split to three areas [16, 21), [21, 29), [29, 36) eventually. however, due to the usage for a chunk->map element, the actual offset of the aim area [21, 29) is 21 but is recorded in relevant element as 20; moreover the residual tail free area [29, 36) is mistook as in-use and is lost silently as explained above, inaccurate either offset or free/in-use state of a area is recorded into relevant chunk->map element if request a odd alignment area, and so causes memory leakage issue fix it by forcing the @align of a area to allocate a even number as do for @size. BTW, macro ALIGN() within pcpu_fit_in_area() is replaced by roundup() too due to back reason. in order to align a value @v up to @a boundary, macro roundup(v, a) is more generic than ALIGN(x, a); the latter doesn't work well when @a isn't a power of 2 value. for example, roundup(10, 6) == 12 but ALIGN(10, 6) == 10, the former result is desired obviously Signed-off-by: zijun_hu <zijun...@htc.com> --- include/linux/kernel.h | 1 + mm/percpu.c| 6 -- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 74fd6f05bc5b..ddf46638ef21 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -45,6 +45,7 @@ #define REPEAT_BYTE(x) ((~0ul / 0xff) * (x)) +/* @a is a power of 2 value */ #define ALIGN(x, a)__ALIGN_KERNEL((x), (a)) #define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask)) #define PTR_ALIGN(p, a)((typeof(p))ALIGN((unsigned long)(p), (a))) diff --git a/mm/percpu.c b/mm/percpu.c index c2f0d9734d8c..26d1c73bd9e2 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -502,7 +502,7 @@ static int pcpu_fit_in_area(struct pcpu_chunk *chunk, int off, int this_size, int cand_off = off; while (true) { - int head = ALIGN(cand_off, align) - off; + int head = roundup(cand_off, align) - off; int page_start, page_end, rs, re; if (this_size < head + size) @@ -879,11 +879,13 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, /* * We want the lowest bit of offset available for in-use/free -* indicator, so force >= 16bit alignment and make size even. +* indicator, so force alignment >= 2 even and make size even. */ if (unlikely(align < 2)) align = 2; + if (WARN_ON_ONCE(!IS_ALIGNED(align, 2))) + align = ALIGN(align, 2); size = ALIGN(size, 2); if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) { -- 1.9.1
[RFC PATCH 1/1] mm/percpu.c: fix memory leakage issue when allocate a odd alignment area
From: zijun_hu the LSB of a chunk->map element is used for free/in-use flag of a area and the other bits for offset, the sufficient and necessary condition of this usage is that both size and alignment of a area must be even numbers however, pcpu_alloc() doesn't force its @align parameter a even number explicitly, so a odd @align maybe causes a series of errors, see below example for concrete descriptions. lets assume area [16, 36) is free but its previous one is in-use, we want to allocate a @size == 8 and @align == 7 area. the larger area [16, 36) is split to three areas [16, 21), [21, 29), [29, 36) eventually. however, due to the usage for a chunk->map element, the actual offset of the aim area [21, 29) is 21 but is recorded in relevant element as 20; moreover the residual tail free area [29, 36) is mistook as in-use and is lost silently as explained above, inaccurate either offset or free/in-use state of a area is recorded into relevant chunk->map element if request a odd alignment area, and so causes memory leakage issue fix it by forcing the @align of a area to allocate a even number as do for @size. BTW, macro ALIGN() within pcpu_fit_in_area() is replaced by roundup() too due to back reason. in order to align a value @v up to @a boundary, macro roundup(v, a) is more generic than ALIGN(x, a); the latter doesn't work well when @a isn't a power of 2 value. for example, roundup(10, 6) == 12 but ALIGN(10, 6) == 10, the former result is desired obviously Signed-off-by: zijun_hu --- include/linux/kernel.h | 1 + mm/percpu.c| 6 -- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 74fd6f05bc5b..ddf46638ef21 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -45,6 +45,7 @@ #define REPEAT_BYTE(x) ((~0ul / 0xff) * (x)) +/* @a is a power of 2 value */ #define ALIGN(x, a)__ALIGN_KERNEL((x), (a)) #define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask)) #define PTR_ALIGN(p, a)((typeof(p))ALIGN((unsigned long)(p), (a))) diff --git a/mm/percpu.c b/mm/percpu.c index c2f0d9734d8c..26d1c73bd9e2 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -502,7 +502,7 @@ static int pcpu_fit_in_area(struct pcpu_chunk *chunk, int off, int this_size, int cand_off = off; while (true) { - int head = ALIGN(cand_off, align) - off; + int head = roundup(cand_off, align) - off; int page_start, page_end, rs, re; if (this_size < head + size) @@ -879,11 +879,13 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, /* * We want the lowest bit of offset available for in-use/free -* indicator, so force >= 16bit alignment and make size even. +* indicator, so force alignment >= 2 even and make size even. */ if (unlikely(align < 2)) align = 2; + if (WARN_ON_ONCE(!IS_ALIGNED(align, 2))) + align = ALIGN(align, 2); size = ALIGN(size, 2); if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE)) { -- 1.9.1
[RFC PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
From: zijun_hu <zijun...@htc.com> as shown by pcpu_build_alloc_info(), the number of units within a percpu group is educed by rounding up the number of CPUs within the group to @upa boundary, therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() uses BUG_ON() to assert one number is equal the other roughly, so a panic is maybe triggered by the BUG_ON() falsely. in order to fix this issue, the number of CPUs is rounded up then compared with units's, the BUG_ON() is replaced by warning and returning error code as well to keep system alive as much as possible. Signed-off-by: zijun_hu <zijun...@htc.com> --- mm/percpu.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 32e2d8d128c1..c2f0d9734d8c 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -2095,6 +2095,8 @@ int __init pcpu_page_first_chunk(size_t reserved_size, size_t pages_size; struct page **pages; int unit, i, j, rc; + int upa; + int nr_g0_units; snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10); @@ -2102,7 +2104,12 @@ int __init pcpu_page_first_chunk(size_t reserved_size, if (IS_ERR(ai)) return PTR_ERR(ai); BUG_ON(ai->nr_groups != 1); - BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); + upa = ai->alloc_size/ai->unit_size; + g0_nr_units = roundup(num_possible_cpus(), upa); + if (unlikely(WARN_ON(ai->groups[0].nr_units != nr_g0_units))) { + pcpu_free_alloc_info(ai); + return -EINVAL; + } unit_pages = ai->unit_size >> PAGE_SHIFT; @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t reserved_size, /* allocate pages */ j = 0; - for (unit = 0; unit < num_possible_cpus(); unit++) + for (unit = 0; unit < num_possible_cpus(); unit++) { + unsigned int cpu = ai->groups[0].cpu_map[unit]; for (i = 0; i < unit_pages; i++) { - unsigned int cpu = ai->groups[0].cpu_map[unit]; void *ptr; ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); if (!ptr) { pr_warn("failed to allocate %s page for cpu%u\n", - psize_str, cpu); + psize_str, cpu); goto enomem; } /* kmemleak tracks the percpu allocations separately */ kmemleak_free(ptr); pages[j++] = virt_to_page(ptr); } + } /* allocate vm area, map the pages and copy static data */ vm.flags = VM_ALLOC; -- 1.9.1
[RFC PATCH] mm/percpu.c: fix panic triggered by BUG_ON() falsely
From: zijun_hu as shown by pcpu_build_alloc_info(), the number of units within a percpu group is educed by rounding up the number of CPUs within the group to @upa boundary, therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. however, pcpu_page_first_chunk() uses BUG_ON() to assert one number is equal the other roughly, so a panic is maybe triggered by the BUG_ON() falsely. in order to fix this issue, the number of CPUs is rounded up then compared with units's, the BUG_ON() is replaced by warning and returning error code as well to keep system alive as much as possible. Signed-off-by: zijun_hu --- mm/percpu.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 32e2d8d128c1..c2f0d9734d8c 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -2095,6 +2095,8 @@ int __init pcpu_page_first_chunk(size_t reserved_size, size_t pages_size; struct page **pages; int unit, i, j, rc; + int upa; + int nr_g0_units; snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10); @@ -2102,7 +2104,12 @@ int __init pcpu_page_first_chunk(size_t reserved_size, if (IS_ERR(ai)) return PTR_ERR(ai); BUG_ON(ai->nr_groups != 1); - BUG_ON(ai->groups[0].nr_units != num_possible_cpus()); + upa = ai->alloc_size/ai->unit_size; + g0_nr_units = roundup(num_possible_cpus(), upa); + if (unlikely(WARN_ON(ai->groups[0].nr_units != nr_g0_units))) { + pcpu_free_alloc_info(ai); + return -EINVAL; + } unit_pages = ai->unit_size >> PAGE_SHIFT; @@ -2113,21 +2120,22 @@ int __init pcpu_page_first_chunk(size_t reserved_size, /* allocate pages */ j = 0; - for (unit = 0; unit < num_possible_cpus(); unit++) + for (unit = 0; unit < num_possible_cpus(); unit++) { + unsigned int cpu = ai->groups[0].cpu_map[unit]; for (i = 0; i < unit_pages; i++) { - unsigned int cpu = ai->groups[0].cpu_map[unit]; void *ptr; ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE); if (!ptr) { pr_warn("failed to allocate %s page for cpu%u\n", - psize_str, cpu); + psize_str, cpu); goto enomem; } /* kmemleak tracks the percpu allocations separately */ kmemleak_free(ptr); pages[j++] = virt_to_page(ptr); } + } /* allocate vm area, map the pages and copy static data */ vm.flags = VM_ALLOC; -- 1.9.1
Re: [RFC v2 PATCH] mm/percpu.c: simplify grouping CPU algorithm
On 2016/10/11 20:48, zijun_hu wrote: > From: zijun_hu <zijun...@htc.com> > in order to verify the new algorithm, we enumerate many pairs of type > @pcpu_fc_cpu_distance_fn_t function and the relevant CPU IDs array such > below sample, then apply both algorithms to the same pair and print the > grouping results separately, the new algorithm is okay after checking > whether the result printed from the new one is same with the original. > a sample pair of function and array format is shown as follows: > /* group CPUs by even/odd number */ > static int cpu_distance_fn0(int from, int to) > { > if (from % 2 ^ to % 2) > return REMOTE_DISTANCE; > else > return LOCAL_DISTANCE; > } > /* end with -1 */ > int cpu_ids_0[] = {0, 1, 2, 3, 7, 8, 9, 11, 14, 17, 19, 20, 22, 24, -1}; > > Signed-off-by: zijun_hu <zijun...@htc.com> > Tested-by: zijun_hu <zijun...@htc.com> > --- how to test the new grouping CPU algorithm ? 1) copy the test source code to a file named percpu_group_test.c 2) compile the test program with below command line gcc -Wall -o percpu_group_test percpu_group_test.c 3) get usage info about the percpu_group_test ./percpu_group_test -h 4) produce the grouping result by the new algorithm ./percpu_group_test new > percpu.new 5) produce the grouping result by the original algorithm ./percpu_group_test orig > percpu.orig 6) examine the test result; okay if same result; otherwise, failed diff -u percpu.new percpu.orig test program sources is shown as follows: #include #include #include #include #include #include #include #define LOCAL_DISTANCE 10 #define REMOTE_DISTANCE 20 #define NR_CPUS 96 static int nr_groups; static int group_map[NR_CPUS]; static int group_cnt[NR_CPUS]; static int *cpu_ids_ptr; static int (*cpu_distance_fn) (int from, int to); static int cpu_distance_fn0(int from, int to); extern int cpu_ids_0[]; static int cpu_distance_fn1(int from, int to); extern int cpu_ids_1[]; static int cpu_distance_fn2(int from, int to); extern int cpu_ids_2[]; int (*cpu_distance_funcx[]) (int, int) = { cpu_distance_fn0, cpu_distance_fn1, cpu_distance_fn2, NULL}; int *cpu_ids_x[] = { cpu_ids_0, cpu_ids_1, cpu_ids_2, NULL }; static void percpu_test_prepare(int test_type) { nr_groups = 0; memset(group_map, 0xff, sizeof(group_map)); memset(group_cnt, 0xff, sizeof(group_cnt)); cpu_ids_ptr = cpu_ids_x[test_type]; cpu_distance_fn = cpu_distance_funcx[test_type]; } static int next_cpu(int cpu) { int i = 0; while (cpu_ids_ptr[i] != cpu) i++; return cpu_ids_ptr[i + 1]; } #define for_each_possible_cpu(cpu) \ for ((cpu) = cpu_ids_ptr[0];\ (cpu) != -1;\ (cpu) = next_cpu((cpu))) #define max(v0, v1) ((v0) > (v1) ? (v0) : (v1)) static void percpu_result_printf(void) { int g; int c; printf("nr_groups = %d\n", nr_groups); #if 0 for_each_possible_cpu(c) if (group_map[c] != -1) printf("group_map[%d] = %d\n", c, group_map[c]); for (g = 0; g < nr_groups; g++) printf("group_cnt[%d] = %d\n", g, group_cnt[g]); #else for (g = 0; g < nr_groups; g++) { printf("group id %d : ", g); for_each_possible_cpu(c) if (group_map[c] == g) printf("%d ", c); printf("\n"); } printf("\n"); #endif } /* * group cpus by even or odd ids */ static int cpu_distance_fn0(int from, int to) { if (from % 2 ^ to % 2) return REMOTE_DISTANCE; else return LOCAL_DISTANCE; } /* end with -1 */ int cpu_ids_0[] = { 0, 1, 2, 3, 7, 8, 9, 11, 14, 17, 19, 20, 22, 24, 31, -1 }; /* * group cpus by 3x of cpu ids */ int cpu_distance_fn1(int from, int to) { if (from % 3 == 0 && to % 3 == 0) return LOCAL_DISTANCE; else if (from % 3 && to % 3) return LOCAL_DISTANCE; else return REMOTE_DISTANCE; } int cpu_ids_1[] = { 0, 3, 5, 6, 8, 9, 10, 11, 12, 14, 17, 18, 21, 24, 25, -1 }; /* * group cpus by range, [..., 10), [10, 20), [20, ...) */ int cpu_distance_fn2(int from, int to) { if (from < 10 && to < 10) return LOCAL_DISTANCE; else if (from >= 20 && to >= 20) return LOCAL_DISTANCE; else if ((from >= 10 && from < 20) && (to >= 10 && to < 20)) return LOCAL_DISTANCE; else return REMOTE_DISTANCE; } int cpu_ids_2[] = { 0, 1, 2, 4, 6, 8, 9, 10, 12, 1
Re: [RFC v2 PATCH] mm/percpu.c: simplify grouping CPU algorithm
On 2016/10/11 20:48, zijun_hu wrote: > From: zijun_hu > in order to verify the new algorithm, we enumerate many pairs of type > @pcpu_fc_cpu_distance_fn_t function and the relevant CPU IDs array such > below sample, then apply both algorithms to the same pair and print the > grouping results separately, the new algorithm is okay after checking > whether the result printed from the new one is same with the original. > a sample pair of function and array format is shown as follows: > /* group CPUs by even/odd number */ > static int cpu_distance_fn0(int from, int to) > { > if (from % 2 ^ to % 2) > return REMOTE_DISTANCE; > else > return LOCAL_DISTANCE; > } > /* end with -1 */ > int cpu_ids_0[] = {0, 1, 2, 3, 7, 8, 9, 11, 14, 17, 19, 20, 22, 24, -1}; > > Signed-off-by: zijun_hu > Tested-by: zijun_hu > --- how to test the new grouping CPU algorithm ? 1) copy the test source code to a file named percpu_group_test.c 2) compile the test program with below command line gcc -Wall -o percpu_group_test percpu_group_test.c 3) get usage info about the percpu_group_test ./percpu_group_test -h 4) produce the grouping result by the new algorithm ./percpu_group_test new > percpu.new 5) produce the grouping result by the original algorithm ./percpu_group_test orig > percpu.orig 6) examine the test result; okay if same result; otherwise, failed diff -u percpu.new percpu.orig test program sources is shown as follows: #include #include #include #include #include #include #include #define LOCAL_DISTANCE 10 #define REMOTE_DISTANCE 20 #define NR_CPUS 96 static int nr_groups; static int group_map[NR_CPUS]; static int group_cnt[NR_CPUS]; static int *cpu_ids_ptr; static int (*cpu_distance_fn) (int from, int to); static int cpu_distance_fn0(int from, int to); extern int cpu_ids_0[]; static int cpu_distance_fn1(int from, int to); extern int cpu_ids_1[]; static int cpu_distance_fn2(int from, int to); extern int cpu_ids_2[]; int (*cpu_distance_funcx[]) (int, int) = { cpu_distance_fn0, cpu_distance_fn1, cpu_distance_fn2, NULL}; int *cpu_ids_x[] = { cpu_ids_0, cpu_ids_1, cpu_ids_2, NULL }; static void percpu_test_prepare(int test_type) { nr_groups = 0; memset(group_map, 0xff, sizeof(group_map)); memset(group_cnt, 0xff, sizeof(group_cnt)); cpu_ids_ptr = cpu_ids_x[test_type]; cpu_distance_fn = cpu_distance_funcx[test_type]; } static int next_cpu(int cpu) { int i = 0; while (cpu_ids_ptr[i] != cpu) i++; return cpu_ids_ptr[i + 1]; } #define for_each_possible_cpu(cpu) \ for ((cpu) = cpu_ids_ptr[0];\ (cpu) != -1;\ (cpu) = next_cpu((cpu))) #define max(v0, v1) ((v0) > (v1) ? (v0) : (v1)) static void percpu_result_printf(void) { int g; int c; printf("nr_groups = %d\n", nr_groups); #if 0 for_each_possible_cpu(c) if (group_map[c] != -1) printf("group_map[%d] = %d\n", c, group_map[c]); for (g = 0; g < nr_groups; g++) printf("group_cnt[%d] = %d\n", g, group_cnt[g]); #else for (g = 0; g < nr_groups; g++) { printf("group id %d : ", g); for_each_possible_cpu(c) if (group_map[c] == g) printf("%d ", c); printf("\n"); } printf("\n"); #endif } /* * group cpus by even or odd ids */ static int cpu_distance_fn0(int from, int to) { if (from % 2 ^ to % 2) return REMOTE_DISTANCE; else return LOCAL_DISTANCE; } /* end with -1 */ int cpu_ids_0[] = { 0, 1, 2, 3, 7, 8, 9, 11, 14, 17, 19, 20, 22, 24, 31, -1 }; /* * group cpus by 3x of cpu ids */ int cpu_distance_fn1(int from, int to) { if (from % 3 == 0 && to % 3 == 0) return LOCAL_DISTANCE; else if (from % 3 && to % 3) return LOCAL_DISTANCE; else return REMOTE_DISTANCE; } int cpu_ids_1[] = { 0, 3, 5, 6, 8, 9, 10, 11, 12, 14, 17, 18, 21, 24, 25, -1 }; /* * group cpus by range, [..., 10), [10, 20), [20, ...) */ int cpu_distance_fn2(int from, int to) { if (from < 10 && to < 10) return LOCAL_DISTANCE; else if (from >= 20 && to >= 20) return LOCAL_DISTANCE; else if ((from >= 10 && from < 20) && (to >= 10 && to < 20)) return LOCAL_DISTANCE; else return REMOTE_DISTANCE; } int cpu_ids_2[] = { 0, 1, 2, 4, 6, 8, 9, 10, 12, 15, 16, 18, 19, 20, 22, 25, 27, -1 }; void orig_group_cpus(
[RFC v2 PATCH] mm/percpu.c: simplify grouping CPU algorithm
From: zijun_hu <zijun...@htc.com> pcpu_build_alloc_info() groups CPUs according to relevant proximity together to allocate memory for each percpu unit based on group. however, the grouping algorithm consists of three loops and a goto statement actually, and is inefficient and difficult to understand the original algorithm is simplified to only consists of two loops without any goto statement. for the new one, in order to assign a group number to a target CPU, we check whether it can share a group with any lower index CPU; the shareable group number is reused if so; otherwise, a new one is assigned to it. compared with the original algorithm theoretically and practically, the new one educes the same grouping results, besides, it is more effective, simpler and easier to understand. in order to verify the new algorithm, we enumerate many pairs of type @pcpu_fc_cpu_distance_fn_t function and the relevant CPU IDs array such below sample, then apply both algorithms to the same pair and print the grouping results separately, the new algorithm is okay after checking whether the result printed from the new one is same with the original. a sample pair of function and array format is shown as follows: /* group CPUs by even/odd number */ static int cpu_distance_fn0(int from, int to) { if (from % 2 ^ to % 2) return REMOTE_DISTANCE; else return LOCAL_DISTANCE; } /* end with -1 */ int cpu_ids_0[] = {0, 1, 2, 3, 7, 8, 9, 11, 14, 17, 19, 20, 22, 24, -1}; Signed-off-by: zijun_hu <zijun...@htc.com> Tested-by: zijun_hu <zijun...@htc.com> --- Changes in v2: - update commit messages mm/percpu.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 255714302394..32e2d8d128c1 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1824,23 +1824,25 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info( max_upa = upa; /* group cpus according to their proximity */ - for_each_possible_cpu(cpu) { - group = 0; - next_group: + group = 0; + for_each_possible_cpu(cpu) for_each_possible_cpu(tcpu) { - if (cpu == tcpu) - break; - if (group_map[tcpu] == group && cpu_distance_fn && - (cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE || -cpu_distance_fn(tcpu, cpu) > LOCAL_DISTANCE)) { + if (tcpu == cpu) { + group_map[cpu] = group; + group_cnt[group] = 1; group++; - nr_groups = max(nr_groups, group + 1); - goto next_group; + break; + } + + if (!cpu_distance_fn || + (cpu_distance_fn(cpu, tcpu) == LOCAL_DISTANCE && +cpu_distance_fn(tcpu, cpu) == LOCAL_DISTANCE)) { + group_map[cpu] = group_map[tcpu]; + group_cnt[group_map[cpu]]++; + break; } } - group_map[cpu] = group; - group_cnt[group]++; - } + nr_groups = group; /* * Expand unit size until address space usage goes over 75% -- 1.9.1
[RFC v2 PATCH] mm/percpu.c: simplify grouping CPU algorithm
From: zijun_hu pcpu_build_alloc_info() groups CPUs according to relevant proximity together to allocate memory for each percpu unit based on group. however, the grouping algorithm consists of three loops and a goto statement actually, and is inefficient and difficult to understand the original algorithm is simplified to only consists of two loops without any goto statement. for the new one, in order to assign a group number to a target CPU, we check whether it can share a group with any lower index CPU; the shareable group number is reused if so; otherwise, a new one is assigned to it. compared with the original algorithm theoretically and practically, the new one educes the same grouping results, besides, it is more effective, simpler and easier to understand. in order to verify the new algorithm, we enumerate many pairs of type @pcpu_fc_cpu_distance_fn_t function and the relevant CPU IDs array such below sample, then apply both algorithms to the same pair and print the grouping results separately, the new algorithm is okay after checking whether the result printed from the new one is same with the original. a sample pair of function and array format is shown as follows: /* group CPUs by even/odd number */ static int cpu_distance_fn0(int from, int to) { if (from % 2 ^ to % 2) return REMOTE_DISTANCE; else return LOCAL_DISTANCE; } /* end with -1 */ int cpu_ids_0[] = {0, 1, 2, 3, 7, 8, 9, 11, 14, 17, 19, 20, 22, 24, -1}; Signed-off-by: zijun_hu Tested-by: zijun_hu --- Changes in v2: - update commit messages mm/percpu.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 255714302394..32e2d8d128c1 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1824,23 +1824,25 @@ static struct pcpu_alloc_info * __init pcpu_build_alloc_info( max_upa = upa; /* group cpus according to their proximity */ - for_each_possible_cpu(cpu) { - group = 0; - next_group: + group = 0; + for_each_possible_cpu(cpu) for_each_possible_cpu(tcpu) { - if (cpu == tcpu) - break; - if (group_map[tcpu] == group && cpu_distance_fn && - (cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE || -cpu_distance_fn(tcpu, cpu) > LOCAL_DISTANCE)) { + if (tcpu == cpu) { + group_map[cpu] = group; + group_cnt[group] = 1; group++; - nr_groups = max(nr_groups, group + 1); - goto next_group; + break; + } + + if (!cpu_distance_fn || + (cpu_distance_fn(cpu, tcpu) == LOCAL_DISTANCE && +cpu_distance_fn(tcpu, cpu) == LOCAL_DISTANCE)) { + group_map[cpu] = group_map[tcpu]; + group_cnt[group_map[cpu]]++; + break; } } - group_map[cpu] = group; - group_cnt[group]++; - } + nr_groups = group; /* * Expand unit size until address space usage goes over 75% -- 1.9.1
[RFC PATCH v3 2/2] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
From: zijun_hu <zijun...@htc.com> in order to ensure the percpu group areas within a chunk aren't distributed too sparsely, pcpu_embed_first_chunk() goes to error handling path when a chunk spans over 3/4 VMALLOC area, however, during the error handling, it forget to free the memory allocated for all percpu groups by going to label @out_free other than @out_free_areas. it will cause memory leakage issue if the rare scene really happens, in order to fix the issue, we check chunk spanned area immediately after completing memory allocation for all percpu groups, we go to label @out_free_areas to free the memory then return if the checking is failed. in order to verify the approach, we dump all memory allocated then enforce the jump then dump all memory freed, the result is okay after checking whether we free all memory we allocate in this function. BTW, The approach is chosen after thinking over the below scenes - we don't go to label @out_free directly to fix this issue since we maybe free several allocated memory blocks twice - the aim of jumping after pcpu_setup_first_chunk() is bypassing free usable memory other than handling error, moreover, the function does not return error code in any case, it either panics due to BUG_ON() or return 0. Signed-off-by: zijun_hu <zijun...@htc.com> Tested-by: zijun_hu <zijun...@htc.com> --- Changes in v3: - commit message is updated - more descriptive local variant name highest_group is used Changes in v2: - more detailed commit message is provided as discussed with t...@kernel.org mm/percpu.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index e2737e56b017..255714302394 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, struct pcpu_alloc_info *ai; size_t size_sum, areas_size; unsigned long max_distance; - int group, i, rc; + int group, i, highest_group, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, cpu_distance_fn); @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, goto out_free; } - /* allocate, copy and determine base address */ + /* allocate, copy and determine base address & max_distance */ + highest_group = 0; for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; unsigned int cpu = NR_CPUS; @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, areas[group] = ptr; base = min(ptr, base); + if (ptr > areas[highest_group]) + highest_group = group; + } + max_distance = areas[highest_group] - base; + max_distance += ai->unit_size * ai->groups[highest_group].nr_units; + + /* warn if maximum distance is further than 75% of vmalloc space */ + if (max_distance > VMALLOC_TOTAL * 3 / 4) { + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", + max_distance, VMALLOC_TOTAL); +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK + /* and fail if we have fallback */ + rc = -EINVAL; + goto out_free_areas; +#endif } /* @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - if (areas[group] > areas[i]) - i = group; - } - max_distance = ai->groups[i].base_offset + - ai->unit_size * ai->groups[i].nr_units; - - /* warn if maximum distance is further than 75% of vmalloc space */ - if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", - max_distance, VMALLOC_TOTAL); -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK - /* and fail if we have fallback */ - rc = -EINVAL; - goto out_free; -#endif } pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", -- 1.9.1
[RFC PATCH v3 2/2] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
From: zijun_hu in order to ensure the percpu group areas within a chunk aren't distributed too sparsely, pcpu_embed_first_chunk() goes to error handling path when a chunk spans over 3/4 VMALLOC area, however, during the error handling, it forget to free the memory allocated for all percpu groups by going to label @out_free other than @out_free_areas. it will cause memory leakage issue if the rare scene really happens, in order to fix the issue, we check chunk spanned area immediately after completing memory allocation for all percpu groups, we go to label @out_free_areas to free the memory then return if the checking is failed. in order to verify the approach, we dump all memory allocated then enforce the jump then dump all memory freed, the result is okay after checking whether we free all memory we allocate in this function. BTW, The approach is chosen after thinking over the below scenes - we don't go to label @out_free directly to fix this issue since we maybe free several allocated memory blocks twice - the aim of jumping after pcpu_setup_first_chunk() is bypassing free usable memory other than handling error, moreover, the function does not return error code in any case, it either panics due to BUG_ON() or return 0. Signed-off-by: zijun_hu Tested-by: zijun_hu --- Changes in v3: - commit message is updated - more descriptive local variant name highest_group is used Changes in v2: - more detailed commit message is provided as discussed with t...@kernel.org mm/percpu.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index e2737e56b017..255714302394 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, struct pcpu_alloc_info *ai; size_t size_sum, areas_size; unsigned long max_distance; - int group, i, rc; + int group, i, highest_group, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, cpu_distance_fn); @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, goto out_free; } - /* allocate, copy and determine base address */ + /* allocate, copy and determine base address & max_distance */ + highest_group = 0; for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; unsigned int cpu = NR_CPUS; @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, areas[group] = ptr; base = min(ptr, base); + if (ptr > areas[highest_group]) + highest_group = group; + } + max_distance = areas[highest_group] - base; + max_distance += ai->unit_size * ai->groups[highest_group].nr_units; + + /* warn if maximum distance is further than 75% of vmalloc space */ + if (max_distance > VMALLOC_TOTAL * 3 / 4) { + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", + max_distance, VMALLOC_TOTAL); +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK + /* and fail if we have fallback */ + rc = -EINVAL; + goto out_free_areas; +#endif } /* @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - if (areas[group] > areas[i]) - i = group; - } - max_distance = ai->groups[i].base_offset + - ai->unit_size * ai->groups[i].nr_units; - - /* warn if maximum distance is further than 75% of vmalloc space */ - if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", - max_distance, VMALLOC_TOTAL); -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK - /* and fail if we have fallback */ - rc = -EINVAL; - goto out_free; -#endif } pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", -- 1.9.1
[RFC PATCH v2 1/2] mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()
From: zijun_hu <zijun...@htc.com> pcpu_embed_first_chunk() calculates the range a percpu chunk spans into @max_distance and uses it to ensure that a chunk is not too big compared to the total vmalloc area. However, during calculation, it used incorrect top address by adding a unit size to the highest group's base address. This can make the calculated max_distance slightly smaller than the actual distance although given the scale of values involved the error is very unlikely to have an actual impact. Fix this issue by adding the group's size instead of a unit size. BTW, The type of variable max_distance is changed from size_t to unsigned long too based on below consideration: - type unsigned long usually have same width with IP core registers and can be applied at here very well - make @max_distance type consistent with the operand calculated against it such as @ai->groups[i].base_offset and macro VMALLOC_TOTAL - type unsigned long is more universal then size_t, size_t is type defined to unsigned int or unsigned long among various ARCHs usually Signed-off-by: zijun_hu <zijun...@htc.com> --- Changes in v2: - commit messages offered by Tejun are applied - redundant type cast is removed mm/percpu.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 9903830aaebb..e2737e56b017 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1961,7 +1961,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, void *base = (void *)ULONG_MAX; void **areas = NULL; struct pcpu_alloc_info *ai; - size_t size_sum, areas_size, max_distance; + size_t size_sum, areas_size; + unsigned long max_distance; int group, i, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, @@ -2023,17 +2024,18 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - max_distance = 0; + i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - max_distance = max_t(size_t, max_distance, -ai->groups[group].base_offset); + if (areas[group] > areas[i]) + i = group; } - max_distance += ai->unit_size; + max_distance = ai->groups[i].base_offset + + ai->unit_size * ai->groups[i].nr_units; /* warn if maximum distance is further than 75% of vmalloc space */ if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%zx too large for vmalloc space 0x%lx\n", + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", max_distance, VMALLOC_TOTAL); #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK /* and fail if we have fallback */ -- 1.9.1
[RFC PATCH v2 1/2] mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()
From: zijun_hu pcpu_embed_first_chunk() calculates the range a percpu chunk spans into @max_distance and uses it to ensure that a chunk is not too big compared to the total vmalloc area. However, during calculation, it used incorrect top address by adding a unit size to the highest group's base address. This can make the calculated max_distance slightly smaller than the actual distance although given the scale of values involved the error is very unlikely to have an actual impact. Fix this issue by adding the group's size instead of a unit size. BTW, The type of variable max_distance is changed from size_t to unsigned long too based on below consideration: - type unsigned long usually have same width with IP core registers and can be applied at here very well - make @max_distance type consistent with the operand calculated against it such as @ai->groups[i].base_offset and macro VMALLOC_TOTAL - type unsigned long is more universal then size_t, size_t is type defined to unsigned int or unsigned long among various ARCHs usually Signed-off-by: zijun_hu --- Changes in v2: - commit messages offered by Tejun are applied - redundant type cast is removed mm/percpu.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 9903830aaebb..e2737e56b017 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1961,7 +1961,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, void *base = (void *)ULONG_MAX; void **areas = NULL; struct pcpu_alloc_info *ai; - size_t size_sum, areas_size, max_distance; + size_t size_sum, areas_size; + unsigned long max_distance; int group, i, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, @@ -2023,17 +2024,18 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - max_distance = 0; + i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - max_distance = max_t(size_t, max_distance, -ai->groups[group].base_offset); + if (areas[group] > areas[i]) + i = group; } - max_distance += ai->unit_size; + max_distance = ai->groups[i].base_offset + + ai->unit_size * ai->groups[i].nr_units; /* warn if maximum distance is further than 75% of vmalloc space */ if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%zx too large for vmalloc space 0x%lx\n", + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", max_distance, VMALLOC_TOTAL); #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK /* and fail if we have fallback */ -- 1.9.1
Re: [PATCH v2 1/1] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
Hi Tejun, as we discussed, i include some discussion content in the commit message. could you give some new comments or acknowledgment for this patch? On 2016/9/30 19:30, zijun_hu wrote: > From: zijun_hu <zijun...@htc.com> > > it will cause memory leakage for pcpu_embed_first_chunk() to go to > label @out_free if the chunk spans over 3/4 VMALLOC area. all memory > are allocated and recorded into array @areas for each CPU group, but > the memory allocated aren't be freed before returning after going to > label @out_free. > > in order to fix this bug, we check chunk spanned area immediately > after completing memory allocation for all CPU group, we go to label > @out_free_areas other than @out_free to free all memory allocated if > the checking is failed. > > in order to verify the approach, we dump all memory allocated then > enforce the jump then dump all memory freed, the result is okay after > checking whether we free all memory we allocate in this function. > > BTW, The approach is chosen after thinking over the below scenes > - we don't go to label @out_free directly to fix this issue since we >maybe free several allocated memory blocks twice > - the aim of jumping after pcpu_setup_first_chunk() is bypassing free >usable memory other than handling error, moreover, the function does >not return error code in any case, it either panics due to BUG_ON() >or return 0. > > Signed-off-by: zijun_hu <zijun...@htc.com> > Tested-by: zijun_hu <zijun...@htc.com> > --- > this patch is based on mmotm/linux-next branch so can be > applied to them directly > > Changes in v2: > - more detailed commit message is provided as discussed > with t...@kernel.org > > mm/percpu.c | 36 ++-- > 1 file changed, 18 insertions(+), 18 deletions(-) > > diff --git a/mm/percpu.c b/mm/percpu.c > index 41d9d0b35801..7a5dae185ce1 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, > size_t dyn_size, > struct pcpu_alloc_info *ai; > size_t size_sum, areas_size; > unsigned long max_distance; > - int group, i, rc; > + int group, i, j, rc; > > ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, > cpu_distance_fn); > @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, > size_t dyn_size, > goto out_free; > } > > - /* allocate, copy and determine base address */ > + /* allocate, copy and determine base address & max_distance */ > + j = 0; > for (group = 0; group < ai->nr_groups; group++) { > struct pcpu_group_info *gi = >groups[group]; > unsigned int cpu = NR_CPUS; > @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t > reserved_size, size_t dyn_size, > areas[group] = ptr; > > base = min(ptr, base); > + if (ptr > areas[j]) > + j = group; > + } > + max_distance = areas[j] - base; > + max_distance += ai->unit_size * ai->groups[j].nr_units; > + > + /* warn if maximum distance is further than 75% of vmalloc space */ > + if (max_distance > VMALLOC_TOTAL * 3 / 4) { > + pr_warn("max_distance=0x%lx too large for vmalloc space > 0x%lx\n", > + max_distance, VMALLOC_TOTAL); > +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK > + /* and fail if we have fallback */ > + rc = -EINVAL; > + goto out_free_areas; > +#endif > } > > /* > @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t > reserved_size, size_t dyn_size, > } > > /* base address is now known, determine group base offsets */ > - i = 0; > for (group = 0; group < ai->nr_groups; group++) { > ai->groups[group].base_offset = areas[group] - base; > - if (areas[group] > areas[i]) > - i = group; > - } > - max_distance = ai->groups[i].base_offset + > - (unsigned long)ai->unit_size * ai->groups[i].nr_units; > - > - /* warn if maximum distance is further than 75% of vmalloc space */ > - if (max_distance > VMALLOC_TOTAL * 3 / 4) { > - pr_warn("max_distance=0x%lx too large for vmalloc space > 0x%lx\n", > - max_distance, VMALLOC_TOTAL); > -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK > - /* and fail if we have fallback */ > - rc = -EINVAL; > - goto out_free; > -#endif > } > > pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", >
Re: [PATCH v2 1/1] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
Hi Tejun, as we discussed, i include some discussion content in the commit message. could you give some new comments or acknowledgment for this patch? On 2016/9/30 19:30, zijun_hu wrote: > From: zijun_hu > > it will cause memory leakage for pcpu_embed_first_chunk() to go to > label @out_free if the chunk spans over 3/4 VMALLOC area. all memory > are allocated and recorded into array @areas for each CPU group, but > the memory allocated aren't be freed before returning after going to > label @out_free. > > in order to fix this bug, we check chunk spanned area immediately > after completing memory allocation for all CPU group, we go to label > @out_free_areas other than @out_free to free all memory allocated if > the checking is failed. > > in order to verify the approach, we dump all memory allocated then > enforce the jump then dump all memory freed, the result is okay after > checking whether we free all memory we allocate in this function. > > BTW, The approach is chosen after thinking over the below scenes > - we don't go to label @out_free directly to fix this issue since we >maybe free several allocated memory blocks twice > - the aim of jumping after pcpu_setup_first_chunk() is bypassing free >usable memory other than handling error, moreover, the function does >not return error code in any case, it either panics due to BUG_ON() >or return 0. > > Signed-off-by: zijun_hu > Tested-by: zijun_hu > --- > this patch is based on mmotm/linux-next branch so can be > applied to them directly > > Changes in v2: > - more detailed commit message is provided as discussed > with t...@kernel.org > > mm/percpu.c | 36 ++-- > 1 file changed, 18 insertions(+), 18 deletions(-) > > diff --git a/mm/percpu.c b/mm/percpu.c > index 41d9d0b35801..7a5dae185ce1 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, > size_t dyn_size, > struct pcpu_alloc_info *ai; > size_t size_sum, areas_size; > unsigned long max_distance; > - int group, i, rc; > + int group, i, j, rc; > > ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, > cpu_distance_fn); > @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, > size_t dyn_size, > goto out_free; > } > > - /* allocate, copy and determine base address */ > + /* allocate, copy and determine base address & max_distance */ > + j = 0; > for (group = 0; group < ai->nr_groups; group++) { > struct pcpu_group_info *gi = >groups[group]; > unsigned int cpu = NR_CPUS; > @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t > reserved_size, size_t dyn_size, > areas[group] = ptr; > > base = min(ptr, base); > + if (ptr > areas[j]) > + j = group; > + } > + max_distance = areas[j] - base; > + max_distance += ai->unit_size * ai->groups[j].nr_units; > + > + /* warn if maximum distance is further than 75% of vmalloc space */ > + if (max_distance > VMALLOC_TOTAL * 3 / 4) { > + pr_warn("max_distance=0x%lx too large for vmalloc space > 0x%lx\n", > + max_distance, VMALLOC_TOTAL); > +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK > + /* and fail if we have fallback */ > + rc = -EINVAL; > + goto out_free_areas; > +#endif > } > > /* > @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t > reserved_size, size_t dyn_size, > } > > /* base address is now known, determine group base offsets */ > - i = 0; > for (group = 0; group < ai->nr_groups; group++) { > ai->groups[group].base_offset = areas[group] - base; > - if (areas[group] > areas[i]) > - i = group; > - } > - max_distance = ai->groups[i].base_offset + > - (unsigned long)ai->unit_size * ai->groups[i].nr_units; > - > - /* warn if maximum distance is further than 75% of vmalloc space */ > - if (max_distance > VMALLOC_TOTAL * 3 / 4) { > - pr_warn("max_distance=0x%lx too large for vmalloc space > 0x%lx\n", > - max_distance, VMALLOC_TOTAL); > -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK > - /* and fail if we have fallback */ > - rc = -EINVAL; > - goto out_free; > -#endif > } > > pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", >
[PATCH v2 1/1] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
From: zijun_hu <zijun...@htc.com> it will cause memory leakage for pcpu_embed_first_chunk() to go to label @out_free if the chunk spans over 3/4 VMALLOC area. all memory are allocated and recorded into array @areas for each CPU group, but the memory allocated aren't be freed before returning after going to label @out_free. in order to fix this bug, we check chunk spanned area immediately after completing memory allocation for all CPU group, we go to label @out_free_areas other than @out_free to free all memory allocated if the checking is failed. in order to verify the approach, we dump all memory allocated then enforce the jump then dump all memory freed, the result is okay after checking whether we free all memory we allocate in this function. BTW, The approach is chosen after thinking over the below scenes - we don't go to label @out_free directly to fix this issue since we maybe free several allocated memory blocks twice - the aim of jumping after pcpu_setup_first_chunk() is bypassing free usable memory other than handling error, moreover, the function does not return error code in any case, it either panics due to BUG_ON() or return 0. Signed-off-by: zijun_hu <zijun...@htc.com> Tested-by: zijun_hu <zijun...@htc.com> --- this patch is based on mmotm/linux-next branch so can be applied to them directly Changes in v2: - more detailed commit message is provided as discussed with t...@kernel.org mm/percpu.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 41d9d0b35801..7a5dae185ce1 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, struct pcpu_alloc_info *ai; size_t size_sum, areas_size; unsigned long max_distance; - int group, i, rc; + int group, i, j, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, cpu_distance_fn); @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, goto out_free; } - /* allocate, copy and determine base address */ + /* allocate, copy and determine base address & max_distance */ + j = 0; for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; unsigned int cpu = NR_CPUS; @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, areas[group] = ptr; base = min(ptr, base); + if (ptr > areas[j]) + j = group; + } + max_distance = areas[j] - base; + max_distance += ai->unit_size * ai->groups[j].nr_units; + + /* warn if maximum distance is further than 75% of vmalloc space */ + if (max_distance > VMALLOC_TOTAL * 3 / 4) { + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", + max_distance, VMALLOC_TOTAL); +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK + /* and fail if we have fallback */ + rc = -EINVAL; + goto out_free_areas; +#endif } /* @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - if (areas[group] > areas[i]) - i = group; - } - max_distance = ai->groups[i].base_offset + - (unsigned long)ai->unit_size * ai->groups[i].nr_units; - - /* warn if maximum distance is further than 75% of vmalloc space */ - if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", - max_distance, VMALLOC_TOTAL); -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK - /* and fail if we have fallback */ - rc = -EINVAL; - goto out_free; -#endif } pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", -- 1.9.1
[PATCH v2 1/1] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
From: zijun_hu it will cause memory leakage for pcpu_embed_first_chunk() to go to label @out_free if the chunk spans over 3/4 VMALLOC area. all memory are allocated and recorded into array @areas for each CPU group, but the memory allocated aren't be freed before returning after going to label @out_free. in order to fix this bug, we check chunk spanned area immediately after completing memory allocation for all CPU group, we go to label @out_free_areas other than @out_free to free all memory allocated if the checking is failed. in order to verify the approach, we dump all memory allocated then enforce the jump then dump all memory freed, the result is okay after checking whether we free all memory we allocate in this function. BTW, The approach is chosen after thinking over the below scenes - we don't go to label @out_free directly to fix this issue since we maybe free several allocated memory blocks twice - the aim of jumping after pcpu_setup_first_chunk() is bypassing free usable memory other than handling error, moreover, the function does not return error code in any case, it either panics due to BUG_ON() or return 0. Signed-off-by: zijun_hu Tested-by: zijun_hu --- this patch is based on mmotm/linux-next branch so can be applied to them directly Changes in v2: - more detailed commit message is provided as discussed with t...@kernel.org mm/percpu.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 41d9d0b35801..7a5dae185ce1 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, struct pcpu_alloc_info *ai; size_t size_sum, areas_size; unsigned long max_distance; - int group, i, rc; + int group, i, j, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, cpu_distance_fn); @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, goto out_free; } - /* allocate, copy and determine base address */ + /* allocate, copy and determine base address & max_distance */ + j = 0; for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; unsigned int cpu = NR_CPUS; @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, areas[group] = ptr; base = min(ptr, base); + if (ptr > areas[j]) + j = group; + } + max_distance = areas[j] - base; + max_distance += ai->unit_size * ai->groups[j].nr_units; + + /* warn if maximum distance is further than 75% of vmalloc space */ + if (max_distance > VMALLOC_TOTAL * 3 / 4) { + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", + max_distance, VMALLOC_TOTAL); +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK + /* and fail if we have fallback */ + rc = -EINVAL; + goto out_free_areas; +#endif } /* @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - if (areas[group] > areas[i]) - i = group; - } - max_distance = ai->groups[i].base_offset + - (unsigned long)ai->unit_size * ai->groups[i].nr_units; - - /* warn if maximum distance is further than 75% of vmalloc space */ - if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", - max_distance, VMALLOC_TOTAL); -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK - /* and fail if we have fallback */ - rc = -EINVAL; - goto out_free; -#endif } pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", -- 1.9.1
Re: [RFC PATCH 1/1] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
On 2016/9/30 0:44, Tejun Heo wrote: > Hello, > > On Fri, Sep 30, 2016 at 12:03:20AM +0800, zijun_hu wrote: >> From: zijun_hu <zijun...@htc.com> >> >> it will cause memory leakage for pcpu_embed_first_chunk() to go to >> label @out_free if the chunk spans over 3/4 VMALLOC area. all memory >> are allocated and recorded into array @areas for each CPU group, but >> the memory allocated aren't be freed before returning after going to >> label @out_free >> >> in order to fix this bug, we check chunk spanned area immediately >> after completing memory allocation for all CPU group, we go to label >> @out_free_areas other than @out_free to free all memory allocated if >> the checking is failed. >> >> Signed-off-by: zijun_hu <zijun...@htc.com> > ... >> @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t >> reserved_size, size_t dyn_size, >> areas[group] = ptr; >> >> base = min(ptr, base); >> +if (ptr > areas[j]) >> +j = group; >> +} >> +max_distance = areas[j] - base; >> +max_distance += ai->unit_size * ai->groups[j].nr_units; >> + >> +/* warn if maximum distance is further than 75% of vmalloc space */ >> +if (max_distance > VMALLOC_TOTAL * 3 / 4) { >> +pr_warn("max_distance=0x%lx too large for vmalloc space >> 0x%lx\n", >> +max_distance, VMALLOC_TOTAL); >> +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK >> +/* and fail if we have fallback */ >> +rc = -EINVAL; >> +goto out_free_areas; >> +#endif > > Isn't it way simpler to make the error path jump to out_free_areas? > There's another similar case after pcpu_setup_first_chunk() failure > too. Also, can you please explain how you tested the changes? > > Thanks. > 1) the simpler way don't work because it maybe free many memory block twice let us take a CPU group as a example, after we allocate All memory needed by a CPU group, we maybe free a unit memory block which don't map to a available CPU, we maybe free a part of unit memory which we don't used too, you can refer to following code segments for detailed info. for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; void *ptr = areas[group]; for (i = 0; i < gi->nr_units; i++, ptr += ai->unit_size) { if (gi->cpu_map[i] == NR_CPUS) { /* unused unit, free whole */ free_fn(ptr, ai->unit_size); continue; } /* copy and return the unused part */ memcpy(ptr, __per_cpu_load, ai->static_size); free_fn(ptr + size_sum, ai->unit_size - size_sum); } } 2) as we seen, pcpu_setup_first_chunk() doesn't cause a failure, it return 0 always or panic by BUG_ON(), even if it fails, we can conclude the allocated memory based on information recorded by it, such as pcpu_base_addr and many of static variable, we can complete the free operations; but we can't if we fail in the case pointed by this patch 3) my test way is simple, i force "if (max_distance > VMALLOC_TOTAL * 3 / 4)" to if (1) and print which memory i allocate before the jumping, then print which memory i free after the jumping and before returning, then check whether i free the memory i allocate in this function, the result is okay
Re: [RFC PATCH 1/1] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
On 2016/9/30 0:44, Tejun Heo wrote: > Hello, > > On Fri, Sep 30, 2016 at 12:03:20AM +0800, zijun_hu wrote: >> From: zijun_hu >> >> it will cause memory leakage for pcpu_embed_first_chunk() to go to >> label @out_free if the chunk spans over 3/4 VMALLOC area. all memory >> are allocated and recorded into array @areas for each CPU group, but >> the memory allocated aren't be freed before returning after going to >> label @out_free >> >> in order to fix this bug, we check chunk spanned area immediately >> after completing memory allocation for all CPU group, we go to label >> @out_free_areas other than @out_free to free all memory allocated if >> the checking is failed. >> >> Signed-off-by: zijun_hu > ... >> @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t >> reserved_size, size_t dyn_size, >> areas[group] = ptr; >> >> base = min(ptr, base); >> +if (ptr > areas[j]) >> +j = group; >> +} >> +max_distance = areas[j] - base; >> +max_distance += ai->unit_size * ai->groups[j].nr_units; >> + >> +/* warn if maximum distance is further than 75% of vmalloc space */ >> +if (max_distance > VMALLOC_TOTAL * 3 / 4) { >> +pr_warn("max_distance=0x%lx too large for vmalloc space >> 0x%lx\n", >> +max_distance, VMALLOC_TOTAL); >> +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK >> +/* and fail if we have fallback */ >> +rc = -EINVAL; >> +goto out_free_areas; >> +#endif > > Isn't it way simpler to make the error path jump to out_free_areas? > There's another similar case after pcpu_setup_first_chunk() failure > too. Also, can you please explain how you tested the changes? > > Thanks. > 1) the simpler way don't work because it maybe free many memory block twice let us take a CPU group as a example, after we allocate All memory needed by a CPU group, we maybe free a unit memory block which don't map to a available CPU, we maybe free a part of unit memory which we don't used too, you can refer to following code segments for detailed info. for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; void *ptr = areas[group]; for (i = 0; i < gi->nr_units; i++, ptr += ai->unit_size) { if (gi->cpu_map[i] == NR_CPUS) { /* unused unit, free whole */ free_fn(ptr, ai->unit_size); continue; } /* copy and return the unused part */ memcpy(ptr, __per_cpu_load, ai->static_size); free_fn(ptr + size_sum, ai->unit_size - size_sum); } } 2) as we seen, pcpu_setup_first_chunk() doesn't cause a failure, it return 0 always or panic by BUG_ON(), even if it fails, we can conclude the allocated memory based on information recorded by it, such as pcpu_base_addr and many of static variable, we can complete the free operations; but we can't if we fail in the case pointed by this patch 3) my test way is simple, i force "if (max_distance > VMALLOC_TOTAL * 3 / 4)" to if (1) and print which memory i allocate before the jumping, then print which memory i free after the jumping and before returning, then check whether i free the memory i allocate in this function, the result is okay
[RFC PATCH 1/1] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
From: zijun_hu <zijun...@htc.com> it will cause memory leakage for pcpu_embed_first_chunk() to go to label @out_free if the chunk spans over 3/4 VMALLOC area. all memory are allocated and recorded into array @areas for each CPU group, but the memory allocated aren't be freed before returning after going to label @out_free in order to fix this bug, we check chunk spanned area immediately after completing memory allocation for all CPU group, we go to label @out_free_areas other than @out_free to free all memory allocated if the checking is failed. Signed-off-by: zijun_hu <zijun...@htc.com> --- Hi Andrew, i am sorry to forget to prefix title with "PATCH" keyword in previous mail, so i resend it with correction this patch is based on mmotm/linux-next branch so can be applied directly mm/percpu.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 41d9d0b35801..7a5dae185ce1 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, struct pcpu_alloc_info *ai; size_t size_sum, areas_size; unsigned long max_distance; - int group, i, rc; + int group, i, j, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, cpu_distance_fn); @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, goto out_free; } - /* allocate, copy and determine base address */ + /* allocate, copy and determine base address & max_distance */ + j = 0; for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; unsigned int cpu = NR_CPUS; @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, areas[group] = ptr; base = min(ptr, base); + if (ptr > areas[j]) + j = group; + } + max_distance = areas[j] - base; + max_distance += ai->unit_size * ai->groups[j].nr_units; + + /* warn if maximum distance is further than 75% of vmalloc space */ + if (max_distance > VMALLOC_TOTAL * 3 / 4) { + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", + max_distance, VMALLOC_TOTAL); +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK + /* and fail if we have fallback */ + rc = -EINVAL; + goto out_free_areas; +#endif } /* @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - if (areas[group] > areas[i]) - i = group; - } - max_distance = ai->groups[i].base_offset + - (unsigned long)ai->unit_size * ai->groups[i].nr_units; - - /* warn if maximum distance is further than 75% of vmalloc space */ - if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", - max_distance, VMALLOC_TOTAL); -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK - /* and fail if we have fallback */ - rc = -EINVAL; - goto out_free; -#endif } pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", -- 1.9.1
[RFC PATCH 1/1] mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
From: zijun_hu it will cause memory leakage for pcpu_embed_first_chunk() to go to label @out_free if the chunk spans over 3/4 VMALLOC area. all memory are allocated and recorded into array @areas for each CPU group, but the memory allocated aren't be freed before returning after going to label @out_free in order to fix this bug, we check chunk spanned area immediately after completing memory allocation for all CPU group, we go to label @out_free_areas other than @out_free to free all memory allocated if the checking is failed. Signed-off-by: zijun_hu --- Hi Andrew, i am sorry to forget to prefix title with "PATCH" keyword in previous mail, so i resend it with correction this patch is based on mmotm/linux-next branch so can be applied directly mm/percpu.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 41d9d0b35801..7a5dae185ce1 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, struct pcpu_alloc_info *ai; size_t size_sum, areas_size; unsigned long max_distance; - int group, i, rc; + int group, i, j, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, cpu_distance_fn); @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, goto out_free; } - /* allocate, copy and determine base address */ + /* allocate, copy and determine base address & max_distance */ + j = 0; for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; unsigned int cpu = NR_CPUS; @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, areas[group] = ptr; base = min(ptr, base); + if (ptr > areas[j]) + j = group; + } + max_distance = areas[j] - base; + max_distance += ai->unit_size * ai->groups[j].nr_units; + + /* warn if maximum distance is further than 75% of vmalloc space */ + if (max_distance > VMALLOC_TOTAL * 3 / 4) { + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", + max_distance, VMALLOC_TOTAL); +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK + /* and fail if we have fallback */ + rc = -EINVAL; + goto out_free_areas; +#endif } /* @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - if (areas[group] > areas[i]) - i = group; - } - max_distance = ai->groups[i].base_offset + - (unsigned long)ai->unit_size * ai->groups[i].nr_units; - - /* warn if maximum distance is further than 75% of vmalloc space */ - if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", - max_distance, VMALLOC_TOTAL); -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK - /* and fail if we have fallback */ - rc = -EINVAL; - goto out_free; -#endif } pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", -- 1.9.1
mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
From: zijun_hu <zijun...@htc.com> it will cause memory leakage for pcpu_embed_first_chunk() to go to label @out_free if the chunk spans over 3/4 VMALLOC area. all memory are allocated and recorded into array @areas for each CPU group, but the memory allocated aren't be freed before returning after going to label @out_free in order to fix this bug, we check chunk spanned area immediately after completing memory allocation for all CPU group, we go to label @out_free_areas other than @out_free to free all memory allocated if the checking is failed. Signed-off-by: zijun_hu <zijun...@htc.com> --- Hi Andrew, this patch is based on mmotm/linux-next branch so can be applied directly mm/percpu.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 41d9d0b35801..7a5dae185ce1 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, struct pcpu_alloc_info *ai; size_t size_sum, areas_size; unsigned long max_distance; - int group, i, rc; + int group, i, j, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, cpu_distance_fn); @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, goto out_free; } - /* allocate, copy and determine base address */ + /* allocate, copy and determine base address & max_distance */ + j = 0; for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; unsigned int cpu = NR_CPUS; @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, areas[group] = ptr; base = min(ptr, base); + if (ptr > areas[j]) + j = group; + } + max_distance = areas[j] - base; + max_distance += ai->unit_size * ai->groups[j].nr_units; + + /* warn if maximum distance is further than 75% of vmalloc space */ + if (max_distance > VMALLOC_TOTAL * 3 / 4) { + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", + max_distance, VMALLOC_TOTAL); +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK + /* and fail if we have fallback */ + rc = -EINVAL; + goto out_free_areas; +#endif } /* @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - if (areas[group] > areas[i]) - i = group; - } - max_distance = ai->groups[i].base_offset + - (unsigned long)ai->unit_size * ai->groups[i].nr_units; - - /* warn if maximum distance is further than 75% of vmalloc space */ - if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", - max_distance, VMALLOC_TOTAL); -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK - /* and fail if we have fallback */ - rc = -EINVAL; - goto out_free; -#endif } pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", -- 1.9.1
mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
From: zijun_hu it will cause memory leakage for pcpu_embed_first_chunk() to go to label @out_free if the chunk spans over 3/4 VMALLOC area. all memory are allocated and recorded into array @areas for each CPU group, but the memory allocated aren't be freed before returning after going to label @out_free in order to fix this bug, we check chunk spanned area immediately after completing memory allocation for all CPU group, we go to label @out_free_areas other than @out_free to free all memory allocated if the checking is failed. Signed-off-by: zijun_hu --- Hi Andrew, this patch is based on mmotm/linux-next branch so can be applied directly mm/percpu.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 41d9d0b35801..7a5dae185ce1 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1963,7 +1963,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, struct pcpu_alloc_info *ai; size_t size_sum, areas_size; unsigned long max_distance; - int group, i, rc; + int group, i, j, rc; ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size, cpu_distance_fn); @@ -1979,7 +1979,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, goto out_free; } - /* allocate, copy and determine base address */ + /* allocate, copy and determine base address & max_distance */ + j = 0; for (group = 0; group < ai->nr_groups; group++) { struct pcpu_group_info *gi = >groups[group]; unsigned int cpu = NR_CPUS; @@ -2000,6 +2001,21 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, areas[group] = ptr; base = min(ptr, base); + if (ptr > areas[j]) + j = group; + } + max_distance = areas[j] - base; + max_distance += ai->unit_size * ai->groups[j].nr_units; + + /* warn if maximum distance is further than 75% of vmalloc space */ + if (max_distance > VMALLOC_TOTAL * 3 / 4) { + pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", + max_distance, VMALLOC_TOTAL); +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK + /* and fail if we have fallback */ + rc = -EINVAL; + goto out_free_areas; +#endif } /* @@ -2024,24 +2040,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size, } /* base address is now known, determine group base offsets */ - i = 0; for (group = 0; group < ai->nr_groups; group++) { ai->groups[group].base_offset = areas[group] - base; - if (areas[group] > areas[i]) - i = group; - } - max_distance = ai->groups[i].base_offset + - (unsigned long)ai->unit_size * ai->groups[i].nr_units; - - /* warn if maximum distance is further than 75% of vmalloc space */ - if (max_distance > VMALLOC_TOTAL * 3 / 4) { - pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n", - max_distance, VMALLOC_TOTAL); -#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK - /* and fail if we have fallback */ - rc = -EINVAL; - goto out_free; -#endif } pr_info("Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n", -- 1.9.1
Re: [RESEND PATCH 1/1] mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()
On 2016/9/29 18:35, Tejun Heo wrote: > Hello, > > On Sat, Sep 24, 2016 at 07:20:49AM +0800, zijun_hu wrote: >> it is error to represent the max range max_distance spanned by all the >> group areas as the offset of the highest group area plus unit size in >> pcpu_embed_first_chunk(), it should equal to the offset plus the size >> of the highest group area >> >> in order to fix this issue,let us find the highest group area who has the >> biggest base address among all the ones, then max_distance is formed by >> add it's offset and size value > > [PATCH] percpu: fix max_distance calculation in pcpu_embed_first_chunk() > > pcpu_embed_first_chunk() calculates the range a percpu chunk spans > into max_distance and uses it to ensure that a chunk is not too big > compared to the total vmalloc area. However, during calculation, it > used incorrect top address by adding a unit size to the higest > group's base address. > > This can make the calculated max_distance slightly smaller than the > actual distance although given the scale of values involved the error > is very unlikely to have an actual impact. > > Fix this issue by adding the group's size instead of a unit size. > >> the type of variant max_distance is changed from size_t to unsigned long >> to prevent potential overflow > > This doesn't make any sense. All the values involved are valid > addresses (or +1 of it), they can't overflow and size_t is the same > size as ulong. > >> @@ -2025,17 +2026,18 @@ int __init pcpu_embed_first_chunk(size_t >> reserved_size, size_t dyn_size, >> } >> >> /* base address is now known, determine group base offsets */ >> -max_distance = 0; >> +i = 0; >> for (group = 0; group < ai->nr_groups; group++) { >> ai->groups[group].base_offset = areas[group] - base; >> -max_distance = max_t(size_t, max_distance, >> - ai->groups[group].base_offset); >> +if (areas[group] > areas[i]) >> +i = group; >> } >> -max_distance += ai->unit_size; >> +max_distance = ai->groups[i].base_offset + >> +(unsigned long)ai->unit_size * ai->groups[i].nr_units; > > I don't think you need ulong cast here. > > Thanks. > okay, thanks for your reply i will correct this in another patch
Re: [RESEND PATCH 1/1] mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()
On 2016/9/29 18:35, Tejun Heo wrote: > Hello, > > On Sat, Sep 24, 2016 at 07:20:49AM +0800, zijun_hu wrote: >> it is error to represent the max range max_distance spanned by all the >> group areas as the offset of the highest group area plus unit size in >> pcpu_embed_first_chunk(), it should equal to the offset plus the size >> of the highest group area >> >> in order to fix this issue,let us find the highest group area who has the >> biggest base address among all the ones, then max_distance is formed by >> add it's offset and size value > > [PATCH] percpu: fix max_distance calculation in pcpu_embed_first_chunk() > > pcpu_embed_first_chunk() calculates the range a percpu chunk spans > into max_distance and uses it to ensure that a chunk is not too big > compared to the total vmalloc area. However, during calculation, it > used incorrect top address by adding a unit size to the higest > group's base address. > > This can make the calculated max_distance slightly smaller than the > actual distance although given the scale of values involved the error > is very unlikely to have an actual impact. > > Fix this issue by adding the group's size instead of a unit size. > >> the type of variant max_distance is changed from size_t to unsigned long >> to prevent potential overflow > > This doesn't make any sense. All the values involved are valid > addresses (or +1 of it), they can't overflow and size_t is the same > size as ulong. > >> @@ -2025,17 +2026,18 @@ int __init pcpu_embed_first_chunk(size_t >> reserved_size, size_t dyn_size, >> } >> >> /* base address is now known, determine group base offsets */ >> -max_distance = 0; >> +i = 0; >> for (group = 0; group < ai->nr_groups; group++) { >> ai->groups[group].base_offset = areas[group] - base; >> -max_distance = max_t(size_t, max_distance, >> - ai->groups[group].base_offset); >> +if (areas[group] > areas[i]) >> +i = group; >> } >> -max_distance += ai->unit_size; >> +max_distance = ai->groups[i].base_offset + >> +(unsigned long)ai->unit_size * ai->groups[i].nr_units; > > I don't think you need ulong cast here. > > Thanks. > okay, thanks for your reply i will correct this in another patch
[RFC PATCH v2 1/1] mm/vmalloc.c: simplify /proc/vmallocinfo implementation
From: zijun_hu <zijun...@htc.com> simplify /proc/vmallocinfo implementation via existing seq_file helpers for list_head Signed-off-by: zijun_hu <zijun...@htc.com> --- Changes in v2: - more detailed commit message is provided - the redundant type cast for list_entry() is removed as advised by rient...@google.com mm/vmalloc.c | 27 +-- 1 file changed, 5 insertions(+), 22 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index cc6ecd60cc0e..8b80931654b7 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2576,32 +2576,13 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) static void *s_start(struct seq_file *m, loff_t *pos) __acquires(_area_lock) { - loff_t n = *pos; - struct vmap_area *va; - spin_lock(_area_lock); - va = list_first_entry(_area_list, typeof(*va), list); - while (n > 0 && >list != _area_list) { - n--; - va = list_next_entry(va, list); - } - if (!n && >list != _area_list) - return va; - - return NULL; - + return seq_list_start(_area_list, *pos); } static void *s_next(struct seq_file *m, void *p, loff_t *pos) { - struct vmap_area *va = p, *next; - - ++*pos; - next = list_next_entry(va, list); - if (>list != _area_list) - return next; - - return NULL; + return seq_list_next(p, _area_list, pos); } static void s_stop(struct seq_file *m, void *p) @@ -2636,9 +2617,11 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v) static int s_show(struct seq_file *m, void *p) { - struct vmap_area *va = p; + struct vmap_area *va; struct vm_struct *v; + va = list_entry(p, struct vmap_area, list); + /* * s_show can encounter race with remove_vm_area, !VM_VM_AREA on * behalf of vmap area is being tear down or vm_map_ram allocation. -- 1.9.1
[RFC PATCH v2 1/1] mm/vmalloc.c: simplify /proc/vmallocinfo implementation
From: zijun_hu simplify /proc/vmallocinfo implementation via existing seq_file helpers for list_head Signed-off-by: zijun_hu --- Changes in v2: - more detailed commit message is provided - the redundant type cast for list_entry() is removed as advised by rient...@google.com mm/vmalloc.c | 27 +-- 1 file changed, 5 insertions(+), 22 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index cc6ecd60cc0e..8b80931654b7 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2576,32 +2576,13 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) static void *s_start(struct seq_file *m, loff_t *pos) __acquires(_area_lock) { - loff_t n = *pos; - struct vmap_area *va; - spin_lock(_area_lock); - va = list_first_entry(_area_list, typeof(*va), list); - while (n > 0 && >list != _area_list) { - n--; - va = list_next_entry(va, list); - } - if (!n && >list != _area_list) - return va; - - return NULL; - + return seq_list_start(_area_list, *pos); } static void *s_next(struct seq_file *m, void *p, loff_t *pos) { - struct vmap_area *va = p, *next; - - ++*pos; - next = list_next_entry(va, list); - if (>list != _area_list) - return next; - - return NULL; + return seq_list_next(p, _area_list, pos); } static void s_stop(struct seq_file *m, void *p) @@ -2636,9 +2617,11 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v) static int s_show(struct seq_file *m, void *p) { - struct vmap_area *va = p; + struct vmap_area *va; struct vm_struct *v; + va = list_entry(p, struct vmap_area, list); + /* * s_show can encounter race with remove_vm_area, !VM_VM_AREA on * behalf of vmap area is being tear down or vm_map_ram allocation. -- 1.9.1
[RFC PATCH v2 1/1] mm/vmalloc.c: correct a few logic error for __insert_vmap_area()
From: zijun_hu <zijun...@htc.com> __insert_vmap_area() has a few obvious logic errors as shown by comments within below code segments static void __insert_vmap_area(struct vmap_area *va) { as a internal function parameter, we assume vmap_area @va has nonzero size ... if (va->va_start < tmp->va_end) p = &(*p)->rb_left; else if (va->va_end > tmp->va_start) p = &(*p)->rb_right; this else if condition is always true and meaningless due to va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally else BUG(); this BUG() is meaningless too due to never be touched normally ... } the function don't implement the below desire behavior based on context if the vmap_area @va to be inserted is lower than that on the rbtree then we walk around the left branch of the given rbtree node; else if higher then right branch; else the former vmap_area has overlay with the latter then the existing BUG() is triggered it is fixed by correcting vmap_area rbtree walk manner as mentioned above BTW, we consider (va->va_end == tmp_va->va_start) as legal case since it indicate vmap_area @va neighbors with @tmp_va tightly Fixes: db64fe02258f ("mm: rewrite vmap layer") Signed-off-by: zijun_hu <zijun...@htc.com> --- Hi npiggin, could you offer some comments for this patch since __insert_vmap_area() was introduced by you? thanks a lot Changes in v2: - more detailed commit message is provided mm/vmalloc.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 91f44e78c516..cc6ecd60cc0e 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) parent = *p; tmp_va = rb_entry(parent, struct vmap_area, rb_node); - if (va->va_start < tmp_va->va_end) - p = &(*p)->rb_left; - else if (va->va_end > tmp_va->va_start) - p = &(*p)->rb_right; + if (va->va_end <= tmp_va->va_start) + p = >rb_left; + else if (va->va_start >= tmp_va->va_end) + p = >rb_right; else BUG(); } -- 1.9.1
[RFC PATCH v2 1/1] mm/vmalloc.c: correct a few logic error for __insert_vmap_area()
From: zijun_hu __insert_vmap_area() has a few obvious logic errors as shown by comments within below code segments static void __insert_vmap_area(struct vmap_area *va) { as a internal function parameter, we assume vmap_area @va has nonzero size ... if (va->va_start < tmp->va_end) p = &(*p)->rb_left; else if (va->va_end > tmp->va_start) p = &(*p)->rb_right; this else if condition is always true and meaningless due to va->va_end > va->va_start >= tmp_va->va_end > tmp_va->va_start normally else BUG(); this BUG() is meaningless too due to never be touched normally ... } the function don't implement the below desire behavior based on context if the vmap_area @va to be inserted is lower than that on the rbtree then we walk around the left branch of the given rbtree node; else if higher then right branch; else the former vmap_area has overlay with the latter then the existing BUG() is triggered it is fixed by correcting vmap_area rbtree walk manner as mentioned above BTW, we consider (va->va_end == tmp_va->va_start) as legal case since it indicate vmap_area @va neighbors with @tmp_va tightly Fixes: db64fe02258f ("mm: rewrite vmap layer") Signed-off-by: zijun_hu --- Hi npiggin, could you offer some comments for this patch since __insert_vmap_area() was introduced by you? thanks a lot Changes in v2: - more detailed commit message is provided mm/vmalloc.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 91f44e78c516..cc6ecd60cc0e 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -321,10 +321,10 @@ static void __insert_vmap_area(struct vmap_area *va) parent = *p; tmp_va = rb_entry(parent, struct vmap_area, rb_node); - if (va->va_start < tmp_va->va_end) - p = &(*p)->rb_left; - else if (va->va_end > tmp_va->va_start) - p = &(*p)->rb_right; + if (va->va_end <= tmp_va->va_start) + p = >rb_left; + else if (va->va_start >= tmp_va->va_end) + p = >rb_right; else BUG(); } -- 1.9.1
[RESEND RFC PATCH 1/1] linux/mm.h: canonicalize macro PAGE_ALIGNED() definition
From: zijun_hu <zijun...@htc.com> macro PAGE_ALIGNED() is prone to cause error because it doesn't follow convention to parenthesize parameter @addr within macro body, for example unsigned long *ptr = kmalloc(...); PAGE_ALIGNED(ptr + 16); for the left parameter of macro IS_ALIGNED(), (unsigned long)(ptr + 16) is desired but the actual one is (unsigned long)ptr + 16 it is fixed by simply canonicalizing macro PAGE_ALIGNED() definition Signed-off-by: zijun_hu <zijun...@htc.com> --- include/linux/mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ef815b9cd426..ec6818631635 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -126,7 +126,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *, #define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE) /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ -#define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)addr, PAGE_SIZE) +#define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) /* * Linux kernel virtual memory manager primitives. -- 1.9.1
[RESEND RFC PATCH 1/1] linux/mm.h: canonicalize macro PAGE_ALIGNED() definition
From: zijun_hu macro PAGE_ALIGNED() is prone to cause error because it doesn't follow convention to parenthesize parameter @addr within macro body, for example unsigned long *ptr = kmalloc(...); PAGE_ALIGNED(ptr + 16); for the left parameter of macro IS_ALIGNED(), (unsigned long)(ptr + 16) is desired but the actual one is (unsigned long)ptr + 16 it is fixed by simply canonicalizing macro PAGE_ALIGNED() definition Signed-off-by: zijun_hu --- include/linux/mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ef815b9cd426..ec6818631635 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -126,7 +126,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *, #define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE) /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ -#define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)addr, PAGE_SIZE) +#define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) /* * Linux kernel virtual memory manager primitives. -- 1.9.1
Re: [PATCH 1/5] mm/vmalloc.c: correct a few logic error for __insert_vmap_area()
On 09/22/2016 07:15 AM, David Rientjes wrote: > On Thu, 22 Sep 2016, zijun_hu wrote: > >>> We don't support inserting when va->va_start == tmp_va->va_end, plain and >>> simple. There's no reason to do so. NACK to the patch. >>> >> i am sorry i disagree with you because >> 1) in almost all context of vmalloc, original logic treat the special case >> as normal >>for example, __find_vmap_area() or alloc_vmap_area() > > The ranges are [start, end) like everywhere else. __find_vmap_area() is > implemented as such for the passed address. The address is aligned in > alloc_vmap_area(), there's no surprise here. The logic is correct in > __insert_vmap_area(). > i am sorry to disagree with you i will resend this patch with more detailed illustration
Re: [PATCH 1/5] mm/vmalloc.c: correct a few logic error for __insert_vmap_area()
On 09/22/2016 07:15 AM, David Rientjes wrote: > On Thu, 22 Sep 2016, zijun_hu wrote: > >>> We don't support inserting when va->va_start == tmp_va->va_end, plain and >>> simple. There's no reason to do so. NACK to the patch. >>> >> i am sorry i disagree with you because >> 1) in almost all context of vmalloc, original logic treat the special case >> as normal >>for example, __find_vmap_area() or alloc_vmap_area() > > The ranges are [start, end) like everywhere else. __find_vmap_area() is > implemented as such for the passed address. The address is aligned in > alloc_vmap_area(), there's no surprise here. The logic is correct in > __insert_vmap_area(). > i am sorry to disagree with you i will resend this patch with more detailed illustration