Re: [PATCH] cpufreq: powernv: Use node ID in init_chip_info
+ few IBM guys who have been working on this. On 19-10-16, 15:02, Emily Shaffer wrote: > Fixed assumption that node_id==chip_id in powernv-cpufreq.c:init_chip_info; > explicitly use node ID where necessary. > > Tested: All CPUs report in /sys/devices/system/cpu*/cpufreq/throttle_stats > > Effort: platforms/arch-powerpc > Google-Bug-Id: 26979978 Is this relevant upstream? > > Signed-off-by: Emily Shaffer> Change-Id: I22eb626b32fbb8053b3bbb9c75e677c700d0c2fb Gerrit id isn't required for upstream.. > --- > drivers/cpufreq/powernv-cpufreq.c | 27 +-- > 1 file changed, 21 insertions(+), 6 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index d3ffde8..3750b58 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -911,32 +911,47 @@ static struct cpufreq_driver powernv_cpufreq_driver = { > > static int init_chip_info(void) > { > - unsigned int chip[256]; > + int rc = 0; > unsigned int cpu, i; > unsigned int prev_chip_id = UINT_MAX; > + unsigned int *chip, *node; > + > + chip = kcalloc(num_possible_cpus(), sizeof(unsigned int), GFP_KERNEL); > + node = kcalloc(num_possible_cpus(), sizeof(unsigned int), GFP_KERNEL); > + if (!chip || !node) { > + rc = -ENOMEM; > + goto out; > + } > > for_each_possible_cpu(cpu) { > unsigned int id = cpu_to_chip_id(cpu); > > if (prev_chip_id != id) { > prev_chip_id = id; > - chip[nr_chips++] = id; > + node[nr_chips] = cpu_to_node(cpu); > + chip[nr_chips] = id; > + nr_chips++; > } > } > > chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL); > - if (!chips) > - return -ENOMEM; > + if (!chips) { > + rc = -ENOMEM; > + goto out; > + } > > for (i = 0; i < nr_chips; i++) { > chips[i].id = chip[i]; > - cpumask_copy([i].mask, cpumask_of_node(chip[i])); > + cpumask_copy([i].mask, cpumask_of_node(node[i])); > INIT_WORK([i].throttle, powernv_cpufreq_work_fn); > for_each_cpu(cpu, [i].mask) > per_cpu(chip_info, cpu) = [i]; > } > > - return 0; > +out: > + kfree(node); > + kfree(chip); > + return rc; > } > > static inline void clean_chip_info(void) > -- > 2.8.0.rc3.226.g39d4020 -- viresh
[PATCH] KVM: PPC: Book3S HV: Fix build error when SMP=n
Commit 5d375199ea96 ("KVM: PPC: Book3S HV: Set server for passed-through interrupts") broke the SMP=n build: arch/powerpc/kvm/book3s_hv_rm_xics.c:758:2: error: implicit declaration of function 'get_hard_smp_processor_id' That is because we lost the implicit include of asm/smp.h, so include it explicitly to get the definition for get_hard_smp_processor_id(). Fixes: 5d375199ea96 ("KVM: PPC: Book3S HV: Set server for passed-through interrupts") Signed-off-by: Michael Ellerman--- arch/powerpc/kvm/book3s_hv_rm_xics.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c b/arch/powerpc/kvm/book3s_hv_rm_xics.c index 82ff5de8b1e7..a0ea63ac2b52 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_xics.c +++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "book3s_xics.h" -- 2.7.4
Re: [PATCH 6/7] powerpc: switch to using thin archives if COMPILE_TEST is set
On Wed, 19 Oct 2016 22:00:06 +1100 Michael Ellermanwrote: > Nicholas Piggin writes: > > > Enable thin archives build for powerpc when COMPILE_TEST is set. > > Thin archives are explained in this commit: > > > > a5967db9af51a84f5e181600954714a9e4c69f1f > > > > This is a gradual way to introduce the option to testers. > > I think I'd rather this was actually a user-selectable option, eg. > > config USE_THIN_ARCHIVES > bool "Build the kernel using thin archives" > default n > select THIN_ARCHIVES > help > Build the kernel using thin archives. > If you're unsure say N. Okay, that should be easy. I'll respin the series. Thanks, Nick
Re: [PATCH v4 3/5] powerpc/mm: allow memory hotplug into a memoryless node
On 07/10/16 05:36, Reza Arbab wrote: > Remove the check which prevents us from hotplugging into an empty node. > > This limitation has been questioned before [1], and judging by the > response, there doesn't seem to be a reason we can't remove it. No issues > have been found in light testing. > > [1] > http://lkml.kernel.org/r/cagzkibrmksa1yyhbf5hwgxubcjse5smksmy4tpanerme2ug...@mail.gmail.com > > http://lkml.kernel.org/r/20160511215051.gf22...@arbab-laptop.austin.ibm.com > > Signed-off-by: Reza Arbab> Reviewed-by: Aneesh Kumar K.V > Acked-by: Balbir Singh > Cc: Nathan Fontenot > Cc: Bharata B Rao > --- > arch/powerpc/mm/numa.c | 13 + > 1 file changed, 1 insertion(+), 12 deletions(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 75b9cd6..d7ac419 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1121,7 +1121,7 @@ static int hot_add_node_scn_to_nid(unsigned long > scn_addr) > int hot_add_scn_to_nid(unsigned long scn_addr) > { > struct device_node *memory = NULL; > - int nid, found = 0; > + int nid; > > if (!numa_enabled || (min_common_depth < 0)) > return first_online_node; > @@ -1137,17 +1137,6 @@ int hot_add_scn_to_nid(unsigned long scn_addr) > if (nid < 0 || !node_online(nid)) > nid = first_online_node; > > - if (NODE_DATA(nid)->node_spanned_pages) > - return nid; > - > - for_each_online_node(nid) { > - if (NODE_DATA(nid)->node_spanned_pages) { > - found = 1; > - break; > - } > - } > - > - BUG_ON(!found); > return nid; FYI, these checks were temporary to begin with I found this in git history b226e462124522f2f23153daff31c311729dfa2f (powerpc: don't add memory to empty node/zone) Balbir Singh.
[PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container
In some situations the userspace memory context may live longer than the userspace process itself so if we need to do proper memory context cleanup, we better cache @mm and use it later when the process is gone (@current or @current->mm is NULL). This references mm and stores the pointer in the container; this is done when a container is just created so checking for !current->mm in other places becomes pointless. This replaces current->mm with container->mm everywhere except debug prints. This adds a check that current->mm is the same as the one stored in the container to prevent userspace from registering memory in other processes. Signed-off-by: Alexey Kardashevskiy--- drivers/vfio/vfio_iommu_spapr_tce.c | 127 1 file changed, 71 insertions(+), 56 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index d0c38b2..6b0b121 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -31,49 +31,46 @@ static void tce_iommu_detach_group(void *iommu_data, struct iommu_group *iommu_group); -static long try_increment_locked_vm(long npages) +static long try_increment_locked_vm(struct mm_struct *mm, long npages) { long ret = 0, locked, lock_limit; - if (!current || !current->mm) - return -ESRCH; /* process exited */ - if (!npages) return 0; - down_write(>mm->mmap_sem); - locked = current->mm->locked_vm + npages; + down_write(>mmap_sem); + locked = mm->locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) ret = -ENOMEM; else - current->mm->locked_vm += npages; + mm->locked_vm += npages; pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, npages << PAGE_SHIFT, - current->mm->locked_vm << PAGE_SHIFT, + mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : ""); - up_write(>mm->mmap_sem); + up_write(>mmap_sem); return ret; } -static void decrement_locked_vm(long npages) +static void decrement_locked_vm(struct mm_struct *mm, long npages) { - if (!current || !current->mm || !npages) + if (!mm || !npages) return; /* process exited */ - down_write(>mm->mmap_sem); - if (WARN_ON_ONCE(npages > current->mm->locked_vm)) - npages = current->mm->locked_vm; - current->mm->locked_vm -= npages; + down_write(>mmap_sem); + if (WARN_ON_ONCE(npages > mm->locked_vm)) + npages = mm->locked_vm; + mm->locked_vm -= npages; pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid, npages << PAGE_SHIFT, - current->mm->locked_vm << PAGE_SHIFT, + mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK)); - up_write(>mm->mmap_sem); + up_write(>mmap_sem); } /* @@ -98,6 +95,7 @@ struct tce_container { bool enabled; bool v2; unsigned long locked_pages; + struct mm_struct *mm; struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct list_head group_list; }; @@ -113,11 +111,11 @@ static long tce_iommu_unregister_pages(struct tce_container *container, if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK)) return -EINVAL; - mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT); + mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT); if (!mem) return -ENOENT; - return mm_iommu_put(current->mm, mem); + return mm_iommu_put(container->mm, mem); } static long tce_iommu_register_pages(struct tce_container *container, @@ -134,7 +132,7 @@ static long tce_iommu_register_pages(struct tce_container *container, ((vaddr + size) < vaddr)) return -EINVAL; - ret = mm_iommu_get(current->mm, vaddr, entries, ); + ret = mm_iommu_get(container->mm, vaddr, entries, ); if (ret) return ret; @@ -143,7 +141,8 @@ static long tce_iommu_register_pages(struct tce_container *container, return 0; } -static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl) +static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl, + struct mm_struct *mm) { unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) * tbl->it_size, PAGE_SIZE); @@ -152,13 +151,13 @@ static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl) BUG_ON(tbl->it_userspace); - ret = try_increment_locked_vm(cb >> PAGE_SHIFT); + ret =
[PATCH kernel v3 2/4] powerpc/iommu: Stop using @current in mm_iommu_xxx
This changes mm_iommu_xxx helpers to take mm_struct as a parameter instead of getting it from @current which in some situations may not have a valid reference to mm. This changes helpers to receive @mm and moves all references to @current to the caller, including checks for !current and !current->mm; checks in mm_iommu_preregistered() are removed as there is no caller yet. This moves the mm_iommu_adjust_locked_vm() call to the caller as it receives mm_iommu_table_group_mem_t but it needs mm. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy--- arch/powerpc/include/asm/mmu_context.h | 16 ++-- arch/powerpc/mm/mmu_context_iommu.c| 46 +- drivers/vfio/vfio_iommu_spapr_tce.c| 14 --- 3 files changed, 36 insertions(+), 40 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 424844b..b9e3f0a 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -19,16 +19,18 @@ extern void destroy_context(struct mm_struct *mm); struct mm_iommu_table_group_mem_t; extern int isolate_lru_page(struct page *page);/* from internal.h */ -extern bool mm_iommu_preregistered(void); -extern long mm_iommu_get(unsigned long ua, unsigned long entries, +extern bool mm_iommu_preregistered(struct mm_struct *mm); +extern long mm_iommu_get(struct mm_struct *mm, + unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem); -extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem); +extern long mm_iommu_put(struct mm_struct *mm, + struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_init(struct mm_struct *mm); extern void mm_iommu_cleanup(struct mm_struct *mm); -extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, - unsigned long size); -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua, - unsigned long entries); +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, + unsigned long ua, unsigned long size); +extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, + unsigned long ua, unsigned long entries); extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned long *hpa); extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index ad2e575..4c6db09 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -56,7 +56,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, } pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n", - current->pid, + current ? current->pid : 0, incr ? '+' : '-', npages << PAGE_SHIFT, mm->locked_vm << PAGE_SHIFT, @@ -66,12 +66,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, return ret; } -bool mm_iommu_preregistered(void) +bool mm_iommu_preregistered(struct mm_struct *mm) { - if (!current || !current->mm) - return false; - - return !list_empty(>mm->context.iommu_group_mem_list); + return !list_empty(>context.iommu_group_mem_list); } EXPORT_SYMBOL_GPL(mm_iommu_preregistered); @@ -124,19 +121,16 @@ static int mm_iommu_move_page_from_cma(struct page *page) return 0; } -long mm_iommu_get(unsigned long ua, unsigned long entries, +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; long i, j, ret = 0, locked_entries = 0; struct page *page = NULL; - if (!current || !current->mm) - return -ESRCH; /* process exited */ - mutex_lock(_list_mutex); - list_for_each_entry_rcu(mem, >mm->context.iommu_group_mem_list, + list_for_each_entry_rcu(mem, >context.iommu_group_mem_list, next) { if ((mem->ua == ua) && (mem->entries == entries)) { ++mem->used; @@ -154,7 +148,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, } - ret = mm_iommu_adjust_locked_vm(current->mm, entries, true); + ret = mm_iommu_adjust_locked_vm(mm, entries, true); if (ret) goto unlock_exit; @@ -215,11 +209,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, mem->entries = entries; *pmem = mem; - list_add_rcu(>next, >mm->context.iommu_group_mem_list); + list_add_rcu(>next, >context.iommu_group_mem_list); unlock_exit: if (locked_entries
[PATCH kernel v3 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers
We are going to get rid of @current references in mmu_context_boos3s64.c and cache mm_struct in the VFIO container. Since mm_context_t does not have reference counting, we will be using mm_struct which does have the reference counter. This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather than mm_context_t (which is embedded into mm). This should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy--- arch/powerpc/include/asm/mmu_context.h | 4 ++-- arch/powerpc/kernel/setup-common.c | 2 +- arch/powerpc/mm/mmu_context_book3s64.c | 4 ++-- arch/powerpc/mm/mmu_context_iommu.c| 9 + 4 files changed, 10 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 5c45114..424844b 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -23,8 +23,8 @@ extern bool mm_iommu_preregistered(void); extern long mm_iommu_get(unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem); extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem); -extern void mm_iommu_init(mm_context_t *ctx); -extern void mm_iommu_cleanup(mm_context_t *ctx); +extern void mm_iommu_init(struct mm_struct *mm); +extern void mm_iommu_cleanup(struct mm_struct *mm); extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, unsigned long size); extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua, diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 270ee30..f516ac5 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -915,7 +915,7 @@ void __init setup_arch(char **cmdline_p) init_mm.context.pte_frag = NULL; #endif #ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_init(_mm.context); + mm_iommu_init(_mm); #endif irqstack_early_init(); exc_lvl_early_init(); diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index b114f8b..ad82735 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) mm->context.pte_frag = NULL; #endif #ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_init(>context); + mm_iommu_init(mm); #endif return 0; } @@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { #ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_cleanup(>context); + mm_iommu_cleanup(mm); #endif #ifdef CONFIG_PPC_ICSWX diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index e0f1c33..ad2e575 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -373,16 +373,17 @@ void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem) } EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec); -void mm_iommu_init(mm_context_t *ctx) +void mm_iommu_init(struct mm_struct *mm) { - INIT_LIST_HEAD_RCU(>iommu_group_mem_list); + INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list); } -void mm_iommu_cleanup(mm_context_t *ctx) +void mm_iommu_cleanup(struct mm_struct *mm) { struct mm_iommu_table_group_mem_t *mem, *tmp; - list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) { + list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list, + next) { list_del_rcu(>next); mm_iommu_do_free(mem); } -- 2.5.0.rc3
[PATCH kernel v3 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
At the moment the userspace tool is expected to request pinning of the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present. When the userspace process finishes, all the pinned pages need to be put; this is done as a part of the userspace memory context (MM) destruction which happens on the very last mmdrop(). This approach has a problem that a MM of the userspace process may live longer than the userspace process itself as kernel threads use userspace process MMs which was runnning on a CPU where the kernel thread was scheduled to. If this happened, the MM remains referenced until this exact kernel thread wakes up again and releases the very last reference to the MM, on an idle system this can take even hours. This moves preregistered regions tracking from MM to VFIO; insteads of using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is added so each container releases regions which it has pre-registered. This changes the userspace interface to return EBUSY if a memory region is already registered in a container. However it should not have any practical effect as the only userspace tool available now does register memory region once per container anyway. As tce_iommu_register_pages/tce_iommu_unregister_pages are called under container->lock, this does not need additional locking. Signed-off-by: Alexey KardashevskiyReviewed-by: Nicholas Piggin --- Changes: v3: * moved tce_iommu_prereg_free() call out of list_for_each_entry() v2: * updated commit log --- arch/powerpc/mm/mmu_context_book3s64.c | 4 --- arch/powerpc/mm/mmu_context_iommu.c| 11 drivers/vfio/vfio_iommu_spapr_tce.c| 49 +- 3 files changed, 48 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c index ad82735..1a07969 100644 --- a/arch/powerpc/mm/mmu_context_book3s64.c +++ b/arch/powerpc/mm/mmu_context_book3s64.c @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { -#ifdef CONFIG_SPAPR_TCE_IOMMU - mm_iommu_cleanup(mm); -#endif - #ifdef CONFIG_PPC_ICSWX drop_cop(mm->context.acop, mm); kfree(mm->context.cop_lockp); diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 4c6db09..104bad0 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm) { INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list); } - -void mm_iommu_cleanup(struct mm_struct *mm) -{ - struct mm_iommu_table_group_mem_t *mem, *tmp; - - list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list, - next) { - list_del_rcu(>next); - mm_iommu_do_free(mem); - } -} diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 6b0b121..3e2f757 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -86,6 +86,15 @@ struct tce_iommu_group { }; /* + * A container needs to remember which preregistered region it has + * referenced to do proper cleanup at the userspace process exit. + */ +struct tce_iommu_prereg { + struct list_head next; + struct mm_iommu_table_group_mem_t *mem; +}; + +/* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group * at the moment of initialization. @@ -98,12 +107,27 @@ struct tce_container { struct mm_struct *mm; struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct list_head group_list; + struct list_head prereg_list; }; +static long tce_iommu_prereg_free(struct tce_container *container, + struct tce_iommu_prereg *tcemem) +{ + long ret; + + list_del(>next); + ret = mm_iommu_put(container->mm, tcemem->mem); + kfree(tcemem); + + return ret; +} + static long tce_iommu_unregister_pages(struct tce_container *container, __u64 vaddr, __u64 size) { struct mm_iommu_table_group_mem_t *mem; + struct tce_iommu_prereg *tcemem; + bool found = false; if (!current || !current->mm) return -ESRCH; /* process exited */ @@ -115,7 +139,17 @@ static long tce_iommu_unregister_pages(struct tce_container *container, if (!mem) return -ENOENT; - return mm_iommu_put(container->mm, mem); + list_for_each_entry(tcemem, >prereg_list, next) { + if (tcemem->mem == mem) { + found = true; + break; + } + } + + if (!found) + return -ENOENT; + + return tce_iommu_prereg_free(container, tcemem); } static long
[PATCH kernel v3 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown
These patches are to fix a bug when pages stay pinned hours after QEMU which requested pinning exited. This time 4 patches for easier reviewing. Please comment. Thanks. Alexey Kardashevskiy (4): powerpc/iommu: Pass mm_struct to init/cleanup helpers powerpc/iommu: Stop using @current in mm_iommu_xxx vfio/spapr: Cache mm in tce_container powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown arch/powerpc/include/asm/mmu_context.h | 20 ++-- arch/powerpc/kernel/setup-common.c | 2 +- arch/powerpc/mm/mmu_context_book3s64.c | 6 +- arch/powerpc/mm/mmu_context_iommu.c| 60 --- drivers/vfio/vfio_iommu_spapr_tce.c| 180 +++-- 5 files changed, 156 insertions(+), 112 deletions(-) -- 2.5.0.rc3
[PATCH] cpufreq: powernv: Use node ID in init_chip_info
Fixed assumption that node_id==chip_id in powernv-cpufreq.c:init_chip_info; explicitly use node ID where necessary. Tested: All CPUs report in /sys/devices/system/cpu*/cpufreq/throttle_stats Effort: platforms/arch-powerpc Google-Bug-Id: 26979978 Signed-off-by: Emily ShafferChange-Id: I22eb626b32fbb8053b3bbb9c75e677c700d0c2fb --- drivers/cpufreq/powernv-cpufreq.c | 27 +-- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c index d3ffde8..3750b58 100644 --- a/drivers/cpufreq/powernv-cpufreq.c +++ b/drivers/cpufreq/powernv-cpufreq.c @@ -911,32 +911,47 @@ static struct cpufreq_driver powernv_cpufreq_driver = { static int init_chip_info(void) { - unsigned int chip[256]; + int rc = 0; unsigned int cpu, i; unsigned int prev_chip_id = UINT_MAX; + unsigned int *chip, *node; + + chip = kcalloc(num_possible_cpus(), sizeof(unsigned int), GFP_KERNEL); + node = kcalloc(num_possible_cpus(), sizeof(unsigned int), GFP_KERNEL); + if (!chip || !node) { + rc = -ENOMEM; + goto out; + } for_each_possible_cpu(cpu) { unsigned int id = cpu_to_chip_id(cpu); if (prev_chip_id != id) { prev_chip_id = id; - chip[nr_chips++] = id; + node[nr_chips] = cpu_to_node(cpu); + chip[nr_chips] = id; + nr_chips++; } } chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL); - if (!chips) - return -ENOMEM; + if (!chips) { + rc = -ENOMEM; + goto out; + } for (i = 0; i < nr_chips; i++) { chips[i].id = chip[i]; - cpumask_copy([i].mask, cpumask_of_node(chip[i])); + cpumask_copy([i].mask, cpumask_of_node(node[i])); INIT_WORK([i].throttle, powernv_cpufreq_work_fn); for_each_cpu(cpu, [i].mask) per_cpu(chip_info, cpu) = [i]; } - return 0; +out: + kfree(node); + kfree(chip); + return rc; } static inline void clean_chip_info(void) -- 2.8.0.rc3.226.g39d4020
Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check
在 2016/10/20 01:24, Radim Krčmář 写道: 2016-10-19 06:20-0400, Pan Xinhui: This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. We use one field of struct kvm_steal_time to indicate that if one vcpu is running or not. unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui--- diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h @@ -98,6 +98,10 @@ struct pv_time_ops { unsigned long long (*steal_clock)(int cpu); }; +struct pv_vcpu_ops { + bool (*vcpu_is_preempted)(int cpu); +}; + (I would put it into pv_lock_ops to save the plumbing.) hi, Radim thanks for your reply. yes, a new struct leads patch into unnecessary lines changed. I do that just because I am not sure which existing xxx_ops I should place the vcpu_is_preempted in. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,8 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 preempted; Why __u32 instead of __u8? I thought it is 32-bits aligned... yes, u8 is good to store the preempt status. + __u32 pad[11]; }; Please document the change in Documentation/virtual/kvm/msr.txt, section MSR_KVM_STEAL_TIME. okay, I totally forgot to do that. thanks! diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = _cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -488,6 +497,8 @@ void __init kvm_guest_init(void) kvm_guest_cpu_init(); #endif + pv_vcpu_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; Would be nicer to assign conditionally in the KVM_FEATURE_STEAL_TIME block. The steal_time structure has to be zeroed, so this code would work, but the native function (return false) is better if we know that the kvm_vcpu_is_preempted() would always return false anway. yes, agree. Will do that. I once thought we can patch the code runtime. we replace binary code "call 0x #pv_vcpu_ops.vcpu_is_preempted" with "xor eax, eax" however it is not worth doing that. the performace improvements might be very small. Old KVMs won't have the feature, so we could also assign only when KVM reports it, but that requires extra definitions and the performance gain is fairly small, so I'm ok with this. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) >arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2812,6 +2814,16 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED) + if (kvm_read_guest_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal, + sizeof(struct kvm_steal_time)) == 0) { + vcpu->arch.st.steal.preempted = 1; + kvm_write_guest_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal, + sizeof(struct kvm_steal_time)); + } Please name this block of code. Something like kvm_steal_time_set_preempted(vcpu); yep, my code style is ugly.
Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check
2016-10-19 06:20-0400, Pan Xinhui: > This is to fix some lock holder preemption issues. Some other locks > implementation do a spin loop before acquiring the lock itself. > Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It > takes the cpu as parameter and return true if the cpu is preempted. Then > kernel can break the spin loops upon on the retval of vcpu_is_preempted. > > As kernel has used this interface, So lets support it. > > We use one field of struct kvm_steal_time to indicate that if one vcpu > is running or not. > > unix benchmark result: > host: kernel 4.8.1, i5-4570, 4 cpus > guest: kernel 4.8.1, 8 vcpus > > test-case after-patch before-patch > Execl Throughput |18307.9 lps |11701.6 lps > File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps > File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps > File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps > Pipe Throughput| 11872208.7 lps | 11855628.9 lps > Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps > Process Creation |29881.2 lps |28572.8 lps > Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm > Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm > System Call Overhead | 10385653.0 lps | 10419979.0 lps > > Signed-off-by: Pan Xinhui> --- > diff --git a/arch/x86/include/asm/paravirt_types.h > b/arch/x86/include/asm/paravirt_types.h > @@ -98,6 +98,10 @@ struct pv_time_ops { > unsigned long long (*steal_clock)(int cpu); > }; > > +struct pv_vcpu_ops { > + bool (*vcpu_is_preempted)(int cpu); > +}; > + (I would put it into pv_lock_ops to save the plumbing.) > diff --git a/arch/x86/include/uapi/asm/kvm_para.h > b/arch/x86/include/uapi/asm/kvm_para.h > @@ -45,7 +45,8 @@ struct kvm_steal_time { > __u64 steal; > __u32 version; > __u32 flags; > - __u32 pad[12]; > + __u32 preempted; Why __u32 instead of __u8? > + __u32 pad[11]; > }; Please document the change in Documentation/virtual/kvm/msr.txt, section MSR_KVM_STEAL_TIME. > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) > +static bool kvm_vcpu_is_preempted(int cpu) > +{ > + struct kvm_steal_time *src; > + > + src = _cpu(steal_time, cpu); > + > + return !!src->preempted; > +} > + > #ifdef CONFIG_SMP > static void __init kvm_smp_prepare_boot_cpu(void) > { > @@ -488,6 +497,8 @@ void __init kvm_guest_init(void) > kvm_guest_cpu_init(); > #endif > > + pv_vcpu_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; Would be nicer to assign conditionally in the KVM_FEATURE_STEAL_TIME block. The steal_time structure has to be zeroed, so this code would work, but the native function (return false) is better if we know that the kvm_vcpu_is_preempted() would always return false anway. Old KVMs won't have the feature, so we could also assign only when KVM reports it, but that requires extra definitions and the performance gain is fairly small, so I'm ok with this. > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) > >arch.st.steal, sizeof(struct kvm_steal_time > return; > > + vcpu->arch.st.steal.preempted = 0; > + > if (vcpu->arch.st.steal.version & 1) > vcpu->arch.st.steal.version += 1; /* first time write, random > junk */ > > @@ -2812,6 +2814,16 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > > void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > { > + if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED) > + if (kvm_read_guest_cached(vcpu->kvm, >arch.st.stime, > + >arch.st.steal, > + sizeof(struct kvm_steal_time)) == 0) { > + vcpu->arch.st.steal.preempted = 1; > + kvm_write_guest_cached(vcpu->kvm, >arch.st.stime, > + >arch.st.steal, > + sizeof(struct kvm_steal_time)); > + } Please name this block of code. Something like kvm_steal_time_set_preempted(vcpu); Thanks.
Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags
On 10/19/2016 10:01 AM, Michal Hocko wrote: > The question I had earlier was whether this has to be an explicit FOLL > flag used by g-u-p users or we can just use it internally when mm != > current->mm The reason I chose not to do that was that deferred work gets run under a basically random 'current'. If we just use 'mm != current->mm', then the deferred work will sometimes have pkeys enforced and sometimes not, basically randomly. We want to be consistent with whether they are enforced or not, so we explicitly indicate that by calling the remote variant vs. plain.
Re: [PATCH v4 0/5] implement vcpu preempted check
在 2016/10/19 23:58, Juergen Gross 写道: On 19/10/16 12:20, Pan Xinhui wrote: change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Pan Xinhui (5): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, kvm: support vcpu preempted check The attached patch adds Xen support for x86. Please tell me whether you want to add this patch to your series or if I should post it when your series has been accepted. hi, Juergen Your patch is pretty small and nice :) thanks! I can include your patch into my next patchset after this patchset reviewed. :) You can add my Tested-by: Juergen Grossfor patches 1-3 and 5 (paravirt parts only). Thanks a lot! xinhui Juergen arch/powerpc/include/asm/spinlock.h | 8 arch/x86/include/asm/paravirt_types.h | 6 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 11 +++ arch/x86/kernel/paravirt.c| 11 +++ arch/x86/kvm/x86.c| 12 include/linux/sched.h | 12 kernel/locking/mutex.c| 15 +-- kernel/locking/osq_lock.c | 10 +- kernel/locking/rwsem-xadd.c | 16 +--- 11 files changed, 105 insertions(+), 7 deletions(-)
Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags
On Wed 19-10-16 09:49:43, Dave Hansen wrote: > On 10/19/2016 02:07 AM, Michal Hocko wrote: > > On Wed 19-10-16 09:58:15, Lorenzo Stoakes wrote: > >> On Tue, Oct 18, 2016 at 05:30:50PM +0200, Michal Hocko wrote: > >>> I am wondering whether we can go further. E.g. it is not really clear to > >>> me whether we need an explicit FOLL_REMOTE when we can in fact check > >>> mm != current->mm and imply that. Maybe there are some contexts which > >>> wouldn't work, I haven't checked. > >> > >> This flag is set even when /proc/self/mem is used. I've not looked deeply > >> into > >> this flag but perhaps accessing your own memory this way can be considered > >> 'remote' since you're not accessing it directly. On the other hand, > >> perhaps this > >> is just mistaken in this case? > > > > My understanding of the flag is quite limited as well. All I know it is > > related to protection keys and it is needed to bypass protection check. > > See arch_vma_access_permitted. See also 1b2ee1266ea6 ("mm/core: Do not > > enforce PKEY permissions on remote mm access"). > > Yeah, we need the flag to tell us when PKEYs should be applied or not. > The current task's PKRU (pkey rights register) should really only be > used to impact access to the task's memory, but has no bearing on how a > given task should access remote memory. The question I had earlier was whether this has to be an explicit FOLL flag used by g-u-p users or we can just use it internally when mm != current->mm -- Michal Hocko SUSE Labs
Re: [PATCH v4 0/5] implement vcpu preempted check
在 2016/10/19 14:47, Christian Borntraeger 写道: On 10/19/2016 12:20 PM, Pan Xinhui wrote: change from v3: add x86 vcpu preempted check patch If you want you could add the s390 patch that I provided for your last version. I also gave my Acked-by for all previous patches. hi, Christian Thanks a lot! I can include your new s390 patch into my next patchset(if v5 is needed). xinhui change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Pan Xinhui (5): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, kvm: support vcpu preempted check arch/powerpc/include/asm/spinlock.h | 8 arch/x86/include/asm/paravirt_types.h | 6 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 11 +++ arch/x86/kernel/paravirt.c| 11 +++ arch/x86/kvm/x86.c| 12 include/linux/sched.h | 12 kernel/locking/mutex.c| 15 +-- kernel/locking/osq_lock.c | 10 +- kernel/locking/rwsem-xadd.c | 16 +--- 11 files changed, 105 insertions(+), 7 deletions(-)
Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags
On 10/19/2016 02:07 AM, Michal Hocko wrote: > On Wed 19-10-16 09:58:15, Lorenzo Stoakes wrote: >> On Tue, Oct 18, 2016 at 05:30:50PM +0200, Michal Hocko wrote: >>> I am wondering whether we can go further. E.g. it is not really clear to >>> me whether we need an explicit FOLL_REMOTE when we can in fact check >>> mm != current->mm and imply that. Maybe there are some contexts which >>> wouldn't work, I haven't checked. >> >> This flag is set even when /proc/self/mem is used. I've not looked deeply >> into >> this flag but perhaps accessing your own memory this way can be considered >> 'remote' since you're not accessing it directly. On the other hand, perhaps >> this >> is just mistaken in this case? > > My understanding of the flag is quite limited as well. All I know it is > related to protection keys and it is needed to bypass protection check. > See arch_vma_access_permitted. See also 1b2ee1266ea6 ("mm/core: Do not > enforce PKEY permissions on remote mm access"). Yeah, we need the flag to tell us when PKEYs should be applied or not. The current task's PKRU (pkey rights register) should really only be used to impact access to the task's memory, but has no bearing on how a given task should access remote memory.
[PATCH wq/for-4.10] workqueue: move wq_numa_init() to workqueue_init()
Hello, Michael. I simplified the patch a bit and applied the following to for-4.10 and updated for-next accordingly. The coming next snapshot should be fine. Thanks! -- 8< -- >From 2186d9f940b6a04f263a3bacd48f2a7ba96df4cf Mon Sep 17 00:00:00 2001 From: Tejun HeoDate: Wed, 19 Oct 2016 12:01:27 -0400 While splitting up workqueue initialization into two parts, ac8f73400782 ("workqueue: make workqueue available early during boot") put wq_numa_init() into workqueue_init_early(). Unfortunately, on some archs including power and arm64, cpu to node mapping isn't yet established by the time the early init is called leading to incorrect NUMA initialization and subsequently the following oops due to zero cpumask on node-specific unbound pools. Unable to handle kernel paging request for data at address 0x0038 Faulting instruction address: 0xc00fc0cc Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94 task: c007f540 task.stack: c01ffc084000 NIP: c00fc0cc LR: c00ed928 CTR: c00fbfd0 REGS: c01ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005) MSR: 92009033 CR: 48000424 XER: CFAR: c00089dc DAR: 0038 DSISR: 4000 SOFTE: 0 GPR00: c00ed928 c01ffc087a00 c0e63200 c00010d6d600 GPR04: c007f5409200 0021 0748e08c 001f GPR08: 0021 0748f1f8 GPR12: 28000422 cfb8 c000e0c8 GPR16: 0021 0001 GPR20: afb50401 c00010d6d600 ba7e GPR24: ba7e c0d8bc58 afb504000afb5041 0001 GPR28: 0004 c007f5409280 NIP [c00fc0cc] enqueue_task_fair+0xfc/0x18b0 LR [c00ed928] activate_task+0x78/0xe0 Call Trace: [c01ffc087a00] [c007f5409200] 0xc007f5409200 (unreliable) [c01ffc087b10] [c00ed928] activate_task+0x78/0xe0 [c01ffc087b50] [c00ede58] ttwu_do_activate+0x68/0xc0 [c01ffc087b90] [c00ef1b8] try_to_wake_up+0x208/0x4f0 [c01ffc087c10] [c00d3484] create_worker+0x144/0x250 [c01ffc087cb0] [c0cd72d0] workqueue_init+0x124/0x150 [c01ffc087d00] [c0cc0e74] kernel_init_freeable+0x158/0x360 [c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160 [c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc Instruction dump: 62940401 3b80 3aa0 7f17c378 3a61 3b61 6000 6000 6042 72490021 ebfe0150 2f890001 419e0de0 7fbee840 419e0e58 ---[ end trace ]--- Fix it by moving wq_numa_init() to workqueue_init(). As this means that the early intialization may not have full NUMA info for per-cpu pools and ignores NUMA affinity for unbound pools, fix them up from workqueue_init() after wq_numa_init(). Signed-off-by: Tejun Heo Reported-by: Michael Ellerman Link: http://lkml.kernel.org/r/87twck5wqo@concordia.ellerman.id.au Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot") Signed-off-by: Tejun Heo --- kernel/workqueue.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ad0cd43..c56479b 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -5495,8 +5495,6 @@ int __init workqueue_init_early(void) pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC); - wq_numa_init(); - /* initialize CPU pools */ for_each_possible_cpu(cpu) { struct worker_pool *pool; @@ -5566,9 +5564,32 @@ int __init workqueue_init_early(void) */ int __init workqueue_init(void) { + struct workqueue_struct *wq; struct worker_pool *pool; int cpu, bkt; + /* +* It'd be simpler to initialize NUMA in workqueue_init_early() but +* CPU to node mapping may not be available that early on some +* archs such as power and arm64. As per-cpu pools created +* previously could be missing node hint and unbound pools NUMA +* affinity, fix them up. +*/ + wq_numa_init(); + + mutex_lock(_pool_mutex); + + for_each_possible_cpu(cpu) { + for_each_cpu_worker_pool(pool, cpu) { + pool->node = cpu_to_node(cpu); + } + } + + list_for_each_entry(wq, , list) + wq_update_unbound_numa(wq, smp_processor_id(), true); + + mutex_unlock(_pool_mutex); + /* create the initial
Re: [PATCH v4 0/5] implement vcpu preempted check
On 19/10/16 12:20, Pan Xinhui wrote: > change from v3: > add x86 vcpu preempted check patch > change from v2: > no code change, fix typos, update some comments > change from v1: > a simplier definition of default vcpu_is_preempted > skip mahcine type check on ppc, and add config. remove dedicated macro. > add one patch to drop overload of rwsem_spin_on_owner and > mutex_spin_on_owner. > add more comments > thanks boqun and Peter's suggestion. > > This patch set aims to fix lock holder preemption issues. > > test-case: > perf record -a perf bench sched messaging -g 400 -p && perf report > > 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock > 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner > 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock > 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task > 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq > 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is > 2.49% sched-messaging [kernel.vmlinux] [k] system_call > > We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin > loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. > These spin_on_onwer variant also cause rcu stall before we apply this patch > set > > We also have observed some performace improvements. > > PPC test result: > > 1 copy - 0.94% > 2 copy - 7.17% > 4 copy - 11.9% > 8 copy - 3.04% > 16 copy - 15.11% > > details below: > Without patch: > > 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, > 1 samples) > 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, > 1 samples) > 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, > 1 samples) > 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, > 1 samples) > 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 > s, 1 samples) > > With patch: > > 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, > 1 samples) > 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, > 1 samples) > 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, > 1 samples) > 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, > 1 samples) > 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 > s, 1 samples) > > X86 test result: > test-case after-patch before-patch > Execl Throughput |18307.9 lps |11701.6 lps > File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps > File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps > File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps > Pipe Throughput| 11872208.7 lps | 11855628.9 lps > Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps > Process Creation |29881.2 lps |28572.8 lps > Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm > Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm > System Call Overhead | 10385653.0 lps | 10419979.0 lps > > Pan Xinhui (5): > kernel/sched: introduce vcpu preempted check interface > locking/osq: Drop the overload of osq_lock() > kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner > powerpc/spinlock: support vcpu preempted check > x86, kvm: support vcpu preempted check The attached patch adds Xen support for x86. Please tell me whether you want to add this patch to your series or if I should post it when your series has been accepted. You can add my Tested-by: Juergen Grossfor patches 1-3 and 5 (paravirt parts only). Juergen > > arch/powerpc/include/asm/spinlock.h | 8 > arch/x86/include/asm/paravirt_types.h | 6 ++ > arch/x86/include/asm/spinlock.h | 8 > arch/x86/include/uapi/asm/kvm_para.h | 3 ++- > arch/x86/kernel/kvm.c | 11 +++ > arch/x86/kernel/paravirt.c| 11 +++ > arch/x86/kvm/x86.c| 12 > include/linux/sched.h | 12 > kernel/locking/mutex.c| 15 +-- > kernel/locking/osq_lock.c | 10 +- > kernel/locking/rwsem-xadd.c | 16 +--- > 11 files changed, 105 insertions(+), 7 deletions(-) > >From c79b86d00a812d6207ef788d453e2d0289ef22a0 Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Wed, 19 Oct 2016 15:30:59 +0200 Subject: [PATCH] x86, xen: support vcpu preempted check Support the vcpu_is_preempted() functionality under Xen. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical
RE: [RFC PATCH 2/3] powerpc/pseries: Define & use a type for the plpar_hcall() retvals
From: Balbir Singh > Sent: 19 October 2016 15:00 ... > Here is an example > > - *slot = retbuf[0]; > + *slot = retvals.v[0]; > > Could we hide retvals.v[0] under a macro like > > *slot = hcalls_ret_val(retvals, 0); Ugg.. > Since we could end up with similar issues if > someone dereferenced retvals.v[4] The compiler will detect that these days. David
Re: [PATCH] net: fs_enet: Use net_device_stats from struct net_device
From: Tobias KlauserDate: Wed, 19 Oct 2016 11:24:57 +0200 > Instead of using a private copy of struct net_device_stats in struct > fs_enet_private, use stats from struct net_device. Also remove the now > unnecessary .ndo_get_stats function. > > Signed-off-by: Tobias Klauser Applied to net-next, thanks.
Re: [RFC PATCH 2/3] powerpc/pseries: Define & use a type for the plpar_hcall() retvals
On 19/10/16 22:47, Michael Ellerman wrote: > Balbir Singhwrites: > >> On 18/10/16 19:40, Michael Ellerman wrote: >>> We have now had two nasty stack corruption bugs caused by incorrect >>> sizing of the return buffer for plpar_hcall()/plpar_hcall9(). >>> >>> To avoid any more such bugs, define a type which encodes the size of the >>> return buffer, and change the argument of plpar_hcall() to be of that >>> type, meaning the compiler will check for us that we passed the right >>> size buffer. >>> >>> There isn't an easy way to do this incrementally, without introducing a >>> new function name, eg. plpar_hcall_with_struct(), which is ugly as hell. >>> So just do it in one tree-wide change. >>> >> Conceptually looks god, but I think we need to abstract the return values >> as well. I'll test and see if I can send you something on top of this > > Not sure I know what you mean. Here is an example - *slot = retbuf[0]; + *slot = retvals.v[0]; Could we hide retvals.v[0] under a macro like *slot = hcalls_ret_val(retvals, 0); Since we could end up with similar issues if someone dereferenced retvals.v[4] Since we are abstracting under retvals, I was wondering if we want to further abstract the return values as well and make retvals opaque to the user Balbir Singh.
RE: [PATCH] powerpc: link error on orphan sections
From: Michael Ellerman [mailto:m...@ellerman.id.au] > > In the past I've caused orphan sections to error by assigning them > > to the same address as something that exists. > > Works with all linkers, even if the error message isn't as useful. > > How do you assign them an address without knowing what they are? With a > wild card rule? Right at the end of the ldscript I have: /* Anything from any unexpected section ends up here and * generates an error because this overlaps another section */ unwanted _gp : { *(*) } } David
Re: [PATCH v2] powerpc/hash64: Be more careful when generating tlbiel
Michael Ellermanwrites: > From: Balbir Singh ... > > Although there may be very old toolchains which don't understand tlbiel, we > have > other code in the tree which has been using tlbiel for over five years, and no > one has reported any build failures, so just let the assembler generate the > instructions. Turns out this part is not true, it depends on the -mcpu you pass. So we'll have to go back to using the macros for generating the two argument form. v3 to come. cheers
RE: [PATCH] powerpc: link error on orphan sections
David Laightwrites: > From: Michael Ellerman >> Sent: 14 October 2016 01:46 > ... >> > +LDFLAGS_vmlinux := $(LDFLAGS_vmlinux-y) --orphan-handling=error >> >> At least some old(er) toolchains don't support that: >> >> /opt/cross/kisskb/gcc-4.6.3-nolibc/powerpc-linux/bin/powerpc-linux-ld: >> unrecognized option '-- >> orphan-handling=error' >> >> So we'll need an ld-option wrapper around it. > > In the past I've caused orphan sections to error by assigning them > to the same address as something that exists. > Works with all linkers, even if the error message isn't as useful. How do you assign them an address without knowing what they are? With a wild card rule? Anyway I think I'm happy for this to only work on newer toolchains. cheers
Re: [Patch v5 06/12] powerpc/virtex: Use generic xilinx irqchip driver
Zubair Lutfullah Kakakhelwrites: > The Xilinx interrupt controller driver is now available in drivers/irqchip. > Switch to using that driver. > > Signed-off-by: Zubair Lutfullah Kakakhel > > --- > V5 New patch > > Tested on virtex440-ml507 using qemu I don't have one of these to test on, and the patch looks sane, so that's good enough for me. Acked-by: Michael Ellerman (powerpc) cheers
Re: [RFC PATCH 2/3] powerpc/pseries: Define & use a type for the plpar_hcall() retvals
Balbir Singhwrites: > On 18/10/16 19:40, Michael Ellerman wrote: >> We have now had two nasty stack corruption bugs caused by incorrect >> sizing of the return buffer for plpar_hcall()/plpar_hcall9(). >> >> To avoid any more such bugs, define a type which encodes the size of the >> return buffer, and change the argument of plpar_hcall() to be of that >> type, meaning the compiler will check for us that we passed the right >> size buffer. >> >> There isn't an easy way to do this incrementally, without introducing a >> new function name, eg. plpar_hcall_with_struct(), which is ugly as hell. >> So just do it in one tree-wide change. >> > Conceptually looks god, but I think we need to abstract the return values > as well. I'll test and see if I can send you something on top of this Not sure I know what you mean. cheers
Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)
Tejun Heowrites: > Hello, Michael. > > On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote: >> That doesn't compile, wq doesn't exist. >> >> I guessed that you meant: >> >> + wq_numa_init(); >> + list_for_each_entry(wq, , list) >> + wq_update_unbound_numa(wq, smp_processor_id(), true); > > Yeap, sorry about that. No worries. >> And that does boot. >> >> The sysrq-t output is below, it's rather large. > > Didn't expect that many cpus but it looks good. I have another system here with twice as many CPUs if that would help ;) > I'll post proper patches soon. Thanks. Can you pull the current series out of linux-next for now until you merge the new series? cheers
Re: [PATCH 10/10] mm: replace access_process_vm() write parameter with gup_flags
Lorenzo Stoakeswrites: > diff --git a/arch/powerpc/kernel/ptrace32.c b/arch/powerpc/kernel/ptrace32.c > index f52b7db3..010b7b3 100644 > --- a/arch/powerpc/kernel/ptrace32.c > +++ b/arch/powerpc/kernel/ptrace32.c > @@ -74,7 +74,7 @@ long compat_arch_ptrace(struct task_struct *child, > compat_long_t request, > break; > > copied = access_process_vm(child, (u64)addrOthers, , > - sizeof(tmp), 0); > + sizeof(tmp), FOLL_FORCE); > if (copied != sizeof(tmp)) > break; > ret = put_user(tmp, (u32 __user *)data); LGTM. > @@ -179,7 +179,8 @@ long compat_arch_ptrace(struct task_struct *child, > compat_long_t request, > break; > ret = 0; > if (access_process_vm(child, (u64)addrOthers, , > - sizeof(tmp), 1) == sizeof(tmp)) > + sizeof(tmp), > + FOLL_FORCE | FOLL_WRITE) == sizeof(tmp)) > break; > ret = -EIO; > break; If you're respinning this anyway, can you format that as: if (access_process_vm(child, (u64)addrOthers, , sizeof(tmp), FOLL_FORCE | FOLL_WRITE) == sizeof(tmp)) break; I realise you probably deliberately didn't do that to make the diff clearer. Either way: Acked-by: Michael Ellerman (powerpc) cheers
Re: [PATCH 6/7] powerpc: switch to using thin archives if COMPILE_TEST is set
Nicholas Pigginwrites: > Enable thin archives build for powerpc when COMPILE_TEST is set. > Thin archives are explained in this commit: > > a5967db9af51a84f5e181600954714a9e4c69f1f > > This is a gradual way to introduce the option to testers. I think I'd rather this was actually a user-selectable option, eg. config USE_THIN_ARCHIVES bool "Build the kernel using thin archives" default n select THIN_ARCHIVES help Build the kernel using thin archives. If you're unsure say N. cheers
Re: [PATCH 3/7] powerpc/64s: tool to flag direct branches from unrelocated interrupt vectors
Nicholas Pigginwrites: > Direct banches from code below __end_interrupts to code above > __end_interrupts when built with CONFIG_RELOCATABLE are disallowed > because they will break when the kernel is not located at 0. > > Sample output: > > WARNING: Unrelocated relative branches > c118 bl-> 0xc0038fb8 > c13c b-> 0xc01068a4 > c148 b-> 0xc003919c > c14c b-> 0xc003923c > c5a4 b-> 0xc0106ffc > c0001af0 b-> 0xc0106ffc > c0001b24 b-> 0xc0106ffc > c0001b58 b-> 0xc0106ffc > > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/Makefile.postlink | 9 - > arch/powerpc/tools/unrel_branch_check.sh | 56 > arch/powerpc/scripts is meant for these kind of scripts (there's only one there ATM, and yes we should move others in there). cheers
[PATCH] net: fs_enet: Use net_device_stats from struct net_device
Instead of using a private copy of struct net_device_stats in struct fs_enet_private, use stats from struct net_device. Also remove the now unnecessary .ndo_get_stats function. Signed-off-by: Tobias Klauser--- .../net/ethernet/freescale/fs_enet/fs_enet-main.c | 43 +- drivers/net/ethernet/freescale/fs_enet/fs_enet.h | 1 - 2 files changed, 18 insertions(+), 26 deletions(-) diff --git a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c index 925d1bca6a26..44f50e168703 100644 --- a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c +++ b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c @@ -118,22 +118,22 @@ static int fs_enet_napi(struct napi_struct *napi, int budget) BD_ENET_TX_RL | BD_ENET_TX_UN | BD_ENET_TX_CSL)) { if (sc & BD_ENET_TX_HB) /* No heartbeat */ - fep->stats.tx_heartbeat_errors++; + dev->stats.tx_heartbeat_errors++; if (sc & BD_ENET_TX_LC) /* Late collision */ - fep->stats.tx_window_errors++; + dev->stats.tx_window_errors++; if (sc & BD_ENET_TX_RL) /* Retrans limit */ - fep->stats.tx_aborted_errors++; + dev->stats.tx_aborted_errors++; if (sc & BD_ENET_TX_UN) /* Underrun */ - fep->stats.tx_fifo_errors++; + dev->stats.tx_fifo_errors++; if (sc & BD_ENET_TX_CSL)/* Carrier lost */ - fep->stats.tx_carrier_errors++; + dev->stats.tx_carrier_errors++; if (sc & (BD_ENET_TX_LC | BD_ENET_TX_RL | BD_ENET_TX_UN)) { - fep->stats.tx_errors++; + dev->stats.tx_errors++; do_restart = 1; } } else - fep->stats.tx_packets++; + dev->stats.tx_packets++; if (sc & BD_ENET_TX_READY) { dev_warn(fep->dev, @@ -145,7 +145,7 @@ static int fs_enet_napi(struct napi_struct *napi, int budget) * but we eventually sent the packet OK. */ if (sc & BD_ENET_TX_DEF) - fep->stats.collisions++; + dev->stats.collisions++; /* unmap */ if (fep->mapped_as_page[dirtyidx]) @@ -212,19 +212,19 @@ static int fs_enet_napi(struct napi_struct *napi, int budget) */ if (sc & (BD_ENET_RX_LG | BD_ENET_RX_SH | BD_ENET_RX_CL | BD_ENET_RX_NO | BD_ENET_RX_CR | BD_ENET_RX_OV)) { - fep->stats.rx_errors++; + dev->stats.rx_errors++; /* Frame too long or too short. */ if (sc & (BD_ENET_RX_LG | BD_ENET_RX_SH)) - fep->stats.rx_length_errors++; + dev->stats.rx_length_errors++; /* Frame alignment */ if (sc & (BD_ENET_RX_NO | BD_ENET_RX_CL)) - fep->stats.rx_frame_errors++; + dev->stats.rx_frame_errors++; /* CRC Error */ if (sc & BD_ENET_RX_CR) - fep->stats.rx_crc_errors++; + dev->stats.rx_crc_errors++; /* FIFO overrun */ if (sc & BD_ENET_RX_OV) - fep->stats.rx_crc_errors++; + dev->stats.rx_crc_errors++; skbn = fep->rx_skbuff[curidx]; } else { @@ -233,9 +233,9 @@ static int fs_enet_napi(struct napi_struct *napi, int budget) /* * Process the incoming frame. */ - fep->stats.rx_packets++; + dev->stats.rx_packets++; pkt_len = CBDR_DATLEN(bdp) - 4; /* remove CRC */ - fep->stats.rx_bytes += pkt_len + 4; + dev->stats.rx_bytes += pkt_len + 4; if (pkt_len <= fpi->rx_copybreak) { /* +2 to make IP header L1 cache aligned */ @@ -277,7 +277,7 @@ static int fs_enet_napi(struct napi_struct *napi, int budget) received++; netif_receive_skb(skb); } else { - fep->stats.rx_dropped++; +
Re: [PATCH 6/7] powerpc: switch to using thin archives if COMPILE_TEST is set
On Wed, 19 Oct 2016 16:57:23 +1100 Stephen Rothwellwrote: > Hi Nick, > > On Wed, 19 Oct 2016 14:15:59 +1100 Nicholas Piggin wrote: > > > > Enable thin archives build for powerpc when COMPILE_TEST is set. > > Thin archives are explained in this commit: > > > > a5967db9af51a84f5e181600954714a9e4c69f1f > > This reference should be like this: > > a5967db9af51 ("kbuild: allow architectures to use thin archives instead of > ld -r") > Hi Stephen, Good point, I'll change it. Thanks, Nick
Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags
On Wed 19-10-16 10:06:46, Lorenzo Stoakes wrote: > On Wed, Oct 19, 2016 at 10:52:05AM +0200, Michal Hocko wrote: > > yes this is the desirable and expected behavior. > > > > > wonder if this is desirable behaviour or whether this ought to be limited > > > to > > > ptrace system calls. Regardless, by making the flag more visible it makes > > > it > > > easier to see that this is happening. > > > > mem_open already enforces PTRACE_MODE_ATTACH > > Ah I missed this, that makes a lot of sense, thanks! > > I still wonder whether other invocations of access_remote_vm() in > fs/proc/base.c > (the principle caller of this function) need FOLL_FORCE, for example the > various > calls that simply read data from other processes, so I think the point stands > about keeping this explicit. I do agree. Making them explicit will help to clean them up later, should there be a need. -- Michal Hocko SUSE Labs
Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags
On Wed 19-10-16 09:58:15, Lorenzo Stoakes wrote: > On Tue, Oct 18, 2016 at 05:30:50PM +0200, Michal Hocko wrote: > > I am wondering whether we can go further. E.g. it is not really clear to > > me whether we need an explicit FOLL_REMOTE when we can in fact check > > mm != current->mm and imply that. Maybe there are some contexts which > > wouldn't work, I haven't checked. > > This flag is set even when /proc/self/mem is used. I've not looked deeply into > this flag but perhaps accessing your own memory this way can be considered > 'remote' since you're not accessing it directly. On the other hand, perhaps > this > is just mistaken in this case? My understanding of the flag is quite limited as well. All I know it is related to protection keys and it is needed to bypass protection check. See arch_vma_access_permitted. See also 1b2ee1266ea6 ("mm/core: Do not enforce PKEY permissions on remote mm access"). -- Michal Hocko SUSE Labs
Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags
On Wed, Oct 19, 2016 at 10:52:05AM +0200, Michal Hocko wrote: > yes this is the desirable and expected behavior. > > > wonder if this is desirable behaviour or whether this ought to be limited to > > ptrace system calls. Regardless, by making the flag more visible it makes it > > easier to see that this is happening. > > mem_open already enforces PTRACE_MODE_ATTACH Ah I missed this, that makes a lot of sense, thanks! I still wonder whether other invocations of access_remote_vm() in fs/proc/base.c (the principle caller of this function) need FOLL_FORCE, for example the various calls that simply read data from other processes, so I think the point stands about keeping this explicit.
Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags
On Tue, Oct 18, 2016 at 05:30:50PM +0200, Michal Hocko wrote: > I am wondering whether we can go further. E.g. it is not really clear to > me whether we need an explicit FOLL_REMOTE when we can in fact check > mm != current->mm and imply that. Maybe there are some contexts which > wouldn't work, I haven't checked. This flag is set even when /proc/self/mem is used. I've not looked deeply into this flag but perhaps accessing your own memory this way can be considered 'remote' since you're not accessing it directly. On the other hand, perhaps this is just mistaken in this case? > I guess there is more work in that area and I do not want to impose all > that work on you, but I couldn't resist once I saw you playing in that > area ;) Definitely a good start! Thanks, I am more than happy to go as far down this rabbit hole as is helpful, no imposition at all :)
Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags
On Wed 19-10-16 09:40:45, Lorenzo Stoakes wrote: > On Wed, Oct 19, 2016 at 10:13:52AM +0200, Michal Hocko wrote: > > On Wed 19-10-16 09:59:03, Jan Kara wrote: > > > On Thu 13-10-16 01:20:18, Lorenzo Stoakes wrote: > > > > This patch removes the write parameter from __access_remote_vm() and > > > > replaces it > > > > with a gup_flags parameter as use of this function previously _implied_ > > > > FOLL_FORCE, whereas after this patch callers explicitly pass this flag. > > > > > > > > We make this explicit as use of FOLL_FORCE can result in surprising > > > > behaviour > > > > (and hence bugs) within the mm subsystem. > > > > > > > > Signed-off-by: Lorenzo Stoakes> > > > > > So I'm not convinced this (and the following two patches) is actually > > > helping much. By grepping for FOLL_FORCE we will easily see that any > > > caller > > > of access_remote_vm() gets that semantics and can thus continue search > > > > I am really wondering. Is there anything inherent that would require > > FOLL_FORCE for access_remote_vm? I mean FOLL_FORCE is a really > > non-trivial thing. It doesn't obey vma permissions so we should really > > minimize its usage. Do all of those users really need FOLL_FORCE? > > I wonder about this also, for example by accessing /proc/self/mem you trigger > access_remote_vm() and consequently get_user_pages_remote() meaning FOLL_FORCE > is implied and you can use /proc/self/mem to override any VMA permissions. I yes this is the desirable and expected behavior. > wonder if this is desirable behaviour or whether this ought to be limited to > ptrace system calls. Regardless, by making the flag more visible it makes it > easier to see that this is happening. mem_open already enforces PTRACE_MODE_ATTACH -- Michal Hocko SUSE Labs
Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags
On Wed, Oct 19, 2016 at 10:13:52AM +0200, Michal Hocko wrote: > On Wed 19-10-16 09:59:03, Jan Kara wrote: > > On Thu 13-10-16 01:20:18, Lorenzo Stoakes wrote: > > > This patch removes the write parameter from __access_remote_vm() and > > > replaces it > > > with a gup_flags parameter as use of this function previously _implied_ > > > FOLL_FORCE, whereas after this patch callers explicitly pass this flag. > > > > > > We make this explicit as use of FOLL_FORCE can result in surprising > > > behaviour > > > (and hence bugs) within the mm subsystem. > > > > > > Signed-off-by: Lorenzo Stoakes> > > > So I'm not convinced this (and the following two patches) is actually > > helping much. By grepping for FOLL_FORCE we will easily see that any caller > > of access_remote_vm() gets that semantics and can thus continue search > > I am really wondering. Is there anything inherent that would require > FOLL_FORCE for access_remote_vm? I mean FOLL_FORCE is a really > non-trivial thing. It doesn't obey vma permissions so we should really > minimize its usage. Do all of those users really need FOLL_FORCE? I wonder about this also, for example by accessing /proc/self/mem you trigger access_remote_vm() and consequently get_user_pages_remote() meaning FOLL_FORCE is implied and you can use /proc/self/mem to override any VMA permissions. I wonder if this is desirable behaviour or whether this ought to be limited to ptrace system calls. Regardless, by making the flag more visible it makes it easier to see that this is happening. Note that this /proc/self/mem behaviour is what triggered the bug that brought about this discussion in the first place - https://marc.info/?l=linux-mm=147363447105059 - as using /proc/self/mem this way on PROT_NONE memory broke some assumptions. On the other hand I see your point Jan in that you know any caller of these functions will have FOLL_FORCE implied, and you have to decide to stop passing the flag at _some_ point in the stack, however it seems fairly sane to have that stopping point exist outside of exported gup functions so the gup interface forces explicitness but callers can wrap things as they need.
Re: [v12, 0/8] Fix eSDHC host version register bug
On Wed, Oct 19, 2016 at 02:47:07AM +, Y.B. Lu wrote: > + Greg > > Hi Greg, > > I submitted this patchset for a MMC bug fix, and introduce the below patch > which needs your ACK. > > > Arnd Bergmann (1): > > > base: soc: introduce soc_device_match() interface > https://patchwork.kernel.org/patch/9342913/ > > Could you help to review it and give some comments or ACK. > Thank you very much. Now acked.
Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags
On Wed 19-10-16 09:59:03, Jan Kara wrote: > On Thu 13-10-16 01:20:18, Lorenzo Stoakes wrote: > > This patch removes the write parameter from __access_remote_vm() and > > replaces it > > with a gup_flags parameter as use of this function previously _implied_ > > FOLL_FORCE, whereas after this patch callers explicitly pass this flag. > > > > We make this explicit as use of FOLL_FORCE can result in surprising > > behaviour > > (and hence bugs) within the mm subsystem. > > > > Signed-off-by: Lorenzo Stoakes> > So I'm not convinced this (and the following two patches) is actually > helping much. By grepping for FOLL_FORCE we will easily see that any caller > of access_remote_vm() gets that semantics and can thus continue search I am really wondering. Is there anything inherent that would require FOLL_FORCE for access_remote_vm? I mean FOLL_FORCE is a really non-trivial thing. It doesn't obey vma permissions so we should really minimize its usage. Do all of those users really need FOLL_FORCE? Anyway I would rather see the flag explicit and used at more places than hidden behind a helper function. -- Michal Hocko SUSE Labs
Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags
On Thu 13-10-16 01:20:18, Lorenzo Stoakes wrote: > This patch removes the write parameter from __access_remote_vm() and replaces > it > with a gup_flags parameter as use of this function previously _implied_ > FOLL_FORCE, whereas after this patch callers explicitly pass this flag. > > We make this explicit as use of FOLL_FORCE can result in surprising behaviour > (and hence bugs) within the mm subsystem. > > Signed-off-by: Lorenzo StoakesSo I'm not convinced this (and the following two patches) is actually helping much. By grepping for FOLL_FORCE we will easily see that any caller of access_remote_vm() gets that semantics and can thus continue search accordingly (it is much simpler than searching for all get_user_pages() users and extracting from parameter lists what they actually pass as 'force' argument). Sure it makes somewhat more visible to callers of access_remote_vm() that they get FOLL_FORCE semantics but OTOH it also opens a space for issues where a caller of access_remote_vm() actually wants FOLL_FORCE (and currently all of them want it) and just mistakenly does not set it. All in all I'd prefer to keep access_remote_vm() and friends as is... Honza > --- > mm/memory.c | 23 +++ > mm/nommu.c | 9 ++--- > 2 files changed, 21 insertions(+), 11 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 20a9adb..79ebed3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3869,14 +3869,11 @@ EXPORT_SYMBOL_GPL(generic_access_phys); > * given task for page fault accounting. > */ > static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, > - unsigned long addr, void *buf, int len, int write) > + unsigned long addr, void *buf, int len, unsigned int gup_flags) > { > struct vm_area_struct *vma; > void *old_buf = buf; > - unsigned int flags = FOLL_FORCE; > - > - if (write) > - flags |= FOLL_WRITE; > + int write = gup_flags & FOLL_WRITE; > > down_read(>mmap_sem); > /* ignore errors, just check how much was successfully transferred */ > @@ -3886,7 +3883,7 @@ static int __access_remote_vm(struct task_struct *tsk, > struct mm_struct *mm, > struct page *page = NULL; > > ret = get_user_pages_remote(tsk, mm, addr, 1, > - flags, , ); > + gup_flags, , ); > if (ret <= 0) { > #ifndef CONFIG_HAVE_IOREMAP_PROT > break; > @@ -3945,7 +3942,12 @@ static int __access_remote_vm(struct task_struct *tsk, > struct mm_struct *mm, > int access_remote_vm(struct mm_struct *mm, unsigned long addr, > void *buf, int len, int write) > { > - return __access_remote_vm(NULL, mm, addr, buf, len, write); > + unsigned int flags = FOLL_FORCE; > + > + if (write) > + flags |= FOLL_WRITE; > + > + return __access_remote_vm(NULL, mm, addr, buf, len, flags); > } > > /* > @@ -3958,12 +3960,17 @@ int access_process_vm(struct task_struct *tsk, > unsigned long addr, > { > struct mm_struct *mm; > int ret; > + unsigned int flags = FOLL_FORCE; > > mm = get_task_mm(tsk); > if (!mm) > return 0; > > - ret = __access_remote_vm(tsk, mm, addr, buf, len, write); > + if (write) > + flags |= FOLL_WRITE; > + > + ret = __access_remote_vm(tsk, mm, addr, buf, len, flags); > + > mmput(mm); > > return ret; > diff --git a/mm/nommu.c b/mm/nommu.c > index 70cb844..bde7df3 100644 > --- a/mm/nommu.c > +++ b/mm/nommu.c > @@ -1809,9 +1809,10 @@ void filemap_map_pages(struct fault_env *fe, > EXPORT_SYMBOL(filemap_map_pages); > > static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, > - unsigned long addr, void *buf, int len, int write) > + unsigned long addr, void *buf, int len, unsigned int gup_flags) > { > struct vm_area_struct *vma; > + int write = gup_flags & FOLL_WRITE; > > down_read(>mmap_sem); > > @@ -1853,7 +1854,8 @@ static int __access_remote_vm(struct task_struct *tsk, > struct mm_struct *mm, > int access_remote_vm(struct mm_struct *mm, unsigned long addr, > void *buf, int len, int write) > { > - return __access_remote_vm(NULL, mm, addr, buf, len, write); > + return __access_remote_vm(NULL, mm, addr, buf, len, > + write ? FOLL_WRITE : 0); > } > > /* > @@ -1871,7 +1873,8 @@ int access_process_vm(struct task_struct *tsk, unsigned > long addr, void *buf, in > if (!mm) > return 0; > > - len = __access_remote_vm(tsk, mm, addr, buf, len, write); > + len = __access_remote_vm(tsk, mm, addr, buf, len, > + write ? FOLL_WRITE : 0); > > mmput(mm); > return len; > -- > 2.10.0 > -- Jan Kara SUSE Labs, CR
Re: [PATCH 07/10] mm: replace get_user_pages_remote() write/force parameters with gup_flags
On Thu 13-10-16 01:20:17, Lorenzo Stoakes wrote: > This patch removes the write and force parameters from get_user_pages_remote() > and replaces them with a gup_flags parameter to make the use of FOLL_FORCE > explicit in callers as use of this flag can result in surprising behaviour > (and > hence bugs) within the mm subsystem. > > Signed-off-by: Lorenzo StoakesLooks good. You can add: Reviewed-by: Jan Kara Honza > --- > drivers/gpu/drm/etnaviv/etnaviv_gem.c | 7 +-- > drivers/gpu/drm/i915/i915_gem_userptr.c | 6 +- > drivers/infiniband/core/umem_odp.c | 7 +-- > fs/exec.c | 9 +++-- > include/linux/mm.h | 2 +- > kernel/events/uprobes.c | 6 -- > mm/gup.c| 22 +++--- > mm/memory.c | 6 +- > security/tomoyo/domain.c| 2 +- > 9 files changed, 40 insertions(+), 27 deletions(-) > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c > b/drivers/gpu/drm/etnaviv/etnaviv_gem.c > index 5ce3603..0370b84 100644 > --- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c > +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c > @@ -748,19 +748,22 @@ static struct page **etnaviv_gem_userptr_do_get_pages( > int ret = 0, pinned, npages = etnaviv_obj->base.size >> PAGE_SHIFT; > struct page **pvec; > uintptr_t ptr; > + unsigned int flags = 0; > > pvec = drm_malloc_ab(npages, sizeof(struct page *)); > if (!pvec) > return ERR_PTR(-ENOMEM); > > + if (!etnaviv_obj->userptr.ro) > + flags |= FOLL_WRITE; > + > pinned = 0; > ptr = etnaviv_obj->userptr.ptr; > > down_read(>mmap_sem); > while (pinned < npages) { > ret = get_user_pages_remote(task, mm, ptr, npages - pinned, > - !etnaviv_obj->userptr.ro, 0, > - pvec + pinned, NULL); > + flags, pvec + pinned, NULL); > if (ret < 0) > break; > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c > b/drivers/gpu/drm/i915/i915_gem_userptr.c > index e537930..c6f780f 100644 > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > @@ -508,6 +508,10 @@ __i915_gem_userptr_get_pages_worker(struct work_struct > *_work) > pvec = drm_malloc_gfp(npages, sizeof(struct page *), GFP_TEMPORARY); > if (pvec != NULL) { > struct mm_struct *mm = obj->userptr.mm->mm; > + unsigned int flags = 0; > + > + if (!obj->userptr.read_only) > + flags |= FOLL_WRITE; > > ret = -EFAULT; > if (atomic_inc_not_zero(>mm_users)) { > @@ -517,7 +521,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct > *_work) > (work->task, mm, >obj->userptr.ptr + pinned * PAGE_SIZE, >npages - pinned, > - !obj->userptr.read_only, 0, > + flags, >pvec + pinned, NULL); > if (ret < 0) > break; > diff --git a/drivers/infiniband/core/umem_odp.c > b/drivers/infiniband/core/umem_odp.c > index 75077a0..1f0fe32 100644 > --- a/drivers/infiniband/core/umem_odp.c > +++ b/drivers/infiniband/core/umem_odp.c > @@ -527,6 +527,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 > user_virt, u64 bcnt, > u64 off; > int j, k, ret = 0, start_idx, npages = 0; > u64 base_virt_addr; > + unsigned int flags = 0; > > if (access_mask == 0) > return -EINVAL; > @@ -556,6 +557,9 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 > user_virt, u64 bcnt, > goto out_put_task; > } > > + if (access_mask & ODP_WRITE_ALLOWED_BIT) > + flags |= FOLL_WRITE; > + > start_idx = (user_virt - ib_umem_start(umem)) >> PAGE_SHIFT; > k = start_idx; > > @@ -574,8 +578,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 > user_virt, u64 bcnt, >*/ > npages = get_user_pages_remote(owning_process, owning_mm, > user_virt, gup_num_pages, > - access_mask & ODP_WRITE_ALLOWED_BIT, > - 0, local_page_list, NULL); > + flags, local_page_list, NULL); > up_read(_mm->mmap_sem); > > if (npages < 0) > diff --git a/fs/exec.c b/fs/exec.c > index 6fcfb3f..4e497b9 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -191,6 +191,7 @@ static
Re: [PATCH 06/10] mm: replace get_user_pages() write/force parameters with gup_flags
On Thu 13-10-16 01:20:16, Lorenzo Stoakes wrote: > This patch removes the write and force parameters from get_user_pages() and > replaces them with a gup_flags parameter to make the use of FOLL_FORCE > explicit > in callers as use of this flag can result in surprising behaviour (and hence > bugs) within the mm subsystem. > > Signed-off-by: Lorenzo StoakesThe patch looks good. You can add: Reviewed-by: Jan Kara Honza > --- > arch/cris/arch-v32/drivers/cryptocop.c | 4 +--- > arch/ia64/kernel/err_inject.c | 2 +- > arch/x86/mm/mpx.c | 5 ++--- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 7 +-- > drivers/gpu/drm/radeon/radeon_ttm.c| 3 ++- > drivers/gpu/drm/via/via_dmablit.c | 4 ++-- > drivers/infiniband/core/umem.c | 6 +- > drivers/infiniband/hw/mthca/mthca_memfree.c| 2 +- > drivers/infiniband/hw/qib/qib_user_pages.c | 3 ++- > drivers/infiniband/hw/usnic/usnic_uiom.c | 5 - > drivers/media/v4l2-core/videobuf-dma-sg.c | 7 +-- > drivers/misc/mic/scif/scif_rma.c | 3 +-- > drivers/misc/sgi-gru/grufault.c| 2 +- > drivers/platform/goldfish/goldfish_pipe.c | 3 ++- > drivers/rapidio/devices/rio_mport_cdev.c | 3 ++- > .../vc04_services/interface/vchiq_arm/vchiq_2835_arm.c | 3 +-- > .../vc04_services/interface/vchiq_arm/vchiq_arm.c | 3 +-- > drivers/virt/fsl_hypervisor.c | 4 ++-- > include/linux/mm.h | 2 +- > mm/gup.c | 12 +++- > mm/mempolicy.c | 2 +- > mm/nommu.c | 18 > -- > 22 files changed, 49 insertions(+), 54 deletions(-) > > diff --git a/arch/cris/arch-v32/drivers/cryptocop.c > b/arch/cris/arch-v32/drivers/cryptocop.c > index b5698c8..099e170 100644 > --- a/arch/cris/arch-v32/drivers/cryptocop.c > +++ b/arch/cris/arch-v32/drivers/cryptocop.c > @@ -2722,7 +2722,6 @@ static int cryptocop_ioctl_process(struct inode *inode, > struct file *filp, unsig > err = get_user_pages((unsigned long int)(oper.indata + prev_ix), >noinpages, >0, /* read access only for in data */ > - 0, /* no force */ >inpages, >NULL); > > @@ -2736,8 +2735,7 @@ static int cryptocop_ioctl_process(struct inode *inode, > struct file *filp, unsig > if (oper.do_cipher){ > err = get_user_pages((unsigned long int)oper.cipher_outdata, >nooutpages, > - 1, /* write access for out data */ > - 0, /* no force */ > + FOLL_WRITE, /* write access for out data */ >outpages, >NULL); > up_read(>mm->mmap_sem); > diff --git a/arch/ia64/kernel/err_inject.c b/arch/ia64/kernel/err_inject.c > index 09f8457..5ed0ea9 100644 > --- a/arch/ia64/kernel/err_inject.c > +++ b/arch/ia64/kernel/err_inject.c > @@ -142,7 +142,7 @@ store_virtual_to_phys(struct device *dev, struct > device_attribute *attr, > u64 virt_addr=simple_strtoull(buf, NULL, 16); > int ret; > > - ret = get_user_pages(virt_addr, 1, VM_READ, 0, NULL, NULL); > + ret = get_user_pages(virt_addr, 1, FOLL_WRITE, NULL, NULL); > if (ret<=0) { > #ifdef ERR_INJ_DEBUG > printk("Virtual address %lx is not existing.\n",virt_addr); > diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c > index 8047687..e4f8009 100644 > --- a/arch/x86/mm/mpx.c > +++ b/arch/x86/mm/mpx.c > @@ -544,10 +544,9 @@ static int mpx_resolve_fault(long __user *addr, int > write) > { > long gup_ret; > int nr_pages = 1; > - int force = 0; > > - gup_ret = get_user_pages((unsigned long)addr, nr_pages, write, > - force, NULL, NULL); > + gup_ret = get_user_pages((unsigned long)addr, nr_pages, > + write ? FOLL_WRITE : 0, NULL, NULL); > /* >* get_user_pages() returns number of pages gotten. >* 0 means we failed to fault in and get anything, > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index 887483b..dcaf691 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -555,10 +555,13 @@ struct amdgpu_ttm_tt { > int amdgpu_ttm_tt_get_user_pages(struct ttm_tt *ttm, struct page **pages) > { >
Re: [PATCH 0/7] build updates
On Wed, 19 Oct 2016 14:15:53 +1100 Nicholas Pigginwrote: > [*] Building allyesconfig still requires KALLSYMS_EXTRA_PASS=1, which > I'm yet to look into. Oh, it's because the kallsyms payload increases kernel image size and that causes more linker stubs to be generated, which have symbols, which go into kallsyms... What a nightmare. We can use --no-emit-stub-syms, but it's kind of nice to have names for things. Thanks, Nick
Re: [PATCH 05/10] mm: replace get_vaddr_frames() write/force parameters with gup_flags
On Thu 13-10-16 01:20:15, Lorenzo Stoakes wrote: > This patch removes the write and force parameters from get_vaddr_frames() and > replaces them with a gup_flags parameter to make the use of FOLL_FORCE > explicit > in callers as use of this flag can result in surprising behaviour (and hence > bugs) within the mm subsystem. > > Signed-off-by: Lorenzo StoakesLooks good. You can add: Reviewed-by: Jan Kara Honza > --- > drivers/gpu/drm/exynos/exynos_drm_g2d.c| 3 ++- > drivers/media/platform/omap/omap_vout.c| 2 +- > drivers/media/v4l2-core/videobuf2-memops.c | 6 +- > include/linux/mm.h | 2 +- > mm/frame_vector.c | 13 ++--- > 5 files changed, 11 insertions(+), 15 deletions(-) > > diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c > b/drivers/gpu/drm/exynos/exynos_drm_g2d.c > index aa92dec..fbd13fa 100644 > --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c > +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c > @@ -488,7 +488,8 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct > drm_device *drm_dev, > goto err_free; > } > > - ret = get_vaddr_frames(start, npages, true, true, g2d_userptr->vec); > + ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE, > + g2d_userptr->vec); > if (ret != npages) { > DRM_ERROR("failed to get user pages from userptr.\n"); > if (ret < 0) > diff --git a/drivers/media/platform/omap/omap_vout.c > b/drivers/media/platform/omap/omap_vout.c > index e668dde..a31b95c 100644 > --- a/drivers/media/platform/omap/omap_vout.c > +++ b/drivers/media/platform/omap/omap_vout.c > @@ -214,7 +214,7 @@ static int omap_vout_get_userptr(struct videobuf_buffer > *vb, u32 virtp, > if (!vec) > return -ENOMEM; > > - ret = get_vaddr_frames(virtp, 1, true, false, vec); > + ret = get_vaddr_frames(virtp, 1, FOLL_WRITE, vec); > if (ret != 1) { > frame_vector_destroy(vec); > return -EINVAL; > diff --git a/drivers/media/v4l2-core/videobuf2-memops.c > b/drivers/media/v4l2-core/videobuf2-memops.c > index 3c3b517..1cd322e 100644 > --- a/drivers/media/v4l2-core/videobuf2-memops.c > +++ b/drivers/media/v4l2-core/videobuf2-memops.c > @@ -42,6 +42,10 @@ struct frame_vector *vb2_create_framevec(unsigned long > start, > unsigned long first, last; > unsigned long nr; > struct frame_vector *vec; > + unsigned int flags = FOLL_FORCE; > + > + if (write) > + flags |= FOLL_WRITE; > > first = start >> PAGE_SHIFT; > last = (start + length - 1) >> PAGE_SHIFT; > @@ -49,7 +53,7 @@ struct frame_vector *vb2_create_framevec(unsigned long > start, > vec = frame_vector_create(nr); > if (!vec) > return ERR_PTR(-ENOMEM); > - ret = get_vaddr_frames(start & PAGE_MASK, nr, write, true, vec); > + ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec); > if (ret < 0) > goto out_destroy; > /* We accept only complete set of PFNs */ > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 27ab538..5ff084f6 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1305,7 +1305,7 @@ struct frame_vector { > struct frame_vector *frame_vector_create(unsigned int nr_frames); > void frame_vector_destroy(struct frame_vector *vec); > int get_vaddr_frames(unsigned long start, unsigned int nr_pfns, > - bool write, bool force, struct frame_vector *vec); > + unsigned int gup_flags, struct frame_vector *vec); > void put_vaddr_frames(struct frame_vector *vec); > int frame_vector_to_pages(struct frame_vector *vec); > void frame_vector_to_pfns(struct frame_vector *vec); > diff --git a/mm/frame_vector.c b/mm/frame_vector.c > index 81b6749..db77dcb 100644 > --- a/mm/frame_vector.c > +++ b/mm/frame_vector.c > @@ -11,10 +11,7 @@ > * get_vaddr_frames() - map virtual addresses to pfns > * @start: starting user address > * @nr_frames: number of pages / pfns from start to map > - * @write: whether pages will be written to by the caller > - * @force: whether to force write access even if user mapping is > - * readonly. See description of the same argument of > - get_user_pages(). > + * @gup_flags: flags modifying lookup behaviour > * @vec: structure which receives pages / pfns of the addresses mapped. > * It should have space for at least nr_frames entries. > * > @@ -34,23 +31,17 @@ > * This function takes care of grabbing mmap_sem as necessary. > */ > int get_vaddr_frames(unsigned long start, unsigned int nr_frames, > - bool write, bool force, struct frame_vector *vec) > + unsigned int gup_flags, struct frame_vector *vec) > { > struct mm_struct *mm = current->mm; >
Re: [PATCH 04/10] mm: replace get_user_pages_locked() write/force parameters with gup_flags
On Thu 13-10-16 01:20:14, Lorenzo Stoakes wrote: > This patch removes the write and force parameters from get_user_pages_locked() > and replaces them with a gup_flags parameter to make the use of FOLL_FORCE > explicit in callers as use of this flag can result in surprising behaviour > (and > hence bugs) within the mm subsystem. > > Signed-off-by: Lorenzo StoakesAfter our discussion the patch looks good to me. You can add: Reviewed-by: Jan Kara Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 04/10] mm: replace get_user_pages_locked() write/force parameters with gup_flags
On Tue 18-10-16 14:56:09, Lorenzo Stoakes wrote: > On Tue, Oct 18, 2016 at 02:54:25PM +0200, Jan Kara wrote: > > > @@ -1282,7 +1282,7 @@ long get_user_pages(unsigned long start, unsigned > > > long nr_pages, > > > int write, int force, struct page **pages, > > > struct vm_area_struct **vmas); > > > long get_user_pages_locked(unsigned long start, unsigned long nr_pages, > > > - int write, int force, struct page **pages, int *locked); > > > + unsigned int gup_flags, struct page **pages, int *locked); > > > > Hum, the prototype is inconsistent with e.g. __get_user_pages_unlocked() > > where gup_flags come after **pages argument. Actually it makes more sense > > to have it before **pages so that input arguments come first and output > > arguments second but I don't care that much. But it definitely should be > > consistent... > > It was difficult to decide quite how to arrange parameters as there was > inconsitency with regards to parameter ordering already - for example > __get_user_pages() places its flags argument before pages whereas, as you > note, > __get_user_pages_unlocked() puts them afterwards. > > I ended up compromising by trying to match the existing ordering of the > function > as much as I could by replacing write, force pairs with gup_flags in the same > location (with the exception of get_user_pages_unlocked() which I felt should > match __get_user_pages_unlocked() in signature) or if there was already a > gup_flags parameter as in the case of __get_user_pages_unlocked() I simply > removed the write, force pair and left the flags as the last parameter. > > I am happy to rearrange parameters as needed, however I am not sure if it'd be > worthwhile for me to do so (I am keen to try to avoid adding too much noise > here > :) > > If we were to rearrange parameters for consistency I'd suggest adjusting > __get_user_pages_unlocked() to put gup_flags before pages and do the same with > get_user_pages_unlocked(), let me know what you think. Yeah, ok. If the inconsistency is already there, just leave it for now. Honza -- Jan KaraSUSE Labs, CR
Re: [PATCH v4 0/5] implement vcpu preempted check
On 10/19/2016 12:20 PM, Pan Xinhui wrote: > change from v3: > add x86 vcpu preempted check patch If you want you could add the s390 patch that I provided for your last version. I also gave my Acked-by for all previous patches. > change from v2: > no code change, fix typos, update some comments > change from v1: > a simplier definition of default vcpu_is_preempted > skip mahcine type check on ppc, and add config. remove dedicated macro. > add one patch to drop overload of rwsem_spin_on_owner and > mutex_spin_on_owner. > add more comments > thanks boqun and Peter's suggestion. > > This patch set aims to fix lock holder preemption issues. > > test-case: > perf record -a perf bench sched messaging -g 400 -p && perf report > > 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock > 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner > 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock > 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task > 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq > 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is > 2.49% sched-messaging [kernel.vmlinux] [k] system_call > > We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin > loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. > These spin_on_onwer variant also cause rcu stall before we apply this patch > set > > We also have observed some performace improvements. > > PPC test result: > > 1 copy - 0.94% > 2 copy - 7.17% > 4 copy - 11.9% > 8 copy - 3.04% > 16 copy - 15.11% > > details below: > Without patch: > > 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, > 1 samples) > 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, > 1 samples) > 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, > 1 samples) > 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, > 1 samples) > 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 > s, 1 samples) > > With patch: > > 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, > 1 samples) > 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, > 1 samples) > 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, > 1 samples) > 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, > 1 samples) > 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 > s, 1 samples) > > X86 test result: > test-case after-patch before-patch > Execl Throughput |18307.9 lps |11701.6 lps > File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps > File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps > File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps > Pipe Throughput| 11872208.7 lps | 11855628.9 lps > Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps > Process Creation |29881.2 lps |28572.8 lps > Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm > Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm > System Call Overhead | 10385653.0 lps | 10419979.0 lps > > Pan Xinhui (5): > kernel/sched: introduce vcpu preempted check interface > locking/osq: Drop the overload of osq_lock() > kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner > powerpc/spinlock: support vcpu preempted check > x86, kvm: support vcpu preempted check > > arch/powerpc/include/asm/spinlock.h | 8 > arch/x86/include/asm/paravirt_types.h | 6 ++ > arch/x86/include/asm/spinlock.h | 8 > arch/x86/include/uapi/asm/kvm_para.h | 3 ++- > arch/x86/kernel/kvm.c | 11 +++ > arch/x86/kernel/paravirt.c| 11 +++ > arch/x86/kvm/x86.c| 12 > include/linux/sched.h | 12 > kernel/locking/mutex.c| 15 +-- > kernel/locking/osq_lock.c | 10 +- > kernel/locking/rwsem-xadd.c | 16 +--- > 11 files changed, 105 insertions(+), 7 deletions(-) >
[PATCH v4 5/5] x86, kvm: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. We use one field of struct kvm_steal_time to indicate that if one vcpu is running or not. unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui--- arch/x86/include/asm/paravirt_types.h | 6 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 11 +++ arch/x86/kernel/paravirt.c| 11 +++ arch/x86/kvm/x86.c| 12 6 files changed, 50 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..b1c7937 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -98,6 +98,10 @@ struct pv_time_ops { unsigned long long (*steal_clock)(int cpu); }; +struct pv_vcpu_ops { + bool (*vcpu_is_preempted)(int cpu); +}; + struct pv_cpu_ops { /* hooks for various privileged instructions */ unsigned long (*get_debugreg)(int regno); @@ -318,6 +322,7 @@ struct pv_lock_ops { struct paravirt_patch_template { struct pv_init_ops pv_init_ops; struct pv_time_ops pv_time_ops; + struct pv_vcpu_ops pv_vcpu_ops; struct pv_cpu_ops pv_cpu_ops; struct pv_irq_ops pv_irq_ops; struct pv_mmu_ops pv_mmu_ops; @@ -327,6 +332,7 @@ struct paravirt_patch_template { extern struct pv_info pv_info; extern struct pv_init_ops pv_init_ops; extern struct pv_time_ops pv_time_ops; +extern struct pv_vcpu_ops pv_vcpu_ops; extern struct pv_cpu_ops pv_cpu_ops; extern struct pv_irq_ops pv_irq_ops; extern struct pv_mmu_ops pv_mmu_ops; diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..52fd942 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_vcpu_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..e9c12a1 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,8 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 preempted; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0011bef 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = _cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -488,6 +497,8 @@ void __init kvm_guest_init(void) kvm_guest_cpu_init(); #endif + pv_vcpu_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; + /* * Hard lockup detection is enabled by default. Disable it, as guests * can get false positives too easily, for example if the host is diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index bbf3d59..7adb7e9 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -122,6 +122,7 @@ static void *get_call_destination(u8 type) struct
[PATCH v4 4/5] powerpc/spinlock: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. Only pSeries need support it. And the fact is powerNV are built into same kernel image with pSeries. So we need return false if we are runnig as powerNV. The another fact is that lppaca->yiled_count keeps zero on powerNV. So we can just skip the machine type check. Suggested-by: Boqun FengSuggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index abb6b0f..af4285b 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + #if defined(CONFIG_PPC_SPLPAR) /* We only yield to the hypervisor if we are in shared processor mode */ #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) -- 2.4.11
[PATCH v4 3/5] kernel/locking: Drop the overload of {mutex, rwsem}_spin_on_owner
An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Signed-off-by: Pan Xinhui--- kernel/locking/mutex.c | 15 +-- kernel/locking/rwsem-xadd.c | 16 +--- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..8927e96 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = READ_ONCE(lock->owner); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* * if lock->owner is not set, the mutex owner may have just acquired diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4b..ad0b5bb 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. vcpu_is_preempted is a macro +* defined by false if arch does not support vcpu preempted +* check +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; } -- 2.4.11
[PATCH v4 2/5] locking/osq: Drop the overload of osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq lock. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun FengSigned-off-by: Pan Xinhui --- kernel/locking/osq_lock.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..39d1385 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
[PATCH v4 0/5] implement vcpu preempted check
change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Pan Xinhui (5): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, kvm: support vcpu preempted check arch/powerpc/include/asm/spinlock.h | 8 arch/x86/include/asm/paravirt_types.h | 6 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 11 +++ arch/x86/kernel/paravirt.c| 11 +++ arch/x86/kvm/x86.c| 12 include/linux/sched.h | 12 kernel/locking/mutex.c| 15 +-- kernel/locking/osq_lock.c | 10 +- kernel/locking/rwsem-xadd.c | 16 +--- 11 files changed, 105 insertions(+), 7 deletions(-) -- 2.4.11
[PATCH v4 1/5] kernel/sched: introduce vcpu preempted check interface
This patch support to fix lock holder preemption issue. For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if one vcpu is preempted or not. The default implementation is a macro defined by false. So compiler can wrap it out if arch dose not support such vcpu pteempted check. Suggested-by: Peter Zijlstra (Intel)Signed-off-by: Pan Xinhui --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..44c1ce7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to deal with a various lock holder preemption issues provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +#define vcpu_is_preempted(cpu) false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); -- 2.4.11
Re: [PATCH 3/7] powerpc/64s: tool to flag direct branches from unrelocated interrupt vectors
On Wed, 19 Oct 2016 15:28:40 +1100 Balbir Singhwrote: > On 19/10/16 14:15, Nicholas Piggin wrote: > > Direct banches from code below __end_interrupts to code above > > __end_interrupts when built with CONFIG_RELOCATABLE are disallowed > > because they will break when the kernel is not located at 0. > > > > Sample output: > > > > WARNING: Unrelocated relative branches > > c118 bl-> 0xc0038fb8 > > c13c b-> 0xc01068a4 > > c148 b-> 0xc003919c > > c14c b-> 0xc003923c > > c5a4 b-> 0xc0106ffc > > c0001af0 b-> 0xc0106ffc > > c0001b24 b-> 0xc0106ffc > > c0001b58 b-> 0xc0106ffc > > > > Signed-off-by: Nicholas Piggin > > --- > > If this is the same script I gave you, you can add my SOB line as well I can't remember now, but I know you had something... I'm usually careful to get attributions right. I'll add your SOB.