Re: [PATCH] cpufreq: powernv: Use node ID in init_chip_info

2016-10-19 Thread Viresh Kumar
+ few IBM guys who have been working on this.

On 19-10-16, 15:02, Emily Shaffer wrote:
> Fixed assumption that node_id==chip_id in powernv-cpufreq.c:init_chip_info;
> explicitly use node ID where necessary.
> 
> Tested: All CPUs report in /sys/devices/system/cpu*/cpufreq/throttle_stats
> 
> Effort: platforms/arch-powerpc
> Google-Bug-Id: 26979978

Is this relevant upstream?

> 
> Signed-off-by: Emily Shaffer 
> Change-Id: I22eb626b32fbb8053b3bbb9c75e677c700d0c2fb

Gerrit id isn't required for upstream..

> ---
>  drivers/cpufreq/powernv-cpufreq.c | 27 +--
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c
> b/drivers/cpufreq/powernv-cpufreq.c
> index d3ffde8..3750b58 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -911,32 +911,47 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
> 
>  static int init_chip_info(void)
>  {
> -   unsigned int chip[256];
> +   int rc = 0;
> unsigned int cpu, i;
> unsigned int prev_chip_id = UINT_MAX;
> +   unsigned int *chip, *node;
> +
> +   chip = kcalloc(num_possible_cpus(), sizeof(unsigned int), GFP_KERNEL);
> +   node = kcalloc(num_possible_cpus(), sizeof(unsigned int), GFP_KERNEL);
> +   if (!chip || !node) {
> +   rc = -ENOMEM;
> +   goto out;
> +   }
> 
> for_each_possible_cpu(cpu) {
> unsigned int id = cpu_to_chip_id(cpu);
> 
> if (prev_chip_id != id) {
> prev_chip_id = id;
> -   chip[nr_chips++] = id;
> +   node[nr_chips] = cpu_to_node(cpu);
> +   chip[nr_chips] = id;
> +   nr_chips++;
> }
> }
> 
> chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL);
> -   if (!chips)
> -   return -ENOMEM;
> +   if (!chips) {
> +   rc = -ENOMEM;
> +   goto out;
> +   }
> 
> for (i = 0; i < nr_chips; i++) {
> chips[i].id = chip[i];
> -   cpumask_copy([i].mask, cpumask_of_node(chip[i]));
> +   cpumask_copy([i].mask, cpumask_of_node(node[i]));
> INIT_WORK([i].throttle, powernv_cpufreq_work_fn);
> for_each_cpu(cpu, [i].mask)
> per_cpu(chip_info, cpu) =  [i];
> }
> 
> -   return 0;
> +out:
> +   kfree(node);
> +   kfree(chip);
> +   return rc;
>  }
> 
>  static inline void clean_chip_info(void)
> -- 
> 2.8.0.rc3.226.g39d4020

-- 
viresh


[PATCH] KVM: PPC: Book3S HV: Fix build error when SMP=n

2016-10-19 Thread Michael Ellerman
Commit 5d375199ea96 ("KVM: PPC: Book3S HV: Set server for passed-through
interrupts") broke the SMP=n build:

  arch/powerpc/kvm/book3s_hv_rm_xics.c:758:2: error: implicit declaration of 
function 'get_hard_smp_processor_id'

That is because we lost the implicit include of asm/smp.h, so include it
explicitly to get the definition for get_hard_smp_processor_id().

Fixes: 5d375199ea96 ("KVM: PPC: Book3S HV: Set server for passed-through 
interrupts")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kvm/book3s_hv_rm_xics.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index 82ff5de8b1e7..a0ea63ac2b52 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "book3s_xics.h"
 
-- 
2.7.4



Re: [PATCH 6/7] powerpc: switch to using thin archives if COMPILE_TEST is set

2016-10-19 Thread Nicholas Piggin
On Wed, 19 Oct 2016 22:00:06 +1100
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > Enable thin archives build for powerpc when COMPILE_TEST is set.
> > Thin archives are explained in this commit:
> >
> >   a5967db9af51a84f5e181600954714a9e4c69f1f
> >
> > This is a gradual way to introduce the option to testers.  
> 
> I think I'd rather this was actually a user-selectable option, eg.
> 
> config USE_THIN_ARCHIVES
>   bool "Build the kernel using thin archives"
> default n
>   select THIN_ARCHIVES
>   help
> Build the kernel using thin archives.
> If you're unsure say N.

Okay, that should be easy. I'll respin the series.

Thanks,
Nick


Re: [PATCH v4 3/5] powerpc/mm: allow memory hotplug into a memoryless node

2016-10-19 Thread Balbir Singh


On 07/10/16 05:36, Reza Arbab wrote:
> Remove the check which prevents us from hotplugging into an empty node.
> 
> This limitation has been questioned before [1], and judging by the
> response, there doesn't seem to be a reason we can't remove it. No issues
> have been found in light testing.
> 
> [1] 
> http://lkml.kernel.org/r/cagzkibrmksa1yyhbf5hwgxubcjse5smksmy4tpanerme2ug...@mail.gmail.com
> 
> http://lkml.kernel.org/r/20160511215051.gf22...@arbab-laptop.austin.ibm.com
> 
> Signed-off-by: Reza Arbab 
> Reviewed-by: Aneesh Kumar K.V 
> Acked-by: Balbir Singh 
> Cc: Nathan Fontenot 
> Cc: Bharata B Rao 
> ---
>  arch/powerpc/mm/numa.c | 13 +
>  1 file changed, 1 insertion(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 75b9cd6..d7ac419 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1121,7 +1121,7 @@ static int hot_add_node_scn_to_nid(unsigned long 
> scn_addr)
>  int hot_add_scn_to_nid(unsigned long scn_addr)
>  {
>   struct device_node *memory = NULL;
> - int nid, found = 0;
> + int nid;
>  
>   if (!numa_enabled || (min_common_depth < 0))
>   return first_online_node;
> @@ -1137,17 +1137,6 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
>   if (nid < 0 || !node_online(nid))
>   nid = first_online_node;
>  
> - if (NODE_DATA(nid)->node_spanned_pages)
> - return nid;
> -
> - for_each_online_node(nid) {
> - if (NODE_DATA(nid)->node_spanned_pages) {
> - found = 1;
> - break;
> - }
> - }
> -
> - BUG_ON(!found);
>   return nid;

FYI, these checks were temporary to begin with

I found this in git history

b226e462124522f2f23153daff31c311729dfa2f (powerpc: don't add memory to empty 
node/zone)

Balbir Singh.


[PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container

2016-10-19 Thread Alexey Kardashevskiy
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better cache @mm and use it later when the process is gone
(@current or @current->mm is NULL).

This references mm and stores the pointer in the container; this is done
when a container is just created so checking for !current->mm in other
places becomes pointless.

This replaces current->mm with container->mm everywhere except debug
prints.

This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from registering memory in other
processes.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 127 
 1 file changed, 71 insertions(+), 56 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index d0c38b2..6b0b121 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -31,49 +31,46 @@
 static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
 
-static long try_increment_locked_vm(long npages)
+static long try_increment_locked_vm(struct mm_struct *mm, long npages)
 {
long ret = 0, locked, lock_limit;
 
-   if (!current || !current->mm)
-   return -ESRCH; /* process exited */
-
if (!npages)
return 0;
 
-   down_write(>mm->mmap_sem);
-   locked = current->mm->locked_vm + npages;
+   down_write(>mmap_sem);
+   locked = mm->locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
ret = -ENOMEM;
else
-   current->mm->locked_vm += npages;
+   mm->locked_vm += npages;
 
pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
npages << PAGE_SHIFT,
-   current->mm->locked_vm << PAGE_SHIFT,
+   mm->locked_vm << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK),
ret ? " - exceeded" : "");
 
-   up_write(>mm->mmap_sem);
+   up_write(>mmap_sem);
 
return ret;
 }
 
-static void decrement_locked_vm(long npages)
+static void decrement_locked_vm(struct mm_struct *mm, long npages)
 {
-   if (!current || !current->mm || !npages)
+   if (!mm || !npages)
return; /* process exited */
 
-   down_write(>mm->mmap_sem);
-   if (WARN_ON_ONCE(npages > current->mm->locked_vm))
-   npages = current->mm->locked_vm;
-   current->mm->locked_vm -= npages;
+   down_write(>mmap_sem);
+   if (WARN_ON_ONCE(npages > mm->locked_vm))
+   npages = mm->locked_vm;
+   mm->locked_vm -= npages;
pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
npages << PAGE_SHIFT,
-   current->mm->locked_vm << PAGE_SHIFT,
+   mm->locked_vm << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK));
-   up_write(>mm->mmap_sem);
+   up_write(>mmap_sem);
 }
 
 /*
@@ -98,6 +95,7 @@ struct tce_container {
bool enabled;
bool v2;
unsigned long locked_pages;
+   struct mm_struct *mm;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
 };
@@ -113,11 +111,11 @@ static long tce_iommu_unregister_pages(struct 
tce_container *container,
if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
return -EINVAL;
 
-   mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT);
+   mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
if (!mem)
return -ENOENT;
 
-   return mm_iommu_put(current->mm, mem);
+   return mm_iommu_put(container->mm, mem);
 }
 
 static long tce_iommu_register_pages(struct tce_container *container,
@@ -134,7 +132,7 @@ static long tce_iommu_register_pages(struct tce_container 
*container,
((vaddr + size) < vaddr))
return -EINVAL;
 
-   ret = mm_iommu_get(current->mm, vaddr, entries, );
+   ret = mm_iommu_get(container->mm, vaddr, entries, );
if (ret)
return ret;
 
@@ -143,7 +141,8 @@ static long tce_iommu_register_pages(struct tce_container 
*container,
return 0;
 }
 
-static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
+static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
+   struct mm_struct *mm)
 {
unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
tbl->it_size, PAGE_SIZE);
@@ -152,13 +151,13 @@ static long tce_iommu_userspace_view_alloc(struct 
iommu_table *tbl)
 
BUG_ON(tbl->it_userspace);
 
-   ret = try_increment_locked_vm(cb >> PAGE_SHIFT);
+   ret = 

[PATCH kernel v3 2/4] powerpc/iommu: Stop using @current in mm_iommu_xxx

2016-10-19 Thread Alexey Kardashevskiy
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.

This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.

This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu_context.h | 16 ++--
 arch/powerpc/mm/mmu_context_iommu.c| 46 +-
 drivers/vfio/vfio_iommu_spapr_tce.c| 14 ---
 3 files changed, 36 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 424844b..b9e3f0a 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -19,16 +19,18 @@ extern void destroy_context(struct mm_struct *mm);
 struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);/* from internal.h */
-extern bool mm_iommu_preregistered(void);
-extern long mm_iommu_get(unsigned long ua, unsigned long entries,
+extern bool mm_iommu_preregistered(struct mm_struct *mm);
+extern long mm_iommu_get(struct mm_struct *mm,
+   unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem);
-extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
+extern long mm_iommu_put(struct mm_struct *mm,
+   struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
-   unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
-   unsigned long entries);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
+   unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+   unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index ad2e575..4c6db09 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -56,7 +56,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
}
 
pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
-   current->pid,
+   current ? current->pid : 0,
incr ? '+' : '-',
npages << PAGE_SHIFT,
mm->locked_vm << PAGE_SHIFT,
@@ -66,12 +66,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
return ret;
 }
 
-bool mm_iommu_preregistered(void)
+bool mm_iommu_preregistered(struct mm_struct *mm)
 {
-   if (!current || !current->mm)
-   return false;
-
-   return !list_empty(>mm->context.iommu_group_mem_list);
+   return !list_empty(>context.iommu_group_mem_list);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
@@ -124,19 +121,16 @@ static int mm_iommu_move_page_from_cma(struct page *page)
return 0;
 }
 
-long mm_iommu_get(unsigned long ua, unsigned long entries,
+long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long 
entries,
struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
long i, j, ret = 0, locked_entries = 0;
struct page *page = NULL;
 
-   if (!current || !current->mm)
-   return -ESRCH; /* process exited */
-
mutex_lock(_list_mutex);
 
-   list_for_each_entry_rcu(mem, >mm->context.iommu_group_mem_list,
+   list_for_each_entry_rcu(mem, >context.iommu_group_mem_list,
next) {
if ((mem->ua == ua) && (mem->entries == entries)) {
++mem->used;
@@ -154,7 +148,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
}
 
-   ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
+   ret = mm_iommu_adjust_locked_vm(mm, entries, true);
if (ret)
goto unlock_exit;
 
@@ -215,11 +209,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
mem->entries = entries;
*pmem = mem;
 
-   list_add_rcu(>next, >mm->context.iommu_group_mem_list);
+   list_add_rcu(>next, >context.iommu_group_mem_list);
 
 unlock_exit:
if (locked_entries 

[PATCH kernel v3 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers

2016-10-19 Thread Alexey Kardashevskiy
We are going to get rid of @current references in mmu_context_boos3s64.c
and cache mm_struct in the VFIO container. Since mm_context_t does not
have reference counting, we will be using mm_struct which does have
the reference counter.

This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
than mm_context_t (which is embedded into mm).

This should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/mmu_context.h | 4 ++--
 arch/powerpc/kernel/setup-common.c | 2 +-
 arch/powerpc/mm/mmu_context_book3s64.c | 4 ++--
 arch/powerpc/mm/mmu_context_iommu.c| 9 +
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 5c45114..424844b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -23,8 +23,8 @@ extern bool mm_iommu_preregistered(void);
 extern long mm_iommu_get(unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
-extern void mm_iommu_init(mm_context_t *ctx);
-extern void mm_iommu_cleanup(mm_context_t *ctx);
+extern void mm_iommu_init(struct mm_struct *mm);
+extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 270ee30..f516ac5 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -915,7 +915,7 @@ void __init setup_arch(char **cmdline_p)
init_mm.context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_init(_mm.context);
+   mm_iommu_init(_mm);
 #endif
irqstack_early_init();
exc_lvl_early_init();
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index b114f8b..ad82735 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
mm->context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_init(>context);
+   mm_iommu_init(mm);
 #endif
return 0;
 }
@@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 void destroy_context(struct mm_struct *mm)
 {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(>context);
+   mm_iommu_cleanup(mm);
 #endif
 
 #ifdef CONFIG_PPC_ICSWX
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index e0f1c33..ad2e575 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -373,16 +373,17 @@ void mm_iommu_mapped_dec(struct 
mm_iommu_table_group_mem_t *mem)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
 
-void mm_iommu_init(mm_context_t *ctx)
+void mm_iommu_init(struct mm_struct *mm)
 {
-   INIT_LIST_HEAD_RCU(>iommu_group_mem_list);
+   INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
 }
 
-void mm_iommu_cleanup(mm_context_t *ctx)
+void mm_iommu_cleanup(struct mm_struct *mm)
 {
struct mm_iommu_table_group_mem_t *mem, *tmp;
 
-   list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) {
+   list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
+   next) {
list_del_rcu(>next);
mm_iommu_do_free(mem);
}
-- 
2.5.0.rc3



[PATCH kernel v3 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2016-10-19 Thread Alexey Kardashevskiy
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.

This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Nicholas Piggin 
---
Changes:
v3:
* moved tce_iommu_prereg_free() call out of list_for_each_entry()

v2:
* updated commit log
---
 arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
 arch/powerpc/mm/mmu_context_iommu.c| 11 
 drivers/vfio/vfio_iommu_spapr_tce.c| 49 +-
 3 files changed, 48 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index ad82735..1a07969 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 
 void destroy_context(struct mm_struct *mm)
 {
-#ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(mm);
-#endif
-
 #ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 4c6db09..104bad0 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
 {
INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
 }
-
-void mm_iommu_cleanup(struct mm_struct *mm)
-{
-   struct mm_iommu_table_group_mem_t *mem, *tmp;
-
-   list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
-   next) {
-   list_del_rcu(>next);
-   mm_iommu_do_free(mem);
-   }
-}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 6b0b121..3e2f757 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -86,6 +86,15 @@ struct tce_iommu_group {
 };
 
 /*
+ * A container needs to remember which preregistered region  it has
+ * referenced to do proper cleanup at the userspace process exit.
+ */
+struct tce_iommu_prereg {
+   struct list_head next;
+   struct mm_iommu_table_group_mem_t *mem;
+};
+
+/*
  * The container descriptor supports only a single group per container.
  * Required by the API as the container is not supplied with the IOMMU group
  * at the moment of initialization.
@@ -98,12 +107,27 @@ struct tce_container {
struct mm_struct *mm;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
+   struct list_head prereg_list;
 };
 
+static long tce_iommu_prereg_free(struct tce_container *container,
+   struct tce_iommu_prereg *tcemem)
+{
+   long ret;
+
+   list_del(>next);
+   ret = mm_iommu_put(container->mm, tcemem->mem);
+   kfree(tcemem);
+
+   return ret;
+}
+
 static long tce_iommu_unregister_pages(struct tce_container *container,
__u64 vaddr, __u64 size)
 {
struct mm_iommu_table_group_mem_t *mem;
+   struct tce_iommu_prereg *tcemem;
+   bool found = false;
 
if (!current || !current->mm)
return -ESRCH; /* process exited */
@@ -115,7 +139,17 @@ static long tce_iommu_unregister_pages(struct 
tce_container *container,
if (!mem)
return -ENOENT;
 
-   return mm_iommu_put(container->mm, mem);
+   list_for_each_entry(tcemem, >prereg_list, next) {
+   if (tcemem->mem == mem) {
+   found = true;
+   break;
+   }
+   }
+
+   if (!found)
+   return -ENOENT;
+
+   return tce_iommu_prereg_free(container, tcemem);
 }
 
 static long 

[PATCH kernel v3 0/4] powerpc/spapr/vfio: Put pages on VFIO container shutdown

2016-10-19 Thread Alexey Kardashevskiy
These patches are to fix a bug when pages stay pinned hours
after QEMU which requested pinning exited. This time 4 patches
for easier reviewing.

Please comment. Thanks.


Alexey Kardashevskiy (4):
  powerpc/iommu: Pass mm_struct to init/cleanup helpers
  powerpc/iommu: Stop using @current in mm_iommu_xxx
  vfio/spapr: Cache mm in tce_container
  powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

 arch/powerpc/include/asm/mmu_context.h |  20 ++--
 arch/powerpc/kernel/setup-common.c |   2 +-
 arch/powerpc/mm/mmu_context_book3s64.c |   6 +-
 arch/powerpc/mm/mmu_context_iommu.c|  60 ---
 drivers/vfio/vfio_iommu_spapr_tce.c| 180 +++--
 5 files changed, 156 insertions(+), 112 deletions(-)

-- 
2.5.0.rc3



[PATCH] cpufreq: powernv: Use node ID in init_chip_info

2016-10-19 Thread Emily Shaffer
Fixed assumption that node_id==chip_id in powernv-cpufreq.c:init_chip_info;
explicitly use node ID where necessary.

Tested: All CPUs report in /sys/devices/system/cpu*/cpufreq/throttle_stats

Effort: platforms/arch-powerpc
Google-Bug-Id: 26979978

Signed-off-by: Emily Shaffer 
Change-Id: I22eb626b32fbb8053b3bbb9c75e677c700d0c2fb
---
 drivers/cpufreq/powernv-cpufreq.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c
b/drivers/cpufreq/powernv-cpufreq.c
index d3ffde8..3750b58 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -911,32 +911,47 @@ static struct cpufreq_driver powernv_cpufreq_driver = {

 static int init_chip_info(void)
 {
-   unsigned int chip[256];
+   int rc = 0;
unsigned int cpu, i;
unsigned int prev_chip_id = UINT_MAX;
+   unsigned int *chip, *node;
+
+   chip = kcalloc(num_possible_cpus(), sizeof(unsigned int), GFP_KERNEL);
+   node = kcalloc(num_possible_cpus(), sizeof(unsigned int), GFP_KERNEL);
+   if (!chip || !node) {
+   rc = -ENOMEM;
+   goto out;
+   }

for_each_possible_cpu(cpu) {
unsigned int id = cpu_to_chip_id(cpu);

if (prev_chip_id != id) {
prev_chip_id = id;
-   chip[nr_chips++] = id;
+   node[nr_chips] = cpu_to_node(cpu);
+   chip[nr_chips] = id;
+   nr_chips++;
}
}

chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL);
-   if (!chips)
-   return -ENOMEM;
+   if (!chips) {
+   rc = -ENOMEM;
+   goto out;
+   }

for (i = 0; i < nr_chips; i++) {
chips[i].id = chip[i];
-   cpumask_copy([i].mask, cpumask_of_node(chip[i]));
+   cpumask_copy([i].mask, cpumask_of_node(node[i]));
INIT_WORK([i].throttle, powernv_cpufreq_work_fn);
for_each_cpu(cpu, [i].mask)
per_cpu(chip_info, cpu) =  [i];
}

-   return 0;
+out:
+   kfree(node);
+   kfree(chip);
+   return rc;
 }

 static inline void clean_chip_info(void)
-- 
2.8.0.rc3.226.g39d4020


Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check

2016-10-19 Thread Pan Xinhui


在 2016/10/20 01:24, Radim Krčmář 写道:

2016-10-19 06:20-0400, Pan Xinhui:

This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted.  Then
kernel can break the spin loops upon on the retval of vcpu_is_preempted.

As kernel has used this interface, So lets support it.

We use one field of struct kvm_steal_time to indicate that if one vcpu
is running or not.

unix benchmark result:
host:  kernel 4.8.1, i5-4570, 4 cpus
guest: kernel 4.8.1, 8 vcpus

test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps
Process Creation   |29881.2 lps  |28572.8 lps
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps

Signed-off-by: Pan Xinhui 
---
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
@@ -98,6 +98,10 @@ struct pv_time_ops {
unsigned long long (*steal_clock)(int cpu);
 };

+struct pv_vcpu_ops {
+   bool (*vcpu_is_preempted)(int cpu);
+};
+


(I would put it into pv_lock_ops to save the plumbing.)


hi, Radim
thanks for your reply.

yes, a new struct leads patch into unnecessary lines changed.
I do that just because I am not sure which existing xxx_ops I should place the 
vcpu_is_preempted in.


diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
@@ -45,7 +45,8 @@ struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 preempted;


Why __u32 instead of __u8?


I thought it is 32-bits aligned...
yes, u8 is good to store the preempt status.


+   __u32 pad[11];
 };


Please document the change in Documentation/virtual/kvm/msr.txt, section
MSR_KVM_STEAL_TIME.


okay, I totally forgot to do that. thanks!


diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
@@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
+static bool kvm_vcpu_is_preempted(int cpu)
+{
+   struct kvm_steal_time *src;
+
+   src = _cpu(steal_time, cpu);
+
+   return !!src->preempted;
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -488,6 +497,8 @@ void __init kvm_guest_init(void)
kvm_guest_cpu_init();
 #endif

+   pv_vcpu_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;


Would be nicer to assign conditionally in the KVM_FEATURE_STEAL_TIME
block.  The steal_time structure has to be zeroed, so this code would
work, but the native function (return false) is better if we know that
the kvm_vcpu_is_preempted() would always return false anway.


yes, agree. Will do that.

I once thought we can patch the code runtime.
we replace binary code
"call 0x #pv_vcpu_ops.vcpu_is_preempted"
with
"xor eax, eax"
however it is not worth doing that. the performace improvements might be very 
small.


Old KVMs won't have the feature, so we could also assign only when KVM
reports it, but that requires extra definitions and the performance gain
is fairly small, so I'm ok with this.


diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
@@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>arch.st.steal, sizeof(struct kvm_steal_time
return;

+   vcpu->arch.st.steal.preempted = 0;
+
if (vcpu->arch.st.steal.version & 1)
vcpu->arch.st.steal.version += 1;  /* first time write, random 
junk */

@@ -2812,6 +2814,16 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED)
+   if (kvm_read_guest_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal,
+   sizeof(struct kvm_steal_time)) == 0) {
+   vcpu->arch.st.steal.preempted = 1;
+   kvm_write_guest_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal,
+   sizeof(struct kvm_steal_time));
+   }


Please name this block of code.  Something like
  kvm_steal_time_set_preempted(vcpu);


yep, my code style is ugly.

Re: [PATCH v4 5/5] x86, kvm: support vcpu preempted check

2016-10-19 Thread Radim Krčmář
2016-10-19 06:20-0400, Pan Xinhui:
> This is to fix some lock holder preemption issues. Some other locks
> implementation do a spin loop before acquiring the lock itself.
> Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
> takes the cpu as parameter and return true if the cpu is preempted.  Then
> kernel can break the spin loops upon on the retval of vcpu_is_preempted.
> 
> As kernel has used this interface, So lets support it.
> 
> We use one field of struct kvm_steal_time to indicate that if one vcpu
> is running or not.
> 
> unix benchmark result:
> host:  kernel 4.8.1, i5-4570, 4 cpus
> guest: kernel 4.8.1, 8 vcpus
> 
>   test-case   after-patch   before-patch
> Execl Throughput   |18307.9 lps  |11701.6 lps 
> File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
> File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
> File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
> Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
> Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
> Process Creation   |29881.2 lps  |28572.8 lps 
> Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
> Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
> System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 
> 
> Signed-off-by: Pan Xinhui 
> ---
> diff --git a/arch/x86/include/asm/paravirt_types.h 
> b/arch/x86/include/asm/paravirt_types.h
> @@ -98,6 +98,10 @@ struct pv_time_ops {
>   unsigned long long (*steal_clock)(int cpu);
>  };
>  
> +struct pv_vcpu_ops {
> + bool (*vcpu_is_preempted)(int cpu);
> +};
> +

(I would put it into pv_lock_ops to save the plumbing.)

> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
> b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -45,7 +45,8 @@ struct kvm_steal_time {
>   __u64 steal;
>   __u32 version;
>   __u32 flags;
> - __u32 pad[12];
> + __u32 preempted;

Why __u32 instead of __u8?

> + __u32 pad[11];
>  };

Please document the change in Documentation/virtual/kvm/msr.txt, section
MSR_KVM_STEAL_TIME.

> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
> +static bool kvm_vcpu_is_preempted(int cpu)
> +{
> + struct kvm_steal_time *src;
> +
> + src = _cpu(steal_time, cpu);
> +
> + return !!src->preempted;
> +}
> +
>  #ifdef CONFIG_SMP
>  static void __init kvm_smp_prepare_boot_cpu(void)
>  {
> @@ -488,6 +497,8 @@ void __init kvm_guest_init(void)
>   kvm_guest_cpu_init();
>  #endif
>  
> + pv_vcpu_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;

Would be nicer to assign conditionally in the KVM_FEATURE_STEAL_TIME
block.  The steal_time structure has to be zeroed, so this code would
work, but the native function (return false) is better if we know that
the kvm_vcpu_is_preempted() would always return false anway.

Old KVMs won't have the feature, so we could also assign only when KVM
reports it, but that requires extra definitions and the performance gain
is fairly small, so I'm ok with this.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>   >arch.st.steal, sizeof(struct kvm_steal_time
>   return;
>  
> + vcpu->arch.st.steal.preempted = 0;
> +
>   if (vcpu->arch.st.steal.version & 1)
>   vcpu->arch.st.steal.version += 1;  /* first time write, random 
> junk */
>  
> @@ -2812,6 +2814,16 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
> + if (vcpu->arch.st.msr_val & KVM_MSR_ENABLED)
> + if (kvm_read_guest_cached(vcpu->kvm, >arch.st.stime,
> + >arch.st.steal,
> + sizeof(struct kvm_steal_time)) == 0) {
> + vcpu->arch.st.steal.preempted = 1;
> + kvm_write_guest_cached(vcpu->kvm, >arch.st.stime,
> + >arch.st.steal,
> + sizeof(struct kvm_steal_time));
> + }

Please name this block of code.  Something like
  kvm_steal_time_set_preempted(vcpu);

Thanks.


Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags

2016-10-19 Thread Dave Hansen
On 10/19/2016 10:01 AM, Michal Hocko wrote:
> The question I had earlier was whether this has to be an explicit FOLL
> flag used by g-u-p users or we can just use it internally when mm !=
> current->mm

The reason I chose not to do that was that deferred work gets run under
a basically random 'current'.  If we just use 'mm != current->mm', then
the deferred work will sometimes have pkeys enforced and sometimes not,
basically randomly.

We want to be consistent with whether they are enforced or not, so we
explicitly indicate that by calling the remote variant vs. plain.


Re: [PATCH v4 0/5] implement vcpu preempted check

2016-10-19 Thread Pan Xinhui


在 2016/10/19 23:58, Juergen Gross 写道:

On 19/10/16 12:20, Pan Xinhui wrote:

change from v3:
add x86 vcpu preempted check patch
change from v2:
no code change, fix typos, update some comments
change from v1:
a simplier definition of default vcpu_is_preempted
skip mahcine type check on ppc, and add config. remove dedicated macro.
add one patch to drop overload of rwsem_spin_on_owner and 
mutex_spin_on_owner.
add more comments
thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

We also have observed some performace improvements.

PPC test result:

1 copy - 0.94%
2 copy - 7.17%
4 copy - 11.9%
8 copy -  3.04%
16 copy - 15.11%

details below:
Without patch:

1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 s, 
1 samples)

With patch:

1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 s, 
1 samples)

X86 test result:
test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps
Process Creation   |29881.2 lps  |28572.8 lps
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps

Pan Xinhui (5):
  kernel/sched: introduce vcpu preempted check interface
  locking/osq: Drop the overload of osq_lock()
  kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
  powerpc/spinlock: support vcpu preempted check
  x86, kvm: support vcpu preempted check


The attached patch adds Xen support for x86. Please tell me whether you
want to add this patch to your series or if I should post it when your
series has been accepted.


hi, Juergen
Your patch is pretty small and nice :) thanks!
I can include your patch into my next patchset after this patchset reviewed. :)



You can add my

Tested-by: Juergen Gross 

for patches 1-3 and 5 (paravirt parts only).



Thanks a lot!

xinhui



Juergen



 arch/powerpc/include/asm/spinlock.h   |  8 
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/spinlock.h   |  8 
 arch/x86/include/uapi/asm/kvm_para.h  |  3 ++-
 arch/x86/kernel/kvm.c | 11 +++
 arch/x86/kernel/paravirt.c| 11 +++
 arch/x86/kvm/x86.c| 12 
 include/linux/sched.h | 12 
 kernel/locking/mutex.c| 15 +--
 kernel/locking/osq_lock.c | 10 +-
 kernel/locking/rwsem-xadd.c   | 16 +---
 11 files changed, 105 insertions(+), 7 deletions(-)







Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags

2016-10-19 Thread Michal Hocko
On Wed 19-10-16 09:49:43, Dave Hansen wrote:
> On 10/19/2016 02:07 AM, Michal Hocko wrote:
> > On Wed 19-10-16 09:58:15, Lorenzo Stoakes wrote:
> >> On Tue, Oct 18, 2016 at 05:30:50PM +0200, Michal Hocko wrote:
> >>> I am wondering whether we can go further. E.g. it is not really clear to
> >>> me whether we need an explicit FOLL_REMOTE when we can in fact check
> >>> mm != current->mm and imply that. Maybe there are some contexts which
> >>> wouldn't work, I haven't checked.
> >>
> >> This flag is set even when /proc/self/mem is used. I've not looked deeply 
> >> into
> >> this flag but perhaps accessing your own memory this way can be considered
> >> 'remote' since you're not accessing it directly. On the other hand, 
> >> perhaps this
> >> is just mistaken in this case?
> > 
> > My understanding of the flag is quite limited as well. All I know it is
> > related to protection keys and it is needed to bypass protection check.
> > See arch_vma_access_permitted. See also 1b2ee1266ea6 ("mm/core: Do not
> > enforce PKEY permissions on remote mm access").
> 
> Yeah, we need the flag to tell us when PKEYs should be applied or not.
> The current task's PKRU (pkey rights register) should really only be
> used to impact access to the task's memory, but has no bearing on how a
> given task should access remote memory.

The question I had earlier was whether this has to be an explicit FOLL
flag used by g-u-p users or we can just use it internally when mm !=
current->mm

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 0/5] implement vcpu preempted check

2016-10-19 Thread Pan Xinhui



在 2016/10/19 14:47, Christian Borntraeger 写道:

On 10/19/2016 12:20 PM, Pan Xinhui wrote:

change from v3:
add x86 vcpu preempted check patch


If you want you could add the s390 patch that I provided for your last version.
I also gave my Acked-by for all previous patches.


hi, Christian
Thanks a lot!
I can include your new s390 patch into my next patchset(if v5 is needed).

xinhui





change from v2:
no code change, fix typos, update some comments
change from v1:
a simplier definition of default vcpu_is_preempted
skip mahcine type check on ppc, and add config. remove dedicated macro.
add one patch to drop overload of rwsem_spin_on_owner and 
mutex_spin_on_owner.
add more comments
thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

We also have observed some performace improvements.

PPC test result:

1 copy - 0.94%
2 copy - 7.17%
4 copy - 11.9%
8 copy -  3.04%
16 copy - 15.11%

details below:
Without patch:

1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 s, 
1 samples)

With patch:

1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 s, 
1 samples)

X86 test result:
test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps
Process Creation   |29881.2 lps  |28572.8 lps
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps

Pan Xinhui (5):
  kernel/sched: introduce vcpu preempted check interface
  locking/osq: Drop the overload of osq_lock()
  kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
  powerpc/spinlock: support vcpu preempted check
  x86, kvm: support vcpu preempted check

 arch/powerpc/include/asm/spinlock.h   |  8 
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/spinlock.h   |  8 
 arch/x86/include/uapi/asm/kvm_para.h  |  3 ++-
 arch/x86/kernel/kvm.c | 11 +++
 arch/x86/kernel/paravirt.c| 11 +++
 arch/x86/kvm/x86.c| 12 
 include/linux/sched.h | 12 
 kernel/locking/mutex.c| 15 +--
 kernel/locking/osq_lock.c | 10 +-
 kernel/locking/rwsem-xadd.c   | 16 +---
 11 files changed, 105 insertions(+), 7 deletions(-)







Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags

2016-10-19 Thread Dave Hansen
On 10/19/2016 02:07 AM, Michal Hocko wrote:
> On Wed 19-10-16 09:58:15, Lorenzo Stoakes wrote:
>> On Tue, Oct 18, 2016 at 05:30:50PM +0200, Michal Hocko wrote:
>>> I am wondering whether we can go further. E.g. it is not really clear to
>>> me whether we need an explicit FOLL_REMOTE when we can in fact check
>>> mm != current->mm and imply that. Maybe there are some contexts which
>>> wouldn't work, I haven't checked.
>>
>> This flag is set even when /proc/self/mem is used. I've not looked deeply 
>> into
>> this flag but perhaps accessing your own memory this way can be considered
>> 'remote' since you're not accessing it directly. On the other hand, perhaps 
>> this
>> is just mistaken in this case?
> 
> My understanding of the flag is quite limited as well. All I know it is
> related to protection keys and it is needed to bypass protection check.
> See arch_vma_access_permitted. See also 1b2ee1266ea6 ("mm/core: Do not
> enforce PKEY permissions on remote mm access").

Yeah, we need the flag to tell us when PKEYs should be applied or not.
The current task's PKRU (pkey rights register) should really only be
used to impact access to the task's memory, but has no bearing on how a
given task should access remote memory.



[PATCH wq/for-4.10] workqueue: move wq_numa_init() to workqueue_init()

2016-10-19 Thread Tejun Heo
Hello, Michael.

I simplified the patch a bit and applied the following to for-4.10 and
updated for-next accordingly.  The coming next snapshot should be
fine.

Thanks!

-- 8< --
>From 2186d9f940b6a04f263a3bacd48f2a7ba96df4cf Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Wed, 19 Oct 2016 12:01:27 -0400

While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early().  Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.

  Unable to handle kernel paging request for data at address 0x0038
  Faulting instruction address: 0xc00fc0cc
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.8.0-compiler_gcc-6.2.0-next-20161005 #94
  task: c007f540 task.stack: c01ffc084000
  NIP: c00fc0cc LR: c00ed928 CTR: c00fbfd0
  REGS: c01ffc087780 TRAP: 0300   Not tainted  
(4.8.0-compiler_gcc-6.2.0-next-20161005)
  MSR: 92009033   CR: 48000424  XER: 

  CFAR: c00089dc DAR: 0038 DSISR: 4000 SOFTE: 0
  GPR00: c00ed928 c01ffc087a00 c0e63200 c00010d6d600
  GPR04: c007f5409200 0021 0748e08c 001f
  GPR08:  0021 0748f1f8 
  GPR12: 28000422 cfb8 c000e0c8 
  GPR16:   0021 0001
  GPR20: afb50401  c00010d6d600 ba7e
  GPR24: ba7e c0d8bc58 afb504000afb5041 0001
  GPR28:  0004 c007f5409280 
  NIP [c00fc0cc] enqueue_task_fair+0xfc/0x18b0
  LR [c00ed928] activate_task+0x78/0xe0
  Call Trace:
  [c01ffc087a00] [c007f5409200] 0xc007f5409200 (unreliable)
  [c01ffc087b10] [c00ed928] activate_task+0x78/0xe0
  [c01ffc087b50] [c00ede58] ttwu_do_activate+0x68/0xc0
  [c01ffc087b90] [c00ef1b8] try_to_wake_up+0x208/0x4f0
  [c01ffc087c10] [c00d3484] create_worker+0x144/0x250
  [c01ffc087cb0] [c0cd72d0] workqueue_init+0x124/0x150
  [c01ffc087d00] [c0cc0e74] kernel_init_freeable+0x158/0x360
  [c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160
  [c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc
  Instruction dump:
  62940401 3b80 3aa0 7f17c378 3a61 3b61 6000 6000
  6042 72490021 ebfe0150 2f890001  419e0de0 7fbee840 419e0e58
  ---[ end trace  ]---

Fix it by moving wq_numa_init() to workqueue_init().  As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().

Signed-off-by: Tejun Heo 
Reported-by: Michael Ellerman 
Link: http://lkml.kernel.org/r/87twck5wqo@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo 
---
 kernel/workqueue.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ad0cd43..c56479b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5495,8 +5495,6 @@ int __init workqueue_init_early(void)
 
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
-   wq_numa_init();
-
/* initialize CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;
@@ -5566,9 +5564,32 @@ int __init workqueue_init_early(void)
  */
 int __init workqueue_init(void)
 {
+   struct workqueue_struct *wq;
struct worker_pool *pool;
int cpu, bkt;
 
+   /*
+* It'd be simpler to initialize NUMA in workqueue_init_early() but
+* CPU to node mapping may not be available that early on some
+* archs such as power and arm64.  As per-cpu pools created
+* previously could be missing node hint and unbound pools NUMA
+* affinity, fix them up.
+*/
+   wq_numa_init();
+
+   mutex_lock(_pool_mutex);
+
+   for_each_possible_cpu(cpu) {
+   for_each_cpu_worker_pool(pool, cpu) {
+   pool->node = cpu_to_node(cpu);
+   }
+   }
+
+   list_for_each_entry(wq, , list)
+   wq_update_unbound_numa(wq, smp_processor_id(), true);
+
+   mutex_unlock(_pool_mutex);
+
/* create the initial 

Re: [PATCH v4 0/5] implement vcpu preempted check

2016-10-19 Thread Juergen Gross
On 19/10/16 12:20, Pan Xinhui wrote:
> change from v3:
>   add x86 vcpu preempted check patch
> change from v2:
>   no code change, fix typos, update some comments
> change from v1:
>   a simplier definition of default vcpu_is_preempted
>   skip mahcine type check on ppc, and add config. remove dedicated macro.
>   add one patch to drop overload of rwsem_spin_on_owner and 
> mutex_spin_on_owner. 
>   add more comments
>   thanks boqun and Peter's suggestion.
> 
> This patch set aims to fix lock holder preemption issues.
> 
> test-case:
> perf record -a perf bench sched messaging -g 400 -p && perf report
> 
> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
> 
> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
> These spin_on_onwer variant also cause rcu stall before we apply this patch 
> set
> 
> We also have observed some performace improvements.
> 
> PPC test result:
> 
> 1 copy - 0.94%
> 2 copy - 7.17%
> 4 copy - 11.9%
> 8 copy -  3.04%
> 16 copy - 15.11%
> 
> details below:
> Without patch:
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 
> 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 
> 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 
> 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 
> 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 
> s, 1 samples)
> 
> With patch: 
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 
> 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 
> 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 
> 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 
> 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 
> s, 1 samples)
> 
> X86 test result:
>   test-case   after-patch   before-patch
> Execl Throughput   |18307.9 lps  |11701.6 lps 
> File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
> File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
> File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
> Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
> Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
> Process Creation   |29881.2 lps  |28572.8 lps 
> Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
> Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
> System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 
> 
> Pan Xinhui (5):
>   kernel/sched: introduce vcpu preempted check interface
>   locking/osq: Drop the overload of osq_lock()
>   kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
>   powerpc/spinlock: support vcpu preempted check
>   x86, kvm: support vcpu preempted check

The attached patch adds Xen support for x86. Please tell me whether you
want to add this patch to your series or if I should post it when your
series has been accepted.

You can add my

Tested-by: Juergen Gross 

for patches 1-3 and 5 (paravirt parts only).


Juergen

> 
>  arch/powerpc/include/asm/spinlock.h   |  8 
>  arch/x86/include/asm/paravirt_types.h |  6 ++
>  arch/x86/include/asm/spinlock.h   |  8 
>  arch/x86/include/uapi/asm/kvm_para.h  |  3 ++-
>  arch/x86/kernel/kvm.c | 11 +++
>  arch/x86/kernel/paravirt.c| 11 +++
>  arch/x86/kvm/x86.c| 12 
>  include/linux/sched.h | 12 
>  kernel/locking/mutex.c| 15 +--
>  kernel/locking/osq_lock.c | 10 +-
>  kernel/locking/rwsem-xadd.c   | 16 +---
>  11 files changed, 105 insertions(+), 7 deletions(-)
> 

>From c79b86d00a812d6207ef788d453e2d0289ef22a0 Mon Sep 17 00:00:00 2001
From: Juergen Gross 
Date: Wed, 19 Oct 2016 15:30:59 +0200
Subject: [PATCH] x86, xen: support vcpu preempted check

Support the vcpu_is_preempted() functionality under Xen. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical 

RE: [RFC PATCH 2/3] powerpc/pseries: Define & use a type for the plpar_hcall() retvals

2016-10-19 Thread David Laight
From: Balbir Singh
> Sent: 19 October 2016 15:00
...
> Here is an example
> 
> - *slot = retbuf[0];
> + *slot = retvals.v[0];
> 
> Could we hide retvals.v[0] under a macro like
> 
> *slot = hcalls_ret_val(retvals, 0);

Ugg..

> Since we could end up with similar issues if
> someone dereferenced retvals.v[4]

The compiler will detect that these days.

David



Re: [PATCH] net: fs_enet: Use net_device_stats from struct net_device

2016-10-19 Thread David Miller
From: Tobias Klauser 
Date: Wed, 19 Oct 2016 11:24:57 +0200

> Instead of using a private copy of struct net_device_stats in struct
> fs_enet_private, use stats from struct net_device. Also remove the now
> unnecessary .ndo_get_stats function.
> 
> Signed-off-by: Tobias Klauser 

Applied to net-next, thanks.


Re: [RFC PATCH 2/3] powerpc/pseries: Define & use a type for the plpar_hcall() retvals

2016-10-19 Thread Balbir Singh


On 19/10/16 22:47, Michael Ellerman wrote:
> Balbir Singh  writes:
> 
>> On 18/10/16 19:40, Michael Ellerman wrote:
>>> We have now had two nasty stack corruption bugs caused by incorrect
>>> sizing of the return buffer for plpar_hcall()/plpar_hcall9().
>>>
>>> To avoid any more such bugs, define a type which encodes the size of the
>>> return buffer, and change the argument of plpar_hcall() to be of that
>>> type, meaning the compiler will check for us that we passed the right
>>> size buffer.
>>>
>>> There isn't an easy way to do this incrementally, without introducing a
>>> new function name, eg. plpar_hcall_with_struct(), which is ugly as hell.
>>> So just do it in one tree-wide change.
>>>
>> Conceptually looks god, but I think we need to abstract the return values
>> as well. I'll test and see if I can send you something on top of this
> 
> Not sure I know what you mean.

Here is an example

-   *slot = retbuf[0];
+   *slot = retvals.v[0];

Could we hide retvals.v[0] under a macro like 

*slot = hcalls_ret_val(retvals, 0);

Since we could end up with similar issues if
someone dereferenced retvals.v[4]

Since we are abstracting under retvals, I was wondering
if we want to further abstract the return values
as well and make retvals opaque to the user


Balbir Singh.





RE: [PATCH] powerpc: link error on orphan sections

2016-10-19 Thread David Laight
From: Michael Ellerman [mailto:m...@ellerman.id.au]
> > In the past I've caused orphan sections to error by assigning them
> > to the same address as something that exists.
> > Works with all linkers, even if the error message isn't as useful.
> 
> How do you assign them an address without knowing what they are? With a
> wild card rule?

Right at the end of the ldscript I have:

/* Anything from any unexpected section ends up here and
 * generates an error because this overlaps another section */
unwanted _gp : { *(*) }
}

David



Re: [PATCH v2] powerpc/hash64: Be more careful when generating tlbiel

2016-10-19 Thread Michael Ellerman
Michael Ellerman  writes:

> From: Balbir Singh 
...
>
> Although there may be very old toolchains which don't understand tlbiel, we 
> have
> other code in the tree which has been using tlbiel for over five years, and no
> one has reported any build failures, so just let the assembler generate the
> instructions.

Turns out this part is not true, it depends on the -mcpu you pass. So
we'll have to go back to using the macros for generating the two
argument form. v3 to come.

cheers


RE: [PATCH] powerpc: link error on orphan sections

2016-10-19 Thread Michael Ellerman
David Laight  writes:

> From: Michael Ellerman
>> Sent: 14 October 2016 01:46
> ...
>> > +LDFLAGS_vmlinux   := $(LDFLAGS_vmlinux-y) --orphan-handling=error
>> 
>> At least some old(er) toolchains don't support that:
>> 
>>   /opt/cross/kisskb/gcc-4.6.3-nolibc/powerpc-linux/bin/powerpc-linux-ld: 
>> unrecognized option '--
>> orphan-handling=error'
>> 
>> So we'll need an ld-option wrapper around it.
>
> In the past I've caused orphan sections to error by assigning them
> to the same address as something that exists.
> Works with all linkers, even if the error message isn't as useful.

How do you assign them an address without knowing what they are? With a
wild card rule?

Anyway I think I'm happy for this to only work on newer toolchains.

cheers


Re: [Patch v5 06/12] powerpc/virtex: Use generic xilinx irqchip driver

2016-10-19 Thread Michael Ellerman
Zubair Lutfullah Kakakhel  writes:

> The Xilinx interrupt controller driver is now available in drivers/irqchip.
> Switch to using that driver.
>
> Signed-off-by: Zubair Lutfullah Kakakhel 
>
> ---
> V5 New patch
>
> Tested on virtex440-ml507 using qemu

I don't have one of these to test on, and the patch looks sane, so
that's good enough for me.

Acked-by: Michael Ellerman  (powerpc)

cheers


Re: [RFC PATCH 2/3] powerpc/pseries: Define & use a type for the plpar_hcall() retvals

2016-10-19 Thread Michael Ellerman
Balbir Singh  writes:

> On 18/10/16 19:40, Michael Ellerman wrote:
>> We have now had two nasty stack corruption bugs caused by incorrect
>> sizing of the return buffer for plpar_hcall()/plpar_hcall9().
>> 
>> To avoid any more such bugs, define a type which encodes the size of the
>> return buffer, and change the argument of plpar_hcall() to be of that
>> type, meaning the compiler will check for us that we passed the right
>> size buffer.
>> 
>> There isn't an easy way to do this incrementally, without introducing a
>> new function name, eg. plpar_hcall_with_struct(), which is ugly as hell.
>> So just do it in one tree-wide change.
>> 
> Conceptually looks god, but I think we need to abstract the return values
> as well. I'll test and see if I can send you something on top of this

Not sure I know what you mean.

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-19 Thread Michael Ellerman
Tejun Heo  writes:

> Hello, Michael.
>
> On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote:
>> That doesn't compile, wq doesn't exist.
>> 
>> I guessed that you meant:
>> 
>> +   wq_numa_init();
>> +   list_for_each_entry(wq, , list)
>> +   wq_update_unbound_numa(wq, smp_processor_id(), true);
>
> Yeap, sorry about that.

No worries.

>> And that does boot.
>> 
>> The sysrq-t output is below, it's rather large.
>
> Didn't expect that many cpus but it looks good.

I have another system here with twice as many CPUs if that would help ;)

> I'll post proper patches soon.

Thanks. Can you pull the current series out of linux-next for now until
you merge the new series?

cheers


Re: [PATCH 10/10] mm: replace access_process_vm() write parameter with gup_flags

2016-10-19 Thread Michael Ellerman
Lorenzo Stoakes  writes:

> diff --git a/arch/powerpc/kernel/ptrace32.c b/arch/powerpc/kernel/ptrace32.c
> index f52b7db3..010b7b3 100644
> --- a/arch/powerpc/kernel/ptrace32.c
> +++ b/arch/powerpc/kernel/ptrace32.c
> @@ -74,7 +74,7 @@ long compat_arch_ptrace(struct task_struct *child, 
> compat_long_t request,
>   break;
>  
>   copied = access_process_vm(child, (u64)addrOthers, ,
> - sizeof(tmp), 0);
> + sizeof(tmp), FOLL_FORCE);
>   if (copied != sizeof(tmp))
>   break;
>   ret = put_user(tmp, (u32 __user *)data);

LGTM.

> @@ -179,7 +179,8 @@ long compat_arch_ptrace(struct task_struct *child, 
> compat_long_t request,
>   break;
>   ret = 0;
>   if (access_process_vm(child, (u64)addrOthers, ,
> - sizeof(tmp), 1) == sizeof(tmp))
> + sizeof(tmp),
> + FOLL_FORCE | FOLL_WRITE) == sizeof(tmp))
>   break;
>   ret = -EIO;
>   break;

If you're respinning this anyway, can you format that as:

if (access_process_vm(child, (u64)addrOthers, , sizeof(tmp),
  FOLL_FORCE | FOLL_WRITE) == sizeof(tmp))
break;

I realise you probably deliberately didn't do that to make the diff clearer.

Either way:

Acked-by: Michael Ellerman  (powerpc)


cheers


Re: [PATCH 6/7] powerpc: switch to using thin archives if COMPILE_TEST is set

2016-10-19 Thread Michael Ellerman
Nicholas Piggin  writes:

> Enable thin archives build for powerpc when COMPILE_TEST is set.
> Thin archives are explained in this commit:
>
>   a5967db9af51a84f5e181600954714a9e4c69f1f
>
> This is a gradual way to introduce the option to testers.

I think I'd rather this was actually a user-selectable option, eg.

config USE_THIN_ARCHIVES
bool "Build the kernel using thin archives"
default n
select THIN_ARCHIVES
help
  Build the kernel using thin archives.
  If you're unsure say N.


cheers


Re: [PATCH 3/7] powerpc/64s: tool to flag direct branches from unrelocated interrupt vectors

2016-10-19 Thread Michael Ellerman
Nicholas Piggin  writes:

> Direct banches from code below __end_interrupts to code above
> __end_interrupts when built with CONFIG_RELOCATABLE are disallowed
> because they will break when the kernel is not located at 0.
>
> Sample output:
>
> WARNING: Unrelocated relative branches
> c118 bl-> 0xc0038fb8 
> c13c b-> 0xc01068a4 
> c148 b-> 0xc003919c 
> c14c b-> 0xc003923c 
> c5a4 b-> 0xc0106ffc 
> c0001af0 b-> 0xc0106ffc 
> c0001b24 b-> 0xc0106ffc 
> c0001b58 b-> 0xc0106ffc 
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/Makefile.postlink   |  9 -
>  arch/powerpc/tools/unrel_branch_check.sh | 56 
> 

arch/powerpc/scripts is meant for these kind of scripts (there's only
one there ATM, and yes we should move others in there).

cheers


[PATCH] net: fs_enet: Use net_device_stats from struct net_device

2016-10-19 Thread Tobias Klauser
Instead of using a private copy of struct net_device_stats in struct
fs_enet_private, use stats from struct net_device. Also remove the now
unnecessary .ndo_get_stats function.

Signed-off-by: Tobias Klauser 
---
 .../net/ethernet/freescale/fs_enet/fs_enet-main.c  | 43 +-
 drivers/net/ethernet/freescale/fs_enet/fs_enet.h   |  1 -
 2 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c 
b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
index 925d1bca6a26..44f50e168703 100644
--- a/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
+++ b/drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c
@@ -118,22 +118,22 @@ static int fs_enet_napi(struct napi_struct *napi, int 
budget)
  BD_ENET_TX_RL | BD_ENET_TX_UN | BD_ENET_TX_CSL)) {
 
if (sc & BD_ENET_TX_HB) /* No heartbeat */
-   fep->stats.tx_heartbeat_errors++;
+   dev->stats.tx_heartbeat_errors++;
if (sc & BD_ENET_TX_LC) /* Late collision */
-   fep->stats.tx_window_errors++;
+   dev->stats.tx_window_errors++;
if (sc & BD_ENET_TX_RL) /* Retrans limit */
-   fep->stats.tx_aborted_errors++;
+   dev->stats.tx_aborted_errors++;
if (sc & BD_ENET_TX_UN) /* Underrun */
-   fep->stats.tx_fifo_errors++;
+   dev->stats.tx_fifo_errors++;
if (sc & BD_ENET_TX_CSL)/* Carrier lost */
-   fep->stats.tx_carrier_errors++;
+   dev->stats.tx_carrier_errors++;
 
if (sc & (BD_ENET_TX_LC | BD_ENET_TX_RL | 
BD_ENET_TX_UN)) {
-   fep->stats.tx_errors++;
+   dev->stats.tx_errors++;
do_restart = 1;
}
} else
-   fep->stats.tx_packets++;
+   dev->stats.tx_packets++;
 
if (sc & BD_ENET_TX_READY) {
dev_warn(fep->dev,
@@ -145,7 +145,7 @@ static int fs_enet_napi(struct napi_struct *napi, int 
budget)
 * but we eventually sent the packet OK.
 */
if (sc & BD_ENET_TX_DEF)
-   fep->stats.collisions++;
+   dev->stats.collisions++;
 
/* unmap */
if (fep->mapped_as_page[dirtyidx])
@@ -212,19 +212,19 @@ static int fs_enet_napi(struct napi_struct *napi, int 
budget)
 */
if (sc & (BD_ENET_RX_LG | BD_ENET_RX_SH | BD_ENET_RX_CL |
  BD_ENET_RX_NO | BD_ENET_RX_CR | BD_ENET_RX_OV)) {
-   fep->stats.rx_errors++;
+   dev->stats.rx_errors++;
/* Frame too long or too short. */
if (sc & (BD_ENET_RX_LG | BD_ENET_RX_SH))
-   fep->stats.rx_length_errors++;
+   dev->stats.rx_length_errors++;
/* Frame alignment */
if (sc & (BD_ENET_RX_NO | BD_ENET_RX_CL))
-   fep->stats.rx_frame_errors++;
+   dev->stats.rx_frame_errors++;
/* CRC Error */
if (sc & BD_ENET_RX_CR)
-   fep->stats.rx_crc_errors++;
+   dev->stats.rx_crc_errors++;
/* FIFO overrun */
if (sc & BD_ENET_RX_OV)
-   fep->stats.rx_crc_errors++;
+   dev->stats.rx_crc_errors++;
 
skbn = fep->rx_skbuff[curidx];
} else {
@@ -233,9 +233,9 @@ static int fs_enet_napi(struct napi_struct *napi, int 
budget)
/*
 * Process the incoming frame.
 */
-   fep->stats.rx_packets++;
+   dev->stats.rx_packets++;
pkt_len = CBDR_DATLEN(bdp) - 4; /* remove CRC */
-   fep->stats.rx_bytes += pkt_len + 4;
+   dev->stats.rx_bytes += pkt_len + 4;
 
if (pkt_len <= fpi->rx_copybreak) {
/* +2 to make IP header L1 cache aligned */
@@ -277,7 +277,7 @@ static int fs_enet_napi(struct napi_struct *napi, int 
budget)
received++;
netif_receive_skb(skb);
} else {
-   fep->stats.rx_dropped++;
+   

Re: [PATCH 6/7] powerpc: switch to using thin archives if COMPILE_TEST is set

2016-10-19 Thread Nicholas Piggin
On Wed, 19 Oct 2016 16:57:23 +1100
Stephen Rothwell  wrote:

> Hi Nick,
> 
> On Wed, 19 Oct 2016 14:15:59 +1100 Nicholas Piggin  wrote:
> >
> > Enable thin archives build for powerpc when COMPILE_TEST is set.
> > Thin archives are explained in this commit:
> > 
> >   a5967db9af51a84f5e181600954714a9e4c69f1f  
> 
> This reference should be like this:
> 
>   a5967db9af51 ("kbuild: allow architectures to use thin archives instead of 
> ld -r")
> 

Hi Stephen,

Good point, I'll change it.

Thanks,
Nick


Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags

2016-10-19 Thread Michal Hocko
On Wed 19-10-16 10:06:46, Lorenzo Stoakes wrote:
> On Wed, Oct 19, 2016 at 10:52:05AM +0200, Michal Hocko wrote:
> > yes this is the desirable and expected behavior.
> >
> > > wonder if this is desirable behaviour or whether this ought to be limited 
> > > to
> > > ptrace system calls. Regardless, by making the flag more visible it makes 
> > > it
> > > easier to see that this is happening.
> >
> > mem_open already enforces PTRACE_MODE_ATTACH
> 
> Ah I missed this, that makes a lot of sense, thanks!
> 
> I still wonder whether other invocations of access_remote_vm() in 
> fs/proc/base.c
> (the principle caller of this function) need FOLL_FORCE, for example the 
> various
> calls that simply read data from other processes, so I think the point stands
> about keeping this explicit.

I do agree. Making them explicit will help to clean them up later,
should there be a need.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags

2016-10-19 Thread Michal Hocko
On Wed 19-10-16 09:58:15, Lorenzo Stoakes wrote:
> On Tue, Oct 18, 2016 at 05:30:50PM +0200, Michal Hocko wrote:
> > I am wondering whether we can go further. E.g. it is not really clear to
> > me whether we need an explicit FOLL_REMOTE when we can in fact check
> > mm != current->mm and imply that. Maybe there are some contexts which
> > wouldn't work, I haven't checked.
> 
> This flag is set even when /proc/self/mem is used. I've not looked deeply into
> this flag but perhaps accessing your own memory this way can be considered
> 'remote' since you're not accessing it directly. On the other hand, perhaps 
> this
> is just mistaken in this case?

My understanding of the flag is quite limited as well. All I know it is
related to protection keys and it is needed to bypass protection check.
See arch_vma_access_permitted. See also 1b2ee1266ea6 ("mm/core: Do not
enforce PKEY permissions on remote mm access").

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags

2016-10-19 Thread Lorenzo Stoakes
On Wed, Oct 19, 2016 at 10:52:05AM +0200, Michal Hocko wrote:
> yes this is the desirable and expected behavior.
>
> > wonder if this is desirable behaviour or whether this ought to be limited to
> > ptrace system calls. Regardless, by making the flag more visible it makes it
> > easier to see that this is happening.
>
> mem_open already enforces PTRACE_MODE_ATTACH

Ah I missed this, that makes a lot of sense, thanks!

I still wonder whether other invocations of access_remote_vm() in fs/proc/base.c
(the principle caller of this function) need FOLL_FORCE, for example the various
calls that simply read data from other processes, so I think the point stands
about keeping this explicit.


Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags

2016-10-19 Thread Lorenzo Stoakes
On Tue, Oct 18, 2016 at 05:30:50PM +0200, Michal Hocko wrote:
> I am wondering whether we can go further. E.g. it is not really clear to
> me whether we need an explicit FOLL_REMOTE when we can in fact check
> mm != current->mm and imply that. Maybe there are some contexts which
> wouldn't work, I haven't checked.

This flag is set even when /proc/self/mem is used. I've not looked deeply into
this flag but perhaps accessing your own memory this way can be considered
'remote' since you're not accessing it directly. On the other hand, perhaps this
is just mistaken in this case?

> I guess there is more work in that area and I do not want to impose all
> that work on you, but I couldn't resist once I saw you playing in that
> area ;) Definitely a good start!

Thanks, I am more than happy to go as far down this rabbit hole as is helpful,
no imposition at all :)


Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags

2016-10-19 Thread Michal Hocko
On Wed 19-10-16 09:40:45, Lorenzo Stoakes wrote:
> On Wed, Oct 19, 2016 at 10:13:52AM +0200, Michal Hocko wrote:
> > On Wed 19-10-16 09:59:03, Jan Kara wrote:
> > > On Thu 13-10-16 01:20:18, Lorenzo Stoakes wrote:
> > > > This patch removes the write parameter from __access_remote_vm() and 
> > > > replaces it
> > > > with a gup_flags parameter as use of this function previously _implied_
> > > > FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
> > > >
> > > > We make this explicit as use of FOLL_FORCE can result in surprising 
> > > > behaviour
> > > > (and hence bugs) within the mm subsystem.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes 
> > >
> > > So I'm not convinced this (and the following two patches) is actually
> > > helping much. By grepping for FOLL_FORCE we will easily see that any 
> > > caller
> > > of access_remote_vm() gets that semantics and can thus continue search
> >
> > I am really wondering. Is there anything inherent that would require
> > FOLL_FORCE for access_remote_vm? I mean FOLL_FORCE is a really
> > non-trivial thing. It doesn't obey vma permissions so we should really
> > minimize its usage. Do all of those users really need FOLL_FORCE?
> 
> I wonder about this also, for example by accessing /proc/self/mem you trigger
> access_remote_vm() and consequently get_user_pages_remote() meaning FOLL_FORCE
> is implied and you can use /proc/self/mem to override any VMA permissions. I

yes this is the desirable and expected behavior. 

> wonder if this is desirable behaviour or whether this ought to be limited to
> ptrace system calls. Regardless, by making the flag more visible it makes it
> easier to see that this is happening.

mem_open already enforces PTRACE_MODE_ATTACH

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags

2016-10-19 Thread Lorenzo Stoakes
On Wed, Oct 19, 2016 at 10:13:52AM +0200, Michal Hocko wrote:
> On Wed 19-10-16 09:59:03, Jan Kara wrote:
> > On Thu 13-10-16 01:20:18, Lorenzo Stoakes wrote:
> > > This patch removes the write parameter from __access_remote_vm() and 
> > > replaces it
> > > with a gup_flags parameter as use of this function previously _implied_
> > > FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
> > >
> > > We make this explicit as use of FOLL_FORCE can result in surprising 
> > > behaviour
> > > (and hence bugs) within the mm subsystem.
> > >
> > > Signed-off-by: Lorenzo Stoakes 
> >
> > So I'm not convinced this (and the following two patches) is actually
> > helping much. By grepping for FOLL_FORCE we will easily see that any caller
> > of access_remote_vm() gets that semantics and can thus continue search
>
> I am really wondering. Is there anything inherent that would require
> FOLL_FORCE for access_remote_vm? I mean FOLL_FORCE is a really
> non-trivial thing. It doesn't obey vma permissions so we should really
> minimize its usage. Do all of those users really need FOLL_FORCE?

I wonder about this also, for example by accessing /proc/self/mem you trigger
access_remote_vm() and consequently get_user_pages_remote() meaning FOLL_FORCE
is implied and you can use /proc/self/mem to override any VMA permissions. I
wonder if this is desirable behaviour or whether this ought to be limited to
ptrace system calls. Regardless, by making the flag more visible it makes it
easier to see that this is happening.

Note that this /proc/self/mem behaviour is what triggered the bug that brought
about this discussion in the first place -
https://marc.info/?l=linux-mm=147363447105059 - as using /proc/self/mem this
way on PROT_NONE memory broke some assumptions.

On the other hand I see your point Jan in that you know any caller of these
functions will have FOLL_FORCE implied, and you have to decide to stop passing
the flag at _some_ point in the stack, however it seems fairly sane to have that
stopping point exist outside of exported gup functions so the gup interface
forces explicitness but callers can wrap things as they need.


Re: [v12, 0/8] Fix eSDHC host version register bug

2016-10-19 Thread gre...@linuxfoundation.org
On Wed, Oct 19, 2016 at 02:47:07AM +, Y.B. Lu wrote:
> + Greg
> 
> Hi Greg,
> 
> I submitted this patchset for a MMC bug fix, and introduce the below patch 
> which needs your ACK.
> > > Arnd Bergmann (1):
> > >   base: soc: introduce soc_device_match() interface
> https://patchwork.kernel.org/patch/9342913/
> 
> Could you help to review it and give some comments or ACK.
> Thank you very much.

Now acked.



Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags

2016-10-19 Thread Michal Hocko
On Wed 19-10-16 09:59:03, Jan Kara wrote:
> On Thu 13-10-16 01:20:18, Lorenzo Stoakes wrote:
> > This patch removes the write parameter from __access_remote_vm() and 
> > replaces it
> > with a gup_flags parameter as use of this function previously _implied_
> > FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
> > 
> > We make this explicit as use of FOLL_FORCE can result in surprising 
> > behaviour
> > (and hence bugs) within the mm subsystem.
> > 
> > Signed-off-by: Lorenzo Stoakes 
> 
> So I'm not convinced this (and the following two patches) is actually
> helping much. By grepping for FOLL_FORCE we will easily see that any caller
> of access_remote_vm() gets that semantics and can thus continue search

I am really wondering. Is there anything inherent that would require
FOLL_FORCE for access_remote_vm? I mean FOLL_FORCE is a really
non-trivial thing. It doesn't obey vma permissions so we should really
minimize its usage. Do all of those users really need FOLL_FORCE?

Anyway I would rather see the flag explicit and used at more places than
hidden behind a helper function.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:18, Lorenzo Stoakes wrote:
> This patch removes the write parameter from __access_remote_vm() and replaces 
> it
> with a gup_flags parameter as use of this function previously _implied_
> FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
> 
> We make this explicit as use of FOLL_FORCE can result in surprising behaviour
> (and hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes 

So I'm not convinced this (and the following two patches) is actually
helping much. By grepping for FOLL_FORCE we will easily see that any caller
of access_remote_vm() gets that semantics and can thus continue search
accordingly (it is much simpler than searching for all get_user_pages()
users and extracting from parameter lists what they actually pass as
'force' argument). Sure it makes somewhat more visible to callers of
access_remote_vm() that they get FOLL_FORCE semantics but OTOH it also
opens a space for issues where a caller of access_remote_vm() actually
wants FOLL_FORCE (and currently all of them want it) and just mistakenly
does not set it. All in all I'd prefer to keep access_remote_vm() and
friends as is...

Honza

> ---
>  mm/memory.c | 23 +++
>  mm/nommu.c  |  9 ++---
>  2 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 20a9adb..79ebed3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3869,14 +3869,11 @@ EXPORT_SYMBOL_GPL(generic_access_phys);
>   * given task for page fault accounting.
>   */
>  static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
> - unsigned long addr, void *buf, int len, int write)
> + unsigned long addr, void *buf, int len, unsigned int gup_flags)
>  {
>   struct vm_area_struct *vma;
>   void *old_buf = buf;
> - unsigned int flags = FOLL_FORCE;
> -
> - if (write)
> - flags |= FOLL_WRITE;
> + int write = gup_flags & FOLL_WRITE;
>  
>   down_read(>mmap_sem);
>   /* ignore errors, just check how much was successfully transferred */
> @@ -3886,7 +3883,7 @@ static int __access_remote_vm(struct task_struct *tsk, 
> struct mm_struct *mm,
>   struct page *page = NULL;
>  
>   ret = get_user_pages_remote(tsk, mm, addr, 1,
> - flags, , );
> + gup_flags, , );
>   if (ret <= 0) {
>  #ifndef CONFIG_HAVE_IOREMAP_PROT
>   break;
> @@ -3945,7 +3942,12 @@ static int __access_remote_vm(struct task_struct *tsk, 
> struct mm_struct *mm,
>  int access_remote_vm(struct mm_struct *mm, unsigned long addr,
>   void *buf, int len, int write)
>  {
> - return __access_remote_vm(NULL, mm, addr, buf, len, write);
> + unsigned int flags = FOLL_FORCE;
> +
> + if (write)
> + flags |= FOLL_WRITE;
> +
> + return __access_remote_vm(NULL, mm, addr, buf, len, flags);
>  }
>  
>  /*
> @@ -3958,12 +3960,17 @@ int access_process_vm(struct task_struct *tsk, 
> unsigned long addr,
>  {
>   struct mm_struct *mm;
>   int ret;
> + unsigned int flags = FOLL_FORCE;
>  
>   mm = get_task_mm(tsk);
>   if (!mm)
>   return 0;
>  
> - ret = __access_remote_vm(tsk, mm, addr, buf, len, write);
> + if (write)
> + flags |= FOLL_WRITE;
> +
> + ret = __access_remote_vm(tsk, mm, addr, buf, len, flags);
> +
>   mmput(mm);
>  
>   return ret;
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 70cb844..bde7df3 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1809,9 +1809,10 @@ void filemap_map_pages(struct fault_env *fe,
>  EXPORT_SYMBOL(filemap_map_pages);
>  
>  static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
> - unsigned long addr, void *buf, int len, int write)
> + unsigned long addr, void *buf, int len, unsigned int gup_flags)
>  {
>   struct vm_area_struct *vma;
> + int write = gup_flags & FOLL_WRITE;
>  
>   down_read(>mmap_sem);
>  
> @@ -1853,7 +1854,8 @@ static int __access_remote_vm(struct task_struct *tsk, 
> struct mm_struct *mm,
>  int access_remote_vm(struct mm_struct *mm, unsigned long addr,
>   void *buf, int len, int write)
>  {
> - return __access_remote_vm(NULL, mm, addr, buf, len, write);
> + return __access_remote_vm(NULL, mm, addr, buf, len,
> + write ? FOLL_WRITE : 0);
>  }
>  
>  /*
> @@ -1871,7 +1873,8 @@ int access_process_vm(struct task_struct *tsk, unsigned 
> long addr, void *buf, in
>   if (!mm)
>   return 0;
>  
> - len = __access_remote_vm(tsk, mm, addr, buf, len, write);
> + len = __access_remote_vm(tsk, mm, addr, buf, len,
> + write ? FOLL_WRITE : 0);
>  
>   mmput(mm);
>   return len;
> -- 
> 2.10.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 07/10] mm: replace get_user_pages_remote() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:17, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_user_pages_remote()
> and replaces them with a gup_flags parameter to make the use of FOLL_FORCE
> explicit in callers as use of this flag can result in surprising behaviour 
> (and
> hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  drivers/gpu/drm/etnaviv/etnaviv_gem.c   |  7 +--
>  drivers/gpu/drm/i915/i915_gem_userptr.c |  6 +-
>  drivers/infiniband/core/umem_odp.c  |  7 +--
>  fs/exec.c   |  9 +++--
>  include/linux/mm.h  |  2 +-
>  kernel/events/uprobes.c |  6 --
>  mm/gup.c| 22 +++---
>  mm/memory.c |  6 +-
>  security/tomoyo/domain.c|  2 +-
>  9 files changed, 40 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
> b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> index 5ce3603..0370b84 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> @@ -748,19 +748,22 @@ static struct page **etnaviv_gem_userptr_do_get_pages(
>   int ret = 0, pinned, npages = etnaviv_obj->base.size >> PAGE_SHIFT;
>   struct page **pvec;
>   uintptr_t ptr;
> + unsigned int flags = 0;
>  
>   pvec = drm_malloc_ab(npages, sizeof(struct page *));
>   if (!pvec)
>   return ERR_PTR(-ENOMEM);
>  
> + if (!etnaviv_obj->userptr.ro)
> + flags |= FOLL_WRITE;
> +
>   pinned = 0;
>   ptr = etnaviv_obj->userptr.ptr;
>  
>   down_read(>mmap_sem);
>   while (pinned < npages) {
>   ret = get_user_pages_remote(task, mm, ptr, npages - pinned,
> - !etnaviv_obj->userptr.ro, 0,
> - pvec + pinned, NULL);
> + flags, pvec + pinned, NULL);
>   if (ret < 0)
>   break;
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index e537930..c6f780f 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -508,6 +508,10 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
> *_work)
>   pvec = drm_malloc_gfp(npages, sizeof(struct page *), GFP_TEMPORARY);
>   if (pvec != NULL) {
>   struct mm_struct *mm = obj->userptr.mm->mm;
> + unsigned int flags = 0;
> +
> + if (!obj->userptr.read_only)
> + flags |= FOLL_WRITE;
>  
>   ret = -EFAULT;
>   if (atomic_inc_not_zero(>mm_users)) {
> @@ -517,7 +521,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
> *_work)
>   (work->task, mm,
>obj->userptr.ptr + pinned * PAGE_SIZE,
>npages - pinned,
> -  !obj->userptr.read_only, 0,
> +  flags,
>pvec + pinned, NULL);
>   if (ret < 0)
>   break;
> diff --git a/drivers/infiniband/core/umem_odp.c 
> b/drivers/infiniband/core/umem_odp.c
> index 75077a0..1f0fe32 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -527,6 +527,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
> user_virt, u64 bcnt,
>   u64 off;
>   int j, k, ret = 0, start_idx, npages = 0;
>   u64 base_virt_addr;
> + unsigned int flags = 0;
>  
>   if (access_mask == 0)
>   return -EINVAL;
> @@ -556,6 +557,9 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
> user_virt, u64 bcnt,
>   goto out_put_task;
>   }
>  
> + if (access_mask & ODP_WRITE_ALLOWED_BIT)
> + flags |= FOLL_WRITE;
> +
>   start_idx = (user_virt - ib_umem_start(umem)) >> PAGE_SHIFT;
>   k = start_idx;
>  
> @@ -574,8 +578,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
> user_virt, u64 bcnt,
>*/
>   npages = get_user_pages_remote(owning_process, owning_mm,
>   user_virt, gup_num_pages,
> - access_mask & ODP_WRITE_ALLOWED_BIT,
> - 0, local_page_list, NULL);
> + flags, local_page_list, NULL);
>   up_read(_mm->mmap_sem);
>  
>   if (npages < 0)
> diff --git a/fs/exec.c b/fs/exec.c
> index 6fcfb3f..4e497b9 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -191,6 +191,7 @@ static 

Re: [PATCH 06/10] mm: replace get_user_pages() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:16, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_user_pages() and
> replaces them with a gup_flags parameter to make the use of FOLL_FORCE 
> explicit
> in callers as use of this flag can result in surprising behaviour (and hence
> bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes 

The patch looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  arch/cris/arch-v32/drivers/cryptocop.c |  4 +---
>  arch/ia64/kernel/err_inject.c  |  2 +-
>  arch/x86/mm/mpx.c  |  5 ++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|  7 +--
>  drivers/gpu/drm/radeon/radeon_ttm.c|  3 ++-
>  drivers/gpu/drm/via/via_dmablit.c  |  4 ++--
>  drivers/infiniband/core/umem.c |  6 +-
>  drivers/infiniband/hw/mthca/mthca_memfree.c|  2 +-
>  drivers/infiniband/hw/qib/qib_user_pages.c |  3 ++-
>  drivers/infiniband/hw/usnic/usnic_uiom.c   |  5 -
>  drivers/media/v4l2-core/videobuf-dma-sg.c  |  7 +--
>  drivers/misc/mic/scif/scif_rma.c   |  3 +--
>  drivers/misc/sgi-gru/grufault.c|  2 +-
>  drivers/platform/goldfish/goldfish_pipe.c  |  3 ++-
>  drivers/rapidio/devices/rio_mport_cdev.c   |  3 ++-
>  .../vc04_services/interface/vchiq_arm/vchiq_2835_arm.c |  3 +--
>  .../vc04_services/interface/vchiq_arm/vchiq_arm.c  |  3 +--
>  drivers/virt/fsl_hypervisor.c  |  4 ++--
>  include/linux/mm.h |  2 +-
>  mm/gup.c   | 12 +++-
>  mm/mempolicy.c |  2 +-
>  mm/nommu.c | 18 
> --
>  22 files changed, 49 insertions(+), 54 deletions(-)
> 
> diff --git a/arch/cris/arch-v32/drivers/cryptocop.c 
> b/arch/cris/arch-v32/drivers/cryptocop.c
> index b5698c8..099e170 100644
> --- a/arch/cris/arch-v32/drivers/cryptocop.c
> +++ b/arch/cris/arch-v32/drivers/cryptocop.c
> @@ -2722,7 +2722,6 @@ static int cryptocop_ioctl_process(struct inode *inode, 
> struct file *filp, unsig
>   err = get_user_pages((unsigned long int)(oper.indata + prev_ix),
>noinpages,
>0,  /* read access only for in data */
> -  0, /* no force */
>inpages,
>NULL);
>  
> @@ -2736,8 +2735,7 @@ static int cryptocop_ioctl_process(struct inode *inode, 
> struct file *filp, unsig
>   if (oper.do_cipher){
>   err = get_user_pages((unsigned long int)oper.cipher_outdata,
>nooutpages,
> -  1, /* write access for out data */
> -  0, /* no force */
> +  FOLL_WRITE, /* write access for out data */
>outpages,
>NULL);
>   up_read(>mm->mmap_sem);
> diff --git a/arch/ia64/kernel/err_inject.c b/arch/ia64/kernel/err_inject.c
> index 09f8457..5ed0ea9 100644
> --- a/arch/ia64/kernel/err_inject.c
> +++ b/arch/ia64/kernel/err_inject.c
> @@ -142,7 +142,7 @@ store_virtual_to_phys(struct device *dev, struct 
> device_attribute *attr,
>   u64 virt_addr=simple_strtoull(buf, NULL, 16);
>   int ret;
>  
> - ret = get_user_pages(virt_addr, 1, VM_READ, 0, NULL, NULL);
> + ret = get_user_pages(virt_addr, 1, FOLL_WRITE, NULL, NULL);
>   if (ret<=0) {
>  #ifdef ERR_INJ_DEBUG
>   printk("Virtual address %lx is not existing.\n",virt_addr);
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index 8047687..e4f8009 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -544,10 +544,9 @@ static int mpx_resolve_fault(long __user *addr, int 
> write)
>  {
>   long gup_ret;
>   int nr_pages = 1;
> - int force = 0;
>  
> - gup_ret = get_user_pages((unsigned long)addr, nr_pages, write,
> - force, NULL, NULL);
> + gup_ret = get_user_pages((unsigned long)addr, nr_pages,
> + write ? FOLL_WRITE : 0, NULL, NULL);
>   /*
>* get_user_pages() returns number of pages gotten.
>* 0 means we failed to fault in and get anything,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 887483b..dcaf691 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -555,10 +555,13 @@ struct amdgpu_ttm_tt {
>  int amdgpu_ttm_tt_get_user_pages(struct ttm_tt *ttm, struct page **pages)
>  {
>   

Re: [PATCH 0/7] build updates

2016-10-19 Thread Nicholas Piggin
On Wed, 19 Oct 2016 14:15:53 +1100
Nicholas Piggin  wrote:

> [*] Building allyesconfig still requires KALLSYMS_EXTRA_PASS=1, which
> I'm yet to look into.

Oh, it's because the kallsyms payload increases kernel image size and that
causes more linker stubs to be generated, which have symbols, which go into
kallsyms... What a nightmare.

We can use --no-emit-stub-syms, but it's kind of nice to have names for
things.

Thanks,
Nick


Re: [PATCH 05/10] mm: replace get_vaddr_frames() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:15, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_vaddr_frames() and
> replaces them with a gup_flags parameter to make the use of FOLL_FORCE 
> explicit
> in callers as use of this flag can result in surprising behaviour (and hence
> bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  drivers/gpu/drm/exynos/exynos_drm_g2d.c|  3 ++-
>  drivers/media/platform/omap/omap_vout.c|  2 +-
>  drivers/media/v4l2-core/videobuf2-memops.c |  6 +-
>  include/linux/mm.h |  2 +-
>  mm/frame_vector.c  | 13 ++---
>  5 files changed, 11 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c 
> b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
> index aa92dec..fbd13fa 100644
> --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
> +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
> @@ -488,7 +488,8 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct 
> drm_device *drm_dev,
>   goto err_free;
>   }
>  
> - ret = get_vaddr_frames(start, npages, true, true, g2d_userptr->vec);
> + ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
> + g2d_userptr->vec);
>   if (ret != npages) {
>   DRM_ERROR("failed to get user pages from userptr.\n");
>   if (ret < 0)
> diff --git a/drivers/media/platform/omap/omap_vout.c 
> b/drivers/media/platform/omap/omap_vout.c
> index e668dde..a31b95c 100644
> --- a/drivers/media/platform/omap/omap_vout.c
> +++ b/drivers/media/platform/omap/omap_vout.c
> @@ -214,7 +214,7 @@ static int omap_vout_get_userptr(struct videobuf_buffer 
> *vb, u32 virtp,
>   if (!vec)
>   return -ENOMEM;
>  
> - ret = get_vaddr_frames(virtp, 1, true, false, vec);
> + ret = get_vaddr_frames(virtp, 1, FOLL_WRITE, vec);
>   if (ret != 1) {
>   frame_vector_destroy(vec);
>   return -EINVAL;
> diff --git a/drivers/media/v4l2-core/videobuf2-memops.c 
> b/drivers/media/v4l2-core/videobuf2-memops.c
> index 3c3b517..1cd322e 100644
> --- a/drivers/media/v4l2-core/videobuf2-memops.c
> +++ b/drivers/media/v4l2-core/videobuf2-memops.c
> @@ -42,6 +42,10 @@ struct frame_vector *vb2_create_framevec(unsigned long 
> start,
>   unsigned long first, last;
>   unsigned long nr;
>   struct frame_vector *vec;
> + unsigned int flags = FOLL_FORCE;
> +
> + if (write)
> + flags |= FOLL_WRITE;
>  
>   first = start >> PAGE_SHIFT;
>   last = (start + length - 1) >> PAGE_SHIFT;
> @@ -49,7 +53,7 @@ struct frame_vector *vb2_create_framevec(unsigned long 
> start,
>   vec = frame_vector_create(nr);
>   if (!vec)
>   return ERR_PTR(-ENOMEM);
> - ret = get_vaddr_frames(start & PAGE_MASK, nr, write, true, vec);
> + ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
>   if (ret < 0)
>   goto out_destroy;
>   /* We accept only complete set of PFNs */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ab538..5ff084f6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1305,7 +1305,7 @@ struct frame_vector {
>  struct frame_vector *frame_vector_create(unsigned int nr_frames);
>  void frame_vector_destroy(struct frame_vector *vec);
>  int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> -  bool write, bool force, struct frame_vector *vec);
> +  unsigned int gup_flags, struct frame_vector *vec);
>  void put_vaddr_frames(struct frame_vector *vec);
>  int frame_vector_to_pages(struct frame_vector *vec);
>  void frame_vector_to_pfns(struct frame_vector *vec);
> diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> index 81b6749..db77dcb 100644
> --- a/mm/frame_vector.c
> +++ b/mm/frame_vector.c
> @@ -11,10 +11,7 @@
>   * get_vaddr_frames() - map virtual addresses to pfns
>   * @start:   starting user address
>   * @nr_frames:   number of pages / pfns from start to map
> - * @write:   whether pages will be written to by the caller
> - * @force:   whether to force write access even if user mapping is
> - *   readonly. See description of the same argument of
> - get_user_pages().
> + * @gup_flags:   flags modifying lookup behaviour
>   * @vec: structure which receives pages / pfns of the addresses mapped.
>   *   It should have space for at least nr_frames entries.
>   *
> @@ -34,23 +31,17 @@
>   * This function takes care of grabbing mmap_sem as necessary.
>   */
>  int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> -  bool write, bool force, struct frame_vector *vec)
> +  unsigned int gup_flags, struct frame_vector *vec)
>  {
>   struct mm_struct *mm = current->mm;
>

Re: [PATCH 04/10] mm: replace get_user_pages_locked() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:14, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_user_pages_locked()
> and replaces them with a gup_flags parameter to make the use of FOLL_FORCE
> explicit in callers as use of this flag can result in surprising behaviour 
> (and
> hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes 

After our discussion the patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 04/10] mm: replace get_user_pages_locked() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Tue 18-10-16 14:56:09, Lorenzo Stoakes wrote:
> On Tue, Oct 18, 2016 at 02:54:25PM +0200, Jan Kara wrote:
> > > @@ -1282,7 +1282,7 @@ long get_user_pages(unsigned long start, unsigned 
> > > long nr_pages,
> > >   int write, int force, struct page **pages,
> > >   struct vm_area_struct **vmas);
> > >  long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> > > - int write, int force, struct page **pages, int *locked);
> > > + unsigned int gup_flags, struct page **pages, int *locked);
> >
> > Hum, the prototype is inconsistent with e.g. __get_user_pages_unlocked()
> > where gup_flags come after **pages argument. Actually it makes more sense
> > to have it before **pages so that input arguments come first and output
> > arguments second but I don't care that much. But it definitely should be
> > consistent...
> 
> It was difficult to decide quite how to arrange parameters as there was
> inconsitency with regards to parameter ordering already - for example
> __get_user_pages() places its flags argument before pages whereas, as you 
> note,
> __get_user_pages_unlocked() puts them afterwards.
> 
> I ended up compromising by trying to match the existing ordering of the 
> function
> as much as I could by replacing write, force pairs with gup_flags in the same
> location (with the exception of get_user_pages_unlocked() which I felt should
> match __get_user_pages_unlocked() in signature) or if there was already a
> gup_flags parameter as in the case of __get_user_pages_unlocked() I simply
> removed the write, force pair and left the flags as the last parameter.
> 
> I am happy to rearrange parameters as needed, however I am not sure if it'd be
> worthwhile for me to do so (I am keen to try to avoid adding too much noise 
> here
> :)
> 
> If we were to rearrange parameters for consistency I'd suggest adjusting
> __get_user_pages_unlocked() to put gup_flags before pages and do the same with
> get_user_pages_unlocked(), let me know what you think.

Yeah, ok. If the inconsistency is already there, just leave it for now.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 0/5] implement vcpu preempted check

2016-10-19 Thread Christian Borntraeger
On 10/19/2016 12:20 PM, Pan Xinhui wrote:
> change from v3:
>   add x86 vcpu preempted check patch

If you want you could add the s390 patch that I provided for your last version.
I also gave my Acked-by for all previous patches.



> change from v2:
>   no code change, fix typos, update some comments
> change from v1:
>   a simplier definition of default vcpu_is_preempted
>   skip mahcine type check on ppc, and add config. remove dedicated macro.
>   add one patch to drop overload of rwsem_spin_on_owner and 
> mutex_spin_on_owner. 
>   add more comments
>   thanks boqun and Peter's suggestion.
> 
> This patch set aims to fix lock holder preemption issues.
> 
> test-case:
> perf record -a perf bench sched messaging -g 400 -p && perf report
> 
> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
> 
> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
> These spin_on_onwer variant also cause rcu stall before we apply this patch 
> set
> 
> We also have observed some performace improvements.
> 
> PPC test result:
> 
> 1 copy - 0.94%
> 2 copy - 7.17%
> 4 copy - 11.9%
> 8 copy -  3.04%
> 16 copy - 15.11%
> 
> details below:
> Without patch:
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 
> 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 
> 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 
> 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 
> 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 
> s, 1 samples)
> 
> With patch: 
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 
> 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 
> 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 
> 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 
> 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 
> s, 1 samples)
> 
> X86 test result:
>   test-case   after-patch   before-patch
> Execl Throughput   |18307.9 lps  |11701.6 lps 
> File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
> File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
> File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
> Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
> Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
> Process Creation   |29881.2 lps  |28572.8 lps 
> Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
> Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
> System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 
> 
> Pan Xinhui (5):
>   kernel/sched: introduce vcpu preempted check interface
>   locking/osq: Drop the overload of osq_lock()
>   kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
>   powerpc/spinlock: support vcpu preempted check
>   x86, kvm: support vcpu preempted check
> 
>  arch/powerpc/include/asm/spinlock.h   |  8 
>  arch/x86/include/asm/paravirt_types.h |  6 ++
>  arch/x86/include/asm/spinlock.h   |  8 
>  arch/x86/include/uapi/asm/kvm_para.h  |  3 ++-
>  arch/x86/kernel/kvm.c | 11 +++
>  arch/x86/kernel/paravirt.c| 11 +++
>  arch/x86/kvm/x86.c| 12 
>  include/linux/sched.h | 12 
>  kernel/locking/mutex.c| 15 +--
>  kernel/locking/osq_lock.c | 10 +-
>  kernel/locking/rwsem-xadd.c   | 16 +---
>  11 files changed, 105 insertions(+), 7 deletions(-)
> 



[PATCH v4 5/5] x86, kvm: support vcpu preempted check

2016-10-19 Thread Pan Xinhui
This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted.  Then
kernel can break the spin loops upon on the retval of vcpu_is_preempted.

As kernel has used this interface, So lets support it.

We use one field of struct kvm_steal_time to indicate that if one vcpu
is running or not.

unix benchmark result:
host:  kernel 4.8.1, i5-4570, 4 cpus
guest: kernel 4.8.1, 8 vcpus

test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps 
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
Process Creation   |29881.2 lps  |28572.8 lps 
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 

Signed-off-by: Pan Xinhui 
---
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/spinlock.h   |  8 
 arch/x86/include/uapi/asm/kvm_para.h  |  3 ++-
 arch/x86/kernel/kvm.c | 11 +++
 arch/x86/kernel/paravirt.c| 11 +++
 arch/x86/kvm/x86.c| 12 
 6 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 0f400c0..b1c7937 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -98,6 +98,10 @@ struct pv_time_ops {
unsigned long long (*steal_clock)(int cpu);
 };
 
+struct pv_vcpu_ops {
+   bool (*vcpu_is_preempted)(int cpu);
+};
+
 struct pv_cpu_ops {
/* hooks for various privileged instructions */
unsigned long (*get_debugreg)(int regno);
@@ -318,6 +322,7 @@ struct pv_lock_ops {
 struct paravirt_patch_template {
struct pv_init_ops pv_init_ops;
struct pv_time_ops pv_time_ops;
+   struct pv_vcpu_ops pv_vcpu_ops;
struct pv_cpu_ops pv_cpu_ops;
struct pv_irq_ops pv_irq_ops;
struct pv_mmu_ops pv_mmu_ops;
@@ -327,6 +332,7 @@ struct paravirt_patch_template {
 extern struct pv_info pv_info;
 extern struct pv_init_ops pv_init_ops;
 extern struct pv_time_ops pv_time_ops;
+extern struct pv_vcpu_ops pv_vcpu_ops;
 extern struct pv_cpu_ops pv_cpu_ops;
 extern struct pv_irq_ops pv_irq_ops;
 extern struct pv_mmu_ops pv_mmu_ops;
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 921bea7..52fd942 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -26,6 +26,14 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_PARAVIRT
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return pv_vcpu_ops.vcpu_is_preempted(cpu);
+}
+#endif
+
 #include 
 
 /*
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 94dc8ca..e9c12a1 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -45,7 +45,8 @@ struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u32 preempted;
+   __u32 pad[11];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index edbbfc8..0011bef 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+static bool kvm_vcpu_is_preempted(int cpu)
+{
+   struct kvm_steal_time *src;
+
+   src = _cpu(steal_time, cpu);
+
+   return !!src->preempted;
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -488,6 +497,8 @@ void __init kvm_guest_init(void)
kvm_guest_cpu_init();
 #endif
 
+   pv_vcpu_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
+
/*
 * Hard lockup detection is enabled by default. Disable it, as guests
 * can get false positives too easily, for example if the host is
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index bbf3d59..7adb7e9 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -122,6 +122,7 @@ static void *get_call_destination(u8 type)
struct 

[PATCH v4 4/5] powerpc/spinlock: support vcpu preempted check

2016-10-19 Thread Pan Xinhui
This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted. Then
kernel can break the spin loops upon on the retval of vcpu_is_preempted.

As kernel has used this interface, So lets support it.

Only pSeries need support it. And the fact is powerNV are built into
same kernel image with pSeries. So we need return false if we are runnig
as powerNV. The another fact is that lppaca->yiled_count keeps zero on
powerNV. So we can just skip the machine type check.

Suggested-by: Boqun Feng 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
---
 arch/powerpc/include/asm/spinlock.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index abb6b0f..af4285b 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,14 @@
 #define SYNC_IO
 #endif
 
+#ifdef CONFIG_PPC_PSERIES
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+#endif
+
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
 #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
-- 
2.4.11



[PATCH v4 3/5] kernel/locking: Drop the overload of {mutex, rwsem}_spin_on_owner

2016-10-19 Thread Pan Xinhui
An over-committed guest with more vCPUs than pCPUs has a heavy overload in
the two spin_on_owner. This blames on the lock holder preemption issue.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown] [H] 0xc00768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

Signed-off-by: Pan Xinhui 
---
 kernel/locking/mutex.c  | 15 +--
 kernel/locking/rwsem-xadd.c | 16 +---
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..8927e96 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
task_struct *owner)
 */
barrier();
 
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* Use vcpu_is_preempted to detech lock holder preemption issue
+* and break. vcpu_is_preempted is a macro defined by false if
+* arch does not support vcpu preempted check,
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
ret = false;
break;
}
@@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex 
*lock)
 
rcu_read_lock();
owner = READ_ONCE(lock->owner);
+
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
if (owner)
-   retval = owner->on_cpu;
+   retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
rcu_read_unlock();
/*
 * if lock->owner is not set, the mutex owner may have just acquired
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4b..ad0b5bb 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
goto done;
}
 
-   ret = owner->on_cpu;
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
+   ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 done:
rcu_read_unlock();
return ret;
@@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 */
barrier();
 
-   /* abort spinning when need_resched or owner is not running */
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* abort spinning when need_resched or owner is not running or
+* owner's cpu is preempted. vcpu_is_preempted is a macro
+* defined by false if arch does not support vcpu preempted
+* check
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
rcu_read_unlock();
return false;
}
-- 
2.4.11



[PATCH v4 2/5] locking/osq: Drop the overload of osq_lock()

2016-10-19 Thread Pan Xinhui
An over-committed guest with more vCPUs than pCPUs has a heavy overload in
osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu
node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq
lock.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

Suggested-by: Boqun Feng 
Signed-off-by: Pan Xinhui 
---
 kernel/locking/osq_lock.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..39d1385 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
 }
 
+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+   return node->cpu - 1;
+}
+
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
while (!READ_ONCE(node->locked)) {
/*
 * If we need to reschedule bail... so we can block.
+* Use vcpu_is_preempted to detech lock holder preemption issue
+* and break. vcpu_is_preempted is a macro defined by false if
+* arch does not support vcpu preempted check,
 */
-   if (need_resched())
+   if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
goto unqueue;
 
cpu_relax_lowlatency();
-- 
2.4.11



[PATCH v4 0/5] implement vcpu preempted check

2016-10-19 Thread Pan Xinhui
change from v3:
add x86 vcpu preempted check patch
change from v2:
no code change, fix typos, update some comments
change from v1:
a simplier definition of default vcpu_is_preempted
skip mahcine type check on ppc, and add config. remove dedicated macro.
add one patch to drop overload of rwsem_spin_on_owner and 
mutex_spin_on_owner. 
add more comments
thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

We also have observed some performace improvements.

PPC test result:

1 copy - 0.94%
2 copy - 7.17%
4 copy - 11.9%
8 copy -  3.04%
16 copy - 15.11%

details below:
Without patch:

1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 s, 
1 samples)

With patch: 

1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 s, 
1 samples)

X86 test result:
test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps 
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
Process Creation   |29881.2 lps  |28572.8 lps 
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 

Pan Xinhui (5):
  kernel/sched: introduce vcpu preempted check interface
  locking/osq: Drop the overload of osq_lock()
  kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
  powerpc/spinlock: support vcpu preempted check
  x86, kvm: support vcpu preempted check

 arch/powerpc/include/asm/spinlock.h   |  8 
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/spinlock.h   |  8 
 arch/x86/include/uapi/asm/kvm_para.h  |  3 ++-
 arch/x86/kernel/kvm.c | 11 +++
 arch/x86/kernel/paravirt.c| 11 +++
 arch/x86/kvm/x86.c| 12 
 include/linux/sched.h | 12 
 kernel/locking/mutex.c| 15 +--
 kernel/locking/osq_lock.c | 10 +-
 kernel/locking/rwsem-xadd.c   | 16 +---
 11 files changed, 105 insertions(+), 7 deletions(-)

-- 
2.4.11



[PATCH v4 1/5] kernel/sched: introduce vcpu preempted check interface

2016-10-19 Thread Pan Xinhui
This patch support to fix lock holder preemption issue.

For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if
one vcpu is preempted or not.

The default implementation is a macro defined by false. So compiler can
wrap it out if arch dose not support such vcpu pteempted check.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
---
 include/linux/sched.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..44c1ce7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, 
unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * In order to deal with a various lock holder preemption issues provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+#define vcpu_is_preempted(cpu) false
+#endif
+
 extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
-- 
2.4.11



Re: [PATCH 3/7] powerpc/64s: tool to flag direct branches from unrelocated interrupt vectors

2016-10-19 Thread Nicholas Piggin
On Wed, 19 Oct 2016 15:28:40 +1100
Balbir Singh  wrote:

> On 19/10/16 14:15, Nicholas Piggin wrote:
> > Direct banches from code below __end_interrupts to code above
> > __end_interrupts when built with CONFIG_RELOCATABLE are disallowed
> > because they will break when the kernel is not located at 0.
> > 
> > Sample output:
> > 
> > WARNING: Unrelocated relative branches
> > c118 bl-> 0xc0038fb8 
> > c13c b-> 0xc01068a4 
> > c148 b-> 0xc003919c 
> > c14c b-> 0xc003923c 
> > c5a4 b-> 0xc0106ffc 
> > c0001af0 b-> 0xc0106ffc 
> > c0001b24 b-> 0xc0106ffc 
> > c0001b58 b-> 0xc0106ffc 
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---  
> 
> If this is the same script I gave you, you can add my SOB line as well

I can't remember now, but I know you had something... I'm usually careful
to get attributions right. I'll add your SOB.