Re: Missing _restvr_20 and _savevr_20 subroutines for lib/raid6/altivec8.o

2013-10-22 Thread Kumar Gala

On Oct 19, 2013, at 5:24 PM, Ben Hutchings wrote:

> When building lib/raid6/altivec8.o with gcc 4.8 on Debian, the compiler
> is generating references to two new runtime subroutines which are
> apparently not included in the kernel:
> 
> ERROR: "_restvr_20" [lib/raid6/raid6_pq.ko] undefined!
> ERROR: "_savevr_20" [lib/raid6/raid6_pq.ko] undefined!
> 
> The save/restore subroutines are specified in
> http://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.7.1.html#SAVE-RESTORE
> and we do have the _restgpr_* and _savegpr_* subroutines in
> arch/powerpc/boot/crtsavres.S.  I'm not sure whether these subroutines
> should be added or whether this indicates the compiler is doing
> something wrong.
> 
> A configuration that triggers this is included below.
> 
> Ben.

Try with CONFIG_CC_OPTIMIZE_FOR_SIZE=n.  A feature was added to gcc for -Os to 
"outline" the save/restore routines.  I'm surprised this hasn't shown up sooner.

Well need to add _restvr_* / _savevr_* to the version in lib/crtsaveres.S.

http://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgcc/config/rs6000/crtrestvr.S;hb=HEAD
http://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgcc/config/rs6000/crtsavevr.S;hb=HEAD

- k
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] [RFC] Emulate "lwsync" to run standard user land on e500 cores

2013-10-22 Thread Kumar Gala

On Oct 18, 2013, at 2:38 AM, Wolfgang Denk wrote:

> Default Debian PowerPC doesn't work on e500 because the code contains
> "lwsync" instructions, which are unsupported on this core.  As a
> result, applications using this will crash with an "unhandled signal 4"
> "Illegal instruction" error.
> 
> As a work around we add code to emulate this insn.  This is expensive
> performance-wise, but allows to run standard user land code.
> 
> Signed-off-by: Wolfgang Denk 
> Cc: Benjamin Herrenschmidt 
> Cc: Scott Wood 
> ---
> I am aware that the clean solution to the problem is to build user
> space with compiler options that match the target architecture.
> However, sometimes this is just too much effort.
> 
> Also, of course the performance of such an emulation sucks. But the
> the occurrence of such instructions is so rare that no significant
> slowdown can be oserved.
> 
> I'm not sure if this should / could go into mainline.  I'm posting it
> primarily so it can be found should anybody else need this.
> - wd
> 
> arch/powerpc/kernel/traps.c | 7 +++
> 1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index f783c93..f330374 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -986,6 +986,13 @@ static int emulate_instruction(struct pt_regs *regs)
>   return 0;
>   }
> 
> + /* Emulating the lwsync insn as a sync insn */
> + if (instword == PPC_INST_LWSYNC) {
> + PPC_WARN_EMULATED(lwsync, regs);
> + asm volatile("sync" : : : "memory");

Do we really need the inline asm?  Doesn't the fact of just taking an exception 
and returning from it equate to a sync.

> + return 0;
> + }
> +
>   /* Emulate the mcrxr insn.  */
>   if ((instword & PPC_INST_MCRXR_MASK) == PPC_INST_MCRXR) {
>   int shift = (instword >> 21) & 0x1c;
> -- 
> 1.8.3.1
> 
> ___
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-22 Thread Preeti U Murthy
On 10/23/2013 09:30 AM, Preeti U Murthy wrote:
> Hi Peter,
> 
> On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
>> On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
>>>  kernel/sched/fair.c |   19 +--
>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7c70201..12f0eab 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, 
>>> int cpu)
>>>  
>>> rcu_read_lock();
>>> for_each_domain(cpu, sd) {
>>> +   struct sched_domain *sd_parent = sd->parent;
>>> +   struct sched_group *sg;
>>> +   struct sched_group_power *sgp;
>>> +   int nr_busy;
>>> +
>>> +   if (sd_parent) {
>>> +   sg = sd_parent->groups;
>>> +   sgp = sg->sgp;
>>> +   nr_busy = atomic_read(&sgp->nr_busy_cpus);
>>> +
>>> +   if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>>> +   goto need_kick_unlock;
>>> +   }
>>>  
>>> if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>>> && (cpumask_first_and(nohz.idle_cpus_mask,
>>>
>>
>> Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?
> 
> You are right, sorry about this. The idea was to correct the nr_busy
> computation before the patch that would remove its usage in the second
> patch. But that would mean the condition nr_busy != sg->group_weight
> would be invalid with this patch. The second patch needs to go first to
> avoid this confusion.
> 
>>
>> Also, this made me look at the nr_busy stuff again, and somehow that
>> entire thing makes me a little sad.
>>
>> Can't we do something like the below and cut that nr_busy sd iteration
>> short?
> 
> We can surely cut the nr_busy sd iteration but not like what is done
> with this patch. You stop the nr_busy computation at the sched domain
> that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed()
> would want to know the nr_busy for one level above this.
>Consider a core. Assume it is the highest domain with this flag set.
> The nr_busy of its groups, which are logical threads are set to 1/0
> each. But nohz_kick_needed() would like to know the sum of the nr_busy
> parameter of all the groups, i.e. the threads in a core before it
> decides if it can kick nohz_idle balancing. The information about the
> individual group's nr_busy is of no relevance here.
> 
> Thats why the above patch tries to get the
> sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to
> the core's busy cpus in this example. But the below patch stops before
> updating this parameter at the sd->parent level, where sd is the highest
> level sched domain with the SD_SHARE_PKG_RESOURCES flag set.
> 
> But we can get around all this confusion if we can move the nr_busy
> parameter to be included in the sched_domain structure rather than the
> sched_groups_power structure. Anyway the only place where nr_busy is
> used, that is at nohz_kick_needed(), is done to know the total number of
> busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES
> set and not at a sched group level.
> 
> So why not move nr_busy to struct sched_domain  and having the below
> patch which just updates this parameter for the sched domain, sd_busy ?

Oh this can't be done :( Domain structures are per cpu!

Regards
Preeti U Murthy

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-22 Thread Preeti U Murthy
Hi Peter,

On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
> On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
>>  kernel/sched/fair.c |   19 +--
>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7c70201..12f0eab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, 
>> int cpu)
>>  
>>  rcu_read_lock();
>>  for_each_domain(cpu, sd) {
>> +struct sched_domain *sd_parent = sd->parent;
>> +struct sched_group *sg;
>> +struct sched_group_power *sgp;
>> +int nr_busy;
>> +
>> +if (sd_parent) {
>> +sg = sd_parent->groups;
>> +sgp = sg->sgp;
>> +nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> +
>> +if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> +goto need_kick_unlock;
>> +}
>>  
>>  if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>>  && (cpumask_first_and(nohz.idle_cpus_mask,
>>
> 
> Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?

You are right, sorry about this. The idea was to correct the nr_busy
computation before the patch that would remove its usage in the second
patch. But that would mean the condition nr_busy != sg->group_weight
would be invalid with this patch. The second patch needs to go first to
avoid this confusion.

> 
> Also, this made me look at the nr_busy stuff again, and somehow that
> entire thing makes me a little sad.
> 
> Can't we do something like the below and cut that nr_busy sd iteration
> short?

We can surely cut the nr_busy sd iteration but not like what is done
with this patch. You stop the nr_busy computation at the sched domain
that has the flag SD_SHARE_PKG_RESOURCES set. But nohz_kick_needed()
would want to know the nr_busy for one level above this.
   Consider a core. Assume it is the highest domain with this flag set.
The nr_busy of its groups, which are logical threads are set to 1/0
each. But nohz_kick_needed() would like to know the sum of the nr_busy
parameter of all the groups, i.e. the threads in a core before it
decides if it can kick nohz_idle balancing. The information about the
individual group's nr_busy is of no relevance here.

Thats why the above patch tries to get the
sd->parent->groups->sgp->nr_busy_cpus. This will translate rightly to
the core's busy cpus in this example. But the below patch stops before
updating this parameter at the sd->parent level, where sd is the highest
level sched domain with the SD_SHARE_PKG_RESOURCES flag set.

But we can get around all this confusion if we can move the nr_busy
parameter to be included in the sched_domain structure rather than the
sched_groups_power structure. Anyway the only place where nr_busy is
used, that is at nohz_kick_needed(), is done to know the total number of
busy cpus at a sched domain level which has the SD_SHARE_PKG_RESOURCES
set and not at a sched group level.

So why not move nr_busy to struct sched_domain  and having the below
patch which just updates this parameter for the sched domain, sd_busy ?
This will avoid iterating through all the levels of sched domains and
should resolve the scalability issue. We also don't need to get to
sd->parent to get the nr_busy parameter for the sake of nohz_kick_needed().

What do you think?

Regards
Preeti U Murthy
> 
> This nohz stuff really needs to be re-thought and made more scalable --
> its a royal pain :/
> 
> 
>  kernel/sched/core.c  |  4 
>  kernel/sched/fair.c  | 21 +++--
>  kernel/sched/sched.h |  5 ++---
>  3 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c06b8d3..89db8dc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
>  DEFINE_PER_CPU(struct sched_domain *, sd_numa);
> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
> 
>  static void update_top_cache_domain(int cpu)
>  {
> @@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)
> 
>   sd = lowest_flag_domain(cpu, SD_NUMA);
>   rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> +
> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING);
> + rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
>  }
> 
>  /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 813dd61..3d5141e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6512,19 +6512,23 @@ static inline void nohz_balance_exit_idle(int cpu)
>   }
>  }
> 
> -static inline void set_cpu_sd_state_busy(void)
> +static inline void set_cpu_sd_state_busy(int cpu)
>  {
>   struct sched_domain *sd;

perf events ring buffer memory barrier on powerpc

2013-10-22 Thread Michael Neuling
Frederic,

In the perf ring buffer code we have this in perf_output_get_handle():

if (!local_dec_and_test(&rb->nest))
goto out;

/*
 * Publish the known good head. Rely on the full barrier implied
 * by atomic_dec_and_test() order the rb->head read and this
 * write.
 */
rb->user_page->data_head = head;

The comment says atomic_dec_and_test() but the code is
local_dec_and_test().

On powerpc, local_dec_and_test() doesn't have a memory barrier but
atomic_dec_and_test() does.  Is the comment wrong, or is
local_dec_and_test() suppose to imply a memory barrier too and we have
it wrongly implemented in powerpc?

My guess is that local_dec_and_test() is correct but we to add an
explicit memory barrier like below:

(Kudos to Victor Kaplansky for finding this)

Mikey

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..95768c6 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -87,10 +87,10 @@ again:
goto out;
 
/*
-* Publish the known good head. Rely on the full barrier implied
-* by atomic_dec_and_test() order the rb->head read and this
-* write.
+* Publish the known good head. We need a memory barrier to order the
+* order the rb->head read and this write.
 */
+   smp_mb ();
rb->user_page->data_head = head;
 
/*
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

2013-10-22 Thread Peter Zijlstra
On Mon, Oct 21, 2013 at 05:15:02PM +0530, Vaidyanathan Srinivasan wrote:
>  kernel/sched/fair.c |   18 ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 828ed97..bbcd96b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq 
> *this_rq,
>  {
>   int ld_moved, cur_ld_moved, active_balance = 0;
>   struct sched_group *group;
> + struct sched_domain *child;
> + int share_pkg_res = 0;
>   struct rq *busiest;
>   unsigned long flags;
>   struct cpumask *cpus = __get_cpu_var(load_balance_mask);
> @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq 
> *this_rq,
>  
>   schedstat_inc(sd, lb_count[idle]);
>  
> + child = sd->child;
> + if (child && child->flags & SD_SHARE_PKG_RESOURCES)
> + share_pkg_res = 1;
> +
>  redo:
>   if (!should_we_balance(&env)) {
>   *continue_balancing = 0;
> @@ -5202,6 +5208,7 @@ redo:
>   goto out_balanced;
>   }
>  
> +redo_grp:
>   busiest = find_busiest_queue(&env, group);
>   if (!busiest) {
>   schedstat_inc(sd, lb_nobusyq[idle]);
> @@ -5292,6 +5299,11 @@ more_balance:
>   if (!cpumask_empty(cpus)) {
>   env.loop = 0;
>   env.loop_break = sched_nr_migrate_break;
> + if (share_pkg_res &&
> + cpumask_intersects(cpus,
> + to_cpumask(group->cpumask)))

sched_group_cpus()

> + goto redo_grp;
> +
>   goto redo;
>   }
>   goto out_balanced;
> @@ -5318,9 +5330,15 @@ more_balance:
>*/
>   if (!cpumask_test_cpu(this_cpu,
>   tsk_cpus_allowed(busiest->curr))) {
> + cpumask_clear_cpu(cpu_of(busiest), cpus);
>   raw_spin_unlock_irqrestore(&busiest->lock,
>   flags);
>   env.flags |= LBF_ALL_PINNED;
> + if (share_pkg_res &&
> + cpumask_intersects(cpus,
> + to_cpumask(group->cpumask)))
> + goto redo_grp;
> +
>   goto out_one_pinned;
>   }

Man this retry logic is getting annoying.. isn't there anything saner we
can do?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/3] sched: Fix asymmetric scheduling for POWER7

2013-10-22 Thread Peter Zijlstra
On Mon, Oct 21, 2013 at 05:14:52PM +0530, Vaidyanathan Srinivasan wrote:
>  kernel/sched/fair.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 12f0eab..828ed97 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int 
> cpu)
>   goto need_kick_unlock;
>   }
>  
> - if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
> - && (cpumask_first_and(nohz.idle_cpus_mask,
> + if (sd->flags & SD_ASYM_PACKING &&
> + (cpumask_first_and(nohz.idle_cpus_mask,
> sched_domain_span(sd)) < cpu))
>   goto need_kick_unlock;
>  
> 

Ahh, so here you remove the nr_busy usage.. this patch should really go
before the first one that makes this all weird and funny.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-22 Thread Peter Zijlstra
On Mon, Oct 21, 2013 at 05:14:42PM +0530, Vaidyanathan Srinivasan wrote:
>  kernel/sched/fair.c |   19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7c70201..12f0eab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int 
> cpu)
>  
>   rcu_read_lock();
>   for_each_domain(cpu, sd) {
> + struct sched_domain *sd_parent = sd->parent;
> + struct sched_group *sg;
> + struct sched_group_power *sgp;
> + int nr_busy;
> +
> + if (sd_parent) {
> + sg = sd_parent->groups;
> + sgp = sg->sgp;
> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
> +
> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> + goto need_kick_unlock;
> + }
>  
>   if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>   && (cpumask_first_and(nohz.idle_cpus_mask,
> 

Almost I'd say; what happens on !sd_parent && SD_ASYM_PACKING ?

Also, this made me look at the nr_busy stuff again, and somehow that
entire thing makes me a little sad.

Can't we do something like the below and cut that nr_busy sd iteration
short?

This nohz stuff really needs to be re-thought and made more scalable --
its a royal pain :/


 kernel/sched/core.c  |  4 
 kernel/sched/fair.c  | 21 +++--
 kernel/sched/sched.h |  5 ++---
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c06b8d3..89db8dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
+DEFINE_PER_CPU(struct sched_domain *, sd_busy);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)
 
sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
+
+   sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING);
+   rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 813dd61..3d5141e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6512,19 +6512,23 @@ static inline void nohz_balance_exit_idle(int cpu)
}
 }
 
-static inline void set_cpu_sd_state_busy(void)
+static inline void set_cpu_sd_state_busy(int cpu)
 {
struct sched_domain *sd;
+   struct rq *rq = cpu_rq(cpu);
 
rcu_read_lock();
-   sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+   sd = rcu_dereference_check_sched_domain(rq->sd);
 
if (!sd || !sd->nohz_idle)
goto unlock;
sd->nohz_idle = 0;
 
-   for (; sd; sd = sd->parent)
+   for (; sd; sd = sd->parent) {
atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+   if (sd == per_cpu(sd_busy, cpu))
+   break;
+   }
 unlock:
rcu_read_unlock();
 }
@@ -6532,16 +6536,21 @@ static inline void set_cpu_sd_state_busy(void)
 void set_cpu_sd_state_idle(void)
 {
struct sched_domain *sd;
+   int cpu = smp_processor_id();
+   struct rq *rq = cpu_rq(cpu);
 
rcu_read_lock();
-   sd = rcu_dereference_check_sched_domain(this_rq()->sd);
+   sd = rcu_dereference_check_sched_domain(rq->sd);
 
if (!sd || sd->nohz_idle)
goto unlock;
sd->nohz_idle = 1;
 
-   for (; sd; sd = sd->parent)
+   for (; sd; sd = sd->parent) {
atomic_dec(&sd->groups->sgp->nr_busy_cpus);
+   if (sd == per_cpu(sd_busy, cpu))
+   break;
+   }
 unlock:
rcu_read_unlock();
 }
@@ -6756,7 +6765,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
* We may be recently in ticked or tickless idle mode. At the first
* busy tick after returning from idle, we will update the busy stats.
*/
-   set_cpu_sd_state_busy();
+   set_cpu_sd_state_busy(cpu);
nohz_balance_exit_idle(cpu);
 
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ffc7087..80c5fd2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -599,9 +599,8 @@ static inline struct sched_domain *highest_flag_domain(int 
cpu, int flag)
struct sched_domain *sd, *hsd = NULL;
 
for_each_domain(cpu, sd) {
-   if (!(sd->flags & flag))
-   break;
-   hsd = sd;
+   if (sd->flags & flag)
+   hsd = sd;
}
 
return hsd;

___
Linuxppc-dev mailing list
Linuxppc-dev@

[RFC PATCH 2/9] powerpc: Free up _PAGE_COHERENCE for numa fault use later

2013-10-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Set  memory coherence always on hash64 config. If
a platform cannot have memory coherence always set they
can infer that from _PAGE_NO_CACHE and _PAGE_WRITETHRU
like in lpar. So we dont' really need a separate bit
for tracking _PAGE_COHERENCE.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pte-hash64.h |  2 +-
 arch/powerpc/mm/hash_low_64.S | 15 ---
 arch/powerpc/mm/hash_utils_64.c   |  7 ---
 arch/powerpc/mm/hugepage-hash64.c |  6 +-
 arch/powerpc/mm/hugetlbpage-hash64.c  |  4 
 5 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/pte-hash64.h 
b/arch/powerpc/include/asm/pte-hash64.h
index 0419eeb..55aea0c 100644
--- a/arch/powerpc/include/asm/pte-hash64.h
+++ b/arch/powerpc/include/asm/pte-hash64.h
@@ -19,7 +19,7 @@
 #define _PAGE_FILE 0x0002 /* (!present only) software: pte holds 
file offset */
 #define _PAGE_EXEC 0x0004 /* No execute on POWER4 and newer (we 
invert) */
 #define _PAGE_GUARDED  0x0008
-#define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP 
systems) */
+/* We can derive Memory coherence from _PAGE_NO_CACHE */
 #define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */
 #define _PAGE_WRITETHRU0x0040 /* W: cache write-through */
 #define _PAGE_DIRTY0x0080 /* C: page changed */
diff --git a/arch/powerpc/mm/hash_low_64.S b/arch/powerpc/mm/hash_low_64.S
index d3cbda6..1136d26 100644
--- a/arch/powerpc/mm/hash_low_64.S
+++ b/arch/powerpc/mm/hash_low_64.S
@@ -148,7 +148,10 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
and r0,r0,r4/* _PAGE_RW & _PAGE_DIRTY ->r0 bit 30*/
andcr0,r30,r0   /* r0 = pte & ~r0 */
rlwimi  r3,r0,32-1,31,31/* Insert result into PP lsb */
-   ori r3,r3,HPTE_R_C  /* Always add "C" bit for perf. */
+   /*
+* Always add "C" bit for perf. Memory coherence is always enabled
+*/
+   ori r3,r3,HPTE_R_C | HPTE_R_M
 
/* We eventually do the icache sync here (maybe inline that
 * code rather than call a C function...) 
@@ -457,7 +460,10 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
and r0,r0,r4/* _PAGE_RW & _PAGE_DIRTY ->r0 bit 30*/
andcr0,r3,r0/* r0 = pte & ~r0 */
rlwimi  r3,r0,32-1,31,31/* Insert result into PP lsb */
-   ori r3,r3,HPTE_R_C  /* Always add "C" bit for perf. */
+   /*
+* Always add "C" bit for perf. Memory coherence is always enabled
+*/
+   ori r3,r3,HPTE_R_C | HPTE_R_M
 
/* We eventually do the icache sync here (maybe inline that
 * code rather than call a C function...)
@@ -795,7 +801,10 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
and r0,r0,r4/* _PAGE_RW & _PAGE_DIRTY ->r0 bit 30*/
andcr0,r30,r0   /* r0 = pte & ~r0 */
rlwimi  r3,r0,32-1,31,31/* Insert result into PP lsb */
-   ori r3,r3,HPTE_R_C  /* Always add "C" bit for perf. */
+   /*
+* Always add "C" bit for perf. Memory coherence is always enabled
+*/
+   ori r3,r3,HPTE_R_C | HPTE_R_M
 
/* We eventually do the icache sync here (maybe inline that
 * code rather than call a C function...)
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index bde8b55..fb176e9 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -169,9 +169,10 @@ static unsigned long htab_convert_pte_flags(unsigned long 
pteflags)
if ((pteflags & _PAGE_USER) && !((pteflags & _PAGE_RW) &&
 (pteflags & _PAGE_DIRTY)))
rflags |= 1;
-
-   /* Always add C */
-   return rflags | HPTE_R_C;
+   /*
+* Always add "C" bit for perf. Memory coherence is always enabled
+*/
+   return rflags | HPTE_R_C | HPTE_R_M;
 }
 
 int htab_bolt_mapping(unsigned long vstart, unsigned long vend,
diff --git a/arch/powerpc/mm/hugepage-hash64.c 
b/arch/powerpc/mm/hugepage-hash64.c
index 34de9e0..826893f 100644
--- a/arch/powerpc/mm/hugepage-hash64.c
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -127,7 +127,11 @@ repeat:
 
/* Add in WIMG bits */
rflags |= (new_pmd & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
- _PAGE_COHERENT | _PAGE_GUARDED));
+ _PAGE_GUARDED));
+   /*
+* enable the memory coherence always
+*/
+   rflags |= HPTE_R_M;
 
/* Insert into the hash table, primary slot */
slot = ppc_md.hpte_insert(hpte_group, vpn, pa, rflags, 0,
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c 
b/arch/powerpc/mm/hugetlbpage-hash64.c
index 0b7fb67..a5bcf93 10

Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-22 Thread Preeti U Murthy
Hi Kamalesh,

On 10/22/2013 08:05 PM, Kamalesh Babulal wrote:
> * Vaidyanathan Srinivasan  [2013-10-21 17:14:42]:
> 
>>  for_each_domain(cpu, sd) {
>> -struct sched_group *sg = sd->groups;
>> -struct sched_group_power *sgp = sg->sgp;
>> -int nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> -
>> -if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> -goto need_kick_unlock;
>> +struct sched_domain *sd_parent = sd->parent;
>> +struct sched_group *sg;
>> +struct sched_group_power *sgp;
>> +int nr_busy;
>> +
>> +if (sd_parent) {
>> +sg = sd_parent->groups;
>> +sgp = sg->sgp;
>> +nr_busy = atomic_read(&sgp->nr_busy_cpus);
>> +
>> +if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
>> +goto need_kick_unlock;
>> +}
>>
>>  if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>>  && (cpumask_first_and(nohz.idle_cpus_mask,
> 
> CC'ing Suresh Siddha and Vincent Guittot
> 
> Please correct me, If my understanding of idle balancing is wrong.
> With proposed approach will not idle load balancer kick in, even if
> there are busy cpus across groups or if there are 2 busy cpus which
> are spread across sockets.

Yes load balancing will happen on busy cpus periodically.

Wrt idle balancing there are two points here. One, when a CPU is just
about to go idle, it will enter idle_balance(), and trigger load
balancing with itself being the destination CPU to begin with. It will
load balance at every level of the sched domain that it belongs to. If
it manages to pull tasks, good, else it will enter an idle state.

nohz_idle_balancing is triggered by a busy cpu at every tick if it has
more than one task in its runqueue or if it belongs to a group that
shares the package resources and has more than one cpu busy. By
"nohz_idle_balance triggered", it means the busy cpu will send an ipi to
the ilb_cpu to do load balancing on the behalf of the idle cpus in the
nohz mask.

So to answer your question wrt this patch, if there is one busy cpu with
say 2 tasks in one socket and another busy cpu with 1 task on another
socket, the former busy cpu can kick nohz_idle_balance since it has more
than one task in its runqueue. An idle cpu in either socket could be
woken up to balance tasks with it.

The usual idle load balancer that runs on a CPU about to become idle
could pull from either cpu depending on who is more busy as it begins to
load balance across all levels of sched domain that it belongs to.
> 
> Consider 2 socket machine with 4 processors each (MC and NUMA domains).
> If the machine is partial loaded such that cpus 0,4,5,6,7 are busy, then too
> nohz balancing is triggered because with this approach
> (NUMA)->groups->sgp->nr_busy_cpus is taken in account for nohz kick, while
> iterating over MC domain.

For the example that you mention, you will have a CPU domain and a NUMA
domain. When the sockets are NUMA nodes, each socket will belong to a
CPU domain. If the sockets are non-numa nodes, then the domain
encompassing both the nodes will be a CPU domain, possibly with each
socket being an MC domain.
> 
> Isn't idle load balancer not suppose kick in, even in the case of two busy
> cpu's in a dual-core single socket system

nohz_idle_balancing is a special case. It is triggered when the
conditions mentioned in nohz_kick_needed() are true. A CPU just about to
go idle will trigger load balancing without any pre-conditions.

In a single socket machine, there will be a CPU domain encompassing the
socket and the MC domain will encompass a core. nohz_idle load balancer
will kick in if both the threads in the core have tasks running on them.
This is fair enough because the threads share the resources of the core.

Regards
Preeti U Murthy
> 
> Thanks,
> Kamalesh.
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc: sync ppc64, ppc64e and pseries configs

2013-10-22 Thread Nathan Fontenot
On 10/21/2013 07:44 PM, Anton Blanchard wrote:
> 
> Run savedefconfig over the ppc64, ppc64e and pseries config
> 
> Signed-off-by: Anton Blanchard 
> ---
> 
> Index: b/arch/powerpc/configs/ppc64_defconfig
> ===
> --- a/arch/powerpc/configs/ppc64_defconfig
> +++ b/arch/powerpc/configs/ppc64_defconfig
> @@ -2,7 +2,6 @@ CONFIG_PPC64=y
>  CONFIG_ALTIVEC=y
>  CONFIG_VSX=y
>  CONFIG_SMP=y
> -CONFIG_EXPERIMENTAL=y
>  CONFIG_SYSVIPC=y
>  CONFIG_POSIX_MQUEUE=y
>  CONFIG_IRQ_DOMAIN_DEBUG=y
> @@ -25,7 +24,6 @@ CONFIG_MODULE_UNLOAD=y
>  CONFIG_MODVERSIONS=y
>  CONFIG_MODULE_SRCVERSION_ALL=y
>  CONFIG_PARTITION_ADVANCED=y
> -CONFIG_EFI_PARTITION=y
>  CONFIG_PPC_SPLPAR=y
>  CONFIG_SCANLOG=m
>  CONFIG_PPC_SMLPAR=y
> @@ -50,12 +48,10 @@ CONFIG_CPU_FREQ_PMAC64=y
>  CONFIG_HZ_100=y
>  CONFIG_BINFMT_MISC=m
>  CONFIG_PPC_TRANSACTIONAL_MEM=y
> -CONFIG_HOTPLUG_CPU=y

It looks like your disabling hotplug cpu, is that correct?

You did this for all three config files.

-Nathan

>  CONFIG_KEXEC=y
>  CONFIG_IRQ_ALL_CPUS=y
>  CONFIG_MEMORY_HOTREMOVE=y
>  CONFIG_SCHED_SMT=y
> -CONFIG_PPC_DENORMALISATION=y
>  CONFIG_PCCARD=y
>  CONFIG_ELECTRA_CF=y
>  CONFIG_HOTPLUG_PCI=y
> @@ -89,7 +85,6 @@ CONFIG_NF_CONNTRACK_PPTP=m
>  CONFIG_NF_CONNTRACK_SIP=m
>  CONFIG_NF_CONNTRACK_TFTP=m
>  CONFIG_NF_CT_NETLINK=m
> -CONFIG_NETFILTER_TPROXY=m
>  CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
>  CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
>  CONFIG_NETFILTER_XT_TARGET_DSCP=m
> @@ -131,7 +126,6 @@ CONFIG_NETFILTER_XT_MATCH_STRING=m
>  CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
>  CONFIG_NETFILTER_XT_MATCH_U32=m
>  CONFIG_NF_CONNTRACK_IPV4=m
> -CONFIG_IP_NF_QUEUE=m
>  CONFIG_IP_NF_IPTABLES=m
>  CONFIG_IP_NF_MATCH_AH=m
>  CONFIG_IP_NF_MATCH_ECN=m
> @@ -216,6 +210,7 @@ CONFIG_DUMMY=m
>  CONFIG_NETCONSOLE=y
>  CONFIG_NETPOLL_TRAP=y
>  CONFIG_TUN=m
> +CONFIG_VHOST_NET=m
>  CONFIG_VORTEX=y
>  CONFIG_ACENIC=m
>  CONFIG_ACENIC_OMIT_TIGON_I=y
> @@ -301,7 +296,6 @@ CONFIG_HID_GYRATION=y
>  CONFIG_HID_PANTHERLORD=y
>  CONFIG_HID_PETALYNX=y
>  CONFIG_HID_SAMSUNG=y
> -CONFIG_HID_SONY=y
>  CONFIG_HID_SUNPLUS=y
>  CONFIG_USB_HIDDEV=y
>  CONFIG_USB=y
> @@ -386,21 +380,19 @@ CONFIG_NLS_UTF8=y
>  CONFIG_CRC_T10DIF=y
>  CONFIG_MAGIC_SYSRQ=y
>  CONFIG_DEBUG_KERNEL=y
> +CONFIG_DEBUG_STACK_USAGE=y
> +CONFIG_DEBUG_STACKOVERFLOW=y
>  CONFIG_LOCKUP_DETECTOR=y
>  CONFIG_DEBUG_MUTEXES=y
> -CONFIG_DEBUG_STACK_USAGE=y
>  CONFIG_LATENCYTOP=y
>  CONFIG_SCHED_TRACER=y
>  CONFIG_BLK_DEV_IO_TRACE=y
> -CONFIG_DEBUG_STACKOVERFLOW=y
>  CONFIG_CODE_PATCHING_SELFTEST=y
>  CONFIG_FTR_FIXUP_SELFTEST=y
>  CONFIG_MSI_BITMAP_SELFTEST=y
>  CONFIG_XMON=y
>  CONFIG_BOOTX_TEXT=y
>  CONFIG_PPC_EARLY_DEBUG=y
> -CONFIG_PPC_EARLY_DEBUG_BOOTX=y
> -CONFIG_CRYPTO_NULL=m
>  CONFIG_CRYPTO_TEST=m
>  CONFIG_CRYPTO_PCBC=m
>  CONFIG_CRYPTO_HMAC=y
> @@ -422,4 +414,3 @@ CONFIG_CRYPTO_DEV_NX_ENCRYPT=m
>  CONFIG_VIRTUALIZATION=y
>  CONFIG_KVM_BOOK3S_64=m
>  CONFIG_KVM_BOOK3S_64_HV=y
> -CONFIG_VHOST_NET=m
> Index: b/arch/powerpc/configs/ppc64e_defconfig
> ===
> --- a/arch/powerpc/configs/ppc64e_defconfig
> +++ b/arch/powerpc/configs/ppc64e_defconfig
> @@ -1,7 +1,6 @@
>  CONFIG_PPC64=y
>  CONFIG_PPC_BOOK3E_64=y
>  CONFIG_SMP=y
> -CONFIG_EXPERIMENTAL=y
>  CONFIG_SYSVIPC=y
>  CONFIG_POSIX_MQUEUE=y
>  CONFIG_NO_HZ=y
> @@ -22,7 +21,6 @@ CONFIG_MODVERSIONS=y
>  CONFIG_MODULE_SRCVERSION_ALL=y
>  CONFIG_PARTITION_ADVANCED=y
>  CONFIG_MAC_PARTITION=y
> -CONFIG_EFI_PARTITION=y
>  CONFIG_P5020_DS=y
>  CONFIG_CPU_FREQ=y
>  CONFIG_CPU_FREQ_GOV_POWERSAVE=y
> @@ -61,7 +59,6 @@ CONFIG_NF_CONNTRACK_PPTP=m
>  CONFIG_NF_CONNTRACK_SIP=m
>  CONFIG_NF_CONNTRACK_TFTP=m
>  CONFIG_NF_CT_NETLINK=m
> -CONFIG_NETFILTER_TPROXY=m
>  CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
>  CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
>  CONFIG_NETFILTER_XT_TARGET_DSCP=m
> @@ -103,7 +100,6 @@ CONFIG_NETFILTER_XT_MATCH_STRING=m
>  CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
>  CONFIG_NETFILTER_XT_MATCH_U32=m
>  CONFIG_NF_CONNTRACK_IPV4=m
> -CONFIG_IP_NF_QUEUE=m
>  CONFIG_IP_NF_IPTABLES=m
>  CONFIG_IP_NF_MATCH_AH=m
>  CONFIG_IP_NF_MATCH_ECN=m
> @@ -193,7 +189,6 @@ CONFIG_PPP_SYNC_TTY=m
>  CONFIG_INPUT_EVDEV=m
>  CONFIG_INPUT_MISC=y
>  # CONFIG_SERIO_SERPORT is not set
> -CONFIG_VT_HW_CONSOLE_BINDING=y
>  CONFIG_SERIAL_8250=y
>  CONFIG_SERIAL_8250_CONSOLE=y
>  # CONFIG_HW_RANDOM is not set
> @@ -230,7 +225,6 @@ CONFIG_HID_NTRIG=y
>  CONFIG_HID_PANTHERLORD=y
>  CONFIG_HID_PETALYNX=y
>  CONFIG_HID_SAMSUNG=y
> -CONFIG_HID_SONY=y
>  CONFIG_HID_SUNPLUS=y
>  CONFIG_HID_GREENASIA=y
>  CONFIG_HID_SMARTJOYPLUS=y
> @@ -302,19 +296,18 @@ CONFIG_NLS_UTF8=y
>  CONFIG_CRC_T10DIF=y
>  CONFIG_MAGIC_SYSRQ=y
>  CONFIG_DEBUG_KERNEL=y
> +CONFIG_DEBUG_STACK_USAGE=y
> +CONFIG_DEBUG_STACKOVERFLOW=y
>  CONFIG_DETECT_HUNG_TASK=y
>  CONFIG_DEBUG_MUTEXES=y
> -CONFIG_DEBUG_STACK_USAGE=y
>  CONFIG_LATENCYTOP=y
>  CONFIG_IRQSOFF_TRACER=y
>  CONFIG_SCHED_TRACER=y
>  CONFIG_BLK_DEV_IO_TRACE=y
> -CONFIG_DEBUG

Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-22 Thread Kamalesh Babulal
* Vaidyanathan Srinivasan  [2013-10-21 17:14:42]:

>   for_each_domain(cpu, sd) {
> - struct sched_group *sg = sd->groups;
> - struct sched_group_power *sgp = sg->sgp;
> - int nr_busy = atomic_read(&sgp->nr_busy_cpus);
> -
> - if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> - goto need_kick_unlock;
> + struct sched_domain *sd_parent = sd->parent;
> + struct sched_group *sg;
> + struct sched_group_power *sgp;
> + int nr_busy;
> +
> + if (sd_parent) {
> + sg = sd_parent->groups;
> + sgp = sg->sgp;
> + nr_busy = atomic_read(&sgp->nr_busy_cpus);
> +
> + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> + goto need_kick_unlock;
> + }
> 
>   if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
>   && (cpumask_first_and(nohz.idle_cpus_mask,

CC'ing Suresh Siddha and Vincent Guittot

Please correct me, If my understanding of idle balancing is wrong.
With proposed approach will not idle load balancer kick in, even if
there are busy cpus across groups or if there are 2 busy cpus which
are spread across sockets.

Consider 2 socket machine with 4 processors each (MC and NUMA domains).
If the machine is partial loaded such that cpus 0,4,5,6,7 are busy, then too
nohz balancing is triggered because with this approach
(NUMA)->groups->sgp->nr_busy_cpus is taken in account for nohz kick, while
iterating over MC domain.

Isn't idle load balancer not suppose kick in, even in the case of two busy
cpu's in a dual-core single socket system.

Thanks,
Kamalesh.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH 5/9] powerpc: mm: book3s: Enable _PAGE_NUMA for book3s

2013-10-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

We steal the _PAGE_COHERENCE bit and use that for indicating NUMA ptes.
This patch still disables the numa hinting using pmd entries. That
require further changes to pmd entry format which is done in later
patches.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable.h | 66 +-
 arch/powerpc/include/asm/pte-hash64.h  |  6 
 arch/powerpc/platforms/Kconfig.cputype |  1 +
 3 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 7d6eacf..9d87125 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -3,6 +3,7 @@
 #ifdef __KERNEL__
 
 #ifndef __ASSEMBLY__
+#include 
 #include  /* For TASK_SIZE */
 #include 
 #include 
@@ -33,10 +34,73 @@ static inline int pte_dirty(pte_t pte)  { 
return pte_val(pte) & _PAGE_DIRTY; }
 static inline int pte_young(pte_t pte) { return pte_val(pte) & 
_PAGE_ACCESSED; }
 static inline int pte_file(pte_t pte)  { return pte_val(pte) & 
_PAGE_FILE; }
 static inline int pte_special(pte_t pte)   { return pte_val(pte) & 
_PAGE_SPECIAL; }
-static inline int pte_present(pte_t pte)   { return pte_val(pte) & 
_PAGE_PRESENT; }
 static inline int pte_none(pte_t pte)  { return (pte_val(pte) & 
~_PTE_NONE_MASK) == 0; }
 static inline pgprot_t pte_pgprot(pte_t pte)   { return __pgprot(pte_val(pte) 
& PAGE_PROT_BITS); }
 
+#ifdef CONFIG_NUMA_BALANCING
+
+static inline int pte_present(pte_t pte)
+{
+   return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA);
+}
+
+#define pte_numa pte_numa
+static inline int pte_numa(pte_t pte)
+{
+   return (pte_val(pte) &
+   (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
+#define pte_mknonnuma pte_mknonnuma
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+   pte_val(pte) &= ~_PAGE_NUMA;
+   pte_val(pte) |=  _PAGE_PRESENT | _PAGE_ACCESSED;
+   return pte;
+}
+
+#define pte_mknuma pte_mknuma
+static inline pte_t pte_mknuma(pte_t pte)
+{
+   /*
+* We should not set _PAGE_NUMA on non present ptes. Also clear the
+* present bit so that hash_page will return 1 and we collect this
+* as numa fault.
+*/
+   if (pte_present(pte)) {
+   pte_val(pte) |= _PAGE_NUMA;
+   pte_val(pte) &= ~_PAGE_PRESENT;
+   } else
+   VM_BUG_ON(1);
+   return pte;
+}
+
+#define pmd_numa pmd_numa
+static inline int pmd_numa(pmd_t pmd)
+{
+   return 0;
+}
+
+#define pmd_mknonnuma pmd_mknonnuma
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+   return pmd;
+}
+
+#define pmd_mknuma pmd_mknuma
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+   return pmd;
+}
+
+# else
+
+static inline int pte_present(pte_t pte)
+{
+   return pte_val(pte) & _PAGE_PRESENT;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
  *
diff --git a/arch/powerpc/include/asm/pte-hash64.h 
b/arch/powerpc/include/asm/pte-hash64.h
index 55aea0c..2505d8e 100644
--- a/arch/powerpc/include/asm/pte-hash64.h
+++ b/arch/powerpc/include/asm/pte-hash64.h
@@ -27,6 +27,12 @@
 #define _PAGE_RW   0x0200 /* software: user write access allowed */
 #define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */
 
+/*
+ * Used for tracking numa faults
+ */
+#define _PAGE_NUMA 0x0010 /* Gather numa placement stats */
+
+
 /* No separate kernel read-only */
 #define _PAGE_KERNEL_RW(_PAGE_RW | _PAGE_DIRTY) /* user access 
blocked by key */
 #define _PAGE_KERNEL_RO _PAGE_KERNEL_RW
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 6704e2e..c9d6223 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -72,6 +72,7 @@ config PPC_BOOK3S_64
select PPC_HAVE_PMU_SUPPORT
select SYS_SUPPORTS_HUGETLBFS
select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
+   select ARCH_SUPPORTS_NUMA_BALANCING
 
 config PPC_BOOK3E_64
bool "Embedded processors"
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH 6/9] powerpc: mm: book3s: Disable hugepaged pmd format for book3s

2013-10-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

After commit e2b3d202d1dba8f3546ed28224ce485bc50010be we have the
below possible formats for pmd entry

(1) invalid (all zeroes)
(2) pointer to next table, as normal; bottom 6 bits == 0
(3) leaf pte for huge page, bottom two bits != 00
(4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table

On book3s we don't really use the (4).  For Numa balancing we need to
tag pmd entries that are pointer to next table with _PAGE_NUMA for
performance reason (9532fec118d485ea37ab6e3ea372d68cd8b4cd0d). This
patch enables that by disabling hugepd support for book3s if
NUMA_BALANCING is enabled. We ideally want to get rid of hugepd pointer
completely.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page.h | 11 +++
 arch/powerpc/mm/hugetlbpage.c   |  8 +++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index b9f4262..791ab56 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -369,11 +369,22 @@ typedef struct { signed long pd; } hugepd_t;
 #ifdef CONFIG_PPC_BOOK3S_64
 static inline int hugepd_ok(hugepd_t hpd)
 {
+#ifdef CONFIG_NUMA_BALANCING
+   /*
+* In order to enable batch handling of pte numa faults, Numa balancing
+* code use the _PAGE_NUMA bit even on pmd that is pointing to PTE PAGE.
+* 9532fec118d485ea37ab6e3ea372d68cd8b4cd0d. After commit
+* e2b3d202d1dba8f3546ed28224ce485bc50010be we really don't need to
+* support hugepd for ppc64.
+*/
+   return 0;
+#else
/*
 * hugepd pointer, bottom two bits == 00 and next 4 bits
 * indicate size of table
 */
return (((hpd.pd & 0x3) == 0x0) && ((hpd.pd & HUGEPD_SHIFT_MASK) != 0));
+#endif
 }
 #else
 static inline int hugepd_ok(hugepd_t hpd)
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index d67db4b..71bd214 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -235,8 +235,14 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long 
addr, unsigned long sz
if (!hpdp)
return NULL;
 
+#ifdef CONFIG_NUMA_BALANCING
+   /*
+* We cannot support hugepd format with numa balancing support
+* enabled.
+*/
+   return NULL;
+#endif
BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
-
if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, pdshift, 
pshift))
return NULL;
 
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH 7/9] mm: numafaults: Use change_pmd_protnuma for updating _PAGE_NUMA for regular pmds

2013-10-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Archs like ppc64 have different layout for pmd entries pointing to PTE
page. Hence add a separate function for modifying them

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable.h | 17 +
 include/asm-generic/pgtable.h  | 20 
 mm/memory.c|  2 +-
 mm/mprotect.c  | 24 ++--
 4 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 9d87125..67ea8fb 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -75,6 +75,23 @@ static inline pte_t pte_mknuma(pte_t pte)
return pte;
 }
 
+#define change_pmd_protnuma change_pmd_protnuma
+static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long 
addr,
+  pmd_t *pmdp, int prot_numa)
+{
+   /*
+* We don't track the _PAGE_PRESENT bit here
+*/
+   unsigned long pmd_val;
+   pmd_val = pmd_val(*pmdp);
+   if (prot_numa)
+   pmd_val |= _PAGE_NUMA;
+   else
+   pmd_val &= ~_PAGE_NUMA;
+   pmd_set(pmdp, pmd_val | _PAGE_NUMA);
+}
+
+
 #define pmd_numa pmd_numa
 static inline int pmd_numa(pmd_t pmd)
 {
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f330d28..568a8c4 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -697,6 +697,18 @@ static inline pmd_t pmd_mknuma(pmd_t pmd)
return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 #endif
+
+#ifndef change_pmd_protnuma
+static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long 
addr,
+  pmd_t *pmd, int prot_numa)
+{
+   if (prot_numa)
+   set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
+   else
+   set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknonnuma(*pmd));
+}
+
+#endif
 #else
 extern int pte_numa(pte_t pte);
 extern int pmd_numa(pmd_t pmd);
@@ -704,6 +716,8 @@ extern pte_t pte_mknonnuma(pte_t pte);
 extern pmd_t pmd_mknonnuma(pmd_t pmd);
 extern pte_t pte_mknuma(pte_t pte);
 extern pmd_t pmd_mknuma(pmd_t pmd);
+extern void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
+   pmd_t *pmd, int prot_numa);
 #endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
 #else
 static inline int pmd_numa(pmd_t pmd)
@@ -735,6 +749,12 @@ static inline pmd_t pmd_mknuma(pmd_t pmd)
 {
return pmd;
 }
+
+static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long 
addr,
+  pmd_t *pmd, int prot_numa)
+{
+   BUG();
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_MMU */
diff --git a/mm/memory.c b/mm/memory.c
index ca00039..e930e50 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3605,7 +3605,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
if (pmd_numa(pmd)) {
-   set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
+   change_pmd_protnuma(mm, _addr, pmdp, 0);
numa = true;
}
spin_unlock(&mm->page_table_lock);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..88de575 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -112,22 +112,6 @@ static unsigned long change_pte_range(struct 
vm_area_struct *vma, pmd_t *pmd,
return pages;
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long 
addr,
-  pmd_t *pmd)
-{
-   spin_lock(&mm->page_table_lock);
-   set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
-   spin_unlock(&mm->page_table_lock);
-}
-#else
-static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long 
addr,
-  pmd_t *pmd)
-{
-   BUG();
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pud_t *pud, unsigned long addr, unsigned long end,
pgprot_t newprot, int dirty_accountable, int prot_numa)
@@ -161,8 +145,12 @@ static inline unsigned long change_pmd_range(struct 
vm_area_struct *vma,
 * node. This allows a regular PMD to be handled as one fault
 * and effectively batches the taking of the PTL
 */
-   if (prot_numa && all_same_node)
-   change_pmd_protnuma(vma->vm_mm, addr, pmd);
+   if (prot_numa && all_same_node) {
+   spin_lock(&vma->vm_mm->page_table_lock);
+   change_pmd_protnuma(vma->vm_mm, addr, pmd, 1);
+   spin_unlock(&vma->vm_mm->page_table_lock);
+
+   }
} while (pmd++, addr = next, addr != end);
 
re

[RFC PATCH 9/9] powerpc: mm: Enable numa faulting for hugepages

2013-10-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Provide numa related functions for updating pmd entries.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 67ea8fb..aa3add7 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -95,19 +95,19 @@ static inline void change_pmd_protnuma(struct mm_struct 
*mm, unsigned long addr,
 #define pmd_numa pmd_numa
 static inline int pmd_numa(pmd_t pmd)
 {
-   return 0;
+   return pte_numa(pmd_pte(pmd));
 }
 
 #define pmd_mknonnuma pmd_mknonnuma
 static inline pmd_t pmd_mknonnuma(pmd_t pmd)
 {
-   return pmd;
+   return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
 }
 
 #define pmd_mknuma pmd_mknuma
 static inline pmd_t pmd_mknuma(pmd_t pmd)
 {
-   return pmd;
+   return pte_pmd(pte_mknuma(pmd_pte(pmd)));
 }
 
 # else
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH 8/9] powerpc: mm: Support setting _PAGE_NUMA bit on pmd entry which are pointer to PTE page

2013-10-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..f828944 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -150,8 +150,22 @@
 
 #define pmd_set(pmdp, pmdval)  (pmd_val(*(pmdp)) = (pmdval))
 #define pmd_none(pmd)  (!pmd_val(pmd))
-#definepmd_bad(pmd)(!is_kernel_addr(pmd_val(pmd)) \
-|| (pmd_val(pmd) & PMD_BAD_BITS))
+
+static inline int pmd_bad(pmd_t pmd)
+{
+#ifdef CONFIG_NUMA_BALANCING
+   /*
+* For numa balancing we can have this set
+*/
+   if (pmd_val(pmd) & _PAGE_NUMA)
+   return 0;
+#endif
+   if (!is_kernel_addr(pmd_val(pmd)) ||
+   (pmd_val(pmd) & PMD_BAD_BITS))
+   return 1;
+   return 0;
+}
+
 #definepmd_present(pmd)(pmd_val(pmd) != 0)
 #definepmd_clear(pmdp) (pmd_val(*(pmdp)) = 0)
 #define pmd_page_vaddr(pmd)(pmd_val(pmd) & ~PMD_MASKED_BITS)
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH 3/9] mm: Move change_prot_numa outside CONFIG_ARCH_USES_NUMA_PROT_NONE

2013-10-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

change_prot_numa should work even if _PAGE_NUMA != _PAGE_PROTNONE.
On archs like ppc64 that don't use _PAGE_PROTNONE and also have
a separate page table outside linux pagetable, we just need to
make sure that when calling change_prot_numa we flush the
hardware page table entry so that next page access  result in a numa
fault.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/mm.h | 3 ---
 mm/mempolicy.c | 9 -
 2 files changed, 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..5ab0e22 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1668,11 +1668,8 @@ static inline pgprot_t vm_get_page_prot(unsigned long 
vm_flags)
 }
 #endif
 
-#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
 unsigned long change_prot_numa(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
-#endif
-
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0472964..efb4300 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -612,7 +612,6 @@ static inline int queue_pages_pgd_range(struct 
vm_area_struct *vma,
return 0;
 }
 
-#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
 /*
  * This is used to mark a range of virtual addresses to be inaccessible.
  * These are later cleared by a NUMA hinting fault. Depending on these
@@ -626,7 +625,6 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
 {
int nr_updated;
-   BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
 
nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
if (nr_updated)
@@ -634,13 +632,6 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 
return nr_updated;
 }
-#else
-static unsigned long change_prot_numa(struct vm_area_struct *vma,
-   unsigned long addr, unsigned long end)
-{
-   return 0;
-}
-#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
 
 /*
  * Walk through page tables and collect pages to be migrated.
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH 1/9] powerpc: Use HPTE constants when updating hpte bits

2013-10-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

Even though we have same value for linux PTE bits and hash PTE pits
use the hash pte bits wen updating hash pte

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/cell/beat_htab.c | 4 ++--
 arch/powerpc/platforms/pseries/lpar.c   | 3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/cell/beat_htab.c 
b/arch/powerpc/platforms/cell/beat_htab.c
index c34ee4e..d4d245c 100644
--- a/arch/powerpc/platforms/cell/beat_htab.c
+++ b/arch/powerpc/platforms/cell/beat_htab.c
@@ -111,7 +111,7 @@ static long beat_lpar_hpte_insert(unsigned long hpte_group,
DBG_LOW(" hpte_v=%016lx, hpte_r=%016lx\n", hpte_v, hpte_r);
 
if (rflags & _PAGE_NO_CACHE)
-   hpte_r &= ~_PAGE_COHERENT;
+   hpte_r &= ~HPTE_R_M;
 
raw_spin_lock(&beat_htab_lock);
lpar_rc = beat_read_mask(hpte_group);
@@ -337,7 +337,7 @@ static long beat_lpar_hpte_insert_v3(unsigned long 
hpte_group,
DBG_LOW(" hpte_v=%016lx, hpte_r=%016lx\n", hpte_v, hpte_r);
 
if (rflags & _PAGE_NO_CACHE)
-   hpte_r &= ~_PAGE_COHERENT;
+   hpte_r &= ~HPTE_R_M;
 
/* insert into not-volted entry */
lpar_rc = beat_insert_htab_entry3(0, hpte_group, hpte_v, hpte_r,
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 356bc75..c8fbef23 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -153,7 +153,8 @@ static long pSeries_lpar_hpte_insert(unsigned long 
hpte_group,
 
/* Make pHyp happy */
if ((rflags & _PAGE_NO_CACHE) && !(rflags & _PAGE_WRITETHRU))
-   hpte_r &= ~_PAGE_COHERENT;
+   hpte_r &= ~HPTE_R_M;
+
if (firmware_has_feature(FW_FEATURE_XCMO) && !(hpte_r & HPTE_R_N))
flags |= H_COALESCE_CAND;
 
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH 4/9] powerpc: mm: Only check for _PAGE_PRESENT in set_pte/pmd functions

2013-10-22 Thread Aneesh Kumar K.V
From: "Aneesh Kumar K.V" 

We want to make sure we don't use these function when updating a pte
or pmd entry that have a valid hpte entry, because these functions
don't invalidate them. So limit the check to _PAGE_PRESENT bit.
Numafault core changes use these functions for updating _PAGE_NUMA bits.
That should be ok because when _PAGE_NUMA is set we can be sure that
hpte entries are not present.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/pgtable.c| 2 +-
 arch/powerpc/mm/pgtable_64.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index edda589..10c09b6 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -187,7 +187,7 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep,
pte_t pte)
 {
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(pte_present(*ptep));
+   WARN_ON(pte_val(*ptep) & _PAGE_PRESENT);
 #endif
/* Note: mm->context.id might not yet have been assigned as
 * this context might not have been activated yet when this
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 536eec72..56b7586 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -686,7 +686,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef CONFIG_DEBUG_VM
-   WARN_ON(!pmd_none(*pmdp));
+   WARN_ON(pmd_val(*pmdp) & _PAGE_PRESENT);
assert_spin_locked(&mm->page_table_lock);
WARN_ON(!pmd_trans_huge(pmd));
 #endif
-- 
1.8.3.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[RFC PATCH 0/9] powerpc: mm: Numa faults support for ppc64

2013-10-22 Thread Aneesh Kumar K.V
Hi,

This patch series add support for numa faults on ppc64 architecture. We steal 
the
_PAGE_COHERENCE bit and use that for indicating _PAGE_NUMA. We clear the 
_PAGE_PRESENT bit
and also invalidate the hpte entry on setting _PAGE_NUMA. The next fault on that
page will be considered a numa fault.


NOTE:
__
Issue:
I am finding large lock contention on page_table_lock with this series on a 95 
cpu 4 node box with autonuma benchmark

I will out on vacation till NOV 6 without email access. Hence i will not be 
able to respond to review feedbacks
till then. 


lock_stat version 0.3
---
  class namecon-bouncescontentions   waittime-min   
waittime-max waittime-totalacq-bounces   acquisitions   holdtime-mi  hold 
time hold total
-

  &(&mm->page_table_lock)->rlock: 713531791  719610919   0.09   
  3038193.19 357867523236.3  729709189  7500401620.0  236991.36  
1159646899.68
  --
  &(&mm->page_table_lock)->rlock  1  [] 
.anon_vma_prepare+0xb0/0x1e0
  &(&mm->page_table_lock)->rlock 93  [] 
.do_numa_page+0x4c/0x190
  &(&mm->page_table_lock)->rlock 301678  [] 
.change_protection+0x1d4/0x560
  &(&mm->page_table_lock)->rlock 244524  [] 
.change_protection+0x3e8/0x560
  --
  &(&mm->page_table_lock)->rlock  1  [] 
.__do_fault+0x198/0x6b0
  &(&mm->page_table_lock)->rlock 704163  [] 
.change_protection+0x1d4/0x560
  &(&mm->page_table_lock)->rlock 207227  [] 
.change_protection+0x3e8/0x560
  &(&mm->page_table_lock)->rlock 95  [] 
.do_numa_page+0x4c/0x190
 
-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCHv1 8/8] Documentation: Add device tree bindings for Freescale VF610 sound.

2013-10-22 Thread Mark Brown
On Mon, Oct 21, 2013 at 07:24:56AM +, Xiubo Li-B47053 wrote:

> Yes, the "-- SGTL5000 pins:" should be in the CODEC binding.
> But, actually the CODEC binding hasn't any reference about this.

> So I added it here, but not very sure.

Please add them to the CODEC binding instead.


signature.asc
Description: Digital signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Elbc device driver

2013-10-22 Thread Mercier Ivan
Ok Scott,
now it works!
We had severals hardware problem.
Thanks for your help

2013/10/11 Scott Wood :
> On Fri, 2013-10-11 at 17:03 +0200, Mercier Ivan wrote:
>> Hi,
>> this should be correct (I'm using chip select 3 for this device)
>> lbc: localbus@ffe124000 {
>> reg = <0xf 0xfe124000 0 0x1000>;
>> ranges = <3 0 0xf 0xe000 0x0800>;
>>
>> a3p400{
>> #address-cells = <1>;
>> #size-cells = <1>;
>> compatible = "my_a3p_driver";
>> reg = <0x0 0x0 0x80>;
>> };
>> };
>
> Compatible describes the device, not the driver.  It takes the format
> "vendor,device".  The node name, OTOH, is normally a generic description
> of the device's functionality ("flash", "ethernet", "board-control",
> etc).
>
> You don't need #address-cells/#size-cells on the a3p400 node unless it
> has child nodes with reg or ranges.
>
> -Scott
>
>
>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev