Re: [PATCH 10/33] autonuma: CPU follows memory algorithm

2012-10-12 Thread Mel Gorman
On Fri, Oct 12, 2012 at 02:25:13AM +0200, Andrea Arcangeli wrote:
> On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
> > On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
> > > This algorithm takes as input the statistical information filled by the
> > > knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> > > (p->task_autonuma), evaluates it for the current scheduled task, and
> > > compares it against every other running process to see if it should
> > > move the current task to another NUMA node.
> > > 
> > 
> > That sounds expensive if there are a lot of running processes in the
> > system. How often does this happen? Mention it here even though I
> > realised much later that it's obvious from the patch itself.
> 
> Ok I added:
> 
> ==
> This algorithm will run once every ~100msec,

~100msec (depending on the scheduler tick)

> and can be easily slowed
> down further

using the sysfs tunable 

>. Its computational complexity is O(nr_cpus) and it's
> executed by all CPUs. The number of running threads and processes is
> not going to alter the cost of this algorithm, only the online number
> of CPUs is. However practically this will very rarely hit on all CPUs
> runqueues. Most of the time it will only compute on local data in the
> task_autonuma struct (for example if convergence has been
> reached). Even if no convergence has been reached yet, it'll only scan
> the CPUs in the NUMA nodes where the local task_autonuma data is
> showing that they are worth migrating to.

Ok, this explains how things are currently which is beter.

> ==
> 
> It's configurable through sysfs, 100mses is the default.
> 
> > > + * there is no affinity set for the task).
> > > + */
> > > +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> > > +{
> > 
> > nit, but elsewhere you have
> > 
> > static inline TYPE and here you have
> > static TYPE inline
> 
> Fixed.
> 
> > 
> > > + int task_selected_nid;
> > > + struct task_autonuma *task_autonuma = p->task_autonuma;
> > > +
> > > + if (!task_autonuma)
> > > + return true;
> > > +
> > > + task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
> > > + if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
> > > + return true;
> > > + else
> > > + return false;
> > > +}
> > 
> > no need for else.
> 
> Removed.
> 
> > 
> > > +
> > > +static inline void sched_autonuma_balance(void)
> > > +{
> > > + struct task_autonuma *ta = current->task_autonuma;
> > > +
> > > + if (ta && current->mm)
> > > + __sched_autonuma_balance();
> > > +}
> > > +
> > 
> > Ok, so this could do with a comment explaining where it is called from.
> > It is called during idle balancing at least so potentially this is every
> > scheduler tick. It'll be run from softirq context so the cost will not
> > be obvious to a process but the overhead will be there. What happens if
> > this takes longer than a scheduler tick to run? Is that possible?
> 
> softirqs can run for huge amount of time so it won't harm.
> 

They're allowed, but it's not free. Its not a stopper but eventually
we'll want to get away with it.

> Nested IRQs could even run on top of the softirq, and they could take
> milliseconds too if they're hyper inefficient and we must still run
> perfectly rock solid (with horrible latency, but still stable).
> 
> I added:
> 
> /*
>  * This is called in the context of the SCHED_SOFTIRQ from
>  * run_rebalance_domains().
>  */
> 

Ok. A vague idea occurred to me while mulling this over that would avoid the
walk. I did not flesh this out at all so there will be major inaccuracies
but hopefully you'll get the general idea.

The scheduler already caches some information about domains such as
sd_llc storing a per-cpu basis a pointer to the highest shared domain
with the same lowest level cache.

It should be possible to cache on a per-NUMA node domain basis the
highest mm_numafault and task_mmfault and the PID within that domain 
in sd_numa_mostconverged with one entry per NUMA node. At a scheduling tick, the
current task does the for_each_online_node(), calculates its values,
them to sd_numa_mostconverged and updates the cache if necessary.

With the view to integrating this with CFQ better, this update should happen
in kernel/sched/fair.c in a function called update_convergence_stats()
or possibly even integrated within one of the existing CPU walkers
like nohz_idle_balance or maybe in idle_balance itself and moved out of
kernel/sched/numa.c.  It shouldn't migrate tasks at this point and
reduce the overhead in the idle balancer.

This should integrate the whole of the following block into CFQ.

/*
 * Identify the NUMA node where this thread (task_struct), and
 * the process (mm_struct) as a whole, has the largest number
 * of NUMA faults.
 */

It then later considers doing the task exchange but only the
sd_numa_mostconverged values for each node are 

Re: [PATCH 10/33] autonuma: CPU follows memory algorithm

2012-10-12 Thread Mel Gorman
On Fri, Oct 12, 2012 at 02:25:13AM +0200, Andrea Arcangeli wrote:
 On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
  On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
   This algorithm takes as input the statistical information filled by the
   knuma_scand (mm-mm_autonuma) and by the NUMA hinting page faults
   (p-task_autonuma), evaluates it for the current scheduled task, and
   compares it against every other running process to see if it should
   move the current task to another NUMA node.
   
  
  That sounds expensive if there are a lot of running processes in the
  system. How often does this happen? Mention it here even though I
  realised much later that it's obvious from the patch itself.
 
 Ok I added:
 
 ==
 This algorithm will run once every ~100msec,

~100msec (depending on the scheduler tick)

 and can be easily slowed
 down further

using the sysfs tunable 

. Its computational complexity is O(nr_cpus) and it's
 executed by all CPUs. The number of running threads and processes is
 not going to alter the cost of this algorithm, only the online number
 of CPUs is. However practically this will very rarely hit on all CPUs
 runqueues. Most of the time it will only compute on local data in the
 task_autonuma struct (for example if convergence has been
 reached). Even if no convergence has been reached yet, it'll only scan
 the CPUs in the NUMA nodes where the local task_autonuma data is
 showing that they are worth migrating to.

Ok, this explains how things are currently which is beter.

 ==
 
 It's configurable through sysfs, 100mses is the default.
 
   + * there is no affinity set for the task).
   + */
   +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
   +{
  
  nit, but elsewhere you have
  
  static inline TYPE and here you have
  static TYPE inline
 
 Fixed.
 
  
   + int task_selected_nid;
   + struct task_autonuma *task_autonuma = p-task_autonuma;
   +
   + if (!task_autonuma)
   + return true;
   +
   + task_selected_nid = ACCESS_ONCE(task_autonuma-task_selected_nid);
   + if (task_selected_nid  0 || task_selected_nid == cpu_to_node(cpu))
   + return true;
   + else
   + return false;
   +}
  
  no need for else.
 
 Removed.
 
  
   +
   +static inline void sched_autonuma_balance(void)
   +{
   + struct task_autonuma *ta = current-task_autonuma;
   +
   + if (ta  current-mm)
   + __sched_autonuma_balance();
   +}
   +
  
  Ok, so this could do with a comment explaining where it is called from.
  It is called during idle balancing at least so potentially this is every
  scheduler tick. It'll be run from softirq context so the cost will not
  be obvious to a process but the overhead will be there. What happens if
  this takes longer than a scheduler tick to run? Is that possible?
 
 softirqs can run for huge amount of time so it won't harm.
 

They're allowed, but it's not free. Its not a stopper but eventually
we'll want to get away with it.

 Nested IRQs could even run on top of the softirq, and they could take
 milliseconds too if they're hyper inefficient and we must still run
 perfectly rock solid (with horrible latency, but still stable).
 
 I added:
 
 /*
  * This is called in the context of the SCHED_SOFTIRQ from
  * run_rebalance_domains().
  */
 

Ok. A vague idea occurred to me while mulling this over that would avoid the
walk. I did not flesh this out at all so there will be major inaccuracies
but hopefully you'll get the general idea.

The scheduler already caches some information about domains such as
sd_llc storing a per-cpu basis a pointer to the highest shared domain
with the same lowest level cache.

It should be possible to cache on a per-NUMA node domain basis the
highest mm_numafault and task_mmfault and the PID within that domain 
in sd_numa_mostconverged with one entry per NUMA node. At a scheduling tick, the
current task does the for_each_online_node(), calculates its values,
them to sd_numa_mostconverged and updates the cache if necessary.

With the view to integrating this with CFQ better, this update should happen
in kernel/sched/fair.c in a function called update_convergence_stats()
or possibly even integrated within one of the existing CPU walkers
like nohz_idle_balance or maybe in idle_balance itself and moved out of
kernel/sched/numa.c.  It shouldn't migrate tasks at this point and
reduce the overhead in the idle balancer.

This should integrate the whole of the following block into CFQ.

/*
 * Identify the NUMA node where this thread (task_struct), and
 * the process (mm_struct) as a whole, has the largest number
 * of NUMA faults.
 */

It then later considers doing the task exchange but only the
sd_numa_mostconverged values for each node are considered.
This gets rid of the
for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) loop with the obvious
caveat that there is no guarantee that cached PID is eligible for exchange
but I 

Re: [PATCH 10/33] autonuma: CPU follows memory algorithm

2012-10-11 Thread Andrea Arcangeli
On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
> On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
> > This algorithm takes as input the statistical information filled by the
> > knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> > (p->task_autonuma), evaluates it for the current scheduled task, and
> > compares it against every other running process to see if it should
> > move the current task to another NUMA node.
> > 
> 
> That sounds expensive if there are a lot of running processes in the
> system. How often does this happen? Mention it here even though I
> realised much later that it's obvious from the patch itself.

Ok I added:

==
This algorithm will run once every ~100msec, and can be easily slowed
down further. Its computational complexity is O(nr_cpus) and it's
executed by all CPUs. The number of running threads and processes is
not going to alter the cost of this algorithm, only the online number
of CPUs is. However practically this will very rarely hit on all CPUs
runqueues. Most of the time it will only compute on local data in the
task_autonuma struct (for example if convergence has been
reached). Even if no convergence has been reached yet, it'll only scan
the CPUs in the NUMA nodes where the local task_autonuma data is
showing that they are worth migrating to.
==

It's configurable through sysfs, 100mses is the default.

> > + * there is no affinity set for the task).
> > + */
> > +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> > +{
> 
> nit, but elsewhere you have
> 
> static inline TYPE and here you have
> static TYPE inline

Fixed.

> 
> > +   int task_selected_nid;
> > +   struct task_autonuma *task_autonuma = p->task_autonuma;
> > +
> > +   if (!task_autonuma)
> > +   return true;
> > +
> > +   task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
> > +   if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
> > +   return true;
> > +   else
> > +   return false;
> > +}
> 
> no need for else.

Removed.

> 
> > +
> > +static inline void sched_autonuma_balance(void)
> > +{
> > +   struct task_autonuma *ta = current->task_autonuma;
> > +
> > +   if (ta && current->mm)
> > +   __sched_autonuma_balance();
> > +}
> > +
> 
> Ok, so this could do with a comment explaining where it is called from.
> It is called during idle balancing at least so potentially this is every
> scheduler tick. It'll be run from softirq context so the cost will not
> be obvious to a process but the overhead will be there. What happens if
> this takes longer than a scheduler tick to run? Is that possible?

softirqs can run for huge amount of time so it won't harm.

Nested IRQs could even run on top of the softirq, and they could take
milliseconds too if they're hyper inefficient and we must still run
perfectly rock solid (with horrible latency, but still stable).

I added:

/*
 * This is called in the context of the SCHED_SOFTIRQ from
 * run_rebalance_domains().
 */

> > +/*
> > + * This function __sched_autonuma_balance() is responsible for
> 
> This function is far too shot and could do with another few pages :P

:) I tried to split it once already but gave up in the middle.

> > + * "Full convergence" is achieved when all memory accesses by a task
> > + * are 100% local to the CPU it is running on. A task's "best node" is
> 
> I think this is the first time you defined convergence in the series.
> The explanation should be included in the documentation.

Ok. It's not too easy concept to explain with words.  Here a try:

 *
 * A workload converges when all the memory of a thread or a process
 * has been placed in the NUMA node of the CPU where the process or
 * thread is running on.
 *

> > + * other_diff: how much the current task is closer to fully converge
> > + * on the node of the other CPU than the other task that is currently
> > + * running in the other CPU.
> 
> In the changelog you talked about comparing a process with every other
> running process but here it looks like you intent to examine every
> process that is *currently running* on a remote node and compare that.
> What if the best process to swap with is not currently running? Do we
> miss it?

Correct, only currently running processes are being checked. If a task
in R state goes to sleep immediately, it's not relevant where it
runs. We focus on "long running" compute tasks, so tasks that are in R
state most frequently.

> > + * If both checks succeed it guarantees that we found a way to
> > + * multilaterally improve the system wide NUMA
> > + * convergence. Multilateral here means that the same checks will not
> > + * succeed again on those same two tasks, after the task exchange, so
> > + * there is no risk of ping-pong.
> > + *
> 
> At least not in that instance of time. A new CPU binding or change in
> behaviour (such as a computation finishing and a reduce step starting)
> might change that scoring.

Yes.

> > + 

Re: [PATCH 10/33] autonuma: CPU follows memory algorithm

2012-10-11 Thread Andrea Arcangeli
On Thu, Oct 11, 2012 at 03:58:05PM +0100, Mel Gorman wrote:
 On Thu, Oct 04, 2012 at 01:50:52AM +0200, Andrea Arcangeli wrote:
  This algorithm takes as input the statistical information filled by the
  knuma_scand (mm-mm_autonuma) and by the NUMA hinting page faults
  (p-task_autonuma), evaluates it for the current scheduled task, and
  compares it against every other running process to see if it should
  move the current task to another NUMA node.
  
 
 That sounds expensive if there are a lot of running processes in the
 system. How often does this happen? Mention it here even though I
 realised much later that it's obvious from the patch itself.

Ok I added:

==
This algorithm will run once every ~100msec, and can be easily slowed
down further. Its computational complexity is O(nr_cpus) and it's
executed by all CPUs. The number of running threads and processes is
not going to alter the cost of this algorithm, only the online number
of CPUs is. However practically this will very rarely hit on all CPUs
runqueues. Most of the time it will only compute on local data in the
task_autonuma struct (for example if convergence has been
reached). Even if no convergence has been reached yet, it'll only scan
the CPUs in the NUMA nodes where the local task_autonuma data is
showing that they are worth migrating to.
==

It's configurable through sysfs, 100mses is the default.

  + * there is no affinity set for the task).
  + */
  +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
  +{
 
 nit, but elsewhere you have
 
 static inline TYPE and here you have
 static TYPE inline

Fixed.

 
  +   int task_selected_nid;
  +   struct task_autonuma *task_autonuma = p-task_autonuma;
  +
  +   if (!task_autonuma)
  +   return true;
  +
  +   task_selected_nid = ACCESS_ONCE(task_autonuma-task_selected_nid);
  +   if (task_selected_nid  0 || task_selected_nid == cpu_to_node(cpu))
  +   return true;
  +   else
  +   return false;
  +}
 
 no need for else.

Removed.

 
  +
  +static inline void sched_autonuma_balance(void)
  +{
  +   struct task_autonuma *ta = current-task_autonuma;
  +
  +   if (ta  current-mm)
  +   __sched_autonuma_balance();
  +}
  +
 
 Ok, so this could do with a comment explaining where it is called from.
 It is called during idle balancing at least so potentially this is every
 scheduler tick. It'll be run from softirq context so the cost will not
 be obvious to a process but the overhead will be there. What happens if
 this takes longer than a scheduler tick to run? Is that possible?

softirqs can run for huge amount of time so it won't harm.

Nested IRQs could even run on top of the softirq, and they could take
milliseconds too if they're hyper inefficient and we must still run
perfectly rock solid (with horrible latency, but still stable).

I added:

/*
 * This is called in the context of the SCHED_SOFTIRQ from
 * run_rebalance_domains().
 */

  +/*
  + * This function __sched_autonuma_balance() is responsible for
 
 This function is far too shot and could do with another few pages :P

:) I tried to split it once already but gave up in the middle.

  + * Full convergence is achieved when all memory accesses by a task
  + * are 100% local to the CPU it is running on. A task's best node is
 
 I think this is the first time you defined convergence in the series.
 The explanation should be included in the documentation.

Ok. It's not too easy concept to explain with words.  Here a try:

 *
 * A workload converges when all the memory of a thread or a process
 * has been placed in the NUMA node of the CPU where the process or
 * thread is running on.
 *

  + * other_diff: how much the current task is closer to fully converge
  + * on the node of the other CPU than the other task that is currently
  + * running in the other CPU.
 
 In the changelog you talked about comparing a process with every other
 running process but here it looks like you intent to examine every
 process that is *currently running* on a remote node and compare that.
 What if the best process to swap with is not currently running? Do we
 miss it?

Correct, only currently running processes are being checked. If a task
in R state goes to sleep immediately, it's not relevant where it
runs. We focus on long running compute tasks, so tasks that are in R
state most frequently.

  + * If both checks succeed it guarantees that we found a way to
  + * multilaterally improve the system wide NUMA
  + * convergence. Multilateral here means that the same checks will not
  + * succeed again on those same two tasks, after the task exchange, so
  + * there is no risk of ping-pong.
  + *
 
 At least not in that instance of time. A new CPU binding or change in
 behaviour (such as a computation finishing and a reduce step starting)
 might change that scoring.

Yes.

  + * If a task exchange can happen because the two checks succeed, we
  + * select the destination CPU that will give us the biggest increase