Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-06 Thread Rik van Riel
On Wed, 2018-06-06 at 08:32 -0700, Srikar Dronamraju wrote:

> Yes its better to skip cpus if they are already in migration.
> And we are already doing it with the above patch. However as I said
> earlier 
> 
> - Task T1 sets Cpu 1 as best_cpu, 
> - Task T2 finds cpu1 and skips Cpu1
> - Task T1 finds cpu2 slightly better than cpu1.
> - Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2.
> - Task T2 finds cpu2 and skips cpu2
> - Task T1 finds cpu3 as best_cpu slightly better than cpu2.
> - Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3.
> - Task T2 finds cpu3 and skips cpu3
> 
> So after this T1 was able to find a cpu but T2 couldn't find a cpu
> even
> though there were 3 cpus that were available for 2 task to swap.
>
> Again, this is too corner case, that I am okay to drop.

Not only is that above race highly unlikely, it is also
still possible with your patch applied, if the scores
between cpu1, cpu2, and cpu3 differ by more than
SMALLIMP/2.

Not only is this patch some weird magic that makes the
code harder to maintain (since its purpose does not match
what the code actually does), but it also does not
reliably do what you intend it to do.

We may be better off not adding this bit of complexity.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-06 Thread Rik van Riel
On Wed, 2018-06-06 at 08:32 -0700, Srikar Dronamraju wrote:

> Yes its better to skip cpus if they are already in migration.
> And we are already doing it with the above patch. However as I said
> earlier 
> 
> - Task T1 sets Cpu 1 as best_cpu, 
> - Task T2 finds cpu1 and skips Cpu1
> - Task T1 finds cpu2 slightly better than cpu1.
> - Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2.
> - Task T2 finds cpu2 and skips cpu2
> - Task T1 finds cpu3 as best_cpu slightly better than cpu2.
> - Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3.
> - Task T2 finds cpu3 and skips cpu3
> 
> So after this T1 was able to find a cpu but T2 couldn't find a cpu
> even
> though there were 3 cpus that were available for 2 task to swap.
>
> Again, this is too corner case, that I am okay to drop.

Not only is that above race highly unlikely, it is also
still possible with your patch applied, if the scores
between cpu1, cpu2, and cpu3 differ by more than
SMALLIMP/2.

Not only is this patch some weird magic that makes the
code harder to maintain (since its purpose does not match
what the code actually does), but it also does not
reliably do what you intend it to do.

We may be better off not adding this bit of complexity.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-06 Thread Srikar Dronamraju
> > 
> > All tasks will not be stuck at task/cpu A.
> > 
> > "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the
> > cpu..." the first task to set cpu A as swap target will ensure
> > subsequent tasks wont be allowed to set cpu A as target for swap till
> > it
> > finds a better task/cpu. Because of this there a very very small
> > chance
> > of a second task unable to find a task to swap.
> 
> Would it not be better for task_numa_compare to skip
> from consideration CPUs that somebody else is already
> trying to migrate a task to, but still search for the
> best location, instead of settling for a worse location
> to migrate to?

Yes its better to skip cpus if they are already in migration.
And we are already doing it with the above patch. However as I said
earlier 

- Task T1 sets Cpu 1 as best_cpu, 
- Task T2 finds cpu1 and skips Cpu1
- Task T1 finds cpu2 slightly better than cpu1.
- Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2.
- Task T2 finds cpu2 and skips cpu2
- Task T1 finds cpu3 as best_cpu slightly better than cpu2.
- Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3.
- Task T2 finds cpu3 and skips cpu3

So after this T1 was able to find a cpu but T2 couldn't find a cpu even
though there were 3 cpus that were available for 2 task to swap.

Again, this is too corner case, that I am okay to drop.



Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-06 Thread Srikar Dronamraju
> > 
> > All tasks will not be stuck at task/cpu A.
> > 
> > "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the
> > cpu..." the first task to set cpu A as swap target will ensure
> > subsequent tasks wont be allowed to set cpu A as target for swap till
> > it
> > finds a better task/cpu. Because of this there a very very small
> > chance
> > of a second task unable to find a task to swap.
> 
> Would it not be better for task_numa_compare to skip
> from consideration CPUs that somebody else is already
> trying to migrate a task to, but still search for the
> best location, instead of settling for a worse location
> to migrate to?

Yes its better to skip cpus if they are already in migration.
And we are already doing it with the above patch. However as I said
earlier 

- Task T1 sets Cpu 1 as best_cpu, 
- Task T2 finds cpu1 and skips Cpu1
- Task T1 finds cpu2 slightly better than cpu1.
- Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2.
- Task T2 finds cpu2 and skips cpu2
- Task T1 finds cpu3 as best_cpu slightly better than cpu2.
- Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3.
- Task T2 finds cpu3 and skips cpu3

So after this T1 was able to find a cpu but T2 couldn't find a cpu even
though there were 3 cpus that were available for 2 task to swap.

Again, this is too corner case, that I am okay to drop.



Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-06 Thread Rik van Riel
On Wed, 2018-06-06 at 05:55 -0700, Srikar Dronamraju wrote:
> > > 
> > > While we can't complete avoid this, the second check will try to
> > > make
> > > sure we don't hop on/hop off just for small incremental numa
> > > improvement.
> > 
> > However, all those racing tasks start searching
> > the CPUs on a node from the same start position.
> > 
> > That means they may all get stuck on the same
> > task/cpu A, and not select the better task/cpu B.
> > 
> > What am I missing? 
> 
> All tasks will not be stuck at task/cpu A.
> 
> "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the
> cpu..." the first task to set cpu A as swap target will ensure
> subsequent tasks wont be allowed to set cpu A as target for swap till
> it
> finds a better task/cpu. Because of this there a very very small
> chance
> of a second task unable to find a task to swap.

Would it not be better for task_numa_compare to skip
from consideration CPUs that somebody else is already
trying to migrate a task to, but still search for the
best location, instead of settling for a worse location
to migrate to?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-06 Thread Rik van Riel
On Wed, 2018-06-06 at 05:55 -0700, Srikar Dronamraju wrote:
> > > 
> > > While we can't complete avoid this, the second check will try to
> > > make
> > > sure we don't hop on/hop off just for small incremental numa
> > > improvement.
> > 
> > However, all those racing tasks start searching
> > the CPUs on a node from the same start position.
> > 
> > That means they may all get stuck on the same
> > task/cpu A, and not select the better task/cpu B.
> > 
> > What am I missing? 
> 
> All tasks will not be stuck at task/cpu A.
> 
> "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the
> cpu..." the first task to set cpu A as swap target will ensure
> subsequent tasks wont be allowed to set cpu A as target for swap till
> it
> finds a better task/cpu. Because of this there a very very small
> chance
> of a second task unable to find a task to swap.

Would it not be better for task_numa_compare to skip
from consideration CPUs that somebody else is already
trying to migrate a task to, but still search for the
best location, instead of settling for a worse location
to migrate to?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-06 Thread Srikar Dronamraju
> > 
> > I thought about this. Lets say we evaluated that destination node can
> > allow movement. While we iterate through the list of cpus trying to
> > find
> > the best cpu node, we find a idle cpu towards the end of the list.
> > However if another task as already raced with us to move a task to
> > this
> > node, then we should bail out. Keeping the check in task_numa_compare
> > will allow us to do this.
> 
> Your check is called once for every invocation
> of task_numa_compare. It does not matter whether
> it is inside or outside, except on the outside
> the variable manipulation will be easier to read.
> 

Okay I mistook your comment; Basically you want the check to be moved
within the for-loop in task_numa_find_cpu.
I will do the needful.

> > 
> > While we can't complete avoid this, the second check will try to make
> > sure we don't hop on/hop off just for small incremental numa
> > improvement.
> 
> However, all those racing tasks start searching
> the CPUs on a node from the same start position.
> 
> That means they may all get stuck on the same
> task/cpu A, and not select the better task/cpu B.

> 
> What am I missing? 

All tasks will not be stuck at task/cpu A.

"[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the
cpu..." the first task to set cpu A as swap target will ensure
subsequent tasks wont be allowed to set cpu A as target for swap till it
finds a better task/cpu. Because of this there a very very small chance
of a second task unable to find a task to swap.




Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-06 Thread Srikar Dronamraju
> > 
> > I thought about this. Lets say we evaluated that destination node can
> > allow movement. While we iterate through the list of cpus trying to
> > find
> > the best cpu node, we find a idle cpu towards the end of the list.
> > However if another task as already raced with us to move a task to
> > this
> > node, then we should bail out. Keeping the check in task_numa_compare
> > will allow us to do this.
> 
> Your check is called once for every invocation
> of task_numa_compare. It does not matter whether
> it is inside or outside, except on the outside
> the variable manipulation will be easier to read.
> 

Okay I mistook your comment; Basically you want the check to be moved
within the for-loop in task_numa_find_cpu.
I will do the needful.

> > 
> > While we can't complete avoid this, the second check will try to make
> > sure we don't hop on/hop off just for small incremental numa
> > improvement.
> 
> However, all those racing tasks start searching
> the CPUs on a node from the same start position.
> 
> That means they may all get stuck on the same
> task/cpu A, and not select the better task/cpu B.

> 
> What am I missing? 

All tasks will not be stuck at task/cpu A.

"[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the
cpu..." the first task to set cpu A as swap target will ensure
subsequent tasks wont be allowed to set cpu A as target for swap till it
finds a better task/cpu. Because of this there a very very small chance
of a second task unable to find a task to swap.




Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-05 Thread Rik van Riel
On Mon, 2018-06-04 at 20:56 -0700, Srikar Dronamraju wrote:
> * Rik van Riel  [2018-06-04 16:05:55]:
> 
> > On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> > 
> > > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct
> > > task_numa_env *env,
> > >   if (READ_ONCE(dst_rq->numa_migrate_on))
> > >   return;
> > >  
> > > + if (*move && READ_ONCE(pgdat->active_node_migrate))
> > > + *move = false;
> > 
> > Why not do this check in task_numa_find_cpu?
> > 
> > That way you won't have to pass this in as a
> > pointer, and you also will not have to recalculate
> > NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU.
> > 
> 
> I thought about this. Lets say we evaluated that destination node can
> allow movement. While we iterate through the list of cpus trying to
> find
> the best cpu node, we find a idle cpu towards the end of the list.
> However if another task as already raced with us to move a task to
> this
> node, then we should bail out. Keeping the check in task_numa_compare
> will allow us to do this.

Your check is called once for every invocation
of task_numa_compare. It does not matter whether
it is inside or outside, except on the outside
the variable manipulation will be easier to read.

> > > +  * task migration might only result in ping pong
> > > +  * of tasks and also hurt performance due to cache
> > > +  * misses.
> > > +  */
> > > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP /
> > > 2)
> > > + goto unlock;
> > 
> > I can see a use for the first test, but why limit the
> > search for the best score once you are past the
> > threshold?
> > 
> > I don't understand the use for that second test.
> > 
> 
> Lets say few threads are racing with each other to find a cpu on the
> node. The first thread has already found a task/cpu 'A' to swap and
> finds another task/cpu 'B' thats slightly better than the current
> best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to
> be
> set as best_cpu. However the second or subsequent threads cannot find
> the first task/cpu A because its suppose to be in active migration.
> By
> the time it reaches task/cpu B even that may look to be in active
> migration. It may never know that task/cpu A was cleared. In this
> way,
> the second and subsequent threads may not get a task/cpu in the
> preferred node to swap just because the first task kept hopping
> task/cpu
> as its choice of migration.
> 
> While we can't complete avoid this, the second check will try to make
> sure we don't hop on/hop off just for small incremental numa
> improvement.

However, all those racing tasks start searching
the CPUs on a node from the same start position.

That means they may all get stuck on the same
task/cpu A, and not select the better task/cpu B.

What am I missing? 

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-05 Thread Rik van Riel
On Mon, 2018-06-04 at 20:56 -0700, Srikar Dronamraju wrote:
> * Rik van Riel  [2018-06-04 16:05:55]:
> 
> > On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> > 
> > > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct
> > > task_numa_env *env,
> > >   if (READ_ONCE(dst_rq->numa_migrate_on))
> > >   return;
> > >  
> > > + if (*move && READ_ONCE(pgdat->active_node_migrate))
> > > + *move = false;
> > 
> > Why not do this check in task_numa_find_cpu?
> > 
> > That way you won't have to pass this in as a
> > pointer, and you also will not have to recalculate
> > NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU.
> > 
> 
> I thought about this. Lets say we evaluated that destination node can
> allow movement. While we iterate through the list of cpus trying to
> find
> the best cpu node, we find a idle cpu towards the end of the list.
> However if another task as already raced with us to move a task to
> this
> node, then we should bail out. Keeping the check in task_numa_compare
> will allow us to do this.

Your check is called once for every invocation
of task_numa_compare. It does not matter whether
it is inside or outside, except on the outside
the variable manipulation will be easier to read.

> > > +  * task migration might only result in ping pong
> > > +  * of tasks and also hurt performance due to cache
> > > +  * misses.
> > > +  */
> > > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP /
> > > 2)
> > > + goto unlock;
> > 
> > I can see a use for the first test, but why limit the
> > search for the best score once you are past the
> > threshold?
> > 
> > I don't understand the use for that second test.
> > 
> 
> Lets say few threads are racing with each other to find a cpu on the
> node. The first thread has already found a task/cpu 'A' to swap and
> finds another task/cpu 'B' thats slightly better than the current
> best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to
> be
> set as best_cpu. However the second or subsequent threads cannot find
> the first task/cpu A because its suppose to be in active migration.
> By
> the time it reaches task/cpu B even that may look to be in active
> migration. It may never know that task/cpu A was cleared. In this
> way,
> the second and subsequent threads may not get a task/cpu in the
> preferred node to swap just because the first task kept hopping
> task/cpu
> as its choice of migration.
> 
> While we can't complete avoid this, the second check will try to make
> sure we don't hop on/hop off just for small incremental numa
> improvement.

However, all those racing tasks start searching
the CPUs on a node from the same start position.

That means they may all get stuck on the same
task/cpu A, and not select the better task/cpu B.

What am I missing? 

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-04 Thread Srikar Dronamraju
* Rik van Riel  [2018-06-04 16:05:55]:

> On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> 
> > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct
> > task_numa_env *env,
> > if (READ_ONCE(dst_rq->numa_migrate_on))
> > return;
> >  
> > +   if (*move && READ_ONCE(pgdat->active_node_migrate))
> > +   *move = false;
> 
> Why not do this check in task_numa_find_cpu?
> 
> That way you won't have to pass this in as a
> pointer, and you also will not have to recalculate
> NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU.
> 

I thought about this. Lets say we evaluated that destination node can
allow movement. While we iterate through the list of cpus trying to find
the best cpu node, we find a idle cpu towards the end of the list.
However if another task as already raced with us to move a task to this
node, then we should bail out. Keeping the check in task_numa_compare
will allow us to do this.

> > /*
> > +* If the numa importance is less than SMALLIMP,
> 
>   ^^^ numa improvement
> 

okay

> > +* task migration might only result in ping pong
> > +* of tasks and also hurt performance due to cache
> > +* misses.
> > +*/
> > +   if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> > +   goto unlock;
> 
> I can see a use for the first test, but why limit the
> search for the best score once you are past the
> threshold?
> 
> I don't understand the use for that second test.
> 

Lets say few threads are racing with each other to find a cpu on the
node. The first thread has already found a task/cpu 'A' to swap and
finds another task/cpu 'B' thats slightly better than the current
best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to be
set as best_cpu. However the second or subsequent threads cannot find
the first task/cpu A because its suppose to be in active migration. By
the time it reaches task/cpu B even that may look to be in active
migration. It may never know that task/cpu A was cleared. In this way,
the second and subsequent threads may not get a task/cpu in the
preferred node to swap just because the first task kept hopping task/cpu
as its choice of migration.

While we can't complete avoid this, the second check will try to make
sure we don't hop on/hop off just for small incremental numa
improvement.



> What workload benefits from it?

> 
> -- 
> All Rights Reversed.




Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-04 Thread Srikar Dronamraju
* Rik van Riel  [2018-06-04 16:05:55]:

> On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> 
> > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct
> > task_numa_env *env,
> > if (READ_ONCE(dst_rq->numa_migrate_on))
> > return;
> >  
> > +   if (*move && READ_ONCE(pgdat->active_node_migrate))
> > +   *move = false;
> 
> Why not do this check in task_numa_find_cpu?
> 
> That way you won't have to pass this in as a
> pointer, and you also will not have to recalculate
> NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU.
> 

I thought about this. Lets say we evaluated that destination node can
allow movement. While we iterate through the list of cpus trying to find
the best cpu node, we find a idle cpu towards the end of the list.
However if another task as already raced with us to move a task to this
node, then we should bail out. Keeping the check in task_numa_compare
will allow us to do this.

> > /*
> > +* If the numa importance is less than SMALLIMP,
> 
>   ^^^ numa improvement
> 

okay

> > +* task migration might only result in ping pong
> > +* of tasks and also hurt performance due to cache
> > +* misses.
> > +*/
> > +   if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> > +   goto unlock;
> 
> I can see a use for the first test, but why limit the
> search for the best score once you are past the
> threshold?
> 
> I don't understand the use for that second test.
> 

Lets say few threads are racing with each other to find a cpu on the
node. The first thread has already found a task/cpu 'A' to swap and
finds another task/cpu 'B' thats slightly better than the current
best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to be
set as best_cpu. However the second or subsequent threads cannot find
the first task/cpu A because its suppose to be in active migration. By
the time it reaches task/cpu B even that may look to be in active
migration. It may never know that task/cpu A was cleared. In this way,
the second and subsequent threads may not get a task/cpu in the
preferred node to swap just because the first task kept hopping task/cpu
as its choice of migration.

While we can't complete avoid this, the second check will try to make
sure we don't hop on/hop off just for small incremental numa
improvement.



> What workload benefits from it?

> 
> -- 
> All Rights Reversed.




Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-04 Thread Rik van Riel
On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:

> @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct
> task_numa_env *env,
>   if (READ_ONCE(dst_rq->numa_migrate_on))
>   return;
>  
> + if (*move && READ_ONCE(pgdat->active_node_migrate))
> + *move = false;

Why not do this check in task_numa_find_cpu?

That way you won't have to pass this in as a
pointer, and you also will not have to recalculate
NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU.

>   /*
> +  * If the numa importance is less than SMALLIMP,

  ^^^ numa improvement

> +  * task migration might only result in ping pong
> +  * of tasks and also hurt performance due to cache
> +  * misses.
> +  */
> + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> + goto unlock;

I can see a use for the first test, but why limit the
search for the best score once you are past the
threshold?

I don't understand the use for that second test.

What workload benefits from it?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-04 Thread Rik van Riel
On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:

> @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct
> task_numa_env *env,
>   if (READ_ONCE(dst_rq->numa_migrate_on))
>   return;
>  
> + if (*move && READ_ONCE(pgdat->active_node_migrate))
> + *move = false;

Why not do this check in task_numa_find_cpu?

That way you won't have to pass this in as a
pointer, and you also will not have to recalculate
NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU.

>   /*
> +  * If the numa importance is less than SMALLIMP,

  ^^^ numa improvement

> +  * task migration might only result in ping pong
> +  * of tasks and also hurt performance due to cache
> +  * misses.
> +  */
> + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> + goto unlock;

I can see a use for the first test, but why limit the
search for the best score once you are past the
threshold?

I don't understand the use for that second test.

What workload benefits from it?

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


[PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-04 Thread Srikar Dronamraju
If a node is the destination for a task migration under numa balancing,
then any parallel movements to the node will be restricted. In such a
scenario, detect at the earliest and avoid evaluation for a task
movement.

While here, avoid task migration if the numa imbalance is very minimal.
Especially consider two tasks A and B racing with each other to find the
best cpu to swap. If task A already has found one task/cpu pair to
swap and trying to find a better cpu. Task B is yet to find a better
cpu/task to swap. Task A can race with task B and deprive it from
getting a task/cpu to swap.


Testcase   Time: Min Max Avg  StdDev
numa01.sh  Real:  493.19  672.88  597.51   59.38
numa01.sh   Sys:  150.09  245.48  207.76   34.26
numa01.sh  User:41928.5153779.1748747.06 3901.39
numa02.sh  Real:   60.63   62.87   61.220.83
numa02.sh   Sys:   16.64   27.97   20.254.06
numa02.sh  User: 5222.92 5309.60 5254.03   29.98
numa03.sh  Real:  821.52  902.15  863.60   32.41
numa03.sh   Sys:  112.04  130.66  118.357.08
numa03.sh  User:62245.1669165.1466443.04 2450.32
numa04.sh  Real:  414.53  519.57  476.25   37.00
numa04.sh   Sys:  181.84  335.67  280.41   54.07
numa04.sh  User:33924.5039115.3937343.78 1934.26
numa05.sh  Real:  408.30  441.45  417.90   12.05
numa05.sh   Sys:  233.41  381.60  295.58   57.37
numa05.sh  User:33301.3135972.5034335.19  938.94

Testcase   Time: Min Max Avg  StdDev %Change
numa01.sh  Real:  428.48  837.17  700.45  162.77 -14.6%
numa01.sh   Sys:   78.64  247.70  164.45   58.32 26.33%
numa01.sh  User:37487.2563728.0654399.2710088.13 -10.3%
numa02.sh  Real:   60.07   62.65   61.410.85 -0.30%
numa02.sh   Sys:   15.83   29.36   21.044.48 -3.75%
numa02.sh  User: 5194.27 5280.60 5236.55   28.01 0.333%
numa03.sh  Real:  814.33  881.93  849.69   27.06 1.637%
numa03.sh   Sys:  111.45  134.02  125.287.69 -5.53%
numa03.sh  User:63007.3668013.4665590.46 2023.37 1.299%
numa04.sh  Real:  412.19  438.75  424.439.28 12.20%
numa04.sh   Sys:  232.97  315.77  268.98   26.98 4.249%
numa04.sh  User:33997.3035292.8834711.66  415.78 7.582%
numa05.sh  Real:  394.88  449.45  424.30   22.53 -1.50%
numa05.sh   Sys:  262.03  390.10  314.53   51.01 -6.02%
numa05.sh  User:33389.0335684.4034561.34  942.34 -0.65%

Signed-off-by: Srikar Dronamraju 
---
 kernel/sched/fair.c | 37 +++--
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c388ecf..6851412 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1535,14 +1535,22 @@ static bool load_too_imbalanced(long src_load, long 
dst_load,
 }
 
 /*
+ * Maximum numa importance can be 1998 (2*999);
+ * SMALLIMP @ 30 would be close to 1998/64.
+ * Used to deter task migration.
+ */
+#define SMALLIMP   30
+
+/*
  * This checks if the overall compute and NUMA accesses of the system would
  * be improved if the source tasks was migrated to the target dst_cpu taking
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
 static void task_numa_compare(struct task_numa_env *env,
- long taskimp, long groupimp, bool move)
+ long taskimp, long groupimp, bool *move)
 {
+   pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu));
struct rq *dst_rq = cpu_rq(env->dst_cpu);
struct task_struct *cur;
long src_load, dst_load;
@@ -1554,6 +1562,9 @@ static void task_numa_compare(struct task_numa_env *env,
if (READ_ONCE(dst_rq->numa_migrate_on))
return;
 
+   if (*move && READ_ONCE(pgdat->active_node_migrate))
+   *move = false;
+
rcu_read_lock();
cur = task_rcu_dereference(_rq->curr);
if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1567,10 +1578,10 @@ static void task_numa_compare(struct task_numa_env *env,
goto unlock;
 
if (!cur) {
-   if (!move || imp <= env->best_imp)
-   goto unlock;
-   else
+   if (*move && moveimp >= env->best_imp)
goto assign;
+   else
+   goto unlock;
}
 
/*
@@ 

[PATCH 16/19] sched/numa: Detect if node actively handling migration

2018-06-04 Thread Srikar Dronamraju
If a node is the destination for a task migration under numa balancing,
then any parallel movements to the node will be restricted. In such a
scenario, detect at the earliest and avoid evaluation for a task
movement.

While here, avoid task migration if the numa imbalance is very minimal.
Especially consider two tasks A and B racing with each other to find the
best cpu to swap. If task A already has found one task/cpu pair to
swap and trying to find a better cpu. Task B is yet to find a better
cpu/task to swap. Task A can race with task B and deprive it from
getting a task/cpu to swap.


Testcase   Time: Min Max Avg  StdDev
numa01.sh  Real:  493.19  672.88  597.51   59.38
numa01.sh   Sys:  150.09  245.48  207.76   34.26
numa01.sh  User:41928.5153779.1748747.06 3901.39
numa02.sh  Real:   60.63   62.87   61.220.83
numa02.sh   Sys:   16.64   27.97   20.254.06
numa02.sh  User: 5222.92 5309.60 5254.03   29.98
numa03.sh  Real:  821.52  902.15  863.60   32.41
numa03.sh   Sys:  112.04  130.66  118.357.08
numa03.sh  User:62245.1669165.1466443.04 2450.32
numa04.sh  Real:  414.53  519.57  476.25   37.00
numa04.sh   Sys:  181.84  335.67  280.41   54.07
numa04.sh  User:33924.5039115.3937343.78 1934.26
numa05.sh  Real:  408.30  441.45  417.90   12.05
numa05.sh   Sys:  233.41  381.60  295.58   57.37
numa05.sh  User:33301.3135972.5034335.19  938.94

Testcase   Time: Min Max Avg  StdDev %Change
numa01.sh  Real:  428.48  837.17  700.45  162.77 -14.6%
numa01.sh   Sys:   78.64  247.70  164.45   58.32 26.33%
numa01.sh  User:37487.2563728.0654399.2710088.13 -10.3%
numa02.sh  Real:   60.07   62.65   61.410.85 -0.30%
numa02.sh   Sys:   15.83   29.36   21.044.48 -3.75%
numa02.sh  User: 5194.27 5280.60 5236.55   28.01 0.333%
numa03.sh  Real:  814.33  881.93  849.69   27.06 1.637%
numa03.sh   Sys:  111.45  134.02  125.287.69 -5.53%
numa03.sh  User:63007.3668013.4665590.46 2023.37 1.299%
numa04.sh  Real:  412.19  438.75  424.439.28 12.20%
numa04.sh   Sys:  232.97  315.77  268.98   26.98 4.249%
numa04.sh  User:33997.3035292.8834711.66  415.78 7.582%
numa05.sh  Real:  394.88  449.45  424.30   22.53 -1.50%
numa05.sh   Sys:  262.03  390.10  314.53   51.01 -6.02%
numa05.sh  User:33389.0335684.4034561.34  942.34 -0.65%

Signed-off-by: Srikar Dronamraju 
---
 kernel/sched/fair.c | 37 +++--
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c388ecf..6851412 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1535,14 +1535,22 @@ static bool load_too_imbalanced(long src_load, long 
dst_load,
 }
 
 /*
+ * Maximum numa importance can be 1998 (2*999);
+ * SMALLIMP @ 30 would be close to 1998/64.
+ * Used to deter task migration.
+ */
+#define SMALLIMP   30
+
+/*
  * This checks if the overall compute and NUMA accesses of the system would
  * be improved if the source tasks was migrated to the target dst_cpu taking
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
 static void task_numa_compare(struct task_numa_env *env,
- long taskimp, long groupimp, bool move)
+ long taskimp, long groupimp, bool *move)
 {
+   pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu));
struct rq *dst_rq = cpu_rq(env->dst_cpu);
struct task_struct *cur;
long src_load, dst_load;
@@ -1554,6 +1562,9 @@ static void task_numa_compare(struct task_numa_env *env,
if (READ_ONCE(dst_rq->numa_migrate_on))
return;
 
+   if (*move && READ_ONCE(pgdat->active_node_migrate))
+   *move = false;
+
rcu_read_lock();
cur = task_rcu_dereference(_rq->curr);
if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1567,10 +1578,10 @@ static void task_numa_compare(struct task_numa_env *env,
goto unlock;
 
if (!cur) {
-   if (!move || imp <= env->best_imp)
-   goto unlock;
-   else
+   if (*move && moveimp >= env->best_imp)
goto assign;
+   else
+   goto unlock;
}
 
/*
@@