Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
On Wed, 2018-06-06 at 08:32 -0700, Srikar Dronamraju wrote: > Yes its better to skip cpus if they are already in migration. > And we are already doing it with the above patch. However as I said > earlier > > - Task T1 sets Cpu 1 as best_cpu, > - Task T2 finds cpu1 and skips Cpu1 > - Task T1 finds cpu2 slightly better than cpu1. > - Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2. > - Task T2 finds cpu2 and skips cpu2 > - Task T1 finds cpu3 as best_cpu slightly better than cpu2. > - Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3. > - Task T2 finds cpu3 and skips cpu3 > > So after this T1 was able to find a cpu but T2 couldn't find a cpu > even > though there were 3 cpus that were available for 2 task to swap. > > Again, this is too corner case, that I am okay to drop. Not only is that above race highly unlikely, it is also still possible with your patch applied, if the scores between cpu1, cpu2, and cpu3 differ by more than SMALLIMP/2. Not only is this patch some weird magic that makes the code harder to maintain (since its purpose does not match what the code actually does), but it also does not reliably do what you intend it to do. We may be better off not adding this bit of complexity. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
On Wed, 2018-06-06 at 08:32 -0700, Srikar Dronamraju wrote: > Yes its better to skip cpus if they are already in migration. > And we are already doing it with the above patch. However as I said > earlier > > - Task T1 sets Cpu 1 as best_cpu, > - Task T2 finds cpu1 and skips Cpu1 > - Task T1 finds cpu2 slightly better than cpu1. > - Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2. > - Task T2 finds cpu2 and skips cpu2 > - Task T1 finds cpu3 as best_cpu slightly better than cpu2. > - Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3. > - Task T2 finds cpu3 and skips cpu3 > > So after this T1 was able to find a cpu but T2 couldn't find a cpu > even > though there were 3 cpus that were available for 2 task to swap. > > Again, this is too corner case, that I am okay to drop. Not only is that above race highly unlikely, it is also still possible with your patch applied, if the scores between cpu1, cpu2, and cpu3 differ by more than SMALLIMP/2. Not only is this patch some weird magic that makes the code harder to maintain (since its purpose does not match what the code actually does), but it also does not reliably do what you intend it to do. We may be better off not adding this bit of complexity. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
> > > > All tasks will not be stuck at task/cpu A. > > > > "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the > > cpu..." the first task to set cpu A as swap target will ensure > > subsequent tasks wont be allowed to set cpu A as target for swap till > > it > > finds a better task/cpu. Because of this there a very very small > > chance > > of a second task unable to find a task to swap. > > Would it not be better for task_numa_compare to skip > from consideration CPUs that somebody else is already > trying to migrate a task to, but still search for the > best location, instead of settling for a worse location > to migrate to? Yes its better to skip cpus if they are already in migration. And we are already doing it with the above patch. However as I said earlier - Task T1 sets Cpu 1 as best_cpu, - Task T2 finds cpu1 and skips Cpu1 - Task T1 finds cpu2 slightly better than cpu1. - Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2. - Task T2 finds cpu2 and skips cpu2 - Task T1 finds cpu3 as best_cpu slightly better than cpu2. - Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3. - Task T2 finds cpu3 and skips cpu3 So after this T1 was able to find a cpu but T2 couldn't find a cpu even though there were 3 cpus that were available for 2 task to swap. Again, this is too corner case, that I am okay to drop.
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
> > > > All tasks will not be stuck at task/cpu A. > > > > "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the > > cpu..." the first task to set cpu A as swap target will ensure > > subsequent tasks wont be allowed to set cpu A as target for swap till > > it > > finds a better task/cpu. Because of this there a very very small > > chance > > of a second task unable to find a task to swap. > > Would it not be better for task_numa_compare to skip > from consideration CPUs that somebody else is already > trying to migrate a task to, but still search for the > best location, instead of settling for a worse location > to migrate to? Yes its better to skip cpus if they are already in migration. And we are already doing it with the above patch. However as I said earlier - Task T1 sets Cpu 1 as best_cpu, - Task T2 finds cpu1 and skips Cpu1 - Task T1 finds cpu2 slightly better than cpu1. - Task T1 resets cpu1 as best_cpu, sets best_cpu as cpu2. - Task T2 finds cpu2 and skips cpu2 - Task T1 finds cpu3 as best_cpu slightly better than cpu2. - Task T1 resets cpu2 as best_cpu, sets best_cpu as cpu3. - Task T2 finds cpu3 and skips cpu3 So after this T1 was able to find a cpu but T2 couldn't find a cpu even though there were 3 cpus that were available for 2 task to swap. Again, this is too corner case, that I am okay to drop.
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
On Wed, 2018-06-06 at 05:55 -0700, Srikar Dronamraju wrote: > > > > > > While we can't complete avoid this, the second check will try to > > > make > > > sure we don't hop on/hop off just for small incremental numa > > > improvement. > > > > However, all those racing tasks start searching > > the CPUs on a node from the same start position. > > > > That means they may all get stuck on the same > > task/cpu A, and not select the better task/cpu B. > > > > What am I missing? > > All tasks will not be stuck at task/cpu A. > > "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the > cpu..." the first task to set cpu A as swap target will ensure > subsequent tasks wont be allowed to set cpu A as target for swap till > it > finds a better task/cpu. Because of this there a very very small > chance > of a second task unable to find a task to swap. Would it not be better for task_numa_compare to skip from consideration CPUs that somebody else is already trying to migrate a task to, but still search for the best location, instead of settling for a worse location to migrate to? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
On Wed, 2018-06-06 at 05:55 -0700, Srikar Dronamraju wrote: > > > > > > While we can't complete avoid this, the second check will try to > > > make > > > sure we don't hop on/hop off just for small incremental numa > > > improvement. > > > > However, all those racing tasks start searching > > the CPUs on a node from the same start position. > > > > That means they may all get stuck on the same > > task/cpu A, and not select the better task/cpu B. > > > > What am I missing? > > All tasks will not be stuck at task/cpu A. > > "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the > cpu..." the first task to set cpu A as swap target will ensure > subsequent tasks wont be allowed to set cpu A as target for swap till > it > finds a better task/cpu. Because of this there a very very small > chance > of a second task unable to find a task to swap. Would it not be better for task_numa_compare to skip from consideration CPUs that somebody else is already trying to migrate a task to, but still search for the best location, instead of settling for a worse location to migrate to? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
> > > > I thought about this. Lets say we evaluated that destination node can > > allow movement. While we iterate through the list of cpus trying to > > find > > the best cpu node, we find a idle cpu towards the end of the list. > > However if another task as already raced with us to move a task to > > this > > node, then we should bail out. Keeping the check in task_numa_compare > > will allow us to do this. > > Your check is called once for every invocation > of task_numa_compare. It does not matter whether > it is inside or outside, except on the outside > the variable manipulation will be easier to read. > Okay I mistook your comment; Basically you want the check to be moved within the for-loop in task_numa_find_cpu. I will do the needful. > > > > While we can't complete avoid this, the second check will try to make > > sure we don't hop on/hop off just for small incremental numa > > improvement. > > However, all those racing tasks start searching > the CPUs on a node from the same start position. > > That means they may all get stuck on the same > task/cpu A, and not select the better task/cpu B. > > What am I missing? All tasks will not be stuck at task/cpu A. "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu..." the first task to set cpu A as swap target will ensure subsequent tasks wont be allowed to set cpu A as target for swap till it finds a better task/cpu. Because of this there a very very small chance of a second task unable to find a task to swap.
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
> > > > I thought about this. Lets say we evaluated that destination node can > > allow movement. While we iterate through the list of cpus trying to > > find > > the best cpu node, we find a idle cpu towards the end of the list. > > However if another task as already raced with us to move a task to > > this > > node, then we should bail out. Keeping the check in task_numa_compare > > will allow us to do this. > > Your check is called once for every invocation > of task_numa_compare. It does not matter whether > it is inside or outside, except on the outside > the variable manipulation will be easier to read. > Okay I mistook your comment; Basically you want the check to be moved within the for-loop in task_numa_find_cpu. I will do the needful. > > > > While we can't complete avoid this, the second check will try to make > > sure we don't hop on/hop off just for small incremental numa > > improvement. > > However, all those racing tasks start searching > the CPUs on a node from the same start position. > > That means they may all get stuck on the same > task/cpu A, and not select the better task/cpu B. > > What am I missing? All tasks will not be stuck at task/cpu A. "[PATCH 10/19] sched/numa: Stop multiple tasks from moving to the cpu..." the first task to set cpu A as swap target will ensure subsequent tasks wont be allowed to set cpu A as target for swap till it finds a better task/cpu. Because of this there a very very small chance of a second task unable to find a task to swap.
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
On Mon, 2018-06-04 at 20:56 -0700, Srikar Dronamraju wrote: > * Rik van Riel [2018-06-04 16:05:55]: > > > On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote: > > > > > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct > > > task_numa_env *env, > > > if (READ_ONCE(dst_rq->numa_migrate_on)) > > > return; > > > > > > + if (*move && READ_ONCE(pgdat->active_node_migrate)) > > > + *move = false; > > > > Why not do this check in task_numa_find_cpu? > > > > That way you won't have to pass this in as a > > pointer, and you also will not have to recalculate > > NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU. > > > > I thought about this. Lets say we evaluated that destination node can > allow movement. While we iterate through the list of cpus trying to > find > the best cpu node, we find a idle cpu towards the end of the list. > However if another task as already raced with us to move a task to > this > node, then we should bail out. Keeping the check in task_numa_compare > will allow us to do this. Your check is called once for every invocation of task_numa_compare. It does not matter whether it is inside or outside, except on the outside the variable manipulation will be easier to read. > > > + * task migration might only result in ping pong > > > + * of tasks and also hurt performance due to cache > > > + * misses. > > > + */ > > > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / > > > 2) > > > + goto unlock; > > > > I can see a use for the first test, but why limit the > > search for the best score once you are past the > > threshold? > > > > I don't understand the use for that second test. > > > > Lets say few threads are racing with each other to find a cpu on the > node. The first thread has already found a task/cpu 'A' to swap and > finds another task/cpu 'B' thats slightly better than the current > best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to > be > set as best_cpu. However the second or subsequent threads cannot find > the first task/cpu A because its suppose to be in active migration. > By > the time it reaches task/cpu B even that may look to be in active > migration. It may never know that task/cpu A was cleared. In this > way, > the second and subsequent threads may not get a task/cpu in the > preferred node to swap just because the first task kept hopping > task/cpu > as its choice of migration. > > While we can't complete avoid this, the second check will try to make > sure we don't hop on/hop off just for small incremental numa > improvement. However, all those racing tasks start searching the CPUs on a node from the same start position. That means they may all get stuck on the same task/cpu A, and not select the better task/cpu B. What am I missing? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
On Mon, 2018-06-04 at 20:56 -0700, Srikar Dronamraju wrote: > * Rik van Riel [2018-06-04 16:05:55]: > > > On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote: > > > > > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct > > > task_numa_env *env, > > > if (READ_ONCE(dst_rq->numa_migrate_on)) > > > return; > > > > > > + if (*move && READ_ONCE(pgdat->active_node_migrate)) > > > + *move = false; > > > > Why not do this check in task_numa_find_cpu? > > > > That way you won't have to pass this in as a > > pointer, and you also will not have to recalculate > > NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU. > > > > I thought about this. Lets say we evaluated that destination node can > allow movement. While we iterate through the list of cpus trying to > find > the best cpu node, we find a idle cpu towards the end of the list. > However if another task as already raced with us to move a task to > this > node, then we should bail out. Keeping the check in task_numa_compare > will allow us to do this. Your check is called once for every invocation of task_numa_compare. It does not matter whether it is inside or outside, except on the outside the variable manipulation will be easier to read. > > > + * task migration might only result in ping pong > > > + * of tasks and also hurt performance due to cache > > > + * misses. > > > + */ > > > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / > > > 2) > > > + goto unlock; > > > > I can see a use for the first test, but why limit the > > search for the best score once you are past the > > threshold? > > > > I don't understand the use for that second test. > > > > Lets say few threads are racing with each other to find a cpu on the > node. The first thread has already found a task/cpu 'A' to swap and > finds another task/cpu 'B' thats slightly better than the current > best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to > be > set as best_cpu. However the second or subsequent threads cannot find > the first task/cpu A because its suppose to be in active migration. > By > the time it reaches task/cpu B even that may look to be in active > migration. It may never know that task/cpu A was cleared. In this > way, > the second and subsequent threads may not get a task/cpu in the > preferred node to swap just because the first task kept hopping > task/cpu > as its choice of migration. > > While we can't complete avoid this, the second check will try to make > sure we don't hop on/hop off just for small incremental numa > improvement. However, all those racing tasks start searching the CPUs on a node from the same start position. That means they may all get stuck on the same task/cpu A, and not select the better task/cpu B. What am I missing? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
* Rik van Riel [2018-06-04 16:05:55]: > On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote: > > > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct > > task_numa_env *env, > > if (READ_ONCE(dst_rq->numa_migrate_on)) > > return; > > > > + if (*move && READ_ONCE(pgdat->active_node_migrate)) > > + *move = false; > > Why not do this check in task_numa_find_cpu? > > That way you won't have to pass this in as a > pointer, and you also will not have to recalculate > NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU. > I thought about this. Lets say we evaluated that destination node can allow movement. While we iterate through the list of cpus trying to find the best cpu node, we find a idle cpu towards the end of the list. However if another task as already raced with us to move a task to this node, then we should bail out. Keeping the check in task_numa_compare will allow us to do this. > > /* > > +* If the numa importance is less than SMALLIMP, > > ^^^ numa improvement > okay > > +* task migration might only result in ping pong > > +* of tasks and also hurt performance due to cache > > +* misses. > > +*/ > > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2) > > + goto unlock; > > I can see a use for the first test, but why limit the > search for the best score once you are past the > threshold? > > I don't understand the use for that second test. > Lets say few threads are racing with each other to find a cpu on the node. The first thread has already found a task/cpu 'A' to swap and finds another task/cpu 'B' thats slightly better than the current best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to be set as best_cpu. However the second or subsequent threads cannot find the first task/cpu A because its suppose to be in active migration. By the time it reaches task/cpu B even that may look to be in active migration. It may never know that task/cpu A was cleared. In this way, the second and subsequent threads may not get a task/cpu in the preferred node to swap just because the first task kept hopping task/cpu as its choice of migration. While we can't complete avoid this, the second check will try to make sure we don't hop on/hop off just for small incremental numa improvement. > What workload benefits from it? > > -- > All Rights Reversed.
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
* Rik van Riel [2018-06-04 16:05:55]: > On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote: > > > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct > > task_numa_env *env, > > if (READ_ONCE(dst_rq->numa_migrate_on)) > > return; > > > > + if (*move && READ_ONCE(pgdat->active_node_migrate)) > > + *move = false; > > Why not do this check in task_numa_find_cpu? > > That way you won't have to pass this in as a > pointer, and you also will not have to recalculate > NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU. > I thought about this. Lets say we evaluated that destination node can allow movement. While we iterate through the list of cpus trying to find the best cpu node, we find a idle cpu towards the end of the list. However if another task as already raced with us to move a task to this node, then we should bail out. Keeping the check in task_numa_compare will allow us to do this. > > /* > > +* If the numa importance is less than SMALLIMP, > > ^^^ numa improvement > okay > > +* task migration might only result in ping pong > > +* of tasks and also hurt performance due to cache > > +* misses. > > +*/ > > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2) > > + goto unlock; > > I can see a use for the first test, but why limit the > search for the best score once you are past the > threshold? > > I don't understand the use for that second test. > Lets say few threads are racing with each other to find a cpu on the node. The first thread has already found a task/cpu 'A' to swap and finds another task/cpu 'B' thats slightly better than the current best_cpu which is 'A'. Currently we allow the second task/cpu 'B' to be set as best_cpu. However the second or subsequent threads cannot find the first task/cpu A because its suppose to be in active migration. By the time it reaches task/cpu B even that may look to be in active migration. It may never know that task/cpu A was cleared. In this way, the second and subsequent threads may not get a task/cpu in the preferred node to swap just because the first task kept hopping task/cpu as its choice of migration. While we can't complete avoid this, the second check will try to make sure we don't hop on/hop off just for small incremental numa improvement. > What workload benefits from it? > > -- > All Rights Reversed.
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote: > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct > task_numa_env *env, > if (READ_ONCE(dst_rq->numa_migrate_on)) > return; > > + if (*move && READ_ONCE(pgdat->active_node_migrate)) > + *move = false; Why not do this check in task_numa_find_cpu? That way you won't have to pass this in as a pointer, and you also will not have to recalculate NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU. > /* > + * If the numa importance is less than SMALLIMP, ^^^ numa improvement > + * task migration might only result in ping pong > + * of tasks and also hurt performance due to cache > + * misses. > + */ > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2) > + goto unlock; I can see a use for the first test, but why limit the search for the best score once you are past the threshold? I don't understand the use for that second test. What workload benefits from it? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH 16/19] sched/numa: Detect if node actively handling migration
On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote: > @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct > task_numa_env *env, > if (READ_ONCE(dst_rq->numa_migrate_on)) > return; > > + if (*move && READ_ONCE(pgdat->active_node_migrate)) > + *move = false; Why not do this check in task_numa_find_cpu? That way you won't have to pass this in as a pointer, and you also will not have to recalculate NODE_DATA(cpu_to_node(env->dst_cpu)) for every CPU. > /* > + * If the numa importance is less than SMALLIMP, ^^^ numa improvement > + * task migration might only result in ping pong > + * of tasks and also hurt performance due to cache > + * misses. > + */ > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2) > + goto unlock; I can see a use for the first test, but why limit the search for the best score once you are past the threshold? I don't understand the use for that second test. What workload benefits from it? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
[PATCH 16/19] sched/numa: Detect if node actively handling migration
If a node is the destination for a task migration under numa balancing, then any parallel movements to the node will be restricted. In such a scenario, detect at the earliest and avoid evaluation for a task movement. While here, avoid task migration if the numa imbalance is very minimal. Especially consider two tasks A and B racing with each other to find the best cpu to swap. If task A already has found one task/cpu pair to swap and trying to find a better cpu. Task B is yet to find a better cpu/task to swap. Task A can race with task B and deprive it from getting a task/cpu to swap. Testcase Time: Min Max Avg StdDev numa01.sh Real: 493.19 672.88 597.51 59.38 numa01.sh Sys: 150.09 245.48 207.76 34.26 numa01.sh User:41928.5153779.1748747.06 3901.39 numa02.sh Real: 60.63 62.87 61.220.83 numa02.sh Sys: 16.64 27.97 20.254.06 numa02.sh User: 5222.92 5309.60 5254.03 29.98 numa03.sh Real: 821.52 902.15 863.60 32.41 numa03.sh Sys: 112.04 130.66 118.357.08 numa03.sh User:62245.1669165.1466443.04 2450.32 numa04.sh Real: 414.53 519.57 476.25 37.00 numa04.sh Sys: 181.84 335.67 280.41 54.07 numa04.sh User:33924.5039115.3937343.78 1934.26 numa05.sh Real: 408.30 441.45 417.90 12.05 numa05.sh Sys: 233.41 381.60 295.58 57.37 numa05.sh User:33301.3135972.5034335.19 938.94 Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 428.48 837.17 700.45 162.77 -14.6% numa01.sh Sys: 78.64 247.70 164.45 58.32 26.33% numa01.sh User:37487.2563728.0654399.2710088.13 -10.3% numa02.sh Real: 60.07 62.65 61.410.85 -0.30% numa02.sh Sys: 15.83 29.36 21.044.48 -3.75% numa02.sh User: 5194.27 5280.60 5236.55 28.01 0.333% numa03.sh Real: 814.33 881.93 849.69 27.06 1.637% numa03.sh Sys: 111.45 134.02 125.287.69 -5.53% numa03.sh User:63007.3668013.4665590.46 2023.37 1.299% numa04.sh Real: 412.19 438.75 424.439.28 12.20% numa04.sh Sys: 232.97 315.77 268.98 26.98 4.249% numa04.sh User:33997.3035292.8834711.66 415.78 7.582% numa05.sh Real: 394.88 449.45 424.30 22.53 -1.50% numa05.sh Sys: 262.03 390.10 314.53 51.01 -6.02% numa05.sh User:33389.0335684.4034561.34 942.34 -0.65% Signed-off-by: Srikar Dronamraju --- kernel/sched/fair.c | 37 +++-- 1 file changed, 27 insertions(+), 10 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c388ecf..6851412 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1535,14 +1535,22 @@ static bool load_too_imbalanced(long src_load, long dst_load, } /* + * Maximum numa importance can be 1998 (2*999); + * SMALLIMP @ 30 would be close to 1998/64. + * Used to deter task migration. + */ +#define SMALLIMP 30 + +/* * This checks if the overall compute and NUMA accesses of the system would * be improved if the source tasks was migrated to the target dst_cpu taking * into account that it might be best if task running on the dst_cpu should * be exchanged with the source task */ static void task_numa_compare(struct task_numa_env *env, - long taskimp, long groupimp, bool move) + long taskimp, long groupimp, bool *move) { + pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu)); struct rq *dst_rq = cpu_rq(env->dst_cpu); struct task_struct *cur; long src_load, dst_load; @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct task_numa_env *env, if (READ_ONCE(dst_rq->numa_migrate_on)) return; + if (*move && READ_ONCE(pgdat->active_node_migrate)) + *move = false; + rcu_read_lock(); cur = task_rcu_dereference(_rq->curr); if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur))) @@ -1567,10 +1578,10 @@ static void task_numa_compare(struct task_numa_env *env, goto unlock; if (!cur) { - if (!move || imp <= env->best_imp) - goto unlock; - else + if (*move && moveimp >= env->best_imp) goto assign; + else + goto unlock; } /* @@
[PATCH 16/19] sched/numa: Detect if node actively handling migration
If a node is the destination for a task migration under numa balancing, then any parallel movements to the node will be restricted. In such a scenario, detect at the earliest and avoid evaluation for a task movement. While here, avoid task migration if the numa imbalance is very minimal. Especially consider two tasks A and B racing with each other to find the best cpu to swap. If task A already has found one task/cpu pair to swap and trying to find a better cpu. Task B is yet to find a better cpu/task to swap. Task A can race with task B and deprive it from getting a task/cpu to swap. Testcase Time: Min Max Avg StdDev numa01.sh Real: 493.19 672.88 597.51 59.38 numa01.sh Sys: 150.09 245.48 207.76 34.26 numa01.sh User:41928.5153779.1748747.06 3901.39 numa02.sh Real: 60.63 62.87 61.220.83 numa02.sh Sys: 16.64 27.97 20.254.06 numa02.sh User: 5222.92 5309.60 5254.03 29.98 numa03.sh Real: 821.52 902.15 863.60 32.41 numa03.sh Sys: 112.04 130.66 118.357.08 numa03.sh User:62245.1669165.1466443.04 2450.32 numa04.sh Real: 414.53 519.57 476.25 37.00 numa04.sh Sys: 181.84 335.67 280.41 54.07 numa04.sh User:33924.5039115.3937343.78 1934.26 numa05.sh Real: 408.30 441.45 417.90 12.05 numa05.sh Sys: 233.41 381.60 295.58 57.37 numa05.sh User:33301.3135972.5034335.19 938.94 Testcase Time: Min Max Avg StdDev %Change numa01.sh Real: 428.48 837.17 700.45 162.77 -14.6% numa01.sh Sys: 78.64 247.70 164.45 58.32 26.33% numa01.sh User:37487.2563728.0654399.2710088.13 -10.3% numa02.sh Real: 60.07 62.65 61.410.85 -0.30% numa02.sh Sys: 15.83 29.36 21.044.48 -3.75% numa02.sh User: 5194.27 5280.60 5236.55 28.01 0.333% numa03.sh Real: 814.33 881.93 849.69 27.06 1.637% numa03.sh Sys: 111.45 134.02 125.287.69 -5.53% numa03.sh User:63007.3668013.4665590.46 2023.37 1.299% numa04.sh Real: 412.19 438.75 424.439.28 12.20% numa04.sh Sys: 232.97 315.77 268.98 26.98 4.249% numa04.sh User:33997.3035292.8834711.66 415.78 7.582% numa05.sh Real: 394.88 449.45 424.30 22.53 -1.50% numa05.sh Sys: 262.03 390.10 314.53 51.01 -6.02% numa05.sh User:33389.0335684.4034561.34 942.34 -0.65% Signed-off-by: Srikar Dronamraju --- kernel/sched/fair.c | 37 +++-- 1 file changed, 27 insertions(+), 10 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c388ecf..6851412 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1535,14 +1535,22 @@ static bool load_too_imbalanced(long src_load, long dst_load, } /* + * Maximum numa importance can be 1998 (2*999); + * SMALLIMP @ 30 would be close to 1998/64. + * Used to deter task migration. + */ +#define SMALLIMP 30 + +/* * This checks if the overall compute and NUMA accesses of the system would * be improved if the source tasks was migrated to the target dst_cpu taking * into account that it might be best if task running on the dst_cpu should * be exchanged with the source task */ static void task_numa_compare(struct task_numa_env *env, - long taskimp, long groupimp, bool move) + long taskimp, long groupimp, bool *move) { + pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu)); struct rq *dst_rq = cpu_rq(env->dst_cpu); struct task_struct *cur; long src_load, dst_load; @@ -1554,6 +1562,9 @@ static void task_numa_compare(struct task_numa_env *env, if (READ_ONCE(dst_rq->numa_migrate_on)) return; + if (*move && READ_ONCE(pgdat->active_node_migrate)) + *move = false; + rcu_read_lock(); cur = task_rcu_dereference(_rq->curr); if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur))) @@ -1567,10 +1578,10 @@ static void task_numa_compare(struct task_numa_env *env, goto unlock; if (!cur) { - if (!move || imp <= env->best_imp) - goto unlock; - else + if (*move && moveimp >= env->best_imp) goto assign; + else + goto unlock; } /* @@