Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-21 Thread Jirka Hladky
Resending - previously, I forgot to send the message in the plain text
mode. I'm sorry.

Hi Mel,

Thanks a lot for looking into this!

Our results are mostly in line with what you see. We observe big gains
(20-50%) when the system is loaded to 1/3 of the maximum capacity and
mixed results at the full load - some workloads benefit from the patch
at the full load, others not, but performance changes at the full load
are mostly within the noise of results (+/-5%). Overall, we think this
patch is helpful.

Jirka


On Mon, Sep 21, 2020 at 1:02 PM Mel Gorman  wrote:
>
> On Thu, Sep 10, 2020 at 11:50:04PM +0200, Jirka Hladky wrote:
> > 1) Threads stability has improved a lot. We see much fewer threads
> > migrations. Check for example this heatmap based on the mpstat results
> > collected while running sp.C test from the NAS benchmark. sp.C was run
> > with 16 threads on a dual-socket AMD 7351 server with 8 NUMA nodes:
> > 5.9 vanilla 
> > https://drive.google.com/file/d/10rojhTSQUSu-1aGQi-srr99SmVQ3Llgo/view?usp=sharing
> > 5.9 with the patch
> > https://drive.google.com/file/d/1eZdTBWvWMNvdvXqitwAlcKZ7gb-ySQUl/view?usp=sharing
> >
> > The heatmaps are generated from the mpstat output (there are 5
> > different runs at one picture). We collect CPU utilization every 5
> > seconds. Lighter color corresponds to lower CPU utilization. Light
> > color means that the thread may have run on different CPUs during that
> > 5 seconds. Solid straight lines, on the other hand, mean that thread
> > was running on the same CPU all the time. The difference is striking.
> >
> > We see significantly fewer threads migrations across many different
> > tests - NAS, SPECjbb2005, SPECjvm2008
> >
>
> For all three, I see the point where it's better or worse depends on
> overall activity. I looked at heatmaps for a variety of workloads and
> visually the bigger differences tend to be with utilisation is relatively
> low (e.g. one third of CPUs active).
>
> > 2) We see also performance improvement in terms of runtime, especially
> > at low load scenarios (number of threads being roughly equal to the 2*
> > number of NUMA nodes). It fixes the performance drop which we see
> > since 5.7 kernel, discussed for example here:
> > https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgn46gnxryr0ubwju0oynk7h00y...@mail.gmail.com/
> >
>
> This would make some sense given the original intent about allowing
> imbalances that was later revised significantly. It depends on when
> memory throughput is more important so the impact varies with the level
> of untilisation.  For example, on a 2-socket machine running specjvm 2005
> I see
>
> specjbb
>5.9.0-rc4  5.9.0-rc4
>  vanilladstbalance-v1r1
> Hmean tput-1 46425.00 (   0.00%)43394.00 *  -6.53%*
> Hmean tput-2 98416.00 (   0.00%)96031.00 *  -2.42%*
> Hmean tput-3150184.00 (   0.00%)   148783.00 *  -0.93%*
> Hmean tput-4200683.00 (   0.00%)   197906.00 *  -1.38%*
> Hmean tput-5236305.00 (   0.00%)   245549.00 *   3.91%*
> Hmean tput-6281559.00 (   0.00%)   285692.00 *   1.47%*
> Hmean tput-7338558.00 (   0.00%)   334467.00 *  -1.21%*
> Hmean tput-8340745.00 (   0.00%)   372501.00 *   9.32%*
> Hmean tput-9424343.00 (   0.00%)   413006.00 *  -2.67%*
> Hmean tput-10   421854.00 (   0.00%)   434261.00 *   2.94%*
> Hmean tput-11   493256.00 (   0.00%)   485330.00 *  -1.61%*
> Hmean tput-12   549573.00 (   0.00%)   529959.00 *  -3.57%*
> Hmean tput-13   593183.00 (   0.00%)   555010.00 *  -6.44%*
> Hmean tput-14   588252.00 (   0.00%)   599166.00 *   1.86%*
> Hmean tput-15   623065.00 (   0.00%)   642713.00 *   3.15%*
> Hmean tput-16   703924.00 (   0.00%)   660758.00 *  -6.13%*
> Hmean tput-17   666023.00 (   0.00%)   697675.00 *   4.75%*
> Hmean tput-18   761502.00 (   0.00%)   758360.00 *  -0.41%*
> Hmean tput-19   796088.00 (   0.00%)   798368.00 *   0.29%*
> Hmean tput-20   733564.00 (   0.00%)   823086.00 *  12.20%*
> Hmean tput-21   840980.00 (   0.00%)   856711.00 *   1.87%*
> Hmean tput-22   804285.00 (   0.00%)   872238.00 *   8.45%*
> Hmean tput-23   795208.00 (   0.00%)   889374.00 *  11.84%*
> Hmean tput-24   848619.00 (   0.00%)   966783.00 *  13.92%*
> Hmean tput-25   750848.00 (   0.00%)   903790.00 *  20.37%*
> Hmean tput-26   780523.00 (   0.00%)   962254.00 *  23.28%*
> Hmean tput-27  1042245.00 (   0.00%)   991544.00 *  -4.86%*
> Hmean tput-28  1090580.00 (   0.00%)  1035926.00 *  -5.01%*
> Hmean tput-29   999483.00 (   0.00%)  1082948.00 *   8.35%*
> Hmean tput-30  1098663.00 (   0.00%)  1113427.00 *   1.34%*
> Hmean tput-31  1125671.00 (   0.00%)  1134175.00 *   0.76%*
> Hmean tput-32   968167.00 (   0.00%)  1250286.00 *  29.14%*
> Hmean tput-33  1077676.00 (   0.00%)  1060893.00 *  -1.56%*
> Hmean tput-34  1090538.00 (   0.00%) 

Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-21 Thread Mel Gorman
On Thu, Sep 10, 2020 at 11:50:04PM +0200, Jirka Hladky wrote:
> 1) Threads stability has improved a lot. We see much fewer threads
> migrations. Check for example this heatmap based on the mpstat results
> collected while running sp.C test from the NAS benchmark. sp.C was run
> with 16 threads on a dual-socket AMD 7351 server with 8 NUMA nodes:
> 5.9 vanilla 
> https://drive.google.com/file/d/10rojhTSQUSu-1aGQi-srr99SmVQ3Llgo/view?usp=sharing
> 5.9 with the patch
> https://drive.google.com/file/d/1eZdTBWvWMNvdvXqitwAlcKZ7gb-ySQUl/view?usp=sharing
> 
> The heatmaps are generated from the mpstat output (there are 5
> different runs at one picture). We collect CPU utilization every 5
> seconds. Lighter color corresponds to lower CPU utilization. Light
> color means that the thread may have run on different CPUs during that
> 5 seconds. Solid straight lines, on the other hand, mean that thread
> was running on the same CPU all the time. The difference is striking.
> 
> We see significantly fewer threads migrations across many different
> tests - NAS, SPECjbb2005, SPECjvm2008
> 

For all three, I see the point where it's better or worse depends on
overall activity. I looked at heatmaps for a variety of workloads and
visually the bigger differences tend to be with utilisation is relatively
low (e.g. one third of CPUs active).

> 2) We see also performance improvement in terms of runtime, especially
> at low load scenarios (number of threads being roughly equal to the 2*
> number of NUMA nodes). It fixes the performance drop which we see
> since 5.7 kernel, discussed for example here:
> https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgn46gnxryr0ubwju0oynk7h00y...@mail.gmail.com/
> 

This would make some sense given the original intent about allowing
imbalances that was later revised significantly. It depends on when
memory throughput is more important so the impact varies with the level
of untilisation.  For example, on a 2-socket machine running specjvm 2005
I see

specjbb
   5.9.0-rc4  5.9.0-rc4
 vanilladstbalance-v1r1
Hmean tput-1 46425.00 (   0.00%)43394.00 *  -6.53%*
Hmean tput-2 98416.00 (   0.00%)96031.00 *  -2.42%*
Hmean tput-3150184.00 (   0.00%)   148783.00 *  -0.93%*
Hmean tput-4200683.00 (   0.00%)   197906.00 *  -1.38%*
Hmean tput-5236305.00 (   0.00%)   245549.00 *   3.91%*
Hmean tput-6281559.00 (   0.00%)   285692.00 *   1.47%*
Hmean tput-7338558.00 (   0.00%)   334467.00 *  -1.21%*
Hmean tput-8340745.00 (   0.00%)   372501.00 *   9.32%*
Hmean tput-9424343.00 (   0.00%)   413006.00 *  -2.67%*
Hmean tput-10   421854.00 (   0.00%)   434261.00 *   2.94%*
Hmean tput-11   493256.00 (   0.00%)   485330.00 *  -1.61%*
Hmean tput-12   549573.00 (   0.00%)   529959.00 *  -3.57%*
Hmean tput-13   593183.00 (   0.00%)   555010.00 *  -6.44%*
Hmean tput-14   588252.00 (   0.00%)   599166.00 *   1.86%*
Hmean tput-15   623065.00 (   0.00%)   642713.00 *   3.15%*
Hmean tput-16   703924.00 (   0.00%)   660758.00 *  -6.13%*
Hmean tput-17   666023.00 (   0.00%)   697675.00 *   4.75%*
Hmean tput-18   761502.00 (   0.00%)   758360.00 *  -0.41%*
Hmean tput-19   796088.00 (   0.00%)   798368.00 *   0.29%*
Hmean tput-20   733564.00 (   0.00%)   823086.00 *  12.20%*
Hmean tput-21   840980.00 (   0.00%)   856711.00 *   1.87%*
Hmean tput-22   804285.00 (   0.00%)   872238.00 *   8.45%*
Hmean tput-23   795208.00 (   0.00%)   889374.00 *  11.84%*
Hmean tput-24   848619.00 (   0.00%)   966783.00 *  13.92%*
Hmean tput-25   750848.00 (   0.00%)   903790.00 *  20.37%*
Hmean tput-26   780523.00 (   0.00%)   962254.00 *  23.28%*
Hmean tput-27  1042245.00 (   0.00%)   991544.00 *  -4.86%*
Hmean tput-28  1090580.00 (   0.00%)  1035926.00 *  -5.01%*
Hmean tput-29   999483.00 (   0.00%)  1082948.00 *   8.35%*
Hmean tput-30  1098663.00 (   0.00%)  1113427.00 *   1.34%*
Hmean tput-31  1125671.00 (   0.00%)  1134175.00 *   0.76%*
Hmean tput-32   968167.00 (   0.00%)  1250286.00 *  29.14%*
Hmean tput-33  1077676.00 (   0.00%)  1060893.00 *  -1.56%*
Hmean tput-34  1090538.00 (   0.00%)  1090933.00 *   0.04%*
Hmean tput-35   967058.00 (   0.00%)  1107421.00 *  14.51%*
Hmean tput-36  1051745.00 (   0.00%)  1210663.00 *  15.11%*
Hmean tput-37  1019465.00 (   0.00%)  1351446.00 *  32.56%*
Hmean tput-38  1083102.00 (   0.00%)  1064541.00 *  -1.71%*
Hmean tput-39  1232990.00 (   0.00%)  1303623.00 *   5.73%*
Hmean tput-40  1175542.00 (   0.00%)  1340943.00 *  14.07%*
Hmean tput-41  1127826.00 (   0.00%)  1339492.00 *  18.77%*
Hmean tput-42  1198313.00 (   0.00%)  1411023.00 *  17.75%*
Hmean tput-43  1163733.00 (   0.00%)  1228253.00 *   5.54%*
Hmean tput-44  1305562.00 (   0.00%)  1357886.00 *   4.01%*
Hmean tput-45  1326752.00 (   0.00%)  

Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-10 Thread Jirka Hladky
Hi Hilf and Mel,

thanks a lot for bringing this to my attention. We have tested the
proposed patch and we are getting excellent results so far!

1) Threads stability has improved a lot. We see much fewer threads
migrations. Check for example this heatmap based on the mpstat results
collected while running sp.C test from the NAS benchmark. sp.C was run
with 16 threads on a dual-socket AMD 7351 server with 8 NUMA nodes:
5.9 vanilla 
https://drive.google.com/file/d/10rojhTSQUSu-1aGQi-srr99SmVQ3Llgo/view?usp=sharing
5.9 with the patch
https://drive.google.com/file/d/1eZdTBWvWMNvdvXqitwAlcKZ7gb-ySQUl/view?usp=sharing

The heatmaps are generated from the mpstat output (there are 5
different runs at one picture). We collect CPU utilization every 5
seconds. Lighter color corresponds to lower CPU utilization. Light
color means that the thread may have run on different CPUs during that
5 seconds. Solid straight lines, on the other hand, mean that thread
was running on the same CPU all the time. The difference is striking.

We see significantly fewer threads migrations across many different
tests - NAS, SPECjbb2005, SPECjvm2008

2) We see also performance improvement in terms of runtime, especially
at low load scenarios (number of threads being roughly equal to the 2*
number of NUMA nodes). It fixes the performance drop which we see
since 5.7 kernel, discussed for example here:
https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgn46gnxryr0ubwju0oynk7h00y...@mail.gmail.com/

The biggest improvements are for the NAS and the SPECjvm2008
benchmarks (typically between 20-50%). SPECjbb2005 shows also
improvements, around 10%. This is again on NUMA servers at the low
utilization. You can find snapshots of some results here:
https://drive.google.com/drive/folders/1k3Gb-vlI7UjPZZcLkoL2W2VB_zqxIJ3_?usp=sharing

@Mel - could you please test the proposed patch? I know you have good
coverage for the inter-process communication benchmarks which may show
different behavior than benchmarks which we use. I hope that fewer
threads migrations might show positive effects also for these tests.
Please give it a try.

Thanks a lot!
Jirka


On Tue, Sep 8, 2020 at 3:07 AM Hillf Danton  wrote:
>
>
> On Mon, 7 Sep 2020 18:01:06 +0530 Srikar Dronamraju wrote:
> > > > On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > > > > Something is wrong. In find_busiest_group(), we are checking if src 
> > > > > has
> > > > > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > > > > will have higher load after balancing. It seems it is not sensible to
> > > > > check src.
> > > > > It maybe cause wrong imbalance value, for example, if
> > > > > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > > > > src_running = env->src_stats.nr_running - 1 results in 1;
> > > > > The current code is thinking imbalance as 0 since src_running is 
> > > > > smaller
> > > > > than 2.
> > > > > This is inconsistent with load balancer.
> > > > >
>
> Hi Srikar and Barry
> >
> > I have observed the similar behaviour what Barry Song has documented with a
> > simple ebizzy with less threads on a 2 node system
>
> Thanks for your testing.
> >
> > ebizzy -t 6 -S 100
> >
> > We see couple of ebizzy threads moving back and forth between the 2 nodes
> > because of numa balancer and load balancer trying to do the exact opposite.
> >
> > However with Barry's patch, couple of tests regress heavily. (Any numa
> > workload that has shared numa faults).
> > For example:
> > perf bench numa mem --no-data_rand_walk -p 1 -t 6 -G 0 -P 3072 -T 0 -l 50 -c
> >
> > I also don't understand the rational behind checking for dst_running in numa
> > balancer path. This almost means no numa balancing in lightly loaded 
> > scenario.
> >
> > So agree with Mel that we should probably test more scenarios before
> > we accept this patch.
>
> Take a look at Jirka's work [1] please if you have any plan to do some
> more tests.
>
> [1] 
> https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgn46gnxryr0ubwju0oynk7h00y...@mail.gmail.com/
> >
> > > >
> > > > It checks the conditions if the move was to happen. Have you evaluated
> > > > this for a NUMA balancing load and confirmed it a) balances properly and
> > > > b) does not increase the scan rate trying to "fix" the problem?
> > >
> > > I think the original code was trying to check if the numa migration
> > > would lead to new imbalance in load balancer. In case src is A, dst is B, 
> > > and
> > > both of them have nr_running as 2. A moves one task to B, then A
> > > will have 1, B will have 3. In load balancer, A will try to pull task
> > > from B since B's nr_running is larger than min_imbalance. But the code
> > > is saying imbalance=0 by finding A's nr_running is smaller than
> > > min_imbalance.
> > >
> > > Will share more test data if you need.
> > >
> > > >
> > > > --
> > > > Mel Gorman
> > > > SUSE Labs
> > >
> > > Thanks
> > > Barry
> >
> > --
> > Thanks and Regards
> > Srikar Dronamraju
>
>


Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-07 Thread Srikar Dronamraju
> > 
> > On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > > Something is wrong. In find_busiest_group(), we are checking if src has
> > > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > > will have higher load after balancing. It seems it is not sensible to
> > > check src.
> > > It maybe cause wrong imbalance value, for example, if
> > > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > > src_running = env->src_stats.nr_running - 1 results in 1;
> > > The current code is thinking imbalance as 0 since src_running is smaller
> > > than 2.
> > > This is inconsistent with load balancer.
> > >

I have observed the similar behaviour what Barry Song has documented with a
simple ebizzy with less threads on a 2 node system

ebizzy -t 6 -S 100

We see couple of ebizzy threads moving back and forth between the 2 nodes
because of numa balancer and load balancer trying to do the exact opposite.

However with Barry's patch, couple of tests regress heavily. (Any numa
workload that has shared numa faults).
For example:
perf bench numa mem --no-data_rand_walk -p 1 -t 6 -G 0 -P 3072 -T 0 -l 50 -c

I also don't understand the rational behind checking for dst_running in numa
balancer path. This almost means no numa balancing in lightly loaded scenario.

So agree with Mel that we should probably test more scenarios before
we accept this patch.

> > 
> > It checks the conditions if the move was to happen. Have you evaluated
> > this for a NUMA balancing load and confirmed it a) balances properly and
> > b) does not increase the scan rate trying to "fix" the problem?
> 
> I think the original code was trying to check if the numa migration
> would lead to new imbalance in load balancer. In case src is A, dst is B, and
> both of them have nr_running as 2. A moves one task to B, then A
> will have 1, B will have 3. In load balancer, A will try to pull task
> from B since B's nr_running is larger than min_imbalance. But the code
> is saying imbalance=0 by finding A's nr_running is smaller than
> min_imbalance.
> 
> Will share more test data if you need.
> 
> > 
> > --
> > Mel Gorman
> > SUSE Labs
> 
> Thanks
> Barry

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-07 Thread Mel Gorman
On Mon, Sep 07, 2020 at 09:44:03AM +, Song Bao Hua (Barry Song) wrote:
> 
> 
> > -Original Message-
> > From: Mel Gorman [mailto:mgor...@suse.de]
> > Sent: Monday, September 7, 2020 9:27 PM
> > To: Song Bao Hua (Barry Song) 
> > Cc: mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> > vincent.guit...@linaro.org; dietmar.eggem...@arm.com;
> > bseg...@google.com; linux-kernel@vger.kernel.org; Linuxarm
> > ; Mel Gorman ;
> > Peter Zijlstra ; Valentin Schneider
> > ; Phil Auld ; Hillf Danton
> > ; Ingo Molnar 
> > Subject: Re: [PATCH] sched/fair: use dst group while checking imbalance for
> > NUMA balancer
> > 
> > On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > > Something is wrong. In find_busiest_group(), we are checking if src has
> > > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > > will have higher load after balancing. It seems it is not sensible to
> > > check src.
> > > It maybe cause wrong imbalance value, for example, if
> > > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > > src_running = env->src_stats.nr_running - 1 results in 1;
> > > The current code is thinking imbalance as 0 since src_running is smaller
> > > than 2.
> > > This is inconsistent with load balancer.
> > >
> > 
> > It checks the conditions if the move was to happen. Have you evaluated
> > this for a NUMA balancing load and confirmed it a) balances properly and
> > b) does not increase the scan rate trying to "fix" the problem?
> 
> I think the original code was trying to check if the numa migration
> would lead to new imbalance in load balancer. In case src is A, dst is B, and
> both of them have nr_running as 2. A moves one task to B, then A
> will have 1, B will have 3. In load balancer, A will try to pull task
> from B since B's nr_running is larger than min_imbalance. But the code
> is saying imbalance=0 by finding A's nr_running is smaller than
> min_imbalance.
> 
> Will share more test data if you need.
> 

Include the test description, data and details of the system you used to
evaluate the patch. I ask because the load/numa reconcilation took a long
time to cover all the corner cases and it's very easy to reintroduce major
regressions. At least one of those corner cases was trying to balance
in the wrong direction because in some cases NUMA balancing will try to
allow a small imbalance if it makes sense from a locality point of view.
Another corner case was if that small imbalance is too large or done at
the wrong time, it regresses overall even though the locality is good
because of memory bandwidth limitations. This is obviously far from ideal
but it does mean that it's an area that needs data backing up the changes.

-- 
Mel Gorman
SUSE Labs


RE: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Mel Gorman [mailto:mgor...@suse.de]
> Sent: Monday, September 7, 2020 9:27 PM
> To: Song Bao Hua (Barry Song) 
> Cc: mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> vincent.guit...@linaro.org; dietmar.eggem...@arm.com;
> bseg...@google.com; linux-kernel@vger.kernel.org; Linuxarm
> ; Mel Gorman ;
> Peter Zijlstra ; Valentin Schneider
> ; Phil Auld ; Hillf Danton
> ; Ingo Molnar 
> Subject: Re: [PATCH] sched/fair: use dst group while checking imbalance for
> NUMA balancer
> 
> On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > Something is wrong. In find_busiest_group(), we are checking if src has
> > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > will have higher load after balancing. It seems it is not sensible to
> > check src.
> > It maybe cause wrong imbalance value, for example, if
> > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > src_running = env->src_stats.nr_running - 1 results in 1;
> > The current code is thinking imbalance as 0 since src_running is smaller
> > than 2.
> > This is inconsistent with load balancer.
> >
> 
> It checks the conditions if the move was to happen. Have you evaluated
> this for a NUMA balancing load and confirmed it a) balances properly and
> b) does not increase the scan rate trying to "fix" the problem?

I think the original code was trying to check if the numa migration
would lead to new imbalance in load balancer. In case src is A, dst is B, and
both of them have nr_running as 2. A moves one task to B, then A
will have 1, B will have 3. In load balancer, A will try to pull task
from B since B's nr_running is larger than min_imbalance. But the code
is saying imbalance=0 by finding A's nr_running is smaller than
min_imbalance.

Will share more test data if you need.

> 
> --
> Mel Gorman
> SUSE Labs

Thanks
Barry


Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-07 Thread Mel Gorman
On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> Something is wrong. In find_busiest_group(), we are checking if src has
> higher load, however, in task_numa_find_cpu(), we are checking if dst
> will have higher load after balancing. It seems it is not sensible to
> check src.
> It maybe cause wrong imbalance value, for example, if
> dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> src_running = env->src_stats.nr_running - 1 results in 1;
> The current code is thinking imbalance as 0 since src_running is smaller
> than 2.
> This is inconsistent with load balancer.
> 

It checks the conditions if the move was to happen. Have you evaluated
this for a NUMA balancing load and confirmed it a) balances properly and
b) does not increase the scan rate trying to "fix" the problem?

-- 
Mel Gorman
SUSE Labs


[PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-07 Thread Barry Song
Something is wrong. In find_busiest_group(), we are checking if src has
higher load, however, in task_numa_find_cpu(), we are checking if dst
will have higher load after balancing. It seems it is not sensible to
check src.
It maybe cause wrong imbalance value, for example, if
dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
src_running = env->src_stats.nr_running - 1 results in 1;
The current code is thinking imbalance as 0 since src_running is smaller
than 2.
This is inconsistent with load balancer.

Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for 
moving between domains with spare capacity")
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Vincent Guittot 
Cc: Juri Lelli 
Cc: Dietmar Eggemann 
Cc: Valentin Schneider 
Cc: Phil Auld 
Cc: Hillf Danton 
Cc: Ingo Molnar 
Signed-off-by: Barry Song 
---
 kernel/sched/fair.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a68a05..90cfee7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1548,7 +1548,7 @@ struct task_numa_env {
 
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_util(int cpu);
-static inline long adjust_numa_imbalance(int imbalance, int src_nr_running);
+static inline long adjust_numa_imbalance(int imbalance, int nr_running);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1925,7 +1925,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
src_running = env->src_stats.nr_running - 1;
dst_running = env->dst_stats.nr_running + 1;
imbalance = max(0, dst_running - src_running);
-   imbalance = adjust_numa_imbalance(imbalance, src_running);
+   imbalance = adjust_numa_imbalance(imbalance, dst_running);
 
/* Use idle CPU if there is no imbalance */
if (!imbalance) {
@@ -8957,7 +8957,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, 
struct sd_lb_stats *sd
}
 }
 
-static inline long adjust_numa_imbalance(int imbalance, int src_nr_running)
+static inline long adjust_numa_imbalance(int imbalance, int nr_running)
 {
unsigned int imbalance_min;
 
@@ -8966,7 +8966,7 @@ static inline long adjust_numa_imbalance(int imbalance, 
int src_nr_running)
 * tasks that remain local when the source domain is almost idle.
 */
imbalance_min = 2;
-   if (src_nr_running <= imbalance_min)
+   if (nr_running <= imbalance_min)
return 0;
 
return imbalance;
-- 
2.7.4