Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-21 Thread Jirka Hladky
Resending - previously, I forgot to send the message in the plain text
mode. I'm sorry.

Hi Mel,

Thanks a lot for looking into this!

Our results are mostly in line with what you see. We observe big gains
(20-50%) when the system is loaded to 1/3 of the maximum capacity and
mixed results at the full load - some workloads benefit from the patch
at the full load, others not, but performance changes at the full load
are mostly within the noise of results (+/-5%). Overall, we think this
patch is helpful.

Jirka


On Mon, Sep 21, 2020 at 1:02 PM Mel Gorman  wrote:
>
> On Thu, Sep 10, 2020 at 11:50:04PM +0200, Jirka Hladky wrote:
> > 1) Threads stability has improved a lot. We see much fewer threads
> > migrations. Check for example this heatmap based on the mpstat results
> > collected while running sp.C test from the NAS benchmark. sp.C was run
> > with 16 threads on a dual-socket AMD 7351 server with 8 NUMA nodes:
> > 5.9 vanilla 
> > https://drive.google.com/file/d/10rojhTSQUSu-1aGQi-srr99SmVQ3Llgo/view?usp=sharing
> > 5.9 with the patch
> > https://drive.google.com/file/d/1eZdTBWvWMNvdvXqitwAlcKZ7gb-ySQUl/view?usp=sharing
> >
> > The heatmaps are generated from the mpstat output (there are 5
> > different runs at one picture). We collect CPU utilization every 5
> > seconds. Lighter color corresponds to lower CPU utilization. Light
> > color means that the thread may have run on different CPUs during that
> > 5 seconds. Solid straight lines, on the other hand, mean that thread
> > was running on the same CPU all the time. The difference is striking.
> >
> > We see significantly fewer threads migrations across many different
> > tests - NAS, SPECjbb2005, SPECjvm2008
> >
>
> For all three, I see the point where it's better or worse depends on
> overall activity. I looked at heatmaps for a variety of workloads and
> visually the bigger differences tend to be with utilisation is relatively
> low (e.g. one third of CPUs active).
>
> > 2) We see also performance improvement in terms of runtime, especially
> > at low load scenarios (number of threads being roughly equal to the 2*
> > number of NUMA nodes). It fixes the performance drop which we see
> > since 5.7 kernel, discussed for example here:
> > https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgn46gnxryr0ubwju0oynk7h00y...@mail.gmail.com/
> >
>
> This would make some sense given the original intent about allowing
> imbalances that was later revised significantly. It depends on when
> memory throughput is more important so the impact varies with the level
> of untilisation.  For example, on a 2-socket machine running specjvm 2005
> I see
>
> specjbb
>5.9.0-rc4  5.9.0-rc4
>  vanilladstbalance-v1r1
> Hmean tput-1 46425.00 (   0.00%)43394.00 *  -6.53%*
> Hmean tput-2 98416.00 (   0.00%)96031.00 *  -2.42%*
> Hmean tput-3150184.00 (   0.00%)   148783.00 *  -0.93%*
> Hmean tput-4200683.00 (   0.00%)   197906.00 *  -1.38%*
> Hmean tput-5236305.00 (   0.00%)   245549.00 *   3.91%*
> Hmean tput-6281559.00 (   0.00%)   285692.00 *   1.47%*
> Hmean tput-7338558.00 (   0.00%)   334467.00 *  -1.21%*
> Hmean tput-8340745.00 (   0.00%)   372501.00 *   9.32%*
> Hmean tput-9424343.00 (   0.00%)   413006.00 *  -2.67%*
> Hmean tput-10   421854.00 (   0.00%)   434261.00 *   2.94%*
> Hmean tput-11   493256.00 (   0.00%)   485330.00 *  -1.61%*
> Hmean tput-12   549573.00 (   0.00%)   529959.00 *  -3.57%*
> Hmean tput-13   593183.00 (   0.00%)   555010.00 *  -6.44%*
> Hmean tput-14   588252.00 (   0.00%)   599166.00 *   1.86%*
> Hmean tput-15   623065.00 (   0.00%)   642713.00 *   3.15%*
> Hmean tput-16   703924.00 (   0.00%)   660758.00 *  -6.13%*
> Hmean tput-17   666023.00 (   0.00%)   697675.00 *   4.75%*
> Hmean tput-18   761502.00 (   0.00%)   758360.00 *  -0.41%*
> Hmean tput-19   796088.00 (   0.00%)   798368.00 *   0.29%*
> Hmean tput-20   733564.00 (   0.00%)   823086.00 *  12.20%*
> Hmean tput-21   840980.00 (   0.00%)   856711.00 *   1.87%*
> Hmean tput-22   804285.00 (   0.00%)   872238.00 *   8.45%*
> Hmean tput-23   795208.00 (   0.00%)   889374.00 *  11.84%*
> Hmean tput-24   848619.00 (   0.00%)   966783.00 *  13.92%*
> Hmean tput-25   750848.00 (   0.00%)   903790.00 *  20.37%*
> Hmean tput-26   780523.00 (   0.00%)   962254.00 *  23.28%*
> Hmean tput-27  1042245.00 (   0.00%)   991544.00 *  -4.86%*
> Hmean tput-28  1090580.00 (   0.00%)  1035926.00 *  -5.01%*
> Hmean tput-29   999483.00 (   0.00%)  1082948.00 *   8.35%*
> Hmean   

Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-10 Thread Jirka Hladky
Hi Hilf and Mel,

thanks a lot for bringing this to my attention. We have tested the
proposed patch and we are getting excellent results so far!

1) Threads stability has improved a lot. We see much fewer threads
migrations. Check for example this heatmap based on the mpstat results
collected while running sp.C test from the NAS benchmark. sp.C was run
with 16 threads on a dual-socket AMD 7351 server with 8 NUMA nodes:
5.9 vanilla 
https://drive.google.com/file/d/10rojhTSQUSu-1aGQi-srr99SmVQ3Llgo/view?usp=sharing
5.9 with the patch
https://drive.google.com/file/d/1eZdTBWvWMNvdvXqitwAlcKZ7gb-ySQUl/view?usp=sharing

The heatmaps are generated from the mpstat output (there are 5
different runs at one picture). We collect CPU utilization every 5
seconds. Lighter color corresponds to lower CPU utilization. Light
color means that the thread may have run on different CPUs during that
5 seconds. Solid straight lines, on the other hand, mean that thread
was running on the same CPU all the time. The difference is striking.

We see significantly fewer threads migrations across many different
tests - NAS, SPECjbb2005, SPECjvm2008

2) We see also performance improvement in terms of runtime, especially
at low load scenarios (number of threads being roughly equal to the 2*
number of NUMA nodes). It fixes the performance drop which we see
since 5.7 kernel, discussed for example here:
https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgn46gnxryr0ubwju0oynk7h00y...@mail.gmail.com/

The biggest improvements are for the NAS and the SPECjvm2008
benchmarks (typically between 20-50%). SPECjbb2005 shows also
improvements, around 10%. This is again on NUMA servers at the low
utilization. You can find snapshots of some results here:
https://drive.google.com/drive/folders/1k3Gb-vlI7UjPZZcLkoL2W2VB_zqxIJ3_?usp=sharing

@Mel - could you please test the proposed patch? I know you have good
coverage for the inter-process communication benchmarks which may show
different behavior than benchmarks which we use. I hope that fewer
threads migrations might show positive effects also for these tests.
Please give it a try.

Thanks a lot!
Jirka


On Tue, Sep 8, 2020 at 3:07 AM Hillf Danton  wrote:
>
>
> On Mon, 7 Sep 2020 18:01:06 +0530 Srikar Dronamraju wrote:
> > > > On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > > > > Something is wrong. In find_busiest_group(), we are checking if src 
> > > > > has
> > > > > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > > > > will have higher load after balancing. It seems it is not sensible to
> > > > > check src.
> > > > > It maybe cause wrong imbalance value, for example, if
> > > > > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > > > > src_running = env->src_stats.nr_running - 1 results in 1;
> > > > > The current code is thinking imbalance as 0 since src_running is 
> > > > > smaller
> > > > > than 2.
> > > > > This is inconsistent with load balancer.
> > > > >
>
> Hi Srikar and Barry
> >
> > I have observed the similar behaviour what Barry Song has documented with a
> > simple ebizzy with less threads on a 2 node system
>
> Thanks for your testing.
> >
> > ebizzy -t 6 -S 100
> >
> > We see couple of ebizzy threads moving back and forth between the 2 nodes
> > because of numa balancer and load balancer trying to do the exact opposite.
> >
> > However with Barry's patch, couple of tests regress heavily. (Any numa
> > workload that has shared numa faults).
> > For example:
> > perf bench numa mem --no-data_rand_walk -p 1 -t 6 -G 0 -P 3072 -T 0 -l 50 -c
> >
> > I also don't understand the rational behind checking for dst_running in numa
> > balancer path. This almost means no numa balancing in lightly loaded 
> > scenario.
> >
> > So agree with Mel that we should probably test more scenarios before
> > we accept this patch.
>
> Take a look at Jirka's work [1] please if you have any plan to do some
> more tests.
>
> [1] 
> https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgn46gnxryr0ubwju0oynk7h00y...@mail.gmail.com/
> >
> > > >
> > > > It checks the conditions if the move was to happen. Have you evaluated
> > > > this for a NUMA balancing load and confirmed it a) balances properly and
> > > > b) does not increase the scan rate trying to "fix" the problem?
> > >
> > > I think the original code was trying to check if the numa migration
> > > would lead to new imbalance in load balancer. In case src is A, dst is B, 
> > > and
> > > both of them have nr_running as 2. A moves one task to B, then A
> > > will have 1, B will have 3. In load balancer, A will try to pull task
> > > from B since B's nr_running is larger than min_imbalance. But the code
> > > is saying imbalance=0 by finding A's nr_running is smaller than
> > > min_imbalance.
> > >
> > > Will share more test data if you need.
> > >
> > > >
> > > > --
> > > > Mel Gorman
> > > > SUSE Labs
> > >
> > > Thanks
> > > Barry
> >
> > --
> > Thanks and Regards
> > Srikar Dronamraju
>
>


Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-20 Thread Jirka Hladky
I have an update on netperf-cstate-small-cross-socket results.

Reported performance degradation of 2.5% for the UDP stream throughput
and 0.6% for the TCP throughput is for message size of 16kB. For
smaller message sizes, the performance drop is higher - up to 5% for
UDP throughput for a message size of 64B. See the numbers below [1]

We still think that it's acceptable given the gains in other
situations (this is again compared to 5.7 vanilla) :

* solved the performance drop upto 20%  with single instance
SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
Rome systems) => this performance drop was INCREASING with higher
threads counts (10% for 16 threads and 20 % for 32 threads)
* solved the performance drop upto 50% for low load scenarios
(SPECjvm2008 and NAS)

[1]
Hillf's patch compared to 5.7 (rc4) vanilla:

TCP throughput
Message size (B)
64  -2.6%
128-2.3%
256-2.6%
1024  -2.7%
2048  -2.2%
3312  -2.4%
4096  -1.1%
8192  -0.4%
16384-0.6%

UDP throughput
64  -5.0%
128-3.0%
256-3.0%
1024  -3.1%
2048  -3.3%
3312  -3.5%
4096  -4.0%
8192  -3.3%
16384-2.6%


On Wed, May 20, 2020 at 3:58 PM Jirka Hladky  wrote:
>
> Hi Hillf, Mel and all,
>
> thanks for the patch! It has produced really GOOD results.
>
> 1) It has fixed performance problems with 5.7 vanilla kernel for
> single-tenant workload and low system load scenarios, without
> performance degradation for the multi-tenant tasks. It's producing the
> same results as the previous proof-of-concept patch where
> adjust_numa_imbalance function was modified to be a no-op (returning
> the same value of imbalance as it gets on the input).
>
> 2) We have also added Mel's netperf-cstate-small-cross-socket test to
> our test battery:
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
>
> Mel told me that he had seen significant performance improvements with
> 5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.
>
> Out of 6 different patches we have tested, your patch has performed
> the best for this scenario. Compared to vanilla, we see minimal
> performance degradation of 2.5% for the udp stream throughput and 0.6%
> for the tcp throughput. The testing was done on a dual-socket system
> with Gold 6132 CPU.
>
> @Mel - could you please test Hillf's patch with your full testing
> suite? So far, it looks very promising, but I would like to check the
> patch thoroughly to make sure it does not hurt performance in other
> areas.
>
> Thanks a lot!
> Jirka
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, May 19, 2020 at 6:32 AM Hillf Danton  wrote:
> >
> >
> > Hi Jirka
> >
> > On Mon, 18 May 2020 16:52:52 +0200 Jirka Hladky wrote:
> > >
> > > We have compared it against kernel with adjust_numa_imbalance disabled
> > > [1], and both kernels perform at the same level for the single-tenant
> > > jobs, but the proposed patch is bad for the multitenancy mode. The
> > > kernel with adjust_numa_imbalance disabled is a clear winner here.
> >
> > Double thanks to you for the tests!
> >
> > > We would be very interested in what others think about disabling
> > > adjust_numa_imbalance function. The patch is bellow. It would be great
> >
> > A minute...
> >
> > > to collect performance results for different scenarios to make sure
> > > the results are objective.
> >
> > I don't have another test case but a diff trying to confine the tool
> > in question back to the hard-coded 2's field.
> >
> > It's used in the first hunk below to detect imbalance before migrating
> > a task, and a small churn of code is added at another call site when
> > balancing idle CPUs.
> >
> > Thanks
> > Hillf
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1916,20 +1916,26 @@ static void task_numa_find_cpu(struct ta
> >  * imbalance that would be overruled by the load balancer.
> >  */
> > if (env->dst_stats.node_type == node_has_spare) {
> > -   unsigned int imbalance;
> > -   int src_running, dst_running;
> > +   unsigned int imbalance = 2;
> >
> > -   /*
> > -* Would movement cause an imbalance? Note that if src has
> > -* more running tasks that the imbalance is ignored as the
> > -* move improves the imbalance from the perspective of the
> > -* CPU load balancer.
> > -* */
> > -  

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-20 Thread Jirka Hladky
Hi Hillf, Mel and all,

thanks for the patch! It has produced really GOOD results.

1) It has fixed performance problems with 5.7 vanilla kernel for
single-tenant workload and low system load scenarios, without
performance degradation for the multi-tenant tasks. It's producing the
same results as the previous proof-of-concept patch where
adjust_numa_imbalance function was modified to be a no-op (returning
the same value of imbalance as it gets on the input).

2) We have also added Mel's netperf-cstate-small-cross-socket test to
our test battery:
https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket

Mel told me that he had seen significant performance improvements with
5.7 over 5.6 for the netperf-cstate-small-cross-socket scenario.

Out of 6 different patches we have tested, your patch has performed
the best for this scenario. Compared to vanilla, we see minimal
performance degradation of 2.5% for the udp stream throughput and 0.6%
for the tcp throughput. The testing was done on a dual-socket system
with Gold 6132 CPU.

@Mel - could you please test Hillf's patch with your full testing
suite? So far, it looks very promising, but I would like to check the
patch thoroughly to make sure it does not hurt performance in other
areas.

Thanks a lot!
Jirka












On Tue, May 19, 2020 at 6:32 AM Hillf Danton  wrote:
>
>
> Hi Jirka
>
> On Mon, 18 May 2020 16:52:52 +0200 Jirka Hladky wrote:
> >
> > We have compared it against kernel with adjust_numa_imbalance disabled
> > [1], and both kernels perform at the same level for the single-tenant
> > jobs, but the proposed patch is bad for the multitenancy mode. The
> > kernel with adjust_numa_imbalance disabled is a clear winner here.
>
> Double thanks to you for the tests!
>
> > We would be very interested in what others think about disabling
> > adjust_numa_imbalance function. The patch is bellow. It would be great
>
> A minute...
>
> > to collect performance results for different scenarios to make sure
> > the results are objective.
>
> I don't have another test case but a diff trying to confine the tool
> in question back to the hard-coded 2's field.
>
> It's used in the first hunk below to detect imbalance before migrating
> a task, and a small churn of code is added at another call site when
> balancing idle CPUs.
>
> Thanks
> Hillf
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1916,20 +1916,26 @@ static void task_numa_find_cpu(struct ta
>  * imbalance that would be overruled by the load balancer.
>  */
> if (env->dst_stats.node_type == node_has_spare) {
> -   unsigned int imbalance;
> -   int src_running, dst_running;
> +   unsigned int imbalance = 2;
>
> -   /*
> -* Would movement cause an imbalance? Note that if src has
> -* more running tasks that the imbalance is ignored as the
> -* move improves the imbalance from the perspective of the
> -* CPU load balancer.
> -* */
> -   src_running = env->src_stats.nr_running - 1;
> -   dst_running = env->dst_stats.nr_running + 1;
> -   imbalance = max(0, dst_running - src_running);
> -   imbalance = adjust_numa_imbalance(imbalance, src_running);
> +   //No imbalance computed without spare capacity
> +   if (env->dst_stats.node_type != env->src_stats.node_type)
> +   goto check_imb;
> +
> +   imbalance = adjust_numa_imbalance(imbalance,
> +   env->src_stats.nr_running);
> +
> +   //Do nothing without imbalance
> +   if (!imbalance) {
> +   imbalance = 2;
> +   goto check_imb;
> +   }
> +
> +   //Migrate task if it's likely to grow balance
> +   if (env->dst_stats.nr_running + 1 < env->src_stats.nr_running)
> +   imbalance = 0;
>
> +check_imb:
> /* Use idle CPU if there is no imbalance */
> if (!imbalance) {
> maymove = true;
> @@ -9011,12 +9017,13 @@ static inline void calculate_imbalance(s
> env->migration_type = migrate_task;
> env->imbalance = max_t(long, 0, (local->idle_cpus -
>  busiest->idle_cpus) >> 1);
> -   }
>
> -   /* Consider allowing a small imbalance between NUMA groups */
> -   if (env->sd->f

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-18 Thread Jirka Hladky
Hi Hillf,

thanks a lot for your patch!

Compared to 5.7 rc4 vanilla, we observe the following:
  * Single-tenant jobs show improvement up to 15% for SPECjbb2005 and
up to 100% for NAS in low thread mode. In other words, it fixes all
the problems we have reported in this thread.
  * Multitenancy jobs show performance degradation up to 30% for SPECjbb2005

While it fixes problems with single-tenant jobs and with a performance
at low system load, it breaks multi-tenant tasks.

We have compared it against kernel with adjust_numa_imbalance disabled
[1], and both kernels perform at the same level for the single-tenant
jobs, but the proposed patch is bad for the multitenancy mode. The
kernel with adjust_numa_imbalance disabled is a clear winner here.

We would be very interested in what others think about disabling
adjust_numa_imbalance function. The patch is bellow. It would be great
to collect performance results for different scenarios to make sure
the results are objective.

Thanks a lot!
Jirka

[1] Patch to test kernel with adjust_numa_imbalance disabled:
===
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b..8c43d29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8907,14 +8907,6 @@ static inline long adjust_numa_imbalance(int
imbalance, int src_nr_running)
{
   unsigned int imbalance_min;

-   /*
-* Allow a small imbalance based on a simple pair of communicating
-* tasks that remain local when the source domain is almost idle.
-*/
-   imbalance_min = 2;
-   if (src_nr_running <= imbalance_min)
-   return 0;
-
   return imbalance;
}
===





On Fri, May 8, 2020 at 5:47 AM Hillf Danton  wrote:
>
>
> On Thu, 7 May 2020 13:49:34 Phil Auld wrote:
> >
> > On Thu, May 07, 2020 at 06:29:44PM +0200 Jirka Hladky wrote:
> > > Hi Mel,
> > >
> > > we are not targeting just OMP applications. We see the performance
> > > degradation also for other workloads, like SPECjbb2005 and
> > > SPECjvm2008. Even worse, it also affects a higher number of threads.
> > > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> > > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> > > threads (the system has 64 CPUs in total). We observe this degradation
> > > only when we run a single SPECjbb binary. When running 4 SPECjbb
> > > binaries in parallel, there is no change in performance between 5.6
> > > and 5.7.
> > >
> > > That's why we are asking for the kernel tunable, which we would add to
> > > the tuned profile. We don't expect users to change this frequently but
> > > rather to set the performance profile once based on the purpose of the
> > > server.
> > >
> > > If you could prepare a patch for us, we would be more than happy to
> > > test it extensively. Based on the results, we can then evaluate if
> > > it's the way to go. Thoughts?
> > >
> >
> > I'm happy to spin up a patch once I'm sure what exactly the tuning would
> > effect. At an initial glance I'm thinking it would be the imbalance_min
> > which is currently hardcoded to 2. But there may be something else...
>
> hrm... try to restore the old behavior by skipping task migrate in favor
> of the local node if there is no imbalance.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1928,18 +1928,16 @@ static void task_numa_find_cpu(struct ta
> src_running = env->src_stats.nr_running - 1;
> dst_running = env->dst_stats.nr_running + 1;
> imbalance = max(0, dst_running - src_running);
> -   imbalance = adjust_numa_imbalance(imbalance, src_running);
> +   imbalance = adjust_numa_imbalance(imbalance, src_running +1);
>
> -   /* Use idle CPU if there is no imbalance */
> +   /* No task migrate without imbalance */
> if (!imbalance) {
> -   maymove = true;
> -   if (env->dst_stats.idle_cpu >= 0) {
> -   env->dst_cpu = env->dst_stats.idle_cpu;
> -   task_numa_assign(env, NULL, 0);
> -   return;
> -   }
> +   env->best_cpu = -1;
> +   return;
> }
> -   } else {
> +   }
> +
> +   do {
> long src_load, dst_load, load;
> /*
>  * If the improvement from just moving env->p direction is 
> better
> @@ -1949,7 +1947,7 @@ static void task_n

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-15 Thread Jirka Hladky
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
> int wake_flags)
> struct rq_flags rf;
>  #if defined(CONFIG_SMP)
> -   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> +   if (sched_feat(TTWU_QUEUE)) {
> sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> ttwu_queue_remote(p, cpu, wake_flags);
> return;

Hi Mel,

we have performance results for the proposed patch above ^^^.
Unfortunately, it hasn't helped the performance.

Jirka


On Wed, May 13, 2020 at 5:30 PM Mel Gorman  wrote:
>
> On Wed, May 13, 2020 at 04:57:15PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > we have tried the kernel with adjust_numa_imbalance() crippled to just
> > return the imbalance it's given.
> >
> > It has solved all the performance problems I have reported.
> > Performance is the same as with 5.6 kernel (before the patch was
> > applied).
> >
> > * solved the performance drop upto 20%  with single instance
> > SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
> > Rome systems) => this performance drop was INCREASING with higher
> > threads counts (10% for 16 threads and 20 % for 32 threads)
> > * solved the performance drop for low load scenarios (SPECjvm2008 and NAS)
> >
> > Any suggestions on how to proceed? One approach is to turn
> > "imbalance_min" into the kernel tunable. Any other ideas?
> >
> > https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914
> >
>
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
> int wake_flags)
> struct rq_flags rf;
>
>  #if defined(CONFIG_SMP)
> -   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> +   if (sched_feat(TTWU_QUEUE)) {
> sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> ttwu_queue_remote(p, cpu, wake_flags);
> return;
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-14 Thread Jirka Hladky
THANK YOU!


On Thu, May 14, 2020 at 1:50 PM Mel Gorman  wrote:
>
> On Thu, May 14, 2020 at 12:22:05PM +0200, Jirka Hladky wrote:
> > Thanks!
> >
> > Do you have a link? I cannot find it on github
> > (https://github.com/gormanm/mmtests, searched for
> > config-network-netperf-cstate-small-cross-socket)
> >
>
> https://github.com/gormanm/mmtests/blob/master/configs/config-network-netperf-cstate-small-cross-socket
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-14 Thread Jirka Hladky
Thanks!

Do you have a link? I cannot find it on github
(https://github.com/gormanm/mmtests, searched for
config-network-netperf-cstate-small-cross-socket)


On Thu, May 14, 2020 at 12:08 PM Mel Gorman  wrote:
>
> On Thu, May 14, 2020 at 11:58:36AM +0200, Jirka Hladky wrote:
> > Thank you, Mel!
> >
> > We are using netperf as well, but AFAIK it's running on two different
> > hosts. I will check with colleagues, if they can
> > add network-netperf-unbound run on the localhost.
> >
> > Is this the right config?
> > https://github.com/gormanm/mmtests/blob/345f82bee77cbf09ba57f470a1cfc1ae413c97df/bin/generate-generic-configs
> > sed -e 's/NETPERF_BUFFER_SIZES=.*/NETPERF_BUFFER_SIZES=64/'
> > config-network-netperf-unbound > config-network-netperf-unbound-small
> >
>
> That's one I was using at the moment to have a quick test after
> the reconciliation series was completed. It has since changed to
> config-network-netperf-cstate-small-cross-socket to limit cstates, bind
> the client and server to two local CPUs and using one buffer size. It
> was necessary to get an ftrace function graph of the wakeup path that
> was readable and not too noisy due to migrations, cpuidle exit costs etc.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-13 Thread Jirka Hladky
Thank you, Mel!

I think I have to make sure we cover the scenario you have targeted
when developing adjust_numa_imbalance:

===
https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8910

/*
* Allow a small imbalance based on a simple pair of communicating
* tasks that remain local when the source domain is almost idle.
*/
===

Could you point me to a benchmark for this scenario? I have checked
https://github.com/gormanm/mmtests
and we use lots of the same benchmarks but I'm not sure if we cover
this particular scenario.

Jirka


On Wed, May 13, 2020 at 5:30 PM Mel Gorman  wrote:
>
> On Wed, May 13, 2020 at 04:57:15PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > we have tried the kernel with adjust_numa_imbalance() crippled to just
> > return the imbalance it's given.
> >
> > It has solved all the performance problems I have reported.
> > Performance is the same as with 5.6 kernel (before the patch was
> > applied).
> >
> > * solved the performance drop upto 20%  with single instance
> > SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
> > Rome systems) => this performance drop was INCREASING with higher
> > threads counts (10% for 16 threads and 20 % for 32 threads)
> > * solved the performance drop for low load scenarios (SPECjvm2008 and NAS)
> >
> > Any suggestions on how to proceed? One approach is to turn
> > "imbalance_min" into the kernel tunable. Any other ideas?
> >
> > https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914
> >
>
> Complete shot in the dark but restore adjust_numa_imbalance() and try
> this
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..0b31f4468d5b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
> int wake_flags)
> struct rq_flags rf;
>
>  #if defined(CONFIG_SMP)
> -   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> cpu)) {
> +   if (sched_feat(TTWU_QUEUE)) {
> sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> ttwu_queue_remote(p, cpu, wake_flags);
> return;
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-13 Thread Jirka Hladky
Hi Mel,

we have tried the kernel with adjust_numa_imbalance() crippled to just
return the imbalance it's given.

It has solved all the performance problems I have reported.
Performance is the same as with 5.6 kernel (before the patch was
applied).

* solved the performance drop upto 20%  with single instance
SPECjbb2005 benchmark on 8 NUMA node servers (particularly on AMD EPYC
Rome systems) => this performance drop was INCREASING with higher
threads counts (10% for 16 threads and 20 % for 32 threads)
* solved the performance drop for low load scenarios (SPECjvm2008 and NAS)

Any suggestions on how to proceed? One approach is to turn
"imbalance_min" into the kernel tunable. Any other ideas?

https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8914

Thanks a lot!
Jirka






On Fri, May 8, 2020 at 12:40 PM Jirka Hladky  wrote:
>
> Hi Mel,
>
> thanks for hints! We will try it.
>
> @Phil - could you please prepare a kernel build for me to test?
>
> Thank you!
> Jirka
>
> On Fri, May 8, 2020 at 11:22 AM Mel Gorman  
> wrote:
>>
>> On Thu, May 07, 2020 at 06:29:44PM +0200, Jirka Hladky wrote:
>> > Hi Mel,
>> >
>> > we are not targeting just OMP applications. We see the performance
>> > degradation also for other workloads, like SPECjbb2005 and
>> > SPECjvm2008. Even worse, it also affects a higher number of threads.
>> > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
>> > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
>> > threads (the system has 64 CPUs in total). We observe this degradation
>> > only when we run a single SPECjbb binary. When running 4 SPECjbb
>> > binaries in parallel, there is no change in performance between 5.6
>> > and 5.7.
>> >
>>
>> Minimally I suggest confirming that it's really due to
>> adjust_numa_imbalance() by making the function a no-op and retesting.
>> I have found odd artifacts with it but I'm unsure how to proceed without
>> causing problems elsehwere.
>>
>> For example, netperf on localhost in some cases reported a regression
>> when the client and server were running on the same node. The problem
>> appears to be that netserver completes its work faster when running
>> local and goes idle more regularly. The cost of going idle and waking up
>> builds up and a lower throughput is reported but I'm not sure if gaming
>> an artifact like that is a good idea.
>>
>> > That's why we are asking for the kernel tunable, which we would add to
>> > the tuned profile. We don't expect users to change this frequently but
>> > rather to set the performance profile once based on the purpose of the
>> > server.
>> >
>> > If you could prepare a patch for us, we would be more than happy to
>> > test it extensively. Based on the results, we can then evaluate if
>> > it's the way to go. Thoughts?
>> >
>>
>> I would suggest simply disabling that function first to ensure that is
>> really what is causing problems for you.
>>
>> --
>> Mel Gorman
>> SUSE Labs
>>
>
>
> --
> -Jirka



-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-08 Thread Jirka Hladky
Hi Mel,

thanks for hints! We will try it.

@Phil - could you please prepare a kernel build for me to test?

Thank you!
Jirka

On Fri, May 8, 2020 at 11:22 AM Mel Gorman  wrote:
>
> On Thu, May 07, 2020 at 06:29:44PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > we are not targeting just OMP applications. We see the performance
> > degradation also for other workloads, like SPECjbb2005 and
> > SPECjvm2008. Even worse, it also affects a higher number of threads.
> > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> > threads (the system has 64 CPUs in total). We observe this degradation
> > only when we run a single SPECjbb binary. When running 4 SPECjbb
> > binaries in parallel, there is no change in performance between 5.6
> > and 5.7.
> >
>
> Minimally I suggest confirming that it's really due to
> adjust_numa_imbalance() by making the function a no-op and retesting.
> I have found odd artifacts with it but I'm unsure how to proceed without
> causing problems elsehwere.
>
> For example, netperf on localhost in some cases reported a regression
> when the client and server were running on the same node. The problem
> appears to be that netserver completes its work faster when running
> local and goes idle more regularly. The cost of going idle and waking up
> builds up and a lower throughput is reported but I'm not sure if gaming
> an artifact like that is a good idea.
>
> > That's why we are asking for the kernel tunable, which we would add to
> > the tuned profile. We don't expect users to change this frequently but
> > rather to set the performance profile once based on the purpose of the
> > server.
> >
> > If you could prepare a patch for us, we would be more than happy to
> > test it extensively. Based on the results, we can then evaluate if
> > it's the way to go. Thoughts?
> >
>
> I would suggest simply disabling that function first to ensure that is
> really what is causing problems for you.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-07 Thread Jirka Hladky
Hi Mel,

we are not targeting just OMP applications. We see the performance
degradation also for other workloads, like SPECjbb2005 and
SPECjvm2008. Even worse, it also affects a higher number of threads.
For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
threads (the system has 64 CPUs in total). We observe this degradation
only when we run a single SPECjbb binary. When running 4 SPECjbb
binaries in parallel, there is no change in performance between 5.6
and 5.7.

That's why we are asking for the kernel tunable, which we would add to
the tuned profile. We don't expect users to change this frequently but
rather to set the performance profile once based on the purpose of the
server.

If you could prepare a patch for us, we would be more than happy to
test it extensively. Based on the results, we can then evaluate if
it's the way to go. Thoughts?

Thanks a lot!
Jirka

On Thu, May 7, 2020 at 5:54 PM Mel Gorman  wrote:
>
> On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > > > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > > > servers).
> > >
> > > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > > be left idle.
> >
> > we have discussed today with my colleagues the performance drop for
> > some workloads for low threads counts (roughly up to 2x number of NUMA
> > nodes). We are worried that it can be a severe issue for some use
> > cases, which require a full memory bandwidth even when only part of
> > CPUs is used.
> >
> > We understand that scheduler cannot distinguish this type of workload
> > from others automatically. However, there was an idea for a * new
> > kernel tunable to control the imbalance threshold *. Based on the
> > purpose of the server, users could set this tunable. See the tuned
> > project, which allows creating performance profiles [1].
> >
>
> I'm not completely opposed to it but given that the setting is global,
> I imagine it could have other consequences if two applications ran
> at different times have different requirements. Given that it's OMP,
> I would have imagined that an application that really cared about this
> would specify what was needed using OMP_PLACES. Why would someone prefer
> kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
> specific knowledge of the application even to know that a particular
> tuned profile is needed.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-07 Thread Jirka Hladky
Hi Mel,

> > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > servers).
>
> Ok, so we know it's within the imbalance threshold where a NUMA node can
> be left idle.

we have discussed today with my colleagues the performance drop for
some workloads for low threads counts (roughly up to 2x number of NUMA
nodes). We are worried that it can be a severe issue for some use
cases, which require a full memory bandwidth even when only part of
CPUs is used.

We understand that scheduler cannot distinguish this type of workload
from others automatically. However, there was an idea for a * new
kernel tunable to control the imbalance threshold *. Based on the
purpose of the server, users could set this tunable. See the tuned
project, which allows creating performance profiles [1].

What do you think about this approach?

Thanks a lot!
Jirka

[1] https://tuned-project.org


On Fri, Mar 20, 2020 at 5:38 PM Mel Gorman  wrote:
>
> On Fri, Mar 20, 2020 at 04:30:08PM +0100, Jirka Hladky wrote:
> > >
> > > MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> > > gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> > > think it must be OMP you are using because I found I had to disable UA
> > > for MPI at some point in the past for reasons I no longer remember.
> >
> >
> > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number of
> > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > servers).
> >
>
> Ok, so we know it's within the imbalance threshold where a NUMA node can
> be left idle.
>
> > One possibility would be to spread wide always at clone time and assume
> > > wake_affine will pull related tasks but it's fragile because it breaks
> > > if the cloned task execs and then allocates memory from a remote node
> > > only to migrate to a local node immediately.
> >
> >
> > I think the only way to find out how it performs is to test it. If you
> > could prepare a patch like that, I'm more than happy to give it a try!
> >
>
> When the initial spreading was prevented, it was for pipelines mainly --
> even basic shell scripts. In that case it was observed that a shell would
> fork/exec two tasks connected via pipe that started on separate nodes and
> had allocated remote data before being pulled close. The processes were
> typically too short lived for NUMA balancing to fix it up by exec time
> the information on where the fork happened was lost.  See 2c83362734da
> ("sched/fair: Consider SD_NUMA when selecting the most idle group to
> schedule on"). Now the logic has probably been partially broken since
> because of how SD_NUMA is now treated but the concern about spreading
> wide prematurely remains.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka



Group Imbalance Bug - performance drop by factor 10x on NUMA boxes with cgroups

2018-10-27 Thread Jirka Hladky
Hi Mel and Srikar,

I would like to ask you if you could look into the Group Imbalance Bug
described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. See also comment [1]. The paper describes the bug on
workload which involves different ssh sessions and it assumes
kernel.sched_autogroup_enabled=1. We have found out that it can be
reproduced more easily with cgroups.

Reproducer consists of this workload
* 2 separate "stress --cpu 1" processes. Each stress process needs 1 CPU.
* NAS benchmark (https://www.nas.nasa.gov/publications/npb.html) from
which I use lu.C.x binary (Lower-Upper Gauss-Seidel solver) in the
Open Multi-Processing (OMP) mode.

We run the workload in two modes:

NORMAL - both stress and lu.C.x are run in the same control group
GROUP  - each binary is run in a separate control group:
stress, first instance: cpu:test_group_1
stress, seconds instance: cpu:test_group_2
lu.C.x : cpu:test_group_main

I run lu.C.x with a different number of threads - for example on 4
NUMA server with 4x Xeon Gold 6126 CPU (total 96 CPUs) - I run lu.C.x
with 72, 80, 88, and 92 threads. Since the server has 96 CPUs in
total, even with 92 threads for lu.C.x and two stress processes server
is still not fully loaded.

Here are the runtimes in seconds for lu.C.x for different number of threads

#Threads  NORMAL GROUP
72 21.2730.01
80 15.32 164
88 17.91 367
92 19.22 432

As you can see, already for 72 threads lu.C.x is significantly slower
when executed in dedicated cgroup. And it gets much worse with an
increasing number of threads (slowdown by the factor 10x and greater).

Some more details are below.

Please let me know if it sounds interesting and if you would like to
look into it. I can provide you with the reproducer plus some
supplementary python scripts to further analyze the results.

Thanks a lot!
Jirka

Some more details on the case with 80 threads for lu.C.x, 2 stress
processes run  on 96 CPUs server with 4 NUMA nodes.

Analyzing ps output is very interesting (here for 5 subsequent runs of
the workload):

Average number of threads scheduled for NUMA NODE  0  1  2  3

lu.C.x_80_NORMAL_1.ps.numa.hist Average21.25  21.00  19.75  18.00
lu.C.x_80_NORMAL_1.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_2.ps.numa.hist Average20.50  20.75  18.00  20.75
lu.C.x_80_NORMAL_2.stress.ps.numa.hist  Average1.00  0.75  0.25
lu.C.x_80_NORMAL_3.ps.numa.hist Average21.75  22.00  18.75  17.50
lu.C.x_80_NORMAL_3.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_4.ps.numa.hist Average21.50  21.00  18.75  18.75
lu.C.x_80_NORMAL_4.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_5.ps.numa.hist Average18.00  23.33  19.33  19.33
lu.C.x_80_NORMAL_5.stress.ps.numa.hist  Average1.00  1.00


As you can see, in NORMAL mode lu.C.x is uniformly scheduled over NUMA nodes.

Compare it with cgroups mode:

Average number of threads scheduled for NUMA NODE  0  1  2  3

lu.C.x_80_GROUP_1.ps.numa.hist Average13.05  13.54  27.65  25.76
lu.C.x_80_GROUP_1.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_2.ps.numa.hist Average12.18  14.85  27.56  25.41
lu.C.x_80_GROUP_2.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_3.ps.numa.hist Average15.32  13.23  26.52  24.94
lu.C.x_80_GROUP_3.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_4.ps.numa.hist Average13.82  14.86  25.64  25.68
lu.C.x_80_GROUP_4.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_5.ps.numa.hist Average15.12  13.03  25.12  26.73
lu.C.x_80_GROUP_5.stress.ps.numa.hist  Average1.00  1.00

In cgroup mode, the scheduler is moving lu.C.x away from the nodes #0
and #1 where stress processes are running. It does it to such extent
that NUMA nodes #2 and #3 are overcommitted - these NUMA nodes have
more NAS threads scheduled than CPUs available - there are 24 CPUs in
each NUMA node.

Here is the detailed report:
$more lu.C.x_80_GROUP_1.ps.numa.hist
#Date   NUMA 0  NUMA 1  NUMA 2  NUMA 3
2018-Oct-27_04h39m57s6   7   37  30
2018-Oct-27_04h40m02s16  15  23  26
2018-Oct-27_04h40m08s13  12  27  28
2018-Oct-27_04h40m13s9   15  29  27
2018-Oct-27_04h40m18s16  13  27  24
2018-Oct-27_04h40m23s16  14  25  25
2018-Oct-27_04h40m28s16  15  24  25
2018-Oct-27_04h40m33s10  11  34  25
2018-Oct-27_04h40m38s16  13  25  26
2018-Oct-27_04h40m43s10  10  32  28

Group Imbalance Bug - performance drop by factor 10x on NUMA boxes with cgroups

2018-10-27 Thread Jirka Hladky
Hi Mel and Srikar,

I would like to ask you if you could look into the Group Imbalance Bug
described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. See also comment [1]. The paper describes the bug on
workload which involves different ssh sessions and it assumes
kernel.sched_autogroup_enabled=1. We have found out that it can be
reproduced more easily with cgroups.

Reproducer consists of this workload
* 2 separate "stress --cpu 1" processes. Each stress process needs 1 CPU.
* NAS benchmark (https://www.nas.nasa.gov/publications/npb.html) from
which I use lu.C.x binary (Lower-Upper Gauss-Seidel solver) in the
Open Multi-Processing (OMP) mode.

We run the workload in two modes:

NORMAL - both stress and lu.C.x are run in the same control group
GROUP  - each binary is run in a separate control group:
stress, first instance: cpu:test_group_1
stress, seconds instance: cpu:test_group_2
lu.C.x : cpu:test_group_main

I run lu.C.x with a different number of threads - for example on 4
NUMA server with 4x Xeon Gold 6126 CPU (total 96 CPUs) - I run lu.C.x
with 72, 80, 88, and 92 threads. Since the server has 96 CPUs in
total, even with 92 threads for lu.C.x and two stress processes server
is still not fully loaded.

Here are the runtimes in seconds for lu.C.x for different number of threads

#Threads  NORMAL GROUP
72 21.2730.01
80 15.32 164
88 17.91 367
92 19.22 432

As you can see, already for 72 threads lu.C.x is significantly slower
when executed in dedicated cgroup. And it gets much worse with an
increasing number of threads (slowdown by the factor 10x and greater).

Some more details are below.

Please let me know if it sounds interesting and if you would like to
look into it. I can provide you with the reproducer plus some
supplementary python scripts to further analyze the results.

Thanks a lot!
Jirka

Some more details on the case with 80 threads for lu.C.x, 2 stress
processes run  on 96 CPUs server with 4 NUMA nodes.

Analyzing ps output is very interesting (here for 5 subsequent runs of
the workload):

Average number of threads scheduled for NUMA NODE  0  1  2  3

lu.C.x_80_NORMAL_1.ps.numa.hist Average21.25  21.00  19.75  18.00
lu.C.x_80_NORMAL_1.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_2.ps.numa.hist Average20.50  20.75  18.00  20.75
lu.C.x_80_NORMAL_2.stress.ps.numa.hist  Average1.00  0.75  0.25
lu.C.x_80_NORMAL_3.ps.numa.hist Average21.75  22.00  18.75  17.50
lu.C.x_80_NORMAL_3.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_4.ps.numa.hist Average21.50  21.00  18.75  18.75
lu.C.x_80_NORMAL_4.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_5.ps.numa.hist Average18.00  23.33  19.33  19.33
lu.C.x_80_NORMAL_5.stress.ps.numa.hist  Average1.00  1.00


As you can see, in NORMAL mode lu.C.x is uniformly scheduled over NUMA nodes.

Compare it with cgroups mode:

Average number of threads scheduled for NUMA NODE  0  1  2  3

lu.C.x_80_GROUP_1.ps.numa.hist Average13.05  13.54  27.65  25.76
lu.C.x_80_GROUP_1.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_2.ps.numa.hist Average12.18  14.85  27.56  25.41
lu.C.x_80_GROUP_2.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_3.ps.numa.hist Average15.32  13.23  26.52  24.94
lu.C.x_80_GROUP_3.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_4.ps.numa.hist Average13.82  14.86  25.64  25.68
lu.C.x_80_GROUP_4.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_5.ps.numa.hist Average15.12  13.03  25.12  26.73
lu.C.x_80_GROUP_5.stress.ps.numa.hist  Average1.00  1.00

In cgroup mode, the scheduler is moving lu.C.x away from the nodes #0
and #1 where stress processes are running. It does it to such extent
that NUMA nodes #2 and #3 are overcommitted - these NUMA nodes have
more NAS threads scheduled than CPUs available - there are 24 CPUs in
each NUMA node.

Here is the detailed report:
$more lu.C.x_80_GROUP_1.ps.numa.hist
#Date   NUMA 0  NUMA 1  NUMA 2  NUMA 3
2018-Oct-27_04h39m57s6   7   37  30
2018-Oct-27_04h40m02s16  15  23  26
2018-Oct-27_04h40m08s13  12  27  28
2018-Oct-27_04h40m13s9   15  29  27
2018-Oct-27_04h40m18s16  13  27  24
2018-Oct-27_04h40m23s16  14  25  25
2018-Oct-27_04h40m28s16  15  24  25
2018-Oct-27_04h40m33s10  11  34  25
2018-Oct-27_04h40m38s16  13  25  26
2018-Oct-27_04h40m43s10  10  32  28

Re: [SCHEDULER] Performance drop in 4.19 compared to 4.18 kernel

2018-09-17 Thread Jirka Hladky
Resending in the plain text mode.

> I'm travelling at the moment but when I get back, I'll see what's in the
> tip tree with respect to Srikar's patches and then rebase the fast-migration
> patches on top and reconfirm they still behave as expected. Assuming
> they do, I'll resend them.


Sounds great, thank you!

If you want me to retest the rebased patch set, just let me know, we
would be more than happy to run the tests on our side.

Jirka

On Mon, Sep 17, 2018 at 3:06 PM, Mel Gorman  wrote:
> On Fri, Sep 14, 2018 at 04:50:20PM +0200, Jirka Hladky wrote:
>> Hi Peter and Srikar,
>>
>> > I have bounced the 5 patches to you, (one of the 6 has not been applied by
>> > Peter) so I have skipped that.
>> > They can also be fetched from
>> > http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com
>>
>> I'm sorry for the delay, we have finally the results for the above
>> kernel. The performance results look good compared to 4.19 vanilla and
>> are about the same as Mel's sched-numa-fast-crossnode-v1r12 patch set
>> for 4.18:
>>
>> Compared to kernel-4.19.0-0.rc1.1
>>
>>   * Improvement upto 20% for SPECjbb2005, SPECjvm2008 benchmarks
>>   * Improvement upto 50% for stream benchmark
>>   * Improvement upto 100% for the NAS benchmark (sp_C subtest, 8
>> threads on 4 NUMA system with 4x E5-4610 v2 @ 2.30GHz, 64 cores in
>> total)
>>
>> When I compare it against Mel's patchset
>> (git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>> sched-numa-fast-crossnode-v1r12)
>>
>>   * Mel's kernel is about 15% faster for stream benchmarks
>>   * The other benchmarks show very similar results with both kernels
>>
>> Mel's patchset eliminates NUMA migration rate limits, this is
>> presumbly the reason for the good stream results.
>>
>> Do you have any update when the current patchset could be merged into
>> the upstream kernel?
>>
>
> I'm travelling at the moment but when I get back, I'll see what's in the
> tip tree with respect to Srikar's patches and then rebase the fast-migration
> patches on top and reconfirm they still behave as expected. Assuming
> they do, I'll resend them.


Re: [SCHEDULER] Performance drop in 4.19 compared to 4.18 kernel

2018-09-17 Thread Jirka Hladky
Resending in the plain text mode.

> I'm travelling at the moment but when I get back, I'll see what's in the
> tip tree with respect to Srikar's patches and then rebase the fast-migration
> patches on top and reconfirm they still behave as expected. Assuming
> they do, I'll resend them.


Sounds great, thank you!

If you want me to retest the rebased patch set, just let me know, we
would be more than happy to run the tests on our side.

Jirka

On Mon, Sep 17, 2018 at 3:06 PM, Mel Gorman  wrote:
> On Fri, Sep 14, 2018 at 04:50:20PM +0200, Jirka Hladky wrote:
>> Hi Peter and Srikar,
>>
>> > I have bounced the 5 patches to you, (one of the 6 has not been applied by
>> > Peter) so I have skipped that.
>> > They can also be fetched from
>> > http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com
>>
>> I'm sorry for the delay, we have finally the results for the above
>> kernel. The performance results look good compared to 4.19 vanilla and
>> are about the same as Mel's sched-numa-fast-crossnode-v1r12 patch set
>> for 4.18:
>>
>> Compared to kernel-4.19.0-0.rc1.1
>>
>>   * Improvement upto 20% for SPECjbb2005, SPECjvm2008 benchmarks
>>   * Improvement upto 50% for stream benchmark
>>   * Improvement upto 100% for the NAS benchmark (sp_C subtest, 8
>> threads on 4 NUMA system with 4x E5-4610 v2 @ 2.30GHz, 64 cores in
>> total)
>>
>> When I compare it against Mel's patchset
>> (git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>> sched-numa-fast-crossnode-v1r12)
>>
>>   * Mel's kernel is about 15% faster for stream benchmarks
>>   * The other benchmarks show very similar results with both kernels
>>
>> Mel's patchset eliminates NUMA migration rate limits, this is
>> presumbly the reason for the good stream results.
>>
>> Do you have any update when the current patchset could be merged into
>> the upstream kernel?
>>
>
> I'm travelling at the moment but when I get back, I'll see what's in the
> tip tree with respect to Srikar's patches and then rebase the fast-migration
> patches on top and reconfirm they still behave as expected. Assuming
> they do, I'll resend them.


Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-14 Thread Jirka Hladky
Hi Mel,

we have tried to revert following 2 commits:

305c1fac3225
2d4056fafa196e1ab

We had to revert 10864a9e222048a862da2c21efa28929a4dfed15 as well.

The performance of the kernel was better than when only
2d4056fafa196e1ab was reverted but still worse than the performance of
4.18 kernel.

Since the patch series from Srikar shows very good results we would
wait till it's merged to mainline kernel and stop the bisecting
efforts for now. Your patch series sched-numa-fast-crossnode-v1r12 (on
top of 4.18) is giving in some cases slightly better results than
Srikar's series so it would be really great if both series could be
merged together. Removing NUMA migration rate limit helps performance.

Thanks a lot for your help on this!
Jirka


On Fri, Sep 7, 2018 at 10:09 AM, Jirka Hladky  wrote:
>> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
>> condition in terms of idle CPU handling that has been problematic.
>
>
> We will try that, thanks!
>
>>  I would suggest contacting Srikar directly.
>
>
> I will do that right away. Whom should I put on Cc? Just you and
> linux-kernel@vger.kernel.org ? Should I put Ingo and Peter on Cc as
> well?
>
> $scripts/get_maintainer.pl -f kernel/sched
> Ingo Molnar  (maintainer:SCHEDULER)
> Peter Zijlstra  (maintainer:SCHEDULER)
> linux-kernel@vger.kernel.org (open list:SCHEDULER)
>
> Jirka
>
> On Thu, Sep 6, 2018 at 2:58 PM, Mel Gorman  
> wrote:
>> On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote:
>>> Hi Mel,
>>>
>>> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.
>>>
>>>   * Compared to 4.18, there is still performance regression -
>>> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
>>> systems, regression is around 10-15%
>>>   * Compared to 4.19rc1 there is a clear gain across all benchmarks around 
>>> 20%
>>>
>>
>> Ok.
>>
>>> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
>>> lot there is another issue as well. Could you please recommend some
>>> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?
>>>
>>
>> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
>> condition in terms of idle CPU handling that has been problematic.
>>
>>> Regarding the current results, how do we proceed? Could you please
>>> contact Srikar and ask for the advice or should we contact him
>>> directly?
>>>
>>
>> I would suggest contacting Srikar directly. While I'm working on a
>> series that touches off some similar areas, there is no guarantee it'll
>> be a success as I'm not primarily upstream focused at the moment.
>>
>> Restarting the thread would also end up with a much more sensible cc
>> list.
>>
>> --
>> Mel Gorman
>> SUSE Labs


Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-14 Thread Jirka Hladky
Hi Mel,

we have tried to revert following 2 commits:

305c1fac3225
2d4056fafa196e1ab

We had to revert 10864a9e222048a862da2c21efa28929a4dfed15 as well.

The performance of the kernel was better than when only
2d4056fafa196e1ab was reverted but still worse than the performance of
4.18 kernel.

Since the patch series from Srikar shows very good results we would
wait till it's merged to mainline kernel and stop the bisecting
efforts for now. Your patch series sched-numa-fast-crossnode-v1r12 (on
top of 4.18) is giving in some cases slightly better results than
Srikar's series so it would be really great if both series could be
merged together. Removing NUMA migration rate limit helps performance.

Thanks a lot for your help on this!
Jirka


On Fri, Sep 7, 2018 at 10:09 AM, Jirka Hladky  wrote:
>> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
>> condition in terms of idle CPU handling that has been problematic.
>
>
> We will try that, thanks!
>
>>  I would suggest contacting Srikar directly.
>
>
> I will do that right away. Whom should I put on Cc? Just you and
> linux-kernel@vger.kernel.org ? Should I put Ingo and Peter on Cc as
> well?
>
> $scripts/get_maintainer.pl -f kernel/sched
> Ingo Molnar  (maintainer:SCHEDULER)
> Peter Zijlstra  (maintainer:SCHEDULER)
> linux-kernel@vger.kernel.org (open list:SCHEDULER)
>
> Jirka
>
> On Thu, Sep 6, 2018 at 2:58 PM, Mel Gorman  
> wrote:
>> On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote:
>>> Hi Mel,
>>>
>>> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.
>>>
>>>   * Compared to 4.18, there is still performance regression -
>>> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
>>> systems, regression is around 10-15%
>>>   * Compared to 4.19rc1 there is a clear gain across all benchmarks around 
>>> 20%
>>>
>>
>> Ok.
>>
>>> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
>>> lot there is another issue as well. Could you please recommend some
>>> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?
>>>
>>
>> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
>> condition in terms of idle CPU handling that has been problematic.
>>
>>> Regarding the current results, how do we proceed? Could you please
>>> contact Srikar and ask for the advice or should we contact him
>>> directly?
>>>
>>
>> I would suggest contacting Srikar directly. While I'm working on a
>> series that touches off some similar areas, there is no guarantee it'll
>> be a success as I'm not primarily upstream focused at the moment.
>>
>> Restarting the thread would also end up with a much more sensible cc
>> list.
>>
>> --
>> Mel Gorman
>> SUSE Labs


Re: [SCHEDULER] Performance drop in 4.19 compared to 4.18 kernel

2018-09-14 Thread Jirka Hladky
Hi Peter and Srikar,

> I have bounced the 5 patches to you, (one of the 6 has not been applied by
> Peter) so I have skipped that.
> They can also be fetched from
> http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com

I'm sorry for the delay, we have finally the results for the above
kernel. The performance results look good compared to 4.19 vanilla and
are about the same as Mel's sched-numa-fast-crossnode-v1r12 patch set
for 4.18:

Compared to kernel-4.19.0-0.rc1.1

  * Improvement upto 20% for SPECjbb2005, SPECjvm2008 benchmarks
  * Improvement upto 50% for stream benchmark
  * Improvement upto 100% for the NAS benchmark (sp_C subtest, 8
threads on 4 NUMA system with 4x E5-4610 v2 @ 2.30GHz, 64 cores in
total)

When I compare it against Mel's patchset
(git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
sched-numa-fast-crossnode-v1r12)

  * Mel's kernel is about 15% faster for stream benchmarks
  * The other benchmarks show very similar results with both kernels

Mel's patchset eliminates NUMA migration rate limits, this is
presumbly the reason for the good stream results.

Do you have any update when the current patchset could be merged into
the upstream kernel?

Thanks a lot!
Jirka

On Sun, Sep 9, 2018 at 4:03 PM, Jirka Hladky  wrote:
>
> Hi Peter and Srikar,
>
> thanks a lot for the information and for the patches to test!
>
> > I have bounced the 5 patches to you, (one of the 6 has not been applied by
> > Peter) so I have skipped that.
> > They can also be fetched from
> > http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com
>
> We have started the benchmarks, I will report the results on Monday.
>
> > I generally run specjbb2005 (single and multi instance).
> We also run a single and multi-instance specjbb2005 test.
>
> > I have tried running NAS but I couldn't set it up properly.
> We run the OMP variant and we control the number of threads with the
> OMP_NUM_THREADS env variable. The setup is quite simple:
>
> cd NPB_sources/config/
> mv suite_x86_64.def suite.def
> cd ..
> make suite
>
> FYI - starting from 4.17 kernel there is a significant performance
> drop compared to 4.16 kernel. Mel has come up with a
> sched-numa-fast-crossnode-v1r12 patch series
> git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git which we have
> tested extensively and with it, the benchmarks results are back at
> 4.16 level. As I understand it, Mel's patch series depends on your
> patch series and can be only merged when your patches are completed.
>
> Thanks!
> Jirka
>
>
> On Fri, Sep 7, 2018 at 3:52 PM, Peter Zijlstra  wrote:
> >
> > On Fri, Sep 07, 2018 at 07:14:20PM +0530, Srikar Dronamraju wrote:
> > > * Peter Zijlstra  [2018-09-07 15:19:23]:
> > >
> > > > On Fri, Sep 07, 2018 at 06:26:49PM +0530, Srikar Dronamraju wrote:
> > > >
> > > > > Can you please pick
> > > > >
> > > > >
> > > > > 1. 69bb3230297e881c797bbc4b3dbf73514078bc9d sched/numa: Stop multiple 
> > > > > tasks
> > > > > from moving to the cpu at the same time
> > > > > 2. dc62cfdac5e5b7a61cd8a2bd4190e80b9bb408fc sched/numa: Avoid task 
> > > > > migration
> > > > > for small numa improvement
> > > > > 3. 76e18a67cdd9e3609716c8a074c03168734736f9 sched/numa: Pass 
> > > > > destination cpu as
> > > > > a parameter to migrate_task_rq
> > > > > 4.  489c19b440ebdbabffe530b9a41389d0a8b315d9 sched/numa: Reset scan 
> > > > > rate
> > > > > whenever task moves across nodes
> > > > > 5.  b7e9ae1ae3825f35cd0f38f1f0c8e91ea145bc30 sched/numa: Limit the
> > > > > conditions where scan period is reset
> > > > >
> > > > > from 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched
> > > >
> > > > That is not a stable tree; the whole thing is re-generated from my quilt
> > > > set every time I feel like it.
> > > >
> > > > It is likely those commit ids will no longer exist in a few hours.
> > > >
> > >
> > > Okay, I will forward the relevant mails to Jirka.
> >
> > Or he can click on that link and find new IDs :-)


Re: [SCHEDULER] Performance drop in 4.19 compared to 4.18 kernel

2018-09-14 Thread Jirka Hladky
Hi Peter and Srikar,

> I have bounced the 5 patches to you, (one of the 6 has not been applied by
> Peter) so I have skipped that.
> They can also be fetched from
> http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com

I'm sorry for the delay, we have finally the results for the above
kernel. The performance results look good compared to 4.19 vanilla and
are about the same as Mel's sched-numa-fast-crossnode-v1r12 patch set
for 4.18:

Compared to kernel-4.19.0-0.rc1.1

  * Improvement upto 20% for SPECjbb2005, SPECjvm2008 benchmarks
  * Improvement upto 50% for stream benchmark
  * Improvement upto 100% for the NAS benchmark (sp_C subtest, 8
threads on 4 NUMA system with 4x E5-4610 v2 @ 2.30GHz, 64 cores in
total)

When I compare it against Mel's patchset
(git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
sched-numa-fast-crossnode-v1r12)

  * Mel's kernel is about 15% faster for stream benchmarks
  * The other benchmarks show very similar results with both kernels

Mel's patchset eliminates NUMA migration rate limits, this is
presumbly the reason for the good stream results.

Do you have any update when the current patchset could be merged into
the upstream kernel?

Thanks a lot!
Jirka

On Sun, Sep 9, 2018 at 4:03 PM, Jirka Hladky  wrote:
>
> Hi Peter and Srikar,
>
> thanks a lot for the information and for the patches to test!
>
> > I have bounced the 5 patches to you, (one of the 6 has not been applied by
> > Peter) so I have skipped that.
> > They can also be fetched from
> > http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com
>
> We have started the benchmarks, I will report the results on Monday.
>
> > I generally run specjbb2005 (single and multi instance).
> We also run a single and multi-instance specjbb2005 test.
>
> > I have tried running NAS but I couldn't set it up properly.
> We run the OMP variant and we control the number of threads with the
> OMP_NUM_THREADS env variable. The setup is quite simple:
>
> cd NPB_sources/config/
> mv suite_x86_64.def suite.def
> cd ..
> make suite
>
> FYI - starting from 4.17 kernel there is a significant performance
> drop compared to 4.16 kernel. Mel has come up with a
> sched-numa-fast-crossnode-v1r12 patch series
> git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git which we have
> tested extensively and with it, the benchmarks results are back at
> 4.16 level. As I understand it, Mel's patch series depends on your
> patch series and can be only merged when your patches are completed.
>
> Thanks!
> Jirka
>
>
> On Fri, Sep 7, 2018 at 3:52 PM, Peter Zijlstra  wrote:
> >
> > On Fri, Sep 07, 2018 at 07:14:20PM +0530, Srikar Dronamraju wrote:
> > > * Peter Zijlstra  [2018-09-07 15:19:23]:
> > >
> > > > On Fri, Sep 07, 2018 at 06:26:49PM +0530, Srikar Dronamraju wrote:
> > > >
> > > > > Can you please pick
> > > > >
> > > > >
> > > > > 1. 69bb3230297e881c797bbc4b3dbf73514078bc9d sched/numa: Stop multiple 
> > > > > tasks
> > > > > from moving to the cpu at the same time
> > > > > 2. dc62cfdac5e5b7a61cd8a2bd4190e80b9bb408fc sched/numa: Avoid task 
> > > > > migration
> > > > > for small numa improvement
> > > > > 3. 76e18a67cdd9e3609716c8a074c03168734736f9 sched/numa: Pass 
> > > > > destination cpu as
> > > > > a parameter to migrate_task_rq
> > > > > 4.  489c19b440ebdbabffe530b9a41389d0a8b315d9 sched/numa: Reset scan 
> > > > > rate
> > > > > whenever task moves across nodes
> > > > > 5.  b7e9ae1ae3825f35cd0f38f1f0c8e91ea145bc30 sched/numa: Limit the
> > > > > conditions where scan period is reset
> > > > >
> > > > > from 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched
> > > >
> > > > That is not a stable tree; the whole thing is re-generated from my quilt
> > > > set every time I feel like it.
> > > >
> > > > It is likely those commit ids will no longer exist in a few hours.
> > > >
> > >
> > > Okay, I will forward the relevant mails to Jirka.
> >
> > Or he can click on that link and find new IDs :-)


Re: [SCHEDULER] Performance drop in 4.19 compared to 4.18 kernel

2018-09-09 Thread Jirka Hladky
Hi Peter and Srikar,

thanks a lot for the information and for the patches to test!

> I have bounced the 5 patches to you, (one of the 6 has not been applied by
> Peter) so I have skipped that.
> They can also be fetched from
> http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com

We have started the benchmarks, I will report the results on Monday.

> I generally run specjbb2005 (single and multi instance).
We also run a single and multi-instance specjbb2005 test.

> I have tried running NAS but I couldn't set it up properly.
We run the OMP variant and we control the number of threads with the
OMP_NUM_THREADS env variable. The setup is quite simple:

cd NPB_sources/config/
mv suite_x86_64.def suite.def
cd ..
make suite

FYI - starting from 4.17 kernel there is a significant performance
drop compared to 4.16 kernel. Mel has come up with a
sched-numa-fast-crossnode-v1r12 patch series
git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git which we have
tested extensively and with it, the benchmarks results are back at
4.16 level. As I understand it, Mel's patch series depends on your
patch series and can be only merged when your patches are completed.

Thanks!
Jirka


On Fri, Sep 7, 2018 at 3:52 PM, Peter Zijlstra  wrote:
>
> On Fri, Sep 07, 2018 at 07:14:20PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra  [2018-09-07 15:19:23]:
> >
> > > On Fri, Sep 07, 2018 at 06:26:49PM +0530, Srikar Dronamraju wrote:
> > >
> > > > Can you please pick
> > > >
> > > >
> > > > 1. 69bb3230297e881c797bbc4b3dbf73514078bc9d sched/numa: Stop multiple 
> > > > tasks
> > > > from moving to the cpu at the same time
> > > > 2. dc62cfdac5e5b7a61cd8a2bd4190e80b9bb408fc sched/numa: Avoid task 
> > > > migration
> > > > for small numa improvement
> > > > 3. 76e18a67cdd9e3609716c8a074c03168734736f9 sched/numa: Pass 
> > > > destination cpu as
> > > > a parameter to migrate_task_rq
> > > > 4.  489c19b440ebdbabffe530b9a41389d0a8b315d9 sched/numa: Reset scan rate
> > > > whenever task moves across nodes
> > > > 5.  b7e9ae1ae3825f35cd0f38f1f0c8e91ea145bc30 sched/numa: Limit the
> > > > conditions where scan period is reset
> > > >
> > > > from 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched
> > >
> > > That is not a stable tree; the whole thing is re-generated from my quilt
> > > set every time I feel like it.
> > >
> > > It is likely those commit ids will no longer exist in a few hours.
> > >
> >
> > Okay, I will forward the relevant mails to Jirka.
>
> Or he can click on that link and find new IDs :-)


Re: [SCHEDULER] Performance drop in 4.19 compared to 4.18 kernel

2018-09-09 Thread Jirka Hladky
Hi Peter and Srikar,

thanks a lot for the information and for the patches to test!

> I have bounced the 5 patches to you, (one of the 6 has not been applied by
> Peter) so I have skipped that.
> They can also be fetched from
> http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-sri...@linux.vnet.ibm.com

We have started the benchmarks, I will report the results on Monday.

> I generally run specjbb2005 (single and multi instance).
We also run a single and multi-instance specjbb2005 test.

> I have tried running NAS but I couldn't set it up properly.
We run the OMP variant and we control the number of threads with the
OMP_NUM_THREADS env variable. The setup is quite simple:

cd NPB_sources/config/
mv suite_x86_64.def suite.def
cd ..
make suite

FYI - starting from 4.17 kernel there is a significant performance
drop compared to 4.16 kernel. Mel has come up with a
sched-numa-fast-crossnode-v1r12 patch series
git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git which we have
tested extensively and with it, the benchmarks results are back at
4.16 level. As I understand it, Mel's patch series depends on your
patch series and can be only merged when your patches are completed.

Thanks!
Jirka


On Fri, Sep 7, 2018 at 3:52 PM, Peter Zijlstra  wrote:
>
> On Fri, Sep 07, 2018 at 07:14:20PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra  [2018-09-07 15:19:23]:
> >
> > > On Fri, Sep 07, 2018 at 06:26:49PM +0530, Srikar Dronamraju wrote:
> > >
> > > > Can you please pick
> > > >
> > > >
> > > > 1. 69bb3230297e881c797bbc4b3dbf73514078bc9d sched/numa: Stop multiple 
> > > > tasks
> > > > from moving to the cpu at the same time
> > > > 2. dc62cfdac5e5b7a61cd8a2bd4190e80b9bb408fc sched/numa: Avoid task 
> > > > migration
> > > > for small numa improvement
> > > > 3. 76e18a67cdd9e3609716c8a074c03168734736f9 sched/numa: Pass 
> > > > destination cpu as
> > > > a parameter to migrate_task_rq
> > > > 4.  489c19b440ebdbabffe530b9a41389d0a8b315d9 sched/numa: Reset scan rate
> > > > whenever task moves across nodes
> > > > 5.  b7e9ae1ae3825f35cd0f38f1f0c8e91ea145bc30 sched/numa: Limit the
> > > > conditions where scan period is reset
> > > >
> > > > from 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched
> > >
> > > That is not a stable tree; the whole thing is re-generated from my quilt
> > > set every time I feel like it.
> > >
> > > It is likely those commit ids will no longer exist in a few hours.
> > >
> >
> > Okay, I will forward the relevant mails to Jirka.
>
> Or he can click on that link and find new IDs :-)


[SCHEDULER] Performance drop in 4.19 compared to 4.18 kernel

2018-09-07 Thread Jirka Hladky
Hi Srikar,

I work at Red Hat in the Kernel Performance Team. I would like to ask
you for help.

We have detected a significant performance drop (20% and more) with
4.19rc1 relatively to 4.18 vanilla. We see the regression on different
2 NUMA and 4 NUMA boxes with pretty much all the benchmarks we use -
NAS, Stream, SPECjbb2005, SPECjvm2008.

Mel Gorman has suggested checking
2d4056fafa196e1ab4e7161bae4df76f9602d56d commit - with it reverted we
got some performance back but not entirely:

 * Compared to 4.18, there is still performance regression -
especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
systems, regression is around 10-15%
  * Compared to 4.19rc1 there is a clear gain across all benchmarks, up to 20%.

We are investigating the issue further, Mel has suggested to check
305c1fac3225dfa7eeb89bfe91b7335a6edd5172 as next.

Do you have any further recommendations, which commits have possibly
caused the performance degradation?

I want to discuss with you how can we collaborate on performance
testing for the upstream kernel. Does your testing show as well
performance drop in 4.19? If so, do you have any plans for the fix? If
no, can we send you some more information about our tests so that you
can try to reproduce it?

We would also be more than happy to test the new patches for the
performance - please let us know if you are interested.  We have a
pool of 1 NUMA up to 8 NUMA boxes for that, both AMD and Intel,
covering different CPU generations from Sandy Bridge till Skylake.

I'm looking forward to hearing from you.
Jirka


[SCHEDULER] Performance drop in 4.19 compared to 4.18 kernel

2018-09-07 Thread Jirka Hladky
Hi Srikar,

I work at Red Hat in the Kernel Performance Team. I would like to ask
you for help.

We have detected a significant performance drop (20% and more) with
4.19rc1 relatively to 4.18 vanilla. We see the regression on different
2 NUMA and 4 NUMA boxes with pretty much all the benchmarks we use -
NAS, Stream, SPECjbb2005, SPECjvm2008.

Mel Gorman has suggested checking
2d4056fafa196e1ab4e7161bae4df76f9602d56d commit - with it reverted we
got some performance back but not entirely:

 * Compared to 4.18, there is still performance regression -
especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
systems, regression is around 10-15%
  * Compared to 4.19rc1 there is a clear gain across all benchmarks, up to 20%.

We are investigating the issue further, Mel has suggested to check
305c1fac3225dfa7eeb89bfe91b7335a6edd5172 as next.

Do you have any further recommendations, which commits have possibly
caused the performance degradation?

I want to discuss with you how can we collaborate on performance
testing for the upstream kernel. Does your testing show as well
performance drop in 4.19? If so, do you have any plans for the fix? If
no, can we send you some more information about our tests so that you
can try to reproduce it?

We would also be more than happy to test the new patches for the
performance - please let us know if you are interested.  We have a
pool of 1 NUMA up to 8 NUMA boxes for that, both AMD and Intel,
covering different CPU generations from Sandy Bridge till Skylake.

I'm looking forward to hearing from you.
Jirka


Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-07 Thread Jirka Hladky
> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
> condition in terms of idle CPU handling that has been problematic.


We will try that, thanks!

>  I would suggest contacting Srikar directly.


I will do that right away. Whom should I put on Cc? Just you and
linux-kernel@vger.kernel.org ? Should I put Ingo and Peter on Cc as
well?

$scripts/get_maintainer.pl -f kernel/sched
Ingo Molnar  (maintainer:SCHEDULER)
Peter Zijlstra  (maintainer:SCHEDULER)
linux-kernel@vger.kernel.org (open list:SCHEDULER)

Jirka

On Thu, Sep 6, 2018 at 2:58 PM, Mel Gorman  wrote:
> On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote:
>> Hi Mel,
>>
>> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.
>>
>>   * Compared to 4.18, there is still performance regression -
>> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
>> systems, regression is around 10-15%
>>   * Compared to 4.19rc1 there is a clear gain across all benchmarks around 
>> 20%
>>
>
> Ok.
>
>> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
>> lot there is another issue as well. Could you please recommend some
>> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?
>>
>
> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
> condition in terms of idle CPU handling that has been problematic.
>
>> Regarding the current results, how do we proceed? Could you please
>> contact Srikar and ask for the advice or should we contact him
>> directly?
>>
>
> I would suggest contacting Srikar directly. While I'm working on a
> series that touches off some similar areas, there is no guarantee it'll
> be a success as I'm not primarily upstream focused at the moment.
>
> Restarting the thread would also end up with a much more sensible cc
> list.
>
> --
> Mel Gorman
> SUSE Labs


Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-07 Thread Jirka Hladky
> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
> condition in terms of idle CPU handling that has been problematic.


We will try that, thanks!

>  I would suggest contacting Srikar directly.


I will do that right away. Whom should I put on Cc? Just you and
linux-kernel@vger.kernel.org ? Should I put Ingo and Peter on Cc as
well?

$scripts/get_maintainer.pl -f kernel/sched
Ingo Molnar  (maintainer:SCHEDULER)
Peter Zijlstra  (maintainer:SCHEDULER)
linux-kernel@vger.kernel.org (open list:SCHEDULER)

Jirka

On Thu, Sep 6, 2018 at 2:58 PM, Mel Gorman  wrote:
> On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote:
>> Hi Mel,
>>
>> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.
>>
>>   * Compared to 4.18, there is still performance regression -
>> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
>> systems, regression is around 10-15%
>>   * Compared to 4.19rc1 there is a clear gain across all benchmarks around 
>> 20%
>>
>
> Ok.
>
>> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
>> lot there is another issue as well. Could you please recommend some
>> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?
>>
>
> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
> condition in terms of idle CPU handling that has been problematic.
>
>> Regarding the current results, how do we proceed? Could you please
>> contact Srikar and ask for the advice or should we contact him
>> directly?
>>
>
> I would suggest contacting Srikar directly. While I'm working on a
> series that touches off some similar areas, there is no guarantee it'll
> be a success as I'm not primarily upstream focused at the moment.
>
> Restarting the thread would also end up with a much more sensible cc
> list.
>
> --
> Mel Gorman
> SUSE Labs


Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-06 Thread Jirka Hladky
Hi Mel,

we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.

  * Compared to 4.18, there is still performance regression -
especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
systems, regression is around 10-15%
  * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20%

While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
lot there is another issue as well. Could you please recommend some
commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?

Regarding the current results, how do we proceed? Could you please
contact Srikar and ask for the advice or should we contact him
directly?

Thanks a lot!
Jirka

On Tue, Sep 4, 2018 at 12:07 PM, Jirka Hladky  wrote:
> Hi Mel,
>
> thanks for sharing the background information! We will check if
> 2d4056fafa196e1ab4e7161bae4df76f9602d56d is causing the current
> regression in 4.19 rc1 and let you know the outcome.
>
> Jirka
>
> On Tue, Sep 4, 2018 at 11:00 AM, Mel Gorman  
> wrote:
>> On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote:
>>> Resending in the plain text mode.
>>>
>>> > My own testing completed and the results are within expectations and I
>>> > saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>>> > for 4.18. Srikar Dronamraju's series is likely to need another update
>>> > and I would need to rebase my patches on top of that. Given the scope
>>> > and complexity, I find it unlikely they would be accepted for an -rc,
>>> > particularly this late of an rc. Whether we hit the 4.19 merge window or
>>> > not will depend on when Srikar's series gets updated.
>>>
>>>
>>> Hi Mel,
>>>
>>> we have collaborated back in July on the scheduler patch, improving
>>> the performance by allowing faster memory migration. You came up with
>>> the "sched-numa-fast-crossnode-v1r12" series here:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>>>
>>> which has shown good performance results both in your and our testing.
>>>
>>
>> I remember.
>>
>>> Do you have some update on the latest status? Is there any plan to
>>> merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
>>> and based on the results it seems that the patch is not included (and
>>> I don't see it listed in  git shortlog v4.18..v4.19-rc1
>>> ./kernel/sched)
>>>
>>
>> Srikar's series that mine depended upon was only partially merged due to
>> a review bottleneck. He posted a v2 but it was during the merge window
>> and likely will need a v3 to avoid falling through the cracks. When it
>> is merged, I'll rebase my series on top and post it. While I didn't
>> check against 4.19-rc1, I did find that rebasing on top of the partial
>> series in 4.18 did not have as big an improvement.
>>
>>> With 4.19rc1 we see performance drop
>>>   * up to 40% (NAS bench) relatively to  4.18 + 
>>> sched-numa-fast-crossnode-v1r12
>>>   * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 
>>> vanilla
>>> The performance is dropping. It's quite unclear what are the next
>>> steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
>>> merged or should we start looking at what has caused the drop in
>>> performance going from 4.19rc1 to 4.18?
>>>
>>
>> Both are valid options. If you take the latter option, I suggest looking
>> at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the
>> issue as at least one auto-bisection found that it may be problematic.
>> Whether it is an issue or not depends heavily on the number of threads
>> relative to a socket size.
>>
>> --
>> Mel Gorman
>> SUSE Labs


Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-06 Thread Jirka Hladky
Hi Mel,

we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.

  * Compared to 4.18, there is still performance regression -
especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
systems, regression is around 10-15%
  * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20%

While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
lot there is another issue as well. Could you please recommend some
commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?

Regarding the current results, how do we proceed? Could you please
contact Srikar and ask for the advice or should we contact him
directly?

Thanks a lot!
Jirka

On Tue, Sep 4, 2018 at 12:07 PM, Jirka Hladky  wrote:
> Hi Mel,
>
> thanks for sharing the background information! We will check if
> 2d4056fafa196e1ab4e7161bae4df76f9602d56d is causing the current
> regression in 4.19 rc1 and let you know the outcome.
>
> Jirka
>
> On Tue, Sep 4, 2018 at 11:00 AM, Mel Gorman  
> wrote:
>> On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote:
>>> Resending in the plain text mode.
>>>
>>> > My own testing completed and the results are within expectations and I
>>> > saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>>> > for 4.18. Srikar Dronamraju's series is likely to need another update
>>> > and I would need to rebase my patches on top of that. Given the scope
>>> > and complexity, I find it unlikely they would be accepted for an -rc,
>>> > particularly this late of an rc. Whether we hit the 4.19 merge window or
>>> > not will depend on when Srikar's series gets updated.
>>>
>>>
>>> Hi Mel,
>>>
>>> we have collaborated back in July on the scheduler patch, improving
>>> the performance by allowing faster memory migration. You came up with
>>> the "sched-numa-fast-crossnode-v1r12" series here:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>>>
>>> which has shown good performance results both in your and our testing.
>>>
>>
>> I remember.
>>
>>> Do you have some update on the latest status? Is there any plan to
>>> merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
>>> and based on the results it seems that the patch is not included (and
>>> I don't see it listed in  git shortlog v4.18..v4.19-rc1
>>> ./kernel/sched)
>>>
>>
>> Srikar's series that mine depended upon was only partially merged due to
>> a review bottleneck. He posted a v2 but it was during the merge window
>> and likely will need a v3 to avoid falling through the cracks. When it
>> is merged, I'll rebase my series on top and post it. While I didn't
>> check against 4.19-rc1, I did find that rebasing on top of the partial
>> series in 4.18 did not have as big an improvement.
>>
>>> With 4.19rc1 we see performance drop
>>>   * up to 40% (NAS bench) relatively to  4.18 + 
>>> sched-numa-fast-crossnode-v1r12
>>>   * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 
>>> vanilla
>>> The performance is dropping. It's quite unclear what are the next
>>> steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
>>> merged or should we start looking at what has caused the drop in
>>> performance going from 4.19rc1 to 4.18?
>>>
>>
>> Both are valid options. If you take the latter option, I suggest looking
>> at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the
>> issue as at least one auto-bisection found that it may be problematic.
>> Whether it is an issue or not depends heavily on the number of threads
>> relative to a socket size.
>>
>> --
>> Mel Gorman
>> SUSE Labs


Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-04 Thread Jirka Hladky
Hi Mel,

thanks for sharing the background information! We will check if
2d4056fafa196e1ab4e7161bae4df76f9602d56d is causing the current
regression in 4.19 rc1 and let you know the outcome.

Jirka

On Tue, Sep 4, 2018 at 11:00 AM, Mel Gorman  wrote:
> On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote:
>> Resending in the plain text mode.
>>
>> > My own testing completed and the results are within expectations and I
>> > saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>> > for 4.18. Srikar Dronamraju's series is likely to need another update
>> > and I would need to rebase my patches on top of that. Given the scope
>> > and complexity, I find it unlikely they would be accepted for an -rc,
>> > particularly this late of an rc. Whether we hit the 4.19 merge window or
>> > not will depend on when Srikar's series gets updated.
>>
>>
>> Hi Mel,
>>
>> we have collaborated back in July on the scheduler patch, improving
>> the performance by allowing faster memory migration. You came up with
>> the "sched-numa-fast-crossnode-v1r12" series here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>>
>> which has shown good performance results both in your and our testing.
>>
>
> I remember.
>
>> Do you have some update on the latest status? Is there any plan to
>> merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
>> and based on the results it seems that the patch is not included (and
>> I don't see it listed in  git shortlog v4.18..v4.19-rc1
>> ./kernel/sched)
>>
>
> Srikar's series that mine depended upon was only partially merged due to
> a review bottleneck. He posted a v2 but it was during the merge window
> and likely will need a v3 to avoid falling through the cracks. When it
> is merged, I'll rebase my series on top and post it. While I didn't
> check against 4.19-rc1, I did find that rebasing on top of the partial
> series in 4.18 did not have as big an improvement.
>
>> With 4.19rc1 we see performance drop
>>   * up to 40% (NAS bench) relatively to  4.18 + 
>> sched-numa-fast-crossnode-v1r12
>>   * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 
>> vanilla
>> The performance is dropping. It's quite unclear what are the next
>> steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
>> merged or should we start looking at what has caused the drop in
>> performance going from 4.19rc1 to 4.18?
>>
>
> Both are valid options. If you take the latter option, I suggest looking
> at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the
> issue as at least one auto-bisection found that it may be problematic.
> Whether it is an issue or not depends heavily on the number of threads
> relative to a socket size.
>
> --
> Mel Gorman
> SUSE Labs


Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-04 Thread Jirka Hladky
Hi Mel,

thanks for sharing the background information! We will check if
2d4056fafa196e1ab4e7161bae4df76f9602d56d is causing the current
regression in 4.19 rc1 and let you know the outcome.

Jirka

On Tue, Sep 4, 2018 at 11:00 AM, Mel Gorman  wrote:
> On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote:
>> Resending in the plain text mode.
>>
>> > My own testing completed and the results are within expectations and I
>> > saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>> > for 4.18. Srikar Dronamraju's series is likely to need another update
>> > and I would need to rebase my patches on top of that. Given the scope
>> > and complexity, I find it unlikely they would be accepted for an -rc,
>> > particularly this late of an rc. Whether we hit the 4.19 merge window or
>> > not will depend on when Srikar's series gets updated.
>>
>>
>> Hi Mel,
>>
>> we have collaborated back in July on the scheduler patch, improving
>> the performance by allowing faster memory migration. You came up with
>> the "sched-numa-fast-crossnode-v1r12" series here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>>
>> which has shown good performance results both in your and our testing.
>>
>
> I remember.
>
>> Do you have some update on the latest status? Is there any plan to
>> merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
>> and based on the results it seems that the patch is not included (and
>> I don't see it listed in  git shortlog v4.18..v4.19-rc1
>> ./kernel/sched)
>>
>
> Srikar's series that mine depended upon was only partially merged due to
> a review bottleneck. He posted a v2 but it was during the merge window
> and likely will need a v3 to avoid falling through the cracks. When it
> is merged, I'll rebase my series on top and post it. While I didn't
> check against 4.19-rc1, I did find that rebasing on top of the partial
> series in 4.18 did not have as big an improvement.
>
>> With 4.19rc1 we see performance drop
>>   * up to 40% (NAS bench) relatively to  4.18 + 
>> sched-numa-fast-crossnode-v1r12
>>   * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 
>> vanilla
>> The performance is dropping. It's quite unclear what are the next
>> steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
>> merged or should we start looking at what has caused the drop in
>> performance going from 4.19rc1 to 4.18?
>>
>
> Both are valid options. If you take the latter option, I suggest looking
> at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the
> issue as at least one auto-bisection found that it may be problematic.
> Whether it is an issue or not depends heavily on the number of threads
> relative to a socket size.
>
> --
> Mel Gorman
> SUSE Labs


Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-03 Thread Jirka Hladky
Resending in the plain text mode.

> My own testing completed and the results are within expectations and I
> saw no red flags. Unfortunately, I consider it unlikely they'll be merged
> for 4.18. Srikar Dronamraju's series is likely to need another update
> and I would need to rebase my patches on top of that. Given the scope
> and complexity, I find it unlikely they would be accepted for an -rc,
> particularly this late of an rc. Whether we hit the 4.19 merge window or
> not will depend on when Srikar's series gets updated.


Hi Mel,

we have collaborated back in July on the scheduler patch, improving
the performance by allowing faster memory migration. You came up with
the "sched-numa-fast-crossnode-v1r12" series here:

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git

which has shown good performance results both in your and our testing.

Do you have some update on the latest status? Is there any plan to
merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
and based on the results it seems that the patch is not included (and
I don't see it listed in  git shortlog v4.18..v4.19-rc1
./kernel/sched)

With 4.19rc1 we see performance drop
  * up to 40% (NAS bench) relatively to  4.18 + sched-numa-fast-crossnode-v1r12
  * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla
The performance is dropping. It's quite unclear what are the next
steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
merged or should we start looking at what has caused the drop in
performance going from 4.19rc1 to 4.18?

We would appreciate any guidance on how to proceed.

Thanks a lot!
Jirka

On Mon, Sep 3, 2018 at 5:04 PM, Jirka Hladky  wrote:
>> My own testing completed and the results are within expectations and I
>> saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>> for 4.18. Srikar Dronamraju's series is likely to need another update
>> and I would need to rebase my patches on top of that. Given the scope
>> and complexity, I find it unlikely they would be accepted for an -rc,
>> particularly this late of an rc. Whether we hit the 4.19 merge window or
>> not will depend on when Srikar's series gets updated.
>
>
> Hi Mel,
>
> we have collaborated back in July on the scheduler patch, improving the
> performance by allowing faster memory migration. You came up with the
> "sched-numa-fast-crossnode-v1r12" series here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>
> which has shown good performance results both in your and our testing.
>
> Do you have some update on the latest status? Is there any plan to merge
> this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1 and based
> on the results it seems that the patch is not included (and I don't see it
> listed in  git shortlog v4.18..v4.19-rc1 ./kernel/sched)
>
> With 4.19rc1 we see performance drop
>
> up to 40% (NAS bench) relatively to  4.18 + sched-numa-fast-crossnode-v1r12
> up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla
>
> The performance is dropping. It's quite unclear what are the next steps -
> should we wait for "sched-numa-fast-crossnode-v1r12" to be merged or should
> we start looking at what has caused the drop in performance going from
> 4.19rc1 to 4.18?
>
> We would appreciate any guidance on how to proceed.
>
> Thanks a lot!
> Jirka
>
>
>
>
> On Tue, Jul 17, 2018 at 12:03 PM, Mel Gorman 
> wrote:
>>
>> On Tue, Jul 17, 2018 at 10:45:51AM +0200, Jirka Hladky wrote:
>> > Hi Mel,
>> >
>> > we have compared 4.18 + git://
>> > git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>> > sched-numa-fast-crossnode-v1r12 against 4.16 kernel and performance
>> > results
>> > look very good!
>> >
>>
>> Excellent, thanks to both Kamil and yourself for collecting the data.
>> It's helpful to have independent verification.
>>
>> > We see performance gains about 10-20% for SPECjbb2005. NAS results are a
>> > little bit noisy but show overall performance gains as well (total
>> > runtime
>> > for reduced from 6 hours 34 minutes to 6 hours 26 minutes to give you a
>> > specific example).
>>
>> Great.
>>
>> > The only benchmark showing a slight regression is stream
>> > - but the regression is just a few percents ( upto 10%) and I think it's
>> > not a real concern given that it's an artificial benchmark.
>> >
>>
>> Agreed.
>>
>> > How is your testing going? Do you think
>> > that sched-numa-fast-crossnode-v1r12 series can make it into the 4.18?
>> >
>>
>> My own 

Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-03 Thread Jirka Hladky
Resending in the plain text mode.

> My own testing completed and the results are within expectations and I
> saw no red flags. Unfortunately, I consider it unlikely they'll be merged
> for 4.18. Srikar Dronamraju's series is likely to need another update
> and I would need to rebase my patches on top of that. Given the scope
> and complexity, I find it unlikely they would be accepted for an -rc,
> particularly this late of an rc. Whether we hit the 4.19 merge window or
> not will depend on when Srikar's series gets updated.


Hi Mel,

we have collaborated back in July on the scheduler patch, improving
the performance by allowing faster memory migration. You came up with
the "sched-numa-fast-crossnode-v1r12" series here:

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git

which has shown good performance results both in your and our testing.

Do you have some update on the latest status? Is there any plan to
merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
and based on the results it seems that the patch is not included (and
I don't see it listed in  git shortlog v4.18..v4.19-rc1
./kernel/sched)

With 4.19rc1 we see performance drop
  * up to 40% (NAS bench) relatively to  4.18 + sched-numa-fast-crossnode-v1r12
  * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla
The performance is dropping. It's quite unclear what are the next
steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
merged or should we start looking at what has caused the drop in
performance going from 4.19rc1 to 4.18?

We would appreciate any guidance on how to proceed.

Thanks a lot!
Jirka

On Mon, Sep 3, 2018 at 5:04 PM, Jirka Hladky  wrote:
>> My own testing completed and the results are within expectations and I
>> saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>> for 4.18. Srikar Dronamraju's series is likely to need another update
>> and I would need to rebase my patches on top of that. Given the scope
>> and complexity, I find it unlikely they would be accepted for an -rc,
>> particularly this late of an rc. Whether we hit the 4.19 merge window or
>> not will depend on when Srikar's series gets updated.
>
>
> Hi Mel,
>
> we have collaborated back in July on the scheduler patch, improving the
> performance by allowing faster memory migration. You came up with the
> "sched-numa-fast-crossnode-v1r12" series here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>
> which has shown good performance results both in your and our testing.
>
> Do you have some update on the latest status? Is there any plan to merge
> this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1 and based
> on the results it seems that the patch is not included (and I don't see it
> listed in  git shortlog v4.18..v4.19-rc1 ./kernel/sched)
>
> With 4.19rc1 we see performance drop
>
> up to 40% (NAS bench) relatively to  4.18 + sched-numa-fast-crossnode-v1r12
> up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla
>
> The performance is dropping. It's quite unclear what are the next steps -
> should we wait for "sched-numa-fast-crossnode-v1r12" to be merged or should
> we start looking at what has caused the drop in performance going from
> 4.19rc1 to 4.18?
>
> We would appreciate any guidance on how to proceed.
>
> Thanks a lot!
> Jirka
>
>
>
>
> On Tue, Jul 17, 2018 at 12:03 PM, Mel Gorman 
> wrote:
>>
>> On Tue, Jul 17, 2018 at 10:45:51AM +0200, Jirka Hladky wrote:
>> > Hi Mel,
>> >
>> > we have compared 4.18 + git://
>> > git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>> > sched-numa-fast-crossnode-v1r12 against 4.16 kernel and performance
>> > results
>> > look very good!
>> >
>>
>> Excellent, thanks to both Kamil and yourself for collecting the data.
>> It's helpful to have independent verification.
>>
>> > We see performance gains about 10-20% for SPECjbb2005. NAS results are a
>> > little bit noisy but show overall performance gains as well (total
>> > runtime
>> > for reduced from 6 hours 34 minutes to 6 hours 26 minutes to give you a
>> > specific example).
>>
>> Great.
>>
>> > The only benchmark showing a slight regression is stream
>> > - but the regression is just a few percents ( upto 10%) and I think it's
>> > not a real concern given that it's an artificial benchmark.
>> >
>>
>> Agreed.
>>
>> > How is your testing going? Do you think
>> > that sched-numa-fast-crossnode-v1r12 series can make it into the 4.18?
>> >
>>
>> My own 

Group Imbalance bug - performance drop upto factor 10x

2017-02-06 Thread Jirka Hladky
Hello,

we observe that group imbalance bug can cause performance degradation
upto factor 10x on 4 NUMA server.

I have opened Bug 194231
https://bugzilla.kernel.org/show_bug.cgi?id=194231
for this issue.

The problem was first described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. Scheduler is not correctly balancing load on 4 NUMA
node server in the following scenario:
 * there are three independent ssh connections
 * first two ssh connections are running single threaded CPU intensive workload
 * last ssh session is running multi-threaded application which
requires almost all cores in the system.

We have used
* stress --cpu 1 as single threaded CPU intensive workload
http://people.seas.harvard.edu/~apw/stress/
and
* lu.C.x benchmark from NAS Parallel Benchmarks suite as
multi-threaded application
https://www.nas.nasa.gov/publications/npb.html

Version-Release number of selected component (if applicable):
Reproduced on

kernel 4.10.0-0.rc6


How reproducible:

It requires at least 2 NUMA server. Problem gets worse on 4 NUMA server.


Steps to Reproduce:
1. start 3 ssh connections to server
2. in first two ssh connections run stress --cpu 1
3. in the third ssh connection run lu.C.x benchmark with number of
threads equal to number of CPUs in the system minus 4.
4. run either Intel's numatop
echo "N" | numatop -d log >/dev/null 2>&1 &
or mpstat -P ALL 5 and check the load distribution across the NUMA
nodes. mpstat output can be processed by mpstat2node.py utility to
aggregate data across NUMA nodes
https://github.com/jhladka/MPSTAT2NODE/blob/master/mpstat2node.py

mpstat -P ALL 5 | mpstat2node.py --lscpu <(lscpu)

5. Compare the results against the same workload started from ONE ssh
session (all processes are in one group)


Actual results:

Uneven load across NUMA nodes:
Average:NODE%usr %idle
Average: all   66.12  33.51
Average:   0   37.97  61.74
Average:   1   31.67  68.15
Average:   2   97.50   1.98
Average:   3   97.33   2.19

Please notice that while number of CPU intensive threads is 62 on this
64 CPU system, NUMA nodes #0 and #1 are underutilized.

Real runtime in seconds for lu.C.x benchmark went up from 114 seconds
to 846 seconds!

Expected results:

Load evenly balanced across all NUMA nodes. Real runtime for lu.C.x
benchmark same regardless if jobs were started from one ssh session or
from multiply ssh sessions.

Additional info:

See
https://github.com/jplozi/wastedcores/blob/master/patches/group_imbalance_linux_4.1.patch
as proposal for the patch for kernel 4.1.

I will upload a reproduced to the Bug report
https://bugzilla.kernel.org/show_bug.cgi?id=194231

Thanks a lot!
Jirka


Group Imbalance bug - performance drop upto factor 10x

2017-02-06 Thread Jirka Hladky
Hello,

we observe that group imbalance bug can cause performance degradation
upto factor 10x on 4 NUMA server.

I have opened Bug 194231
https://bugzilla.kernel.org/show_bug.cgi?id=194231
for this issue.

The problem was first described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. Scheduler is not correctly balancing load on 4 NUMA
node server in the following scenario:
 * there are three independent ssh connections
 * first two ssh connections are running single threaded CPU intensive workload
 * last ssh session is running multi-threaded application which
requires almost all cores in the system.

We have used
* stress --cpu 1 as single threaded CPU intensive workload
http://people.seas.harvard.edu/~apw/stress/
and
* lu.C.x benchmark from NAS Parallel Benchmarks suite as
multi-threaded application
https://www.nas.nasa.gov/publications/npb.html

Version-Release number of selected component (if applicable):
Reproduced on

kernel 4.10.0-0.rc6


How reproducible:

It requires at least 2 NUMA server. Problem gets worse on 4 NUMA server.


Steps to Reproduce:
1. start 3 ssh connections to server
2. in first two ssh connections run stress --cpu 1
3. in the third ssh connection run lu.C.x benchmark with number of
threads equal to number of CPUs in the system minus 4.
4. run either Intel's numatop
echo "N" | numatop -d log >/dev/null 2>&1 &
or mpstat -P ALL 5 and check the load distribution across the NUMA
nodes. mpstat output can be processed by mpstat2node.py utility to
aggregate data across NUMA nodes
https://github.com/jhladka/MPSTAT2NODE/blob/master/mpstat2node.py

mpstat -P ALL 5 | mpstat2node.py --lscpu <(lscpu)

5. Compare the results against the same workload started from ONE ssh
session (all processes are in one group)


Actual results:

Uneven load across NUMA nodes:
Average:NODE%usr %idle
Average: all   66.12  33.51
Average:   0   37.97  61.74
Average:   1   31.67  68.15
Average:   2   97.50   1.98
Average:   3   97.33   2.19

Please notice that while number of CPU intensive threads is 62 on this
64 CPU system, NUMA nodes #0 and #1 are underutilized.

Real runtime in seconds for lu.C.x benchmark went up from 114 seconds
to 846 seconds!

Expected results:

Load evenly balanced across all NUMA nodes. Real runtime for lu.C.x
benchmark same regardless if jobs were started from one ssh session or
from multiply ssh sessions.

Additional info:

See
https://github.com/jplozi/wastedcores/blob/master/patches/group_imbalance_linux_4.1.patch
as proposal for the patch for kernel 4.1.

I will upload a reproduced to the Bug report
https://bugzilla.kernel.org/show_bug.cgi?id=194231

Thanks a lot!
Jirka


Re: [PATCH] x86/topology: Fallback to SMT level only once

2016-08-29 Thread Jirka Hladky
Hi Peter,

yes, initially I have reported the issue to occur on Intel E5v3 CPU
(that CPU does not have CoD) but it has turned to be a fluctuation of
results. After repeating the test 10 times it has turned out that
Intel E5v3 CPU is not affected. I'm sorry for that.

I have then rerun the test on Opteron 6272 (same CPU as used by
authors of the paper) and there was performance degradation by factor
9x. Jirka has then provided the patch.

Thanks a lot
Jirka

On Mon, Aug 29, 2016 at 12:40 PM, Peter Zijlstra  wrote:
> On Sun, Aug 28, 2016 at 08:19:46PM +0200, Jiri Olsa wrote:
>> Jirka, Peter and Jean-Pierre reported performance drop on
>> some cpus after making cpu offline and online again.
>>
>> The reason is the kernel logic that falls back to SMT
>> level topology if more than one node is detected within
>> CPU package. During the system boot this logic cuts out
>> the DIE topology level and numa code adds NUMA level
>> on top of this.
>
> Its not SMT topology, and back when I asked if he had CoD enabled or
> such he said not.
>
>
> See also:
>
> http://lkml.kernel.org/r/1471559812-19967-3-git-send-email-srinivas.pandruv...@linux.intel.com
>
> Arguably, that should have been split in two patches, but alas..


Re: [PATCH] x86/topology: Fallback to SMT level only once

2016-08-29 Thread Jirka Hladky
Hi Peter,

yes, initially I have reported the issue to occur on Intel E5v3 CPU
(that CPU does not have CoD) but it has turned to be a fluctuation of
results. After repeating the test 10 times it has turned out that
Intel E5v3 CPU is not affected. I'm sorry for that.

I have then rerun the test on Opteron 6272 (same CPU as used by
authors of the paper) and there was performance degradation by factor
9x. Jirka has then provided the patch.

Thanks a lot
Jirka

On Mon, Aug 29, 2016 at 12:40 PM, Peter Zijlstra  wrote:
> On Sun, Aug 28, 2016 at 08:19:46PM +0200, Jiri Olsa wrote:
>> Jirka, Peter and Jean-Pierre reported performance drop on
>> some cpus after making cpu offline and online again.
>>
>> The reason is the kernel logic that falls back to SMT
>> level topology if more than one node is detected within
>> CPU package. During the system boot this logic cuts out
>> the DIE topology level and numa code adds NUMA level
>> on top of this.
>
> Its not SMT topology, and back when I asked if he had CoD enabled or
> such he said not.
>
>
> See also:
>
> http://lkml.kernel.org/r/1471559812-19967-3-git-send-email-srinivas.pandruv...@linux.intel.com
>
> Arguably, that should have been split in two patches, but alas..


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-07-28 Thread Jirka Hladky
Hi Peter,

I have updated regarding the performance degradation after disabling
and re-enabling a core.

It turns out that lu.C.x results show quite big variation and tests
have to be repeated several times and mean value of real time has to
be used to get reliable results.

There is NO regression on following CPUs

4x Xeon(R) CPU E5-4610 v2 @ 2.30GHz
4x Xeon(R) CPU E5-2690 v3 @ 2.60GHz

but there is regression (slow down by factor 6x) on

AMD Opteron(TM) Processor 6272

Kernel 4.7.0-0.rc7.git0.1.el7.x86_64

real_time to run ./lu.C.x benchmark (mean value out of 10 runs)

Right after boot: 273 seconds
After disabling and enabling a core: 1702 seconds!

So you were right that it's related to COD technology

> The Opteron 6272, which they use, is an Interlagos, that has something
> similar in that each package contains two nodes.

Lauro Venancio is now working on a fix.

Jirka


On Tue, Jul 12, 2016 at 11:04 AM, Jirka Hladky <jhla...@redhat.com> wrote:
> Hi Peter,
>
> have you a chance to look into this? Is there anything I can do to
> help you to fix it?
>
> Thanks a lot!
> Jirka
>
>
> On Wed, Jun 29, 2016 at 11:58 AM, Peter Zijlstra <pet...@infradead.org> wrote:
>> On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> I think Cluster on Die technology was introduced in Haswell generation. The
>>> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
>>> double checked the BIOS and there is no cluster on die setting.
>>
>> Oh right, that's E5v3..
>>
>>> The authors of the paper have reported the issue on AMD Bulldozer CPU which
>>> also does not have COD technology.
>>
>> The Opteron 6272, which they use, is an Interlagos, that has something
>> similar in that each package contains two nodes.
>>
>> And their patch touches exactly that part of the x86 topo setup, the
>> match_die() && !same_node() condition, IOW same package, different node.
>>
>> That's not a path an Intel chip would trigger without COD support.


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-07-28 Thread Jirka Hladky
Hi Peter,

I have updated regarding the performance degradation after disabling
and re-enabling a core.

It turns out that lu.C.x results show quite big variation and tests
have to be repeated several times and mean value of real time has to
be used to get reliable results.

There is NO regression on following CPUs

4x Xeon(R) CPU E5-4610 v2 @ 2.30GHz
4x Xeon(R) CPU E5-2690 v3 @ 2.60GHz

but there is regression (slow down by factor 6x) on

AMD Opteron(TM) Processor 6272

Kernel 4.7.0-0.rc7.git0.1.el7.x86_64

real_time to run ./lu.C.x benchmark (mean value out of 10 runs)

Right after boot: 273 seconds
After disabling and enabling a core: 1702 seconds!

So you were right that it's related to COD technology

> The Opteron 6272, which they use, is an Interlagos, that has something
> similar in that each package contains two nodes.

Lauro Venancio is now working on a fix.

Jirka


On Tue, Jul 12, 2016 at 11:04 AM, Jirka Hladky  wrote:
> Hi Peter,
>
> have you a chance to look into this? Is there anything I can do to
> help you to fix it?
>
> Thanks a lot!
> Jirka
>
>
> On Wed, Jun 29, 2016 at 11:58 AM, Peter Zijlstra  wrote:
>> On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> I think Cluster on Die technology was introduced in Haswell generation. The
>>> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
>>> double checked the BIOS and there is no cluster on die setting.
>>
>> Oh right, that's E5v3..
>>
>>> The authors of the paper have reported the issue on AMD Bulldozer CPU which
>>> also does not have COD technology.
>>
>> The Opteron 6272, which they use, is an Interlagos, that has something
>> similar in that each package contains two nodes.
>>
>> And their patch touches exactly that part of the x86 topo setup, the
>> match_die() && !same_node() condition, IOW same package, different node.
>>
>> That's not a path an Intel chip would trigger without COD support.


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-07-12 Thread Jirka Hladky
Hi Peter,

have you a chance to look into this? Is there anything I can do to
help you to fix it?

Thanks a lot!
Jirka


On Wed, Jun 29, 2016 at 11:58 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
>> Hi Peter,
>>
>> I think Cluster on Die technology was introduced in Haswell generation. The
>> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
>> double checked the BIOS and there is no cluster on die setting.
>
> Oh right, that's E5v3..
>
>> The authors of the paper have reported the issue on AMD Bulldozer CPU which
>> also does not have COD technology.
>
> The Opteron 6272, which they use, is an Interlagos, that has something
> similar in that each package contains two nodes.
>
> And their patch touches exactly that part of the x86 topo setup, the
> match_die() && !same_node() condition, IOW same package, different node.
>
> That's not a path an Intel chip would trigger without COD support.


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-07-12 Thread Jirka Hladky
Hi Peter,

have you a chance to look into this? Is there anything I can do to
help you to fix it?

Thanks a lot!
Jirka


On Wed, Jun 29, 2016 at 11:58 AM, Peter Zijlstra  wrote:
> On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
>> Hi Peter,
>>
>> I think Cluster on Die technology was introduced in Haswell generation. The
>> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
>> double checked the BIOS and there is no cluster on die setting.
>
> Oh right, that's E5v3..
>
>> The authors of the paper have reported the issue on AMD Bulldozer CPU which
>> also does not have COD technology.
>
> The Opteron 6272, which they use, is an Interlagos, that has something
> similar in that each package contains two nodes.
>
> And their patch touches exactly that part of the x86 topo setup, the
> match_die() && !same_node() condition, IOW same package, different node.
>
> That's not a path an Intel chip would trigger without COD support.


Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-06-28 Thread Jirka Hladky
Hello,

on NUMA enabled server equipped with 4 Intel E5-4610 v2 CPUs we
observe following performance degradation:

Runtime to run "lu.C.x" test from NAS Parallel Benchmarks after
booting the kernel:

real  1m57.834s
user  113m51.520s

Then we disable and re-enable one core:

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

and rerun the same test. Runtime is now degraded (by 40% for user time
and by 30% for the real (wall-clock) time) using all 64 cores

real 2m47.746s
user 160m46.109s

The issue was first reported in "The Linux Scheduler: a Decade of
Wasted Cores" paper
http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
https://github.com/jplozi/wastedcores/issues/1

How to reproduce the issue:

A) Get benchmark and compile it:

1) wget http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
2) tar zxvf NPB3.3.1.tar.gz
3) cd ~/NPB3.3.1/NPB3.3-OMP/config/
4) ln -sf NAS.samples/make.def.gcc_x86 make.def (assuming using gcc compiler)
5) ln -sf NAS.samples/suite.def.lu suite.def
6) cd ~/NPB3.3.1/NPB3.3-OMP
7) make suite
8) You should have now in directory ~/NPB3.3.1/NPB3.3-OMP/bin
benchmarks  lu.*. The binaries are alphabetically sorted by runtime
with  "lu.A.x" having the shortest runtime.

B) Reproducing the issue (see also attached script)

Remark: we have done the tests with autogroup disabled
sysctl -w kernel.sched_autogroup_enabled=0
to avoid this issue on 4.7 kernel:
https://bugzilla.kernel.org/show_bug.cgi?id=120481

The test was conducted on NUMA server with 4 nodes and using all
available 64 cores.

1) (time bin/lu.C.x) |& tee $(uname
-r)_lu.C.x.log_before_reenable_kernel.sched_autogroup_enabled=0

2) disable and re-enable one core
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

3) (time bin/lu.C.x) |& tee $(uname
-r)_lu.C.x.log_after_reenable_kernel.sched_autogroup_enabled=0

grep "real\|user" *lu.C*

You will see significant difference in both real and user time.

Regarding to the authors of the paper the root cause of the problem is
a missing call to regenerate domains inside NUMA nodes after
re-enabling CPU. The problem was introduced in 3.19 kernel. The
authors of paper has proposed a patch which applies to 4.1 kernel.
Here is the link:
https://github.com/jplozi/wastedcores/blob/master/patches/missing_sched_domains_linux_4.1.patch

===For the completeness here are the results with 4.6 kernel===

AFTER BOOT
real1m31.639s
user89m24.657s

AFTER core has been disabled and re-enabled
real2m44.566s
user157m59.814s

Please notice that 4.6 kernel problem is much more visible than with
4.7 rc5 kernel.

At the same time, 4.6 kernel delivers much better performance after
boot than 4.7 rc5 kernel which might indicate that another problem is
in play.
=

I have also tested kernel provided by Peter Zijlstra on Friday, June
24th which provides fix for
https://bugzilla.kernel.org/show_bug.cgi?id=120481. It does not fix
this issue and kernel right after boot performs worse than 4.6 kernel
right after boot so we may in fact face two problems here.

Results with 4.7.0-02548776ded1185e6e16ad0a475481e982741ee9 kernel=
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
$ git rev-parse HEAD
02548776ded1185e6e16ad0a475481e982741ee9

 AFTER BOOT
real1m58.549s
user113m31.448s

AFTER core has been disabled and re-enabled
real 2m35.930s
user 148m20.795s
=

Thanks a lot!
Jirka

PS: I have opened this BZ to track this issue
Bug 121121 - Kernel v4.7-rc5 - performance degradation upto 40% after
disabling and re-enabling a core
https://bugzilla.kernel.org/show_bug.cgi?id=121121


reproduce.sh
Description: Bourne shell script


Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-06-28 Thread Jirka Hladky
Hello,

on NUMA enabled server equipped with 4 Intel E5-4610 v2 CPUs we
observe following performance degradation:

Runtime to run "lu.C.x" test from NAS Parallel Benchmarks after
booting the kernel:

real  1m57.834s
user  113m51.520s

Then we disable and re-enable one core:

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

and rerun the same test. Runtime is now degraded (by 40% for user time
and by 30% for the real (wall-clock) time) using all 64 cores

real 2m47.746s
user 160m46.109s

The issue was first reported in "The Linux Scheduler: a Decade of
Wasted Cores" paper
http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
https://github.com/jplozi/wastedcores/issues/1

How to reproduce the issue:

A) Get benchmark and compile it:

1) wget http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
2) tar zxvf NPB3.3.1.tar.gz
3) cd ~/NPB3.3.1/NPB3.3-OMP/config/
4) ln -sf NAS.samples/make.def.gcc_x86 make.def (assuming using gcc compiler)
5) ln -sf NAS.samples/suite.def.lu suite.def
6) cd ~/NPB3.3.1/NPB3.3-OMP
7) make suite
8) You should have now in directory ~/NPB3.3.1/NPB3.3-OMP/bin
benchmarks  lu.*. The binaries are alphabetically sorted by runtime
with  "lu.A.x" having the shortest runtime.

B) Reproducing the issue (see also attached script)

Remark: we have done the tests with autogroup disabled
sysctl -w kernel.sched_autogroup_enabled=0
to avoid this issue on 4.7 kernel:
https://bugzilla.kernel.org/show_bug.cgi?id=120481

The test was conducted on NUMA server with 4 nodes and using all
available 64 cores.

1) (time bin/lu.C.x) |& tee $(uname
-r)_lu.C.x.log_before_reenable_kernel.sched_autogroup_enabled=0

2) disable and re-enable one core
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

3) (time bin/lu.C.x) |& tee $(uname
-r)_lu.C.x.log_after_reenable_kernel.sched_autogroup_enabled=0

grep "real\|user" *lu.C*

You will see significant difference in both real and user time.

Regarding to the authors of the paper the root cause of the problem is
a missing call to regenerate domains inside NUMA nodes after
re-enabling CPU. The problem was introduced in 3.19 kernel. The
authors of paper has proposed a patch which applies to 4.1 kernel.
Here is the link:
https://github.com/jplozi/wastedcores/blob/master/patches/missing_sched_domains_linux_4.1.patch

===For the completeness here are the results with 4.6 kernel===

AFTER BOOT
real1m31.639s
user89m24.657s

AFTER core has been disabled and re-enabled
real2m44.566s
user157m59.814s

Please notice that 4.6 kernel problem is much more visible than with
4.7 rc5 kernel.

At the same time, 4.6 kernel delivers much better performance after
boot than 4.7 rc5 kernel which might indicate that another problem is
in play.
=

I have also tested kernel provided by Peter Zijlstra on Friday, June
24th which provides fix for
https://bugzilla.kernel.org/show_bug.cgi?id=120481. It does not fix
this issue and kernel right after boot performs worse than 4.6 kernel
right after boot so we may in fact face two problems here.

Results with 4.7.0-02548776ded1185e6e16ad0a475481e982741ee9 kernel=
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
$ git rev-parse HEAD
02548776ded1185e6e16ad0a475481e982741ee9

 AFTER BOOT
real1m58.549s
user113m31.448s

AFTER core has been disabled and re-enabled
real 2m35.930s
user 148m20.795s
=

Thanks a lot!
Jirka

PS: I have opened this BZ to track this issue
Bug 121121 - Kernel v4.7-rc5 - performance degradation upto 40% after
disabling and re-enabling a core
https://bugzilla.kernel.org/show_bug.cgi?id=121121


reproduce.sh
Description: Bourne shell script


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
Hi Peter,

I have compiled your version of linux kernel and run the SPECjvm2008
tests. Results are fine, performance is at the level of 4.6 kernel.

$ git rev-parse HEAD
02548776ded1185e6e16ad0a475481e982741ee9

Jirka




On Fri, Jun 24, 2016 at 5:54 PM, Peter Zijlstra  wrote:
> On Fri, Jun 24, 2016 at 03:42:26PM +0200, Peter Zijlstra wrote:
>> On Fri, Jun 24, 2016 at 02:44:07PM +0200, Vincent Guittot wrote:
>> > > --- a/kernel/sched/fair.c
>> > > +++ b/kernel/sched/fair.c
>> > > @@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct 
>> > > task_group *tg, struct cfs_rq *cfs_rq)
>> > >  */
>> > > tg_weight = atomic_long_read(>load_avg);
>> > > tg_weight -= cfs_rq->tg_load_avg_contrib;
>> > > -   tg_weight += cfs_rq->load.weight;
>> > > +   tg_weight += cfs_rq->avg.load_avg;
>> >
>> > IIUC, you are reverting
>> > commit  fde7d22e01aa (sched/fair: Fix overly small weight for
>> > interactive group entities)
>>
>> Hurm.. looking at that commit again, that seems to wreck
>> effective_load(), since that doesn't compensate.
>>
>> Maybe I'll remove calc_tg_weight and open code its slightly different
>> usages in the two sites.
>
> OK, sorry for not actually posting, but I need to run. Please find the
> two patches in:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
Hi Peter,

I have compiled your version of linux kernel and run the SPECjvm2008
tests. Results are fine, performance is at the level of 4.6 kernel.

$ git rev-parse HEAD
02548776ded1185e6e16ad0a475481e982741ee9

Jirka




On Fri, Jun 24, 2016 at 5:54 PM, Peter Zijlstra  wrote:
> On Fri, Jun 24, 2016 at 03:42:26PM +0200, Peter Zijlstra wrote:
>> On Fri, Jun 24, 2016 at 02:44:07PM +0200, Vincent Guittot wrote:
>> > > --- a/kernel/sched/fair.c
>> > > +++ b/kernel/sched/fair.c
>> > > @@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct 
>> > > task_group *tg, struct cfs_rq *cfs_rq)
>> > >  */
>> > > tg_weight = atomic_long_read(>load_avg);
>> > > tg_weight -= cfs_rq->tg_load_avg_contrib;
>> > > -   tg_weight += cfs_rq->load.weight;
>> > > +   tg_weight += cfs_rq->avg.load_avg;
>> >
>> > IIUC, you are reverting
>> > commit  fde7d22e01aa (sched/fair: Fix overly small weight for
>> > interactive group entities)
>>
>> Hurm.. looking at that commit again, that seems to wreck
>> effective_load(), since that doesn't compensate.
>>
>> Maybe I'll remove calc_tg_weight and open code its slightly different
>> usages in the two sites.
>
> OK, sorry for not actually posting, but I need to run. Please find the
> two patches in:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
Hi Peter,

the proposed patch has fixed the performance issue. I have applied the
patch to v4.7-rc4

Jirka

On Fri, Jun 24, 2016 at 2:44 PM, Vincent Guittot
<vincent.guit...@linaro.org> wrote:
> Hi Peter,
>
> On 24 June 2016 at 14:02, Peter Zijlstra <pet...@infradead.org> wrote:
>> On Fri, Jun 24, 2016 at 09:44:41AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> thanks a lot for looking into it!
>>>
>>> I have tried to disable autogroups
>>>
>>> sysctl -w kernel.sched_autogroup_enabled=0
>>>
>>> and I can confirm that performance is then back at level as in 4.6 kernel.
>>
>> So unless the heat has made me do really silly things, the below seems
>> to cure things. Could you please verify?
>>
>>
>> ---
>>  kernel/sched/fair.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 22d64b3f5876..d4f6fb2f3057 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct task_group 
>> *tg, struct cfs_rq *cfs_rq)
>>  */
>> tg_weight = atomic_long_read(>load_avg);
>> tg_weight -= cfs_rq->tg_load_avg_contrib;
>> -   tg_weight += cfs_rq->load.weight;
>> +   tg_weight += cfs_rq->avg.load_avg;
>
> IIUC, you are reverting
> commit  fde7d22e01aa (sched/fair: Fix overly small weight for
> interactive group entities)
>
> I have one question regarding the use of cfs_rq->avg.load_avg
> cfs_rq->tg_load_avg_contrib is the sampling of cfs_rq->avg.load_avg so
> I'm curious to understand why you use cfs_rq->avg.load_avg instead of
> keeping cfs_rq->tg_load_avg_contrib. Do you think that the sampling is
> not accurate enough to prevent any significant difference between both
> when we use tg->load_avg ?
>
>
>>
>> return tg_weight;
>>  }
>> @@ -2494,7 +2494,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, 
>> struct task_group *tg)
>> long tg_weight, load, shares;
>>
>> tg_weight = calc_tg_weight(tg, cfs_rq);
>> -   load = cfs_rq->load.weight;
>> +   load = cfs_rq->avg.load_avg;
>>
>> shares = (tg->shares * load);
>> if (tg_weight)


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
Hi Peter,

the proposed patch has fixed the performance issue. I have applied the
patch to v4.7-rc4

Jirka

On Fri, Jun 24, 2016 at 2:44 PM, Vincent Guittot
 wrote:
> Hi Peter,
>
> On 24 June 2016 at 14:02, Peter Zijlstra  wrote:
>> On Fri, Jun 24, 2016 at 09:44:41AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> thanks a lot for looking into it!
>>>
>>> I have tried to disable autogroups
>>>
>>> sysctl -w kernel.sched_autogroup_enabled=0
>>>
>>> and I can confirm that performance is then back at level as in 4.6 kernel.
>>
>> So unless the heat has made me do really silly things, the below seems
>> to cure things. Could you please verify?
>>
>>
>> ---
>>  kernel/sched/fair.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 22d64b3f5876..d4f6fb2f3057 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct task_group 
>> *tg, struct cfs_rq *cfs_rq)
>>  */
>> tg_weight = atomic_long_read(>load_avg);
>> tg_weight -= cfs_rq->tg_load_avg_contrib;
>> -   tg_weight += cfs_rq->load.weight;
>> +   tg_weight += cfs_rq->avg.load_avg;
>
> IIUC, you are reverting
> commit  fde7d22e01aa (sched/fair: Fix overly small weight for
> interactive group entities)
>
> I have one question regarding the use of cfs_rq->avg.load_avg
> cfs_rq->tg_load_avg_contrib is the sampling of cfs_rq->avg.load_avg so
> I'm curious to understand why you use cfs_rq->avg.load_avg instead of
> keeping cfs_rq->tg_load_avg_contrib. Do you think that the sampling is
> not accurate enough to prevent any significant difference between both
> when we use tg->load_avg ?
>
>
>>
>> return tg_weight;
>>  }
>> @@ -2494,7 +2494,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, 
>> struct task_group *tg)
>> long tg_weight, load, shares;
>>
>> tg_weight = calc_tg_weight(tg, cfs_rq);
>> -   load = cfs_rq->load.weight;
>> +   load = cfs_rq->avg.load_avg;
>>
>> shares = (tg->shares * load);
>> if (tg_weight)


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
OK, I have applied to v4.7-rc4 via git am

Compiling kernel, should have the results soon.

Jirka

On Fri, Jun 24, 2016 at 2:09 PM, Jirka Hladky <jhla...@redhat.com> wrote:
> Thank you Peter!
>
> Should I apply it to v4.7-rc4 ?
>
> Jirka
>
> On Fri, Jun 24, 2016 at 2:02 PM, Peter Zijlstra <pet...@infradead.org> wrote:
>> On Fri, Jun 24, 2016 at 09:44:41AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> thanks a lot for looking into it!
>>>
>>> I have tried to disable autogroups
>>>
>>> sysctl -w kernel.sched_autogroup_enabled=0
>>>
>>> and I can confirm that performance is then back at level as in 4.6 kernel.
>>
>> So unless the heat has made me do really silly things, the below seems
>> to cure things. Could you please verify?
>>
>>
>> ---
>>  kernel/sched/fair.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 22d64b3f5876..d4f6fb2f3057 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct task_group 
>> *tg, struct cfs_rq *cfs_rq)
>>  */
>> tg_weight = atomic_long_read(>load_avg);
>> tg_weight -= cfs_rq->tg_load_avg_contrib;
>> -   tg_weight += cfs_rq->load.weight;
>> +   tg_weight += cfs_rq->avg.load_avg;
>>
>> return tg_weight;
>>  }
>> @@ -2494,7 +2494,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, 
>> struct task_group *tg)
>> long tg_weight, load, shares;
>>
>> tg_weight = calc_tg_weight(tg, cfs_rq);
>> -   load = cfs_rq->load.weight;
>> +   load = cfs_rq->avg.load_avg;
>>
>> shares = (tg->shares * load);
>> if (tg_weight)


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
OK, I have applied to v4.7-rc4 via git am

Compiling kernel, should have the results soon.

Jirka

On Fri, Jun 24, 2016 at 2:09 PM, Jirka Hladky  wrote:
> Thank you Peter!
>
> Should I apply it to v4.7-rc4 ?
>
> Jirka
>
> On Fri, Jun 24, 2016 at 2:02 PM, Peter Zijlstra  wrote:
>> On Fri, Jun 24, 2016 at 09:44:41AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> thanks a lot for looking into it!
>>>
>>> I have tried to disable autogroups
>>>
>>> sysctl -w kernel.sched_autogroup_enabled=0
>>>
>>> and I can confirm that performance is then back at level as in 4.6 kernel.
>>
>> So unless the heat has made me do really silly things, the below seems
>> to cure things. Could you please verify?
>>
>>
>> ---
>>  kernel/sched/fair.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 22d64b3f5876..d4f6fb2f3057 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct task_group 
>> *tg, struct cfs_rq *cfs_rq)
>>  */
>> tg_weight = atomic_long_read(>load_avg);
>> tg_weight -= cfs_rq->tg_load_avg_contrib;
>> -   tg_weight += cfs_rq->load.weight;
>> +   tg_weight += cfs_rq->avg.load_avg;
>>
>> return tg_weight;
>>  }
>> @@ -2494,7 +2494,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, 
>> struct task_group *tg)
>> long tg_weight, load, shares;
>>
>> tg_weight = calc_tg_weight(tg, cfs_rq);
>> -   load = cfs_rq->load.weight;
>> +   load = cfs_rq->avg.load_avg;
>>
>> shares = (tg->shares * load);
>> if (tg_weight)


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
Thank you Peter!

Should I apply it to v4.7-rc4 ?

Jirka

On Fri, Jun 24, 2016 at 2:02 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Fri, Jun 24, 2016 at 09:44:41AM +0200, Jirka Hladky wrote:
>> Hi Peter,
>>
>> thanks a lot for looking into it!
>>
>> I have tried to disable autogroups
>>
>> sysctl -w kernel.sched_autogroup_enabled=0
>>
>> and I can confirm that performance is then back at level as in 4.6 kernel.
>
> So unless the heat has made me do really silly things, the below seems
> to cure things. Could you please verify?
>
>
> ---
>  kernel/sched/fair.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 22d64b3f5876..d4f6fb2f3057 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct task_group 
> *tg, struct cfs_rq *cfs_rq)
>  */
> tg_weight = atomic_long_read(>load_avg);
> tg_weight -= cfs_rq->tg_load_avg_contrib;
> -   tg_weight += cfs_rq->load.weight;
> +   tg_weight += cfs_rq->avg.load_avg;
>
> return tg_weight;
>  }
> @@ -2494,7 +2494,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, 
> struct task_group *tg)
> long tg_weight, load, shares;
>
> tg_weight = calc_tg_weight(tg, cfs_rq);
> -   load = cfs_rq->load.weight;
> +   load = cfs_rq->avg.load_avg;
>
> shares = (tg->shares * load);
> if (tg_weight)


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
Thank you Peter!

Should I apply it to v4.7-rc4 ?

Jirka

On Fri, Jun 24, 2016 at 2:02 PM, Peter Zijlstra  wrote:
> On Fri, Jun 24, 2016 at 09:44:41AM +0200, Jirka Hladky wrote:
>> Hi Peter,
>>
>> thanks a lot for looking into it!
>>
>> I have tried to disable autogroups
>>
>> sysctl -w kernel.sched_autogroup_enabled=0
>>
>> and I can confirm that performance is then back at level as in 4.6 kernel.
>
> So unless the heat has made me do really silly things, the below seems
> to cure things. Could you please verify?
>
>
> ---
>  kernel/sched/fair.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 22d64b3f5876..d4f6fb2f3057 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct task_group 
> *tg, struct cfs_rq *cfs_rq)
>  */
> tg_weight = atomic_long_read(>load_avg);
> tg_weight -= cfs_rq->tg_load_avg_contrib;
> -   tg_weight += cfs_rq->load.weight;
> +   tg_weight += cfs_rq->avg.load_avg;
>
> return tg_weight;
>  }
> @@ -2494,7 +2494,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, 
> struct task_group *tg)
> long tg_weight, load, shares;
>
> tg_weight = calc_tg_weight(tg, cfs_rq);
> -   load = cfs_rq->load.weight;
> +   load = cfs_rq->avg.load_avg;
>
> shares = (tg->shares * load);
> if (tg_weight)


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
I had a look and

CONFIG_SCHED_AUTOGROUP=y

is used both in RHEL6 and RHEL7.  We compile the upstream kernels with
config derived from RHEL7 config file.

Jirka

On Fri, Jun 24, 2016 at 10:08 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Fri, Jun 24, 2016 at 09:44:41AM +0200, Jirka Hladky wrote:
>> I have double checked default settings and
>>
>> kernel.sched_autogroup_enabled
>>
>> is by default ON both in 4.6 and 4.7 kernel.
>
> Yeah, if you enable that CONFIG its default enabled. In any case, I'll
> go trawl through the cgroup code now. I spend yesterday looking at the
> 'wrong' part things.
>
>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
I had a look and

CONFIG_SCHED_AUTOGROUP=y

is used both in RHEL6 and RHEL7.  We compile the upstream kernels with
config derived from RHEL7 config file.

Jirka

On Fri, Jun 24, 2016 at 10:08 AM, Peter Zijlstra  wrote:
> On Fri, Jun 24, 2016 at 09:44:41AM +0200, Jirka Hladky wrote:
>> I have double checked default settings and
>>
>> kernel.sched_autogroup_enabled
>>
>> is by default ON both in 4.6 and 4.7 kernel.
>
> Yeah, if you enable that CONFIG its default enabled. In any case, I'll
> go trawl through the cgroup code now. I spend yesterday looking at the
> 'wrong' part things.
>
>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
Hi Peter,

thanks a lot for looking into it!

I have tried to disable autogroups

sysctl -w kernel.sched_autogroup_enabled=0

and I can confirm that performance is then back at level as in 4.6 kernel.

I have double checked default settings and

kernel.sched_autogroup_enabled

is by default ON both in 4.6 and 4.7 kernel.

Jirka

On Thu, Jun 23, 2016 at 8:43 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, Jun 23, 2016 at 08:33:18PM +0200, Peter Zijlstra wrote:
>> On Fri, Jun 17, 2016 at 01:04:23AM +0200, Jirka Hladky wrote:
>>
>> > > What kind of config and userspace setup? Do you run this cruft in a
>> > > cgroup of sorts?
>> >
>> >  No, we don't do any special setup except to control the number of threads.
>>
>> OK, so I'm fairly certain you _do_ run in a cgroup, because its made
>> almost impossible not to these days.
>>
>> Run:
>>
>>   grep java /proc/sched_debug
>>
>> while the thing is running. That'll show you the actual cgroup the stuff
>> is running in.
>
> That'll end up looking something like:
>
> root@ivb-ep:/usr/src/linux-2.6# grep java /proc/sched_debug
> java  2714 18270.63492589   120 0.00  
>1.490023 0.00 0 0 /user.slice/user-0.slice/session-2.scope
> java  2666 18643.629673 2   120 0.00  
>0.063129 0.00 0 0 /user.slice/user-0.slice/session-2.scope
> java  2676 18655.652878 3   120 0.00  
>0.077127 0.00 0 0 /user.slice/user-0.slice/session-2.scope
> java  2680 18655.683384 3   120 0.00  
>0.082993 0.00 0 0 /user.slice/user-0.slice/session-2.scope
>
> which shows a 3 deep hierarchy. Clearly these people haven't the
> faintest clue about the cost of what they're doing. This stuff ain't
> free.
>
>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-24 Thread Jirka Hladky
Hi Peter,

thanks a lot for looking into it!

I have tried to disable autogroups

sysctl -w kernel.sched_autogroup_enabled=0

and I can confirm that performance is then back at level as in 4.6 kernel.

I have double checked default settings and

kernel.sched_autogroup_enabled

is by default ON both in 4.6 and 4.7 kernel.

Jirka

On Thu, Jun 23, 2016 at 8:43 PM, Peter Zijlstra  wrote:
> On Thu, Jun 23, 2016 at 08:33:18PM +0200, Peter Zijlstra wrote:
>> On Fri, Jun 17, 2016 at 01:04:23AM +0200, Jirka Hladky wrote:
>>
>> > > What kind of config and userspace setup? Do you run this cruft in a
>> > > cgroup of sorts?
>> >
>> >  No, we don't do any special setup except to control the number of threads.
>>
>> OK, so I'm fairly certain you _do_ run in a cgroup, because its made
>> almost impossible not to these days.
>>
>> Run:
>>
>>   grep java /proc/sched_debug
>>
>> while the thing is running. That'll show you the actual cgroup the stuff
>> is running in.
>
> That'll end up looking something like:
>
> root@ivb-ep:/usr/src/linux-2.6# grep java /proc/sched_debug
> java  2714 18270.63492589   120 0.00  
>1.490023 0.00 0 0 /user.slice/user-0.slice/session-2.scope
> java  2666 18643.629673 2   120 0.00  
>0.063129 0.00 0 0 /user.slice/user-0.slice/session-2.scope
> java  2676 18655.652878 3   120 0.00  
>0.077127 0.00 0 0 /user.slice/user-0.slice/session-2.scope
> java  2680 18655.683384 3   120 0.00  
>0.082993 0.00 0 0 /user.slice/user-0.slice/session-2.scope
>
> which shows a 3 deep hierarchy. Clearly these people haven't the
> faintest clue about the cost of what they're doing. This stuff ain't
> free.
>
>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Peter,

the kernel I got with bisecting does not work - I'm getting kernel
panic during the boot.

In any case, the regression was introduced between
git bisect good 64b7aad
git bisect bad 2159197

This commit is good:
64b7aad - Ingo Molnar, 7 weeks ago : Merge branch 'sched/urgent' into
sched/core, to pick up fixes before applying new changes

This commit is bad:
2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased
load resolution on 64-bit kernels

Could you please have a look?

Thanks a lot!
Jirka


On Wed, Jun 22, 2016 at 2:46 PM, Jirka Hladky <jhla...@redhat.com> wrote:
> OK, I have reviewed my results once again:
>
> This commit is fine:
> 64b7aad - Ingo Molnar, 7 weeks ago : Merge branch 'sched/urgent' into
> sched/core, to pick up fixes before applying new changes
>
> This version has already a problem:
> 2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased
> load resolution on 64-bit kernels
>
> git bisect start
> git bisect good 64b7aad
> git bisect bad 2159197
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [eb58075149b7f0300ff19142e6245fe75db2a081] sched/core: Introduce
> 'struct rq_flags'
>
> I should have results pretty soon.
>
> Jirka
>
>
> On Wed, Jun 22, 2016 at 2:37 PM, Jirka Hladky <jhla...@redhat.com> wrote:
>> Hi Peter,
>>
>> crap - I have done bisecting manually (not using git bisect) and I
>> have probably done some mistake.
>>
>> Commits (git checkout ) for which I got BAD results:
>>
>> 2159197d66770ec01f75c93fb11dc66df81fd45b
>> 6ecdd74962f246dfe8750b7bea481a1c0816315d
>>
>> Commits (git checkout ) for which I got GOOD results:
>> 21e96f88776deead303ecd30a17d1d7c2a1776e3
>> 64b7aad579847852e110878ccaae4c3aaa34
>> e7904a28f5331c21d17af638cb477c83662e3cb6
>>
>> I will try to use git bisect now.
>> 
>> Jirka
>>
>> On Wed, Jun 22, 2016 at 1:12 PM, Peter Zijlstra <pet...@infradead.org> wrote:
>>> On Wed, Jun 22, 2016 at 11:52:45AM +0200, Jirka Hladky wrote:
>>>> Hi Peter,
>>>>
>>>> the performance regression has been caused by this commit
>>>>
>>>> =
>>>> commit 6ecdd74962f246dfe8750b7bea481a1c0816315d
>>>> Author: Yuyang Du <yuyang...@intel.com>
>>>> Date:   Tue Apr 5 12:12:26 2016 +0800
>>>>
>>>> sched/fair: Generalize the load/util averages resolution definition
>>>> =
>>>>
>>>> Could you please have a look?
>>>
>>> That patch looks like a NO-OP to me.
>>>
>>> In any case, the good news it that I can run the benchmark, the bad news
>>> is that the patch you fingered doesn't appear to be it.
>>>
>>>
>>> v4.60:
>>> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.transform: 2007.18 ops/m
>>> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.validation: 2999.44 ops/m
>>>
>>> tip/master:
>>> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on 
>>> xml.transform: 1283.14 ops/m
>>> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on 
>>> xml.validation: 2008.62 ops/m
>>>
>>> patch^1
>>> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on 
>>> xml.transform: 1196.18 ops/m
>>> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on 
>>> xml.validation: 2055.11 ops/m
>>>
>>> patch^1 + patch
>>> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
>>> xml.transform: 1294.59 ops/m
>>> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
>>> xml.validation: 2140.02 ops/m
>>>
>>>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Peter,

the kernel I got with bisecting does not work - I'm getting kernel
panic during the boot.

In any case, the regression was introduced between
git bisect good 64b7aad
git bisect bad 2159197

This commit is good:
64b7aad - Ingo Molnar, 7 weeks ago : Merge branch 'sched/urgent' into
sched/core, to pick up fixes before applying new changes

This commit is bad:
2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased
load resolution on 64-bit kernels

Could you please have a look?

Thanks a lot!
Jirka


On Wed, Jun 22, 2016 at 2:46 PM, Jirka Hladky  wrote:
> OK, I have reviewed my results once again:
>
> This commit is fine:
> 64b7aad - Ingo Molnar, 7 weeks ago : Merge branch 'sched/urgent' into
> sched/core, to pick up fixes before applying new changes
>
> This version has already a problem:
> 2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased
> load resolution on 64-bit kernels
>
> git bisect start
> git bisect good 64b7aad
> git bisect bad 2159197
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [eb58075149b7f0300ff19142e6245fe75db2a081] sched/core: Introduce
> 'struct rq_flags'
>
> I should have results pretty soon.
>
> Jirka
>
>
> On Wed, Jun 22, 2016 at 2:37 PM, Jirka Hladky  wrote:
>> Hi Peter,
>>
>> crap - I have done bisecting manually (not using git bisect) and I
>> have probably done some mistake.
>>
>> Commits (git checkout ) for which I got BAD results:
>>
>> 2159197d66770ec01f75c93fb11dc66df81fd45b
>> 6ecdd74962f246dfe8750b7bea481a1c0816315d
>>
>> Commits (git checkout ) for which I got GOOD results:
>> 21e96f88776deead303ecd30a17d1d7c2a1776e3
>> 64b7aad579847852e110878ccaae4c3aaa34
>> e7904a28f5331c21d17af638cb477c83662e3cb6
>>
>> I will try to use git bisect now.
>> 
>> Jirka
>>
>> On Wed, Jun 22, 2016 at 1:12 PM, Peter Zijlstra  wrote:
>>> On Wed, Jun 22, 2016 at 11:52:45AM +0200, Jirka Hladky wrote:
>>>> Hi Peter,
>>>>
>>>> the performance regression has been caused by this commit
>>>>
>>>> =
>>>> commit 6ecdd74962f246dfe8750b7bea481a1c0816315d
>>>> Author: Yuyang Du 
>>>> Date:   Tue Apr 5 12:12:26 2016 +0800
>>>>
>>>> sched/fair: Generalize the load/util averages resolution definition
>>>> =
>>>>
>>>> Could you please have a look?
>>>
>>> That patch looks like a NO-OP to me.
>>>
>>> In any case, the good news it that I can run the benchmark, the bad news
>>> is that the patch you fingered doesn't appear to be it.
>>>
>>>
>>> v4.60:
>>> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.transform: 2007.18 ops/m
>>> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.validation: 2999.44 ops/m
>>>
>>> tip/master:
>>> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on 
>>> xml.transform: 1283.14 ops/m
>>> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on 
>>> xml.validation: 2008.62 ops/m
>>>
>>> patch^1
>>> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on 
>>> xml.transform: 1196.18 ops/m
>>> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on 
>>> xml.validation: 2055.11 ops/m
>>>
>>> patch^1 + patch
>>> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
>>> xml.transform: 1294.59 ops/m
>>> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
>>> xml.validation: 2140.02 ops/m
>>>
>>>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
OK, I have reviewed my results once again:

This commit is fine:
64b7aad - Ingo Molnar, 7 weeks ago : Merge branch 'sched/urgent' into
sched/core, to pick up fixes before applying new changes

This version has already a problem:
2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased
load resolution on 64-bit kernels

git bisect start
git bisect good 64b7aad
git bisect bad 2159197
Bisecting: 1 revision left to test after this (roughly 1 step)
[eb58075149b7f0300ff19142e6245fe75db2a081] sched/core: Introduce
'struct rq_flags'

I should have results pretty soon.

Jirka


On Wed, Jun 22, 2016 at 2:37 PM, Jirka Hladky <jhla...@redhat.com> wrote:
> Hi Peter,
>
> crap - I have done bisecting manually (not using git bisect) and I
> have probably done some mistake.
>
> Commits (git checkout ) for which I got BAD results:
>
> 2159197d66770ec01f75c93fb11dc66df81fd45b
> 6ecdd74962f246dfe8750b7bea481a1c0816315d
>
> Commits (git checkout ) for which I got GOOD results:
> 21e96f88776deead303ecd30a17d1d7c2a1776e3
> 64b7aad579847852e110878ccaae4c3aaa34
> e7904a28f5331c21d17af638cb477c83662e3cb6
>
> I will try to use git bisect now.
> 
> Jirka
>
> On Wed, Jun 22, 2016 at 1:12 PM, Peter Zijlstra <pet...@infradead.org> wrote:
>> On Wed, Jun 22, 2016 at 11:52:45AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> the performance regression has been caused by this commit
>>>
>>> =
>>> commit 6ecdd74962f246dfe8750b7bea481a1c0816315d
>>> Author: Yuyang Du <yuyang...@intel.com>
>>> Date:   Tue Apr 5 12:12:26 2016 +0800
>>>
>>> sched/fair: Generalize the load/util averages resolution definition
>>> =
>>>
>>> Could you please have a look?
>>
>> That patch looks like a NO-OP to me.
>>
>> In any case, the good news it that I can run the benchmark, the bad news
>> is that the patch you fingered doesn't appear to be it.
>>
>>
>> v4.60:
>> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.transform: 2007.18 ops/m
>> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.validation: 2999.44 ops/m
>>
>> tip/master:
>> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on xml.transform: 
>> 1283.14 ops/m
>> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on 
>> xml.validation: 2008.62 ops/m
>>
>> patch^1
>> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on xml.transform: 
>> 1196.18 ops/m
>> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on 
>> xml.validation: 2055.11 ops/m
>>
>> patch^1 + patch
>> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
>> xml.transform: 1294.59 ops/m
>> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
>> xml.validation: 2140.02 ops/m
>>
>>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
OK, I have reviewed my results once again:

This commit is fine:
64b7aad - Ingo Molnar, 7 weeks ago : Merge branch 'sched/urgent' into
sched/core, to pick up fixes before applying new changes

This version has already a problem:
2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased
load resolution on 64-bit kernels

git bisect start
git bisect good 64b7aad
git bisect bad 2159197
Bisecting: 1 revision left to test after this (roughly 1 step)
[eb58075149b7f0300ff19142e6245fe75db2a081] sched/core: Introduce
'struct rq_flags'

I should have results pretty soon.

Jirka


On Wed, Jun 22, 2016 at 2:37 PM, Jirka Hladky  wrote:
> Hi Peter,
>
> crap - I have done bisecting manually (not using git bisect) and I
> have probably done some mistake.
>
> Commits (git checkout ) for which I got BAD results:
>
> 2159197d66770ec01f75c93fb11dc66df81fd45b
> 6ecdd74962f246dfe8750b7bea481a1c0816315d
>
> Commits (git checkout ) for which I got GOOD results:
> 21e96f88776deead303ecd30a17d1d7c2a1776e3
> 64b7aad579847852e110878ccaae4c3aaa34
> e7904a28f5331c21d17af638cb477c83662e3cb6
>
> I will try to use git bisect now.
> 
> Jirka
>
> On Wed, Jun 22, 2016 at 1:12 PM, Peter Zijlstra  wrote:
>> On Wed, Jun 22, 2016 at 11:52:45AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> the performance regression has been caused by this commit
>>>
>>> =
>>> commit 6ecdd74962f246dfe8750b7bea481a1c0816315d
>>> Author: Yuyang Du 
>>> Date:   Tue Apr 5 12:12:26 2016 +0800
>>>
>>> sched/fair: Generalize the load/util averages resolution definition
>>> =
>>>
>>> Could you please have a look?
>>
>> That patch looks like a NO-OP to me.
>>
>> In any case, the good news it that I can run the benchmark, the bad news
>> is that the patch you fingered doesn't appear to be it.
>>
>>
>> v4.60:
>> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.transform: 2007.18 ops/m
>> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.validation: 2999.44 ops/m
>>
>> tip/master:
>> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on xml.transform: 
>> 1283.14 ops/m
>> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on 
>> xml.validation: 2008.62 ops/m
>>
>> patch^1
>> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on xml.transform: 
>> 1196.18 ops/m
>> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on 
>> xml.validation: 2055.11 ops/m
>>
>> patch^1 + patch
>> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
>> xml.transform: 1294.59 ops/m
>> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
>> xml.validation: 2140.02 ops/m
>>
>>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Peter,

crap - I have done bisecting manually (not using git bisect) and I
have probably done some mistake.

Commits (git checkout ) for which I got BAD results:

2159197d66770ec01f75c93fb11dc66df81fd45b
6ecdd74962f246dfe8750b7bea481a1c0816315d

Commits (git checkout ) for which I got GOOD results:
21e96f88776deead303ecd30a17d1d7c2a1776e3
64b7aad579847852e110878ccaae4c3aaa34
e7904a28f5331c21d17af638cb477c83662e3cb6

I will try to use git bisect now.

Jirka

On Wed, Jun 22, 2016 at 1:12 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Wed, Jun 22, 2016 at 11:52:45AM +0200, Jirka Hladky wrote:
>> Hi Peter,
>>
>> the performance regression has been caused by this commit
>>
>> =
>> commit 6ecdd74962f246dfe8750b7bea481a1c0816315d
>> Author: Yuyang Du <yuyang...@intel.com>
>> Date:   Tue Apr 5 12:12:26 2016 +0800
>>
>> sched/fair: Generalize the load/util averages resolution definition
>> =
>>
>> Could you please have a look?
>
> That patch looks like a NO-OP to me.
>
> In any case, the good news it that I can run the benchmark, the bad news
> is that the patch you fingered doesn't appear to be it.
>
>
> v4.60:
> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.transform: 2007.18 ops/m
> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.validation: 2999.44 ops/m
>
> tip/master:
> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on xml.transform: 
> 1283.14 ops/m
> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on xml.validation: 
> 2008.62 ops/m
>
> patch^1
> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on xml.transform: 
> 1196.18 ops/m
> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on xml.validation: 
> 2055.11 ops/m
>
> patch^1 + patch
> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
> xml.transform: 1294.59 ops/m
> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
> xml.validation: 2140.02 ops/m
>
>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Peter,

crap - I have done bisecting manually (not using git bisect) and I
have probably done some mistake.

Commits (git checkout ) for which I got BAD results:

2159197d66770ec01f75c93fb11dc66df81fd45b
6ecdd74962f246dfe8750b7bea481a1c0816315d

Commits (git checkout ) for which I got GOOD results:
21e96f88776deead303ecd30a17d1d7c2a1776e3
64b7aad579847852e110878ccaae4c3aaa34
e7904a28f5331c21d17af638cb477c83662e3cb6

I will try to use git bisect now.

Jirka

On Wed, Jun 22, 2016 at 1:12 PM, Peter Zijlstra  wrote:
> On Wed, Jun 22, 2016 at 11:52:45AM +0200, Jirka Hladky wrote:
>> Hi Peter,
>>
>> the performance regression has been caused by this commit
>>
>> =
>> commit 6ecdd74962f246dfe8750b7bea481a1c0816315d
>> Author: Yuyang Du 
>> Date:   Tue Apr 5 12:12:26 2016 +0800
>>
>> sched/fair: Generalize the load/util averages resolution definition
>> =
>>
>> Could you please have a look?
>
> That patch looks like a NO-OP to me.
>
> In any case, the good news it that I can run the benchmark, the bad news
> is that the patch you fingered doesn't appear to be it.
>
>
> v4.60:
> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.transform: 2007.18 ops/m
> ./4.6.0/2016-Jun-22_11h11m07s.log:Score on xml.validation: 2999.44 ops/m
>
> tip/master:
> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on xml.transform: 
> 1283.14 ops/m
> ./4.7.0-rc4-00345-gf6e78bb/2016-Jun-22_11h30m27s.log:Score on xml.validation: 
> 2008.62 ops/m
>
> patch^1
> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on xml.transform: 
> 1196.18 ops/m
> ./4.6.0-rc5-00034-g2159197/2016-Jun-22_12h38m50s.log:Score on xml.validation: 
> 2055.11 ops/m
>
> patch^1 + patch
> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
> xml.transform: 1294.59 ops/m
> ./4.6.0-rc5-00034-g2159197-dirty/2016-Jun-22_12h55m43s.log:Score on 
> xml.validation: 2140.02 ops/m
>
>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Peter,

the performance regression has been caused by this commit

=
commit 6ecdd74962f246dfe8750b7bea481a1c0816315d
Author: Yuyang Du 
Date:   Tue Apr 5 12:12:26 2016 +0800

sched/fair: Generalize the load/util averages resolution definition
=

Could you please have a look?

Thanks a lot!
Jirka


On Wed, Jun 22, 2016 at 9:54 AM, Peter Zijlstra  wrote:
> On Wed, Jun 22, 2016 at 09:49:41AM +0200, Peter Zijlstra wrote:
>> On Wed, Jun 22, 2016 at 09:16:01AM +0200, Peter Zijlstra wrote:
>> > WTF a benchmark needs that crap is beyond me, but whatever, I have
>> > numbers.
>>
>> Oh, shaft me harder, its XML shite :/ How is a sane person ever going to
>> get numbers out.
>>
>> I'm >.< close to giving up on this site and declaring the thing
>> -EDONTCARE.
>
> OK, done.. have a look at this:
>
>
> /tmp/SPECjvm2008/compiler.compiler/compiler/src/share/classes/javax/lang/model/element/Name.java:54:
>  cannot access java.lang.CharSequence
> bad class file: 
> spec.benchmarks.compiler.SpecFileManager$CachedFileObject@1c06fce6
> bad constant pool tag: 18 at 10
> Please remove or make sure it appears in the correct subdirectory of the 
> classpath.
> public interface Name extends CharSequence {
>   ^
> ERROR: compiler exit code: 1
>
> Warmup (120s) begins: Wed Jun 22 09:45:33 CEST 2016
> /tmp/SPECjvm2008/compiler.compiler/compiler/src/share/classes/javax/lang/model/element/Name.java:54:
>  cannot access java.lang.CharSequence
> bad class file: 
> spec.benchmarks.compiler.SpecFileManager$CachedFileObject@1c06fce6
> bad constant pool tag: 18 at 10
> Please remove or make sure it appears in the correct subdirectory of the 
> classpath.
> public interface Name extends CharSequence {
>   ^
> /tmp/SPECjvm2008/compiler.compiler/compiler/src/share/classes/javax/lang/model/element/Name.java:54:
>  cannot access java.lang.CharSequence
> bad class file: 
> spec.benchmarks.compiler.SpecFileManager$CachedFileObject@1c06fce6
> bad constant pool tag: 18 at 10
> Please remove or make sure it appears in the correct subdirectory of the 
> classpath.
>
>
>
> Clearly this stuff just isn't made to be used.
>
>
> /me goes do something useful.


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Peter,

the performance regression has been caused by this commit

=
commit 6ecdd74962f246dfe8750b7bea481a1c0816315d
Author: Yuyang Du 
Date:   Tue Apr 5 12:12:26 2016 +0800

sched/fair: Generalize the load/util averages resolution definition
=

Could you please have a look?

Thanks a lot!
Jirka


On Wed, Jun 22, 2016 at 9:54 AM, Peter Zijlstra  wrote:
> On Wed, Jun 22, 2016 at 09:49:41AM +0200, Peter Zijlstra wrote:
>> On Wed, Jun 22, 2016 at 09:16:01AM +0200, Peter Zijlstra wrote:
>> > WTF a benchmark needs that crap is beyond me, but whatever, I have
>> > numbers.
>>
>> Oh, shaft me harder, its XML shite :/ How is a sane person ever going to
>> get numbers out.
>>
>> I'm >.< close to giving up on this site and declaring the thing
>> -EDONTCARE.
>
> OK, done.. have a look at this:
>
>
> /tmp/SPECjvm2008/compiler.compiler/compiler/src/share/classes/javax/lang/model/element/Name.java:54:
>  cannot access java.lang.CharSequence
> bad class file: 
> spec.benchmarks.compiler.SpecFileManager$CachedFileObject@1c06fce6
> bad constant pool tag: 18 at 10
> Please remove or make sure it appears in the correct subdirectory of the 
> classpath.
> public interface Name extends CharSequence {
>   ^
> ERROR: compiler exit code: 1
>
> Warmup (120s) begins: Wed Jun 22 09:45:33 CEST 2016
> /tmp/SPECjvm2008/compiler.compiler/compiler/src/share/classes/javax/lang/model/element/Name.java:54:
>  cannot access java.lang.CharSequence
> bad class file: 
> spec.benchmarks.compiler.SpecFileManager$CachedFileObject@1c06fce6
> bad constant pool tag: 18 at 10
> Please remove or make sure it appears in the correct subdirectory of the 
> classpath.
> public interface Name extends CharSequence {
>   ^
> /tmp/SPECjvm2008/compiler.compiler/compiler/src/share/classes/javax/lang/model/element/Name.java:54:
>  cannot access java.lang.CharSequence
> bad class file: 
> spec.benchmarks.compiler.SpecFileManager$CachedFileObject@1c06fce6
> bad constant pool tag: 18 at 10
> Please remove or make sure it appears in the correct subdirectory of the 
> classpath.
>
>
>
> Clearly this stuff just isn't made to be used.
>
>
> /me goes do something useful.


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Branimir,

I don't think that it's related. The regression has happened in one of
these two commits:

$ git log --pretty=oneline
e7904a28f5331c21d17af638cb477c83662e3cb6..6ecdd74962f246dfe8750b7bea481a1c0816315d
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the
load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased
load resolution on 64-bit kernels

Please see
https://bugzilla.kernel.org/show_bug.cgi?id=120481
for the details.

Jirka




On Wed, Jun 22, 2016 at 9:37 AM, Branimir Maksimovic
<branimir.maksimo...@gmail.com> wrote:
> Could it be related to this:
>
> https://www.phoronix.com/scan.php?page=news_item=P-State-Possible-4.6-Regression
>
>
> On Thu, 16 Jun 2016 18:40:01 +0200
> Jirka Hladky <jhla...@redhat.com> wrote:
>
>> Hello,
>>
>> we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>> benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel.
>>
>> We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as
>> well affected.
>>
>> We have observed the drop on variety of different x86_64 servers with
>> different configuration (different CPU models, RAM sizes, both with
>> Hyper Threading ON and OFF, different NUMA configurations (2 and 4
>> NUMA nodes)
>>
>> Linpack and Stream benchmarks do not show any performance drop.
>>
>> The performance drop increases with higher number of threads. The
>> maximum number of threads in each benchmark is the same as number of
>> CPUs.
>>
>> We have opened a BZ to track the progress:
>> https://bugzilla.kernel.org/show_bug.cgi?id=120481
>>
>> You can find more details along with graphs and tables there.
>>
>> Do you have any hints which commit should we try to reverse?
>>
>> Thanks a lot!
>> Jirka
>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Branimir,

I don't think that it's related. The regression has happened in one of
these two commits:

$ git log --pretty=oneline
e7904a28f5331c21d17af638cb477c83662e3cb6..6ecdd74962f246dfe8750b7bea481a1c0816315d
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the
load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased
load resolution on 64-bit kernels

Please see
https://bugzilla.kernel.org/show_bug.cgi?id=120481
for the details.

Jirka




On Wed, Jun 22, 2016 at 9:37 AM, Branimir Maksimovic
 wrote:
> Could it be related to this:
>
> https://www.phoronix.com/scan.php?page=news_item=P-State-Possible-4.6-Regression
>
>
> On Thu, 16 Jun 2016 18:40:01 +0200
> Jirka Hladky  wrote:
>
>> Hello,
>>
>> we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>> benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel.
>>
>> We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as
>> well affected.
>>
>> We have observed the drop on variety of different x86_64 servers with
>> different configuration (different CPU models, RAM sizes, both with
>> Hyper Threading ON and OFF, different NUMA configurations (2 and 4
>> NUMA nodes)
>>
>> Linpack and Stream benchmarks do not show any performance drop.
>>
>> The performance drop increases with higher number of threads. The
>> maximum number of threads in each benchmark is the same as number of
>> CPUs.
>>
>> We have opened a BZ to track the progress:
>> https://bugzilla.kernel.org/show_bug.cgi?id=120481
>>
>> You can find more details along with graphs and tables there.
>>
>> Do you have any hints which commit should we try to reverse?
>>
>> Thanks a lot!
>> Jirka
>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Peter,

please find the reproducer script attached. My command to reproduce the bug is:

./run-specjvm.sh --benchmarkThreads 32 --iterations 1 --iterationTime
180 --warmuptime 90 xml.transform xml.validation

I run just xml benchmarks to speed up the runtime.

Please check
 https://bugzilla.kernel.org/show_bug.cgi?id=120481#c9
for some details how to run the benchmark.

The benchmark needs Window manager to be installed to create graphs.
However, you can run the script from ssh terminal. I don't know
exactly why is that but I know that Python's matplot library has the
same requirements.

last known good commit: e7904a28f5331c21d17af638cb477c83662e3cb6
first known bad commit: 6ecdd74962f246dfe8750b7bea481a1c0816315d

Last two commits to be checked:

 git log --pretty=oneline
e7904a28f5331c21d17af638cb477c83662e3cb6..6ecdd74962f246dfe8750b7bea481a1c0816315d
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the
load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased
load resolution on 64-bit kernels

I use following command to review the results produced by reproduce.sh script.

find ./ -name "*log" | xargs grep -H Score | grep xml.validation |
grep "[0-9]\{4\}[.][0-9]\{2\} ops/m"

Jirka

On Wed, Jun 22, 2016 at 9:16 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Fri, Jun 17, 2016 at 01:04:23AM +0200, Jirka Hladky wrote:
>> > > we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>> > Blergh, of course I don't have those.. :/
>>
>> SPECjvm2008 is publicly available.
>> https://www.spec.org/download.html
>
> Urgh, I _so_ hate java.
>
> Why does it have to pop up windows split between my screens, total fail.
>
> In any case, I run it like:
>
>java -jar SPECjvm2008.jar --benchmarkThreads 40
>
> because I have 40 cpus (2 sockets * 10 cores/socket * 2 threads/core).
>
> It seems to produce numbers, but then ends with a splat:
>
> Error while creating report: Assistive Technology not found: 
> org.GNOME.Accessibility.AtkWrapper
> java.awt.AWTError: Assistive Technology not found: 
> org.GNOME.Accessibility.AtkWrapper
> at java.awt.Toolkit.loadAssistiveTechnologies(Toolkit.java:807)
> at java.awt.Toolkit.getDefaultToolkit(Toolkit.java:886)
> at 
> sun.swing.SwingUtilities2.getSystemMnemonicKeyMask(SwingUtilities2.java:2020)
> at 
> javax.swing.plaf.basic.BasicLookAndFeel.initComponentDefaults(BasicLookAndFeel.java:1158)
> at 
> javax.swing.plaf.metal.MetalLookAndFeel.initComponentDefaults(MetalLookAndFeel.java:431)
> at 
> javax.swing.plaf.basic.BasicLookAndFeel.getDefaults(BasicLookAndFeel.java:148)
> at 
> javax.swing.plaf.metal.MetalLookAndFeel.getDefaults(MetalLookAndFeel.java:1577)
> at javax.swing.UIManager.setLookAndFeel(UIManager.java:539)
> at javax.swing.UIManager.setLookAndFeel(UIManager.java:579)
> at javax.swing.UIManager.initializeDefaultLAF(UIManager.java:1349)
> at javax.swing.UIManager.initialize(UIManager.java:1459)
> at javax.swing.UIManager.maybeInitialize(UIManager.java:1426)
> at javax.swing.UIManager.getDefaults(UIManager.java:659)
> at javax.swing.UIManager.getColor(UIManager.java:701)
> at org.jfree.chart.JFreeChart.(JFreeChart.java:246)
> at 
> org.jfree.chart.ChartFactory.createXYLineChart(ChartFactory.java:1478)
> at spec.reporter.BenchmarkChart.(BenchmarkChart.java:47)
> at 
> spec.reporter.ReportGenerator.handleBenchmarkResult(ReportGenerator.java:141)
> at 
> spec.reporter.ReportGenerator.handleBenchmarksResults(ReportGenerator.java:105)
> at spec.reporter.ReportGenerator.(ReportGenerator.java:87)
> at spec.reporter.ReportGenerator.main2(ReportGenerator.java:750)
> at spec.reporter.Reporter.main2(Reporter.java:51)
> at spec.harness.Launch.createReport(Launch.java:307)
> at spec.harness.Launch.runBenchmarkSuite(Launch.java:250)
> at spec.harness.Launch.main(Launch.java:452)
>
> WTF a benchmark needs that crap is beyond me, but whatever, I have
> numbers.
>
> I'll try and reproduce.


reproduce.sh
Description: Bourne shell script


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-22 Thread Jirka Hladky
Hi Peter,

please find the reproducer script attached. My command to reproduce the bug is:

./run-specjvm.sh --benchmarkThreads 32 --iterations 1 --iterationTime
180 --warmuptime 90 xml.transform xml.validation

I run just xml benchmarks to speed up the runtime.

Please check
 https://bugzilla.kernel.org/show_bug.cgi?id=120481#c9
for some details how to run the benchmark.

The benchmark needs Window manager to be installed to create graphs.
However, you can run the script from ssh terminal. I don't know
exactly why is that but I know that Python's matplot library has the
same requirements.

last known good commit: e7904a28f5331c21d17af638cb477c83662e3cb6
first known bad commit: 6ecdd74962f246dfe8750b7bea481a1c0816315d

Last two commits to be checked:

 git log --pretty=oneline
e7904a28f5331c21d17af638cb477c83662e3cb6..6ecdd74962f246dfe8750b7bea481a1c0816315d
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the
load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased
load resolution on 64-bit kernels

I use following command to review the results produced by reproduce.sh script.

find ./ -name "*log" | xargs grep -H Score | grep xml.validation |
grep "[0-9]\{4\}[.][0-9]\{2\} ops/m"

Jirka

On Wed, Jun 22, 2016 at 9:16 AM, Peter Zijlstra  wrote:
> On Fri, Jun 17, 2016 at 01:04:23AM +0200, Jirka Hladky wrote:
>> > > we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>> > Blergh, of course I don't have those.. :/
>>
>> SPECjvm2008 is publicly available.
>> https://www.spec.org/download.html
>
> Urgh, I _so_ hate java.
>
> Why does it have to pop up windows split between my screens, total fail.
>
> In any case, I run it like:
>
>java -jar SPECjvm2008.jar --benchmarkThreads 40
>
> because I have 40 cpus (2 sockets * 10 cores/socket * 2 threads/core).
>
> It seems to produce numbers, but then ends with a splat:
>
> Error while creating report: Assistive Technology not found: 
> org.GNOME.Accessibility.AtkWrapper
> java.awt.AWTError: Assistive Technology not found: 
> org.GNOME.Accessibility.AtkWrapper
> at java.awt.Toolkit.loadAssistiveTechnologies(Toolkit.java:807)
> at java.awt.Toolkit.getDefaultToolkit(Toolkit.java:886)
> at 
> sun.swing.SwingUtilities2.getSystemMnemonicKeyMask(SwingUtilities2.java:2020)
> at 
> javax.swing.plaf.basic.BasicLookAndFeel.initComponentDefaults(BasicLookAndFeel.java:1158)
> at 
> javax.swing.plaf.metal.MetalLookAndFeel.initComponentDefaults(MetalLookAndFeel.java:431)
> at 
> javax.swing.plaf.basic.BasicLookAndFeel.getDefaults(BasicLookAndFeel.java:148)
> at 
> javax.swing.plaf.metal.MetalLookAndFeel.getDefaults(MetalLookAndFeel.java:1577)
> at javax.swing.UIManager.setLookAndFeel(UIManager.java:539)
> at javax.swing.UIManager.setLookAndFeel(UIManager.java:579)
> at javax.swing.UIManager.initializeDefaultLAF(UIManager.java:1349)
> at javax.swing.UIManager.initialize(UIManager.java:1459)
> at javax.swing.UIManager.maybeInitialize(UIManager.java:1426)
> at javax.swing.UIManager.getDefaults(UIManager.java:659)
> at javax.swing.UIManager.getColor(UIManager.java:701)
> at org.jfree.chart.JFreeChart.(JFreeChart.java:246)
> at 
> org.jfree.chart.ChartFactory.createXYLineChart(ChartFactory.java:1478)
> at spec.reporter.BenchmarkChart.(BenchmarkChart.java:47)
> at 
> spec.reporter.ReportGenerator.handleBenchmarkResult(ReportGenerator.java:141)
> at 
> spec.reporter.ReportGenerator.handleBenchmarksResults(ReportGenerator.java:105)
> at spec.reporter.ReportGenerator.(ReportGenerator.java:87)
> at spec.reporter.ReportGenerator.main2(ReportGenerator.java:750)
> at spec.reporter.Reporter.main2(Reporter.java:51)
> at spec.harness.Launch.createReport(Launch.java:307)
> at spec.harness.Launch.runBenchmarkSuite(Launch.java:250)
> at spec.harness.Launch.main(Launch.java:452)
>
> WTF a benchmark needs that crap is beyond me, but whatever, I have
> numbers.
>
> I'll try and reproduce.


reproduce.sh
Description: Bourne shell script


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-21 Thread Jirka Hladky
Hi Peter,

I have an update for this performance issue. I have tested several
kernels, I'm not at the parent of

  2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels

and I still see the performance regression for multithreaded workloads.

There are only 27 commits remaining between v4.6 (last known to be OK)
and current HEAD (6ecdd74962f246dfe8750b7bea481a1c0816315d)
6ecdd74962f246dfe8750b7bea481a1c0816315dsched/fair: Generalize the
load/util averages resolution definitionq hook unless util changed

See below [0].

Any hint which commit should I try now?

Thanks a lot!
Jirka

[0]
$ git log --pretty=oneline v4.6..HEAD kernel/sched
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the
load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased
load resolution on 64-bit kernels
e7904a28f5331c21d17af638cb477c83662e3cb6 locking/lockdep, sched/core:
Implement a better lock pinning scheme
eb58075149b7f0300ff19142e6245fe75db2a081 sched/core: Introduce 'struct rq_flags'
3e71a462dd483ce508a723356b293731e7d788ea sched/core: Move
task_rq_lock() out of line
64b7aad579847852e110878ccaae4c3aaa34 Merge branch 'sched/urgent'
into sched/core, to pick up fixes before applying new changes
f98db6013c557c216da5038d9c52045be55cd039 sched/core: Add
switch_mm_irqs_off() and use it in the scheduler
594dd290cf5403a9a5818619dfff42d8e8e0518e sched/cpufreq: Optimize
cpufreq update kicker to avoid update multiple times
fec148c000d0f9ac21679601722811eb60b4cc52 sched/deadline: Fix a bug in
dl_overflow()
9fd81dd5ce0b12341c9f83346f8d32ac68bd3841 sched/fair: Optimize
!CONFIG_NO_HZ_COMMON CPU load updates
1f41906a6fda1114debd3898668bd7ab6470ee41 sched/fair: Correctly handle
nohz ticks CPU load accounting
cee1afce3053e7aa0793fbd5f2e845fa2cef9e33 sched/fair: Gather CPU load
functions under a more conventional namespace
a2c6c91f98247fef0fe75216d607812485aeb0df sched/fair: Call cpufreq hook
in additional paths
41e0d37f7ac81297c07ba311e4ad39465b8c8295 sched/fair: Do not call
cpufreq hook unless util changed
21e96f88776deead303ecd30a17d1d7c2a1776e3 sched/fair: Move cpufreq hook
to update_cfs_rq_load_avg()
1f621e028baf391f6684003e32e009bc934b750f sched/fair: Fix asym packing
to select correct CPU
bd92883051a0228cc34996b8e766111ba10c9aac sched/cpuacct: Check for NULL
when using task_pt_regs()
2c923e94cd9c6acff3b22f0ae29cfe65e2658b40 sched/clock: Make
local_clock()/cpu_clock() inline
c78b17e28cc2c2df74264afc408bdc6aaf3fbcc8 sched/clock: Remove pointless
test in cpu_clock/local_clock
fb90a6e93c0684ab2629a42462400603aa829b9c sched/debug: Don't dump sched
debug info in SysRq-W
2b8c41daba327c633228169e8bd8ec067ab443f8 sched/fair: Initiate a new
task's util avg to a bounded value
1c3de5e19fc96206dd086e634129d08e5f7b1000 sched/fair: Update comments
after a variable rename
47252cfbac03644ee4a3adfa50c77896aa94f2bb sched/core: Add preempt
checks in preempt_schedule() code
bfdb198ccd99472c5bded689699eb30dd06316bb sched/numa: Remove
unnecessary NUMA dequeue update from non-SMP kernels
d02c071183e1c01a76811c878c8a52322201f81f sched/fair: Reset
nr_balance_failed after active balancing
d740037fac7052e49450f6fa1454f1144a103b55 sched/cpuacct: Split usage
accounting into user_usage and sys_usage
5ca3726af7f66a8cc71ce4414cfeb86deb784491 sched/cpuacct: Show all
possible CPUs in cpuacct output

On Fri, Jun 17, 2016 at 1:04 AM, Jirka Hladky <jhla...@redhat.com> wrote:
>> > we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>> Blergh, of course I don't have those.. :/
>
> SPECjvm2008 is publicly available.
> https://www.spec.org/download.html
>
> We will prepare a reproducer and attach it to the BZ.
>
>> What kind of config and userspace setup? Do you run this cruft in a
>> cgroup of sorts?
>
>  No, we don't do any special setup except to control the number of threads.
>
> Thanks for the hints which commits are most likely the root cause for
> this. We will try to find the commit which has caused it.
>
> Jirka
>
>
>
> On Thu, Jun 16, 2016 at 7:22 PM, Peter Zijlstra <pet...@infradead.org> wrote:
>> On Thu, Jun 16, 2016 at 06:38:50PM +0200, Jirka Hladky wrote:
>>> Hello,
>>>
>>> we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>>
>> Blergh, of course I don't have those.. :/
>>
>>> benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel.
>>>
>>> We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as
>>> well affected.
>>>
>>> We have observed the drop on variety of different x86_64 servers with
>>> different configuration (different CPU models, RAM sizes, both with
>>> Hyper Threading ON and OFF, different NUMA configurations (2 and 4
>>> NUMA nodes)
>>
>> What kind of config and userspace 

Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-21 Thread Jirka Hladky
Hi Peter,

I have an update for this performance issue. I have tested several
kernels, I'm not at the parent of

  2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels

and I still see the performance regression for multithreaded workloads.

There are only 27 commits remaining between v4.6 (last known to be OK)
and current HEAD (6ecdd74962f246dfe8750b7bea481a1c0816315d)
6ecdd74962f246dfe8750b7bea481a1c0816315dsched/fair: Generalize the
load/util averages resolution definitionq hook unless util changed

See below [0].

Any hint which commit should I try now?

Thanks a lot!
Jirka

[0]
$ git log --pretty=oneline v4.6..HEAD kernel/sched
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the
load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased
load resolution on 64-bit kernels
e7904a28f5331c21d17af638cb477c83662e3cb6 locking/lockdep, sched/core:
Implement a better lock pinning scheme
eb58075149b7f0300ff19142e6245fe75db2a081 sched/core: Introduce 'struct rq_flags'
3e71a462dd483ce508a723356b293731e7d788ea sched/core: Move
task_rq_lock() out of line
64b7aad579847852e110878ccaae4c3aaa34 Merge branch 'sched/urgent'
into sched/core, to pick up fixes before applying new changes
f98db6013c557c216da5038d9c52045be55cd039 sched/core: Add
switch_mm_irqs_off() and use it in the scheduler
594dd290cf5403a9a5818619dfff42d8e8e0518e sched/cpufreq: Optimize
cpufreq update kicker to avoid update multiple times
fec148c000d0f9ac21679601722811eb60b4cc52 sched/deadline: Fix a bug in
dl_overflow()
9fd81dd5ce0b12341c9f83346f8d32ac68bd3841 sched/fair: Optimize
!CONFIG_NO_HZ_COMMON CPU load updates
1f41906a6fda1114debd3898668bd7ab6470ee41 sched/fair: Correctly handle
nohz ticks CPU load accounting
cee1afce3053e7aa0793fbd5f2e845fa2cef9e33 sched/fair: Gather CPU load
functions under a more conventional namespace
a2c6c91f98247fef0fe75216d607812485aeb0df sched/fair: Call cpufreq hook
in additional paths
41e0d37f7ac81297c07ba311e4ad39465b8c8295 sched/fair: Do not call
cpufreq hook unless util changed
21e96f88776deead303ecd30a17d1d7c2a1776e3 sched/fair: Move cpufreq hook
to update_cfs_rq_load_avg()
1f621e028baf391f6684003e32e009bc934b750f sched/fair: Fix asym packing
to select correct CPU
bd92883051a0228cc34996b8e766111ba10c9aac sched/cpuacct: Check for NULL
when using task_pt_regs()
2c923e94cd9c6acff3b22f0ae29cfe65e2658b40 sched/clock: Make
local_clock()/cpu_clock() inline
c78b17e28cc2c2df74264afc408bdc6aaf3fbcc8 sched/clock: Remove pointless
test in cpu_clock/local_clock
fb90a6e93c0684ab2629a42462400603aa829b9c sched/debug: Don't dump sched
debug info in SysRq-W
2b8c41daba327c633228169e8bd8ec067ab443f8 sched/fair: Initiate a new
task's util avg to a bounded value
1c3de5e19fc96206dd086e634129d08e5f7b1000 sched/fair: Update comments
after a variable rename
47252cfbac03644ee4a3adfa50c77896aa94f2bb sched/core: Add preempt
checks in preempt_schedule() code
bfdb198ccd99472c5bded689699eb30dd06316bb sched/numa: Remove
unnecessary NUMA dequeue update from non-SMP kernels
d02c071183e1c01a76811c878c8a52322201f81f sched/fair: Reset
nr_balance_failed after active balancing
d740037fac7052e49450f6fa1454f1144a103b55 sched/cpuacct: Split usage
accounting into user_usage and sys_usage
5ca3726af7f66a8cc71ce4414cfeb86deb784491 sched/cpuacct: Show all
possible CPUs in cpuacct output

On Fri, Jun 17, 2016 at 1:04 AM, Jirka Hladky  wrote:
>> > we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>> Blergh, of course I don't have those.. :/
>
> SPECjvm2008 is publicly available.
> https://www.spec.org/download.html
>
> We will prepare a reproducer and attach it to the BZ.
>
>> What kind of config and userspace setup? Do you run this cruft in a
>> cgroup of sorts?
>
>  No, we don't do any special setup except to control the number of threads.
>
> Thanks for the hints which commits are most likely the root cause for
> this. We will try to find the commit which has caused it.
>
> Jirka
>
>
>
> On Thu, Jun 16, 2016 at 7:22 PM, Peter Zijlstra  wrote:
>> On Thu, Jun 16, 2016 at 06:38:50PM +0200, Jirka Hladky wrote:
>>> Hello,
>>>
>>> we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>>
>> Blergh, of course I don't have those.. :/
>>
>>> benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel.
>>>
>>> We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as
>>> well affected.
>>>
>>> We have observed the drop on variety of different x86_64 servers with
>>> different configuration (different CPU models, RAM sizes, both with
>>> Hyper Threading ON and OFF, different NUMA configurations (2 and 4
>>> NUMA nodes)
>>
>> What kind of config and userspace setup? Do you run this cruft in a
>> cgroup of sorts?

Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-16 Thread Jirka Hladky
> > we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
> Blergh, of course I don't have those.. :/

SPECjvm2008 is publicly available.
https://www.spec.org/download.html

We will prepare a reproducer and attach it to the BZ.

> What kind of config and userspace setup? Do you run this cruft in a
> cgroup of sorts?

 No, we don't do any special setup except to control the number of threads.

Thanks for the hints which commits are most likely the root cause for
this. We will try to find the commit which has caused it.

Jirka



On Thu, Jun 16, 2016 at 7:22 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, Jun 16, 2016 at 06:38:50PM +0200, Jirka Hladky wrote:
>> Hello,
>>
>> we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>
> Blergh, of course I don't have those.. :/
>
>> benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel.
>>
>> We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as
>> well affected.
>>
>> We have observed the drop on variety of different x86_64 servers with
>> different configuration (different CPU models, RAM sizes, both with
>> Hyper Threading ON and OFF, different NUMA configurations (2 and 4
>> NUMA nodes)
>
> What kind of config and userspace setup? Do you run this cruft in a
> cgroup of sorts?
>
> If so, does it change anything if you run it in the root cgroup?
>
>> Linpack and Stream benchmarks do not show any performance drop.
>>
>> The performance drop increases with higher number of threads. The
>> maximum number of threads in each benchmark is the same as number of
>> CPUs.
>>
>> We have opened a BZ to track the progress:
>> https://bugzilla.kernel.org/show_bug.cgi?id=120481
>>
>> You can find more details along with graphs and tables there.
>>
>> Do you have any hints which commit should we try to reverse?
>
> There were only 66 commits or so, and I think we can rule out the
> hotplug changes, which should reduce it even further.
>
> You could see what the parent of this one does:
>
>   2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels
>
> If not that, maybe the parent of:
>
>   c58d25f371f5 sched/fair: Move record_wakee()
>
> After that I suppose you'll have to go bisect.
>


Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-16 Thread Jirka Hladky
> > we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
> Blergh, of course I don't have those.. :/

SPECjvm2008 is publicly available.
https://www.spec.org/download.html

We will prepare a reproducer and attach it to the BZ.

> What kind of config and userspace setup? Do you run this cruft in a
> cgroup of sorts?

 No, we don't do any special setup except to control the number of threads.

Thanks for the hints which commits are most likely the root cause for
this. We will try to find the commit which has caused it.

Jirka



On Thu, Jun 16, 2016 at 7:22 PM, Peter Zijlstra  wrote:
> On Thu, Jun 16, 2016 at 06:38:50PM +0200, Jirka Hladky wrote:
>> Hello,
>>
>> we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>
> Blergh, of course I don't have those.. :/
>
>> benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel.
>>
>> We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as
>> well affected.
>>
>> We have observed the drop on variety of different x86_64 servers with
>> different configuration (different CPU models, RAM sizes, both with
>> Hyper Threading ON and OFF, different NUMA configurations (2 and 4
>> NUMA nodes)
>
> What kind of config and userspace setup? Do you run this cruft in a
> cgroup of sorts?
>
> If so, does it change anything if you run it in the root cgroup?
>
>> Linpack and Stream benchmarks do not show any performance drop.
>>
>> The performance drop increases with higher number of threads. The
>> maximum number of threads in each benchmark is the same as number of
>> CPUs.
>>
>> We have opened a BZ to track the progress:
>> https://bugzilla.kernel.org/show_bug.cgi?id=120481
>>
>> You can find more details along with graphs and tables there.
>>
>> Do you have any hints which commit should we try to reverse?
>
> There were only 66 commits or so, and I think we can rule out the
> hotplug changes, which should reduce it even further.
>
> You could see what the parent of this one does:
>
>   2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels
>
> If not that, maybe the parent of:
>
>   c58d25f371f5 sched/fair: Move record_wakee()
>
> After that I suppose you'll have to go bisect.
>


Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-16 Thread Jirka Hladky
Hello,

we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel.

We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as
well affected.

We have observed the drop on variety of different x86_64 servers with
different configuration (different CPU models, RAM sizes, both with
Hyper Threading ON and OFF, different NUMA configurations (2 and 4
NUMA nodes)

Linpack and Stream benchmarks do not show any performance drop.

The performance drop increases with higher number of threads. The
maximum number of threads in each benchmark is the same as number of
CPUs.

We have opened a BZ to track the progress:
https://bugzilla.kernel.org/show_bug.cgi?id=120481

You can find more details along with graphs and tables there.

Do you have any hints which commit should we try to reverse?

Thanks a lot!
Jirka


Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

2016-06-16 Thread Jirka Hladky
Hello,

we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel.

We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as
well affected.

We have observed the drop on variety of different x86_64 servers with
different configuration (different CPU models, RAM sizes, both with
Hyper Threading ON and OFF, different NUMA configurations (2 and 4
NUMA nodes)

Linpack and Stream benchmarks do not show any performance drop.

The performance drop increases with higher number of threads. The
maximum number of threads in each benchmark is the same as number of
CPUs.

We have opened a BZ to track the progress:
https://bugzilla.kernel.org/show_bug.cgi?id=120481

You can find more details along with graphs and tables there.

Do you have any hints which commit should we try to reverse?

Thanks a lot!
Jirka


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-17 Thread Jirka Hladky
Hi Peter,

I'm not sure how to do the bisecting and avoid landing at:

[2a595721a1fa6b684c1c818f379bef834ac3d65e] sched/numa: Convert
sched_numa_balancing to a static_branch

I have redone the bisecting but I have landed again at this commit.
Can you please help me to identify the commit which has fixed for
2a595721a1fa6b684c1c818f379bef834ac3d65e ? I think I will need to
start the bisecting from there.

Thanks
Jirka

>
>
> On Wed, Dec 16, 2015 at 6:04 PM, Jirka Hladky  wrote:
>>
>> Hi Peter,
>>
>> you are right the kernel  4.4-rc4 has it already fixed. It seems I
>> will need to redo the bisecting once again, starting with
>> 2a595721a1fa6b684c1c818f379bef834ac3d65e
>>
>> git bisect start -- kernel/sched
>> git bisect bad v4.4-rc4
>> git bisect good 2b49d84b259fc18e131026e5d38e7855352f71b9
>> Bisecting: 32 revisions left to test after this (roughly 5 steps)
>> [da7142e2ed735e1c1bef5a757dc55de35c65fbd6] sched/core: Simplify
>> preempt_count tests
>>
>> I will let you know the outcome.
>>
>> Jirka
>>
>>
>> On Wed, Dec 16, 2015 at 2:50 PM, Peter Zijlstra  wrote:
>> > On Wed, Dec 16, 2015 at 01:56:17PM +0100, Jirka Hladky wrote:
>> >> Hi Rik,
>> >>
>> >> I have redone the bisecting and have new results:
>> >>
>> >> # first bad commit: [2a595721a1fa6b684c1c818f379bef834ac3d65e]
>> >> sched/numa: Convert sched_numa_balancing to a static_branch
>> >>
>> >> Could you please have a look what went wrong?
>> >
>> > The below is obviously wrong, but your kernel should have that patch.
>> >
>> > So if you revert this patch (ie. go back to the regular variable) it
>> > works again?
>> >
>> > ---
>> >
>> > commit b52da86e0ad58f096710977fcda856fd84da9233
>> > Author: Srikar Dronamraju 
>> > Date:   Fri Oct 2 07:48:25 2015 +0530
>> >
>> > sched/numa: Fix task_tick_fair() from disabling numa_balancing
>> >
>> > If static branch 'sched_numa_balancing' is enabled, it should kickstart
>> > NUMA balancing through task_tick_numa(). However the following commit:
>> >
>> >   2a595721a1fa ("sched/numa: Convert sched_numa_balancing to a 
>> > static_branch")
>> >
>> > erroneously disables this.
>> >
>> > Fix this anomaly by enabling task_tick_numa() when the static branch
>> > 'sched_numa_balancing' is enabled.
>> >
>> > Signed-off-by: Srikar Dronamraju 
>> > Signed-off-by: Peter Zijlstra (Intel) 
>> > Cc: Linus Torvalds 
>> > Cc: Mel Gorman 
>> > Cc: Mike Galbraith 
>> > Cc: Peter Zijlstra 
>> > Cc: Rik van Riel 
>> > Cc: Thomas Gleixner 
>> > Cc: linux-kernel@vger.kernel.org
>> > Link: 
>> > http://lkml.kernel.org/r/1443752305-27413-1-git-send-email-sri...@linux.vnet.ibm.com
>> > Signed-off-by: Ingo Molnar 
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 4df37a48f499..3bdc3da7bc6a 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -7881,7 +7881,7 @@ static void task_tick_fair(struct rq *rq, struct 
>> > task_struct *curr, int queued)
>> > entity_tick(cfs_rq, se, queued);
>> > }
>> >
>> > -   if (!static_branch_unlikely(_numa_balancing))
>> > +   if (static_branch_unlikely(_numa_balancing))
>> > task_tick_numa(rq, curr);
>> >  }
>> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-17 Thread Jirka Hladky
Hi Peter,

I'm not sure how to do the bisecting and avoid landing at:

[2a595721a1fa6b684c1c818f379bef834ac3d65e] sched/numa: Convert
sched_numa_balancing to a static_branch

I have redone the bisecting but I have landed again at this commit.
Can you please help me to identify the commit which has fixed for
2a595721a1fa6b684c1c818f379bef834ac3d65e ? I think I will need to
start the bisecting from there.

Thanks
Jirka

>
>
> On Wed, Dec 16, 2015 at 6:04 PM, Jirka Hladky <jhla...@redhat.com> wrote:
>>
>> Hi Peter,
>>
>> you are right the kernel  4.4-rc4 has it already fixed. It seems I
>> will need to redo the bisecting once again, starting with
>> 2a595721a1fa6b684c1c818f379bef834ac3d65e
>>
>> git bisect start -- kernel/sched
>> git bisect bad v4.4-rc4
>> git bisect good 2b49d84b259fc18e131026e5d38e7855352f71b9
>> Bisecting: 32 revisions left to test after this (roughly 5 steps)
>> [da7142e2ed735e1c1bef5a757dc55de35c65fbd6] sched/core: Simplify
>> preempt_count tests
>>
>> I will let you know the outcome.
>>
>> Jirka
>>
>>
>> On Wed, Dec 16, 2015 at 2:50 PM, Peter Zijlstra <pet...@infradead.org> wrote:
>> > On Wed, Dec 16, 2015 at 01:56:17PM +0100, Jirka Hladky wrote:
>> >> Hi Rik,
>> >>
>> >> I have redone the bisecting and have new results:
>> >>
>> >> # first bad commit: [2a595721a1fa6b684c1c818f379bef834ac3d65e]
>> >> sched/numa: Convert sched_numa_balancing to a static_branch
>> >>
>> >> Could you please have a look what went wrong?
>> >
>> > The below is obviously wrong, but your kernel should have that patch.
>> >
>> > So if you revert this patch (ie. go back to the regular variable) it
>> > works again?
>> >
>> > ---
>> >
>> > commit b52da86e0ad58f096710977fcda856fd84da9233
>> > Author: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
>> > Date:   Fri Oct 2 07:48:25 2015 +0530
>> >
>> > sched/numa: Fix task_tick_fair() from disabling numa_balancing
>> >
>> > If static branch 'sched_numa_balancing' is enabled, it should kickstart
>> > NUMA balancing through task_tick_numa(). However the following commit:
>> >
>> >   2a595721a1fa ("sched/numa: Convert sched_numa_balancing to a 
>> > static_branch")
>> >
>> > erroneously disables this.
>> >
>> > Fix this anomaly by enabling task_tick_numa() when the static branch
>> > 'sched_numa_balancing' is enabled.
>> >
>> > Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
>> > Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
>> > Cc: Linus Torvalds <torva...@linux-foundation.org>
>> > Cc: Mel Gorman <mgor...@suse.de>
>> > Cc: Mike Galbraith <efa...@gmx.de>
>> > Cc: Peter Zijlstra <pet...@infradead.org>
>> > Cc: Rik van Riel <r...@redhat.com>
>> > Cc: Thomas Gleixner <t...@linutronix.de>
>> > Cc: linux-kernel@vger.kernel.org
>> > Link: 
>> > http://lkml.kernel.org/r/1443752305-27413-1-git-send-email-sri...@linux.vnet.ibm.com
>> > Signed-off-by: Ingo Molnar <mi...@kernel.org>
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 4df37a48f499..3bdc3da7bc6a 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -7881,7 +7881,7 @@ static void task_tick_fair(struct rq *rq, struct 
>> > task_struct *curr, int queued)
>> > entity_tick(cfs_rq, se, queued);
>> > }
>> >
>> > -   if (!static_branch_unlikely(_numa_balancing))
>> > +   if (static_branch_unlikely(_numa_balancing))
>> > task_tick_numa(rq, curr);
>> >  }
>> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-16 Thread Jirka Hladky
Hi Peter,

you are right the kernel  4.4-rc4 has it already fixed. It seems I
will need to redo the bisecting once again, starting with
2a595721a1fa6b684c1c818f379bef834ac3d65e

git bisect start -- kernel/sched
git bisect bad v4.4-rc4
git bisect good 2b49d84b259fc18e131026e5d38e7855352f71b9
Bisecting: 32 revisions left to test after this (roughly 5 steps)
[da7142e2ed735e1c1bef5a757dc55de35c65fbd6] sched/core: Simplify
preempt_count tests

I will let you know the outcome.

Jirka


On Wed, Dec 16, 2015 at 2:50 PM, Peter Zijlstra  wrote:
> On Wed, Dec 16, 2015 at 01:56:17PM +0100, Jirka Hladky wrote:
>> Hi Rik,
>>
>> I have redone the bisecting and have new results:
>>
>> # first bad commit: [2a595721a1fa6b684c1c818f379bef834ac3d65e]
>> sched/numa: Convert sched_numa_balancing to a static_branch
>>
>> Could you please have a look what went wrong?
>
> The below is obviously wrong, but your kernel should have that patch.
>
> So if you revert this patch (ie. go back to the regular variable) it
> works again?
>
> ---
>
> commit b52da86e0ad58f096710977fcda856fd84da9233
> Author: Srikar Dronamraju 
> Date:   Fri Oct 2 07:48:25 2015 +0530
>
> sched/numa: Fix task_tick_fair() from disabling numa_balancing
>
> If static branch 'sched_numa_balancing' is enabled, it should kickstart
> NUMA balancing through task_tick_numa(). However the following commit:
>
>   2a595721a1fa ("sched/numa: Convert sched_numa_balancing to a 
> static_branch")
>
> erroneously disables this.
>
> Fix this anomaly by enabling task_tick_numa() when the static branch
> 'sched_numa_balancing' is enabled.
>
> Signed-off-by: Srikar Dronamraju 
> Signed-off-by: Peter Zijlstra (Intel) 
> Cc: Linus Torvalds 
> Cc: Mel Gorman 
> Cc: Mike Galbraith 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: linux-kernel@vger.kernel.org
> Link: 
> http://lkml.kernel.org/r/1443752305-27413-1-git-send-email-sri...@linux.vnet.ibm.com
> Signed-off-by: Ingo Molnar 
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a48f499..3bdc3da7bc6a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7881,7 +7881,7 @@ static void task_tick_fair(struct rq *rq, struct 
> task_struct *curr, int queued)
> entity_tick(cfs_rq, se, queued);
> }
>
> -   if (!static_branch_unlikely(_numa_balancing))
> +   if (static_branch_unlikely(_numa_balancing))
> task_tick_numa(rq, curr);
>  }
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-16 Thread Jirka Hladky
Hi Rik,

I have redone the bisecting and have new results:

# first bad commit: [2a595721a1fa6b684c1c818f379bef834ac3d65e]
sched/numa: Convert sched_numa_balancing to a static_branch

Could you please have a look what went wrong?

Thanks a lot!
Jirka

git bisect start '--' 'kernel/sched'
# good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# bad: [527e9316f8ec44bd53d90fb9f611fa752bb9] Linux 4.4-rc4
git bisect bad 527e9316f8ec44bd53d90fb9f611fa752bb9
# bad: [b99def8b961448f5b9a550dddeeb718e3975e7a6] sched/core: Rework
TASK_DEAD preemption exception
git bisect bad b99def8b961448f5b9a550dddeeb718e3975e7a6
# skip: [8cd5601c50603caa195ce86cc465cb04079ed488] sched/fair: Convert
arch_scale_cpu_capacity() from weak function to #define
git bisect skip 8cd5601c50603caa195ce86cc465cb04079ed488
# bad: [fe19159225d8516f3f57a5fe8f735c01684f0ddd] Merge branch
'sched/urgent' into sched/core, to pick up fixes before applying new
changes
git bisect bad fe19159225d8516f3f57a5fe8f735c01684f0ddd
# good: [78a9c54649ea220065aad9902460a1d137c7eafd] sched/numa: Rename
numabalancing_enabled to sched_numa_balancing
git bisect good 78a9c54649ea220065aad9902460a1d137c7eafd
# bad: [54a21385facbdcd89a78e8c3e5025f04c5f2b59c] sched/fair: Rename
scale() to cap_scale()
git bisect bad 54a21385facbdcd89a78e8c3e5025f04c5f2b59c
# bad: [9e91d61d9b0ca8d865dbd59af8d0d5c5b68003e9] sched/fair: Name
utilization related data and functions consistently
git bisect bad 9e91d61d9b0ca8d865dbd59af8d0d5c5b68003e9
# bad: [2a595721a1fa6b684c1c818f379bef834ac3d65e] sched/numa: Convert
sched_numa_balancing to a static_branch
git bisect bad 2a595721a1fa6b684c1c818f379bef834ac3d65e
# good: [2b49d84b259fc18e131026e5d38e7855352f71b9] sched/numa: Remove
the NUMA sched_feature
git bisect good 2b49d84b259fc18e131026e5d38e7855352f71b9
# first bad commit: [2a595721a1fa6b684c1c818f379bef834ac3d65e]
sched/numa: Convert sched_numa_balancing to a static_branch

On Tue, Dec 15, 2015 at 9:49 AM, Jirka Hladky  wrote:
> Hi Rik,
>
> I have reviewed the data and you are right. The trouble is that even
> with 4.3 kernel there is 20% change that results will be bad. I have
> repeated tests 100 times on 4.3 kernel over the night. In 20 cases I
> see that runtime went up from 12 seconds to 28 seconds due to the
> wrong NUMA placement. I will try to replay the bisect once again.
>
> Jirka
>
> On Tue, Dec 15, 2015 at 3:12 AM, Rik van Riel  wrote:
>> On 12/14/2015 06:52 PM, Jirka Hladky wrote:
>>> Hi all,
>>>
>>> I have the results of bisecting:
>>>
>>> first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag
>>> 'v4.3-rc1' into sched/core, to refresh the branch
>>>
>>> Could you please have a look at this commit why it has caused the
>>> performance regression when running 4 stream benchmarks in parallel on 4
>>> NUMA node server?
>>
>> That is a merge commit. It contains no actual code changes.
>>
>>> Please let me know if you need additional data. git bisect log is bellow.
>>
>> It looks like "git bisect" may have led you astray.
>>
>> I am not sure what debugging tool to use to figure out which
>> of the patches from some merged-in branch caused the issue,
>> but hopefully one of the people reading this email know a trick.
>>
>> --
>> All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-16 Thread Jirka Hladky
Hi Peter,

you are right the kernel  4.4-rc4 has it already fixed. It seems I
will need to redo the bisecting once again, starting with
2a595721a1fa6b684c1c818f379bef834ac3d65e

git bisect start -- kernel/sched
git bisect bad v4.4-rc4
git bisect good 2b49d84b259fc18e131026e5d38e7855352f71b9
Bisecting: 32 revisions left to test after this (roughly 5 steps)
[da7142e2ed735e1c1bef5a757dc55de35c65fbd6] sched/core: Simplify
preempt_count tests

I will let you know the outcome.

Jirka


On Wed, Dec 16, 2015 at 2:50 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Wed, Dec 16, 2015 at 01:56:17PM +0100, Jirka Hladky wrote:
>> Hi Rik,
>>
>> I have redone the bisecting and have new results:
>>
>> # first bad commit: [2a595721a1fa6b684c1c818f379bef834ac3d65e]
>> sched/numa: Convert sched_numa_balancing to a static_branch
>>
>> Could you please have a look what went wrong?
>
> The below is obviously wrong, but your kernel should have that patch.
>
> So if you revert this patch (ie. go back to the regular variable) it
> works again?
>
> ---
>
> commit b52da86e0ad58f096710977fcda856fd84da9233
> Author: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
> Date:   Fri Oct 2 07:48:25 2015 +0530
>
> sched/numa: Fix task_tick_fair() from disabling numa_balancing
>
> If static branch 'sched_numa_balancing' is enabled, it should kickstart
> NUMA balancing through task_tick_numa(). However the following commit:
>
>   2a595721a1fa ("sched/numa: Convert sched_numa_balancing to a 
> static_branch")
>
> erroneously disables this.
>
> Fix this anomaly by enabling task_tick_numa() when the static branch
> 'sched_numa_balancing' is enabled.
>
> Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> Cc: Linus Torvalds <torva...@linux-foundation.org>
> Cc: Mel Gorman <mgor...@suse.de>
> Cc: Mike Galbraith <efa...@gmx.de>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: linux-kernel@vger.kernel.org
> Link: 
> http://lkml.kernel.org/r/1443752305-27413-1-git-send-email-sri...@linux.vnet.ibm.com
> Signed-off-by: Ingo Molnar <mi...@kernel.org>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a48f499..3bdc3da7bc6a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7881,7 +7881,7 @@ static void task_tick_fair(struct rq *rq, struct 
> task_struct *curr, int queued)
> entity_tick(cfs_rq, se, queued);
> }
>
> -   if (!static_branch_unlikely(_numa_balancing))
> +   if (static_branch_unlikely(_numa_balancing))
> task_tick_numa(rq, curr);
>  }
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-16 Thread Jirka Hladky
Hi Rik,

I have redone the bisecting and have new results:

# first bad commit: [2a595721a1fa6b684c1c818f379bef834ac3d65e]
sched/numa: Convert sched_numa_balancing to a static_branch

Could you please have a look what went wrong?

Thanks a lot!
Jirka

git bisect start '--' 'kernel/sched'
# good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# bad: [527e9316f8ec44bd53d90fb9f611fa752bb9] Linux 4.4-rc4
git bisect bad 527e9316f8ec44bd53d90fb9f611fa752bb9
# bad: [b99def8b961448f5b9a550dddeeb718e3975e7a6] sched/core: Rework
TASK_DEAD preemption exception
git bisect bad b99def8b961448f5b9a550dddeeb718e3975e7a6
# skip: [8cd5601c50603caa195ce86cc465cb04079ed488] sched/fair: Convert
arch_scale_cpu_capacity() from weak function to #define
git bisect skip 8cd5601c50603caa195ce86cc465cb04079ed488
# bad: [fe19159225d8516f3f57a5fe8f735c01684f0ddd] Merge branch
'sched/urgent' into sched/core, to pick up fixes before applying new
changes
git bisect bad fe19159225d8516f3f57a5fe8f735c01684f0ddd
# good: [78a9c54649ea220065aad9902460a1d137c7eafd] sched/numa: Rename
numabalancing_enabled to sched_numa_balancing
git bisect good 78a9c54649ea220065aad9902460a1d137c7eafd
# bad: [54a21385facbdcd89a78e8c3e5025f04c5f2b59c] sched/fair: Rename
scale() to cap_scale()
git bisect bad 54a21385facbdcd89a78e8c3e5025f04c5f2b59c
# bad: [9e91d61d9b0ca8d865dbd59af8d0d5c5b68003e9] sched/fair: Name
utilization related data and functions consistently
git bisect bad 9e91d61d9b0ca8d865dbd59af8d0d5c5b68003e9
# bad: [2a595721a1fa6b684c1c818f379bef834ac3d65e] sched/numa: Convert
sched_numa_balancing to a static_branch
git bisect bad 2a595721a1fa6b684c1c818f379bef834ac3d65e
# good: [2b49d84b259fc18e131026e5d38e7855352f71b9] sched/numa: Remove
the NUMA sched_feature
git bisect good 2b49d84b259fc18e131026e5d38e7855352f71b9
# first bad commit: [2a595721a1fa6b684c1c818f379bef834ac3d65e]
sched/numa: Convert sched_numa_balancing to a static_branch

On Tue, Dec 15, 2015 at 9:49 AM, Jirka Hladky <jhla...@redhat.com> wrote:
> Hi Rik,
>
> I have reviewed the data and you are right. The trouble is that even
> with 4.3 kernel there is 20% change that results will be bad. I have
> repeated tests 100 times on 4.3 kernel over the night. In 20 cases I
> see that runtime went up from 12 seconds to 28 seconds due to the
> wrong NUMA placement. I will try to replay the bisect once again.
>
> Jirka
>
> On Tue, Dec 15, 2015 at 3:12 AM, Rik van Riel <r...@redhat.com> wrote:
>> On 12/14/2015 06:52 PM, Jirka Hladky wrote:
>>> Hi all,
>>>
>>> I have the results of bisecting:
>>>
>>> first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag
>>> 'v4.3-rc1' into sched/core, to refresh the branch
>>>
>>> Could you please have a look at this commit why it has caused the
>>> performance regression when running 4 stream benchmarks in parallel on 4
>>> NUMA node server?
>>
>> That is a merge commit. It contains no actual code changes.
>>
>>> Please let me know if you need additional data. git bisect log is bellow.
>>
>> It looks like "git bisect" may have led you astray.
>>
>> I am not sure what debugging tool to use to figure out which
>> of the patches from some merged-in branch caused the issue,
>> but hopefully one of the people reading this email know a trick.
>>
>> --
>> All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-15 Thread Jirka Hladky
Hi Rik,

I have reviewed the data and you are right. The trouble is that even
with 4.3 kernel there is 20% change that results will be bad. I have
repeated tests 100 times on 4.3 kernel over the night. In 20 cases I
see that runtime went up from 12 seconds to 28 seconds due to the
wrong NUMA placement. I will try to replay the bisect once again.

Jirka

On Tue, Dec 15, 2015 at 3:12 AM, Rik van Riel  wrote:
> On 12/14/2015 06:52 PM, Jirka Hladky wrote:
>> Hi all,
>>
>> I have the results of bisecting:
>>
>> first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag
>> 'v4.3-rc1' into sched/core, to refresh the branch
>>
>> Could you please have a look at this commit why it has caused the
>> performance regression when running 4 stream benchmarks in parallel on 4
>> NUMA node server?
>
> That is a merge commit. It contains no actual code changes.
>
>> Please let me know if you need additional data. git bisect log is bellow.
>
> It looks like "git bisect" may have led you astray.
>
> I am not sure what debugging tool to use to figure out which
> of the patches from some merged-in branch caused the issue,
> but hopefully one of the people reading this email know a trick.
>
> --
> All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-15 Thread Jirka Hladky
Hi Rik,

I have reviewed the data and you are right. The trouble is that even
with 4.3 kernel there is 20% change that results will be bad. I have
repeated tests 100 times on 4.3 kernel over the night. In 20 cases I
see that runtime went up from 12 seconds to 28 seconds due to the
wrong NUMA placement. I will try to replay the bisect once again.

Jirka

On Tue, Dec 15, 2015 at 3:12 AM, Rik van Riel <r...@redhat.com> wrote:
> On 12/14/2015 06:52 PM, Jirka Hladky wrote:
>> Hi all,
>>
>> I have the results of bisecting:
>>
>> first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag
>> 'v4.3-rc1' into sched/core, to refresh the branch
>>
>> Could you please have a look at this commit why it has caused the
>> performance regression when running 4 stream benchmarks in parallel on 4
>> NUMA node server?
>
> That is a merge commit. It contains no actual code changes.
>
>> Please let me know if you need additional data. git bisect log is bellow.
>
> It looks like "git bisect" may have led you astray.
>
> I am not sure what debugging tool to use to figure out which
> of the patches from some merged-in branch caused the issue,
> but hopefully one of the people reading this email know a trick.
>
> --
> All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-14 Thread Jirka Hladky
Hi all,

I have the results of bisecting:

first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag
'v4.3-rc1' into sched/core, to refresh the branch

Could you please have a look at this commit why it has caused the
performance regression when running 4 stream benchmarks in parallel on
4 NUMA node server?

Please let me know if you need additional data. git bisect log is bellow.

Thanks a lot!
Jirka



$ git bisect log
git bisect start '--' 'kernel/sched'
# good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# bad: [527e9316f8ec44bd53d90fb9f611fa752bb9] Linux 4.4-rc4
git bisect bad 527e9316f8ec44bd53d90fb9f611fa752bb9
# bad: [b99def8b961448f5b9a550dddeeb718e3975e7a6] sched/core: Rework
TASK_DEAD preemption exception
git bisect bad b99def8b961448f5b9a550dddeeb718e3975e7a6
# skip: [8cd5601c50603caa195ce86cc465cb04079ed488] sched/fair: Convert
arch_scale_cpu_capacity() from weak function to #define
git bisect skip 8cd5601c50603caa195ce86cc465cb04079ed488
# bad: [fe19159225d8516f3f57a5fe8f735c01684f0ddd] Merge branch
'sched/urgent' into sched/core, to pick up fixes before applying new
changes
git bisect bad fe19159225d8516f3f57a5fe8f735c01684f0ddd
# bad: [78a9c54649ea220065aad9902460a1d137c7eafd] sched/numa: Rename
numabalancing_enabled to sched_numa_balancing
git bisect bad 78a9c54649ea220065aad9902460a1d137c7eafd
# bad: [6efdb105d392da3ad5cb4ef951aed373cd049813] sched/fair: Fix
switched_to_fair()'s per entity load tracking
git bisect bad 6efdb105d392da3ad5cb4ef951aed373cd049813
# bad: [50a2a3b246149d041065a67ccb3e98145f780a2f] sched/fair: Have
task_move_group_fair() unconditionally add the entity load to the
runqueue
git bisect bad 50a2a3b246149d041065a67ccb3e98145f780a2f
# bad: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag 'v4.3-rc1'
into sched/core, to refresh the branch
git bisect bad 973759c80db96ed4b4c5cb85ac7d48107f801371
# first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge
tag 'v4.3-rc1' into sched/core, to refresh the branch

On Sat, Dec 12, 2015 at 3:37 PM, Mike Galbraith
 wrote:
> On Sat, 2015-12-12 at 15:16 +0100, Jirka Hladky wrote:
>> > A bisection doesn't require any special skills, but may give busy
>> > maintainers a single change to eyeball vs the entire lot.
>>
>> They have been couple of merges which makes git revert difficult...
>
> You could try https://git-scm.com/docs/git-bisect instead.
>
> -Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-14 Thread Jirka Hladky
Hi all,

I have the results of bisecting:

first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag
'v4.3-rc1' into sched/core, to refresh the branch

Could you please have a look at this commit why it has caused the
performance regression when running 4 stream benchmarks in parallel on
4 NUMA node server?

Please let me know if you need additional data. git bisect log is bellow.

Thanks a lot!
Jirka



$ git bisect log
git bisect start '--' 'kernel/sched'
# good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# bad: [527e9316f8ec44bd53d90fb9f611fa752bb9] Linux 4.4-rc4
git bisect bad 527e9316f8ec44bd53d90fb9f611fa752bb9
# bad: [b99def8b961448f5b9a550dddeeb718e3975e7a6] sched/core: Rework
TASK_DEAD preemption exception
git bisect bad b99def8b961448f5b9a550dddeeb718e3975e7a6
# skip: [8cd5601c50603caa195ce86cc465cb04079ed488] sched/fair: Convert
arch_scale_cpu_capacity() from weak function to #define
git bisect skip 8cd5601c50603caa195ce86cc465cb04079ed488
# bad: [fe19159225d8516f3f57a5fe8f735c01684f0ddd] Merge branch
'sched/urgent' into sched/core, to pick up fixes before applying new
changes
git bisect bad fe19159225d8516f3f57a5fe8f735c01684f0ddd
# bad: [78a9c54649ea220065aad9902460a1d137c7eafd] sched/numa: Rename
numabalancing_enabled to sched_numa_balancing
git bisect bad 78a9c54649ea220065aad9902460a1d137c7eafd
# bad: [6efdb105d392da3ad5cb4ef951aed373cd049813] sched/fair: Fix
switched_to_fair()'s per entity load tracking
git bisect bad 6efdb105d392da3ad5cb4ef951aed373cd049813
# bad: [50a2a3b246149d041065a67ccb3e98145f780a2f] sched/fair: Have
task_move_group_fair() unconditionally add the entity load to the
runqueue
git bisect bad 50a2a3b246149d041065a67ccb3e98145f780a2f
# bad: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag 'v4.3-rc1'
into sched/core, to refresh the branch
git bisect bad 973759c80db96ed4b4c5cb85ac7d48107f801371
# first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge
tag 'v4.3-rc1' into sched/core, to refresh the branch

On Sat, Dec 12, 2015 at 3:37 PM, Mike Galbraith
 wrote:
> On Sat, 2015-12-12 at 15:16 +0100, Jirka Hladky wrote:
>> > A bisection doesn't require any special skills, but may give busy
>> > maintainers a single change to eyeball vs the entire lot.
>>
>> They have been couple of merges which makes git revert difficult...
>
> You could try https://git-scm.com/docs/git-bisect instead.
>
> -Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-14 Thread Jirka Hladky
Hi all,

I have the results of bisecting:

first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag
'v4.3-rc1' into sched/core, to refresh the branch

Could you please have a look at this commit why it has caused the
performance regression when running 4 stream benchmarks in parallel on
4 NUMA node server?

Please let me know if you need additional data. git bisect log is bellow.

Thanks a lot!
Jirka



$ git bisect log
git bisect start '--' 'kernel/sched'
# good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# bad: [527e9316f8ec44bd53d90fb9f611fa752bb9] Linux 4.4-rc4
git bisect bad 527e9316f8ec44bd53d90fb9f611fa752bb9
# bad: [b99def8b961448f5b9a550dddeeb718e3975e7a6] sched/core: Rework
TASK_DEAD preemption exception
git bisect bad b99def8b961448f5b9a550dddeeb718e3975e7a6
# skip: [8cd5601c50603caa195ce86cc465cb04079ed488] sched/fair: Convert
arch_scale_cpu_capacity() from weak function to #define
git bisect skip 8cd5601c50603caa195ce86cc465cb04079ed488
# bad: [fe19159225d8516f3f57a5fe8f735c01684f0ddd] Merge branch
'sched/urgent' into sched/core, to pick up fixes before applying new
changes
git bisect bad fe19159225d8516f3f57a5fe8f735c01684f0ddd
# bad: [78a9c54649ea220065aad9902460a1d137c7eafd] sched/numa: Rename
numabalancing_enabled to sched_numa_balancing
git bisect bad 78a9c54649ea220065aad9902460a1d137c7eafd
# bad: [6efdb105d392da3ad5cb4ef951aed373cd049813] sched/fair: Fix
switched_to_fair()'s per entity load tracking
git bisect bad 6efdb105d392da3ad5cb4ef951aed373cd049813
# bad: [50a2a3b246149d041065a67ccb3e98145f780a2f] sched/fair: Have
task_move_group_fair() unconditionally add the entity load to the
runqueue
git bisect bad 50a2a3b246149d041065a67ccb3e98145f780a2f
# bad: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag 'v4.3-rc1'
into sched/core, to refresh the branch
git bisect bad 973759c80db96ed4b4c5cb85ac7d48107f801371
# first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge
tag 'v4.3-rc1' into sched/core, to refresh the branch

On Sat, Dec 12, 2015 at 3:37 PM, Mike Galbraith
<umgwanakikb...@gmail.com> wrote:
> On Sat, 2015-12-12 at 15:16 +0100, Jirka Hladky wrote:
>> > A bisection doesn't require any special skills, but may give busy
>> > maintainers a single change to eyeball vs the entire lot.
>>
>> They have been couple of merges which makes git revert difficult...
>
> You could try https://git-scm.com/docs/git-bisect instead.
>
> -Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-14 Thread Jirka Hladky
Hi all,

I have the results of bisecting:

first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag
'v4.3-rc1' into sched/core, to refresh the branch

Could you please have a look at this commit why it has caused the
performance regression when running 4 stream benchmarks in parallel on
4 NUMA node server?

Please let me know if you need additional data. git bisect log is bellow.

Thanks a lot!
Jirka



$ git bisect log
git bisect start '--' 'kernel/sched'
# good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# bad: [527e9316f8ec44bd53d90fb9f611fa752bb9] Linux 4.4-rc4
git bisect bad 527e9316f8ec44bd53d90fb9f611fa752bb9
# bad: [b99def8b961448f5b9a550dddeeb718e3975e7a6] sched/core: Rework
TASK_DEAD preemption exception
git bisect bad b99def8b961448f5b9a550dddeeb718e3975e7a6
# skip: [8cd5601c50603caa195ce86cc465cb04079ed488] sched/fair: Convert
arch_scale_cpu_capacity() from weak function to #define
git bisect skip 8cd5601c50603caa195ce86cc465cb04079ed488
# bad: [fe19159225d8516f3f57a5fe8f735c01684f0ddd] Merge branch
'sched/urgent' into sched/core, to pick up fixes before applying new
changes
git bisect bad fe19159225d8516f3f57a5fe8f735c01684f0ddd
# bad: [78a9c54649ea220065aad9902460a1d137c7eafd] sched/numa: Rename
numabalancing_enabled to sched_numa_balancing
git bisect bad 78a9c54649ea220065aad9902460a1d137c7eafd
# bad: [6efdb105d392da3ad5cb4ef951aed373cd049813] sched/fair: Fix
switched_to_fair()'s per entity load tracking
git bisect bad 6efdb105d392da3ad5cb4ef951aed373cd049813
# bad: [50a2a3b246149d041065a67ccb3e98145f780a2f] sched/fair: Have
task_move_group_fair() unconditionally add the entity load to the
runqueue
git bisect bad 50a2a3b246149d041065a67ccb3e98145f780a2f
# bad: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge tag 'v4.3-rc1'
into sched/core, to refresh the branch
git bisect bad 973759c80db96ed4b4c5cb85ac7d48107f801371
# first bad commit: [973759c80db96ed4b4c5cb85ac7d48107f801371] Merge
tag 'v4.3-rc1' into sched/core, to refresh the branch

On Sat, Dec 12, 2015 at 3:37 PM, Mike Galbraith
<umgwanakikb...@gmail.com> wrote:
> On Sat, 2015-12-12 at 15:16 +0100, Jirka Hladky wrote:
>> > A bisection doesn't require any special skills, but may give busy
>> > maintainers a single change to eyeball vs the entire lot.
>>
>> They have been couple of merges which makes git revert difficult...
>
> You could try https://git-scm.com/docs/git-bisect instead.
>
> -Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-12 Thread Jirka Hladky
> A bisection doesn't require any special skills, but may give busy
> maintainers a single change to eyeball vs the entire lot.

They have been couple of merges which makes git revert difficult but I
will try to produce patch files for kernel/sched directory only with

git diff v4.4-rc4..fe19159 -- sched

and let you the outcome.



On Sat, Dec 12, 2015 at 8:04 AM, Mike Galbraith
 wrote:
> (it's always a good idea to CC subsystem maintainers when reporting)
>
> On Fri, 2015-12-11 at 15:17 +0100, Jirka Hladky wrote:
>> Hello,
>>
>> we are doing performance testing of the new kernel scheduler (commit
>> 53528695ff6d8b77011bc818407c13e30914a946). In most cases we see
>> performance improvements compared to 4.3 kernel with the exception of
>> stream benchmark when running on 4 NUMA node server.
>>
>> When we run 4 stream benchmark processes on 4 NUMA node server and we
>> compare the total performance we see drop about 24% compared to 4.3
>> kernel. This is caused by the fact that 2 stream benchmarks are
>> running on the same NUMA node while 1 NUMA node does not run any
>> stream benchmark. With kernel 4.3, load is distributed evenly among
>> all 4 NUMA nodes. When two stream benchmarks are running on the same
>> NUMA node then the runtime is almost twice as long compared to one
>> stream bench running on one NUMA node. See log files [1] bellow.
>>
>> Please see the graph comparing stream benchmark results between
>> kernel
>> 4.3 and 4.4rc4 (for legend see [2] bellow).
>> https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/Stre
>> am_benchmark_on_4_NUMA_node_server_4.3vs4.4rc4_kernel.png
>>
>> Could you please help us to identify the root cause of this
>> regression? We don't have the skills to fix the problem ourselves but
>> we will be more than happy to test any proposed patch for this issue.
>
> A bisection doesn't require any special skills, but may give busy
> maintainers a single change to eyeball vs the entire lot.
>
> -Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-12 Thread Jirka Hladky
> A bisection doesn't require any special skills, but may give busy
> maintainers a single change to eyeball vs the entire lot.

They have been couple of merges which makes git revert difficult but I
will try to produce patch files for kernel/sched directory only with

git diff v4.4-rc4..fe19159 -- sched

and let you the outcome.



On Sat, Dec 12, 2015 at 8:04 AM, Mike Galbraith
<umgwanakikb...@gmail.com> wrote:
> (it's always a good idea to CC subsystem maintainers when reporting)
>
> On Fri, 2015-12-11 at 15:17 +0100, Jirka Hladky wrote:
>> Hello,
>>
>> we are doing performance testing of the new kernel scheduler (commit
>> 53528695ff6d8b77011bc818407c13e30914a946). In most cases we see
>> performance improvements compared to 4.3 kernel with the exception of
>> stream benchmark when running on 4 NUMA node server.
>>
>> When we run 4 stream benchmark processes on 4 NUMA node server and we
>> compare the total performance we see drop about 24% compared to 4.3
>> kernel. This is caused by the fact that 2 stream benchmarks are
>> running on the same NUMA node while 1 NUMA node does not run any
>> stream benchmark. With kernel 4.3, load is distributed evenly among
>> all 4 NUMA nodes. When two stream benchmarks are running on the same
>> NUMA node then the runtime is almost twice as long compared to one
>> stream bench running on one NUMA node. See log files [1] bellow.
>>
>> Please see the graph comparing stream benchmark results between
>> kernel
>> 4.3 and 4.4rc4 (for legend see [2] bellow).
>> https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/Stre
>> am_benchmark_on_4_NUMA_node_server_4.3vs4.4rc4_kernel.png
>>
>> Could you please help us to identify the root cause of this
>> regression? We don't have the skills to fix the problem ourselves but
>> we will be more than happy to test any proposed patch for this issue.
>
> A bisection doesn't require any special skills, but may give busy
> maintainers a single change to eyeball vs the entire lot.
>
> -Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-11 Thread Jirka Hladky
Hello,

we are doing performance testing of the new kernel scheduler (commit
53528695ff6d8b77011bc818407c13e30914a946). In most cases we see
performance improvements compared to 4.3 kernel with the exception of
stream benchmark when running on 4 NUMA node server.

When we run 4 stream benchmark processes on 4 NUMA node server and we
compare the total performance we see drop about 24% compared to 4.3
kernel. This is caused by the fact that 2 stream benchmarks are
running on the same NUMA node while 1 NUMA node does not run any
stream benchmark. With kernel 4.3, load is distributed evenly among
all 4 NUMA nodes. When two stream benchmarks are running on the same
NUMA node then the runtime is almost twice as long compared to one
stream bench running on one NUMA node. See log files [1] bellow.

Please see the graph comparing stream benchmark results between kernel
4.3 and 4.4rc4 (for legend see [2] bellow).
https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/Stream_benchmark_on_4_NUMA_node_server_4.3vs4.4rc4_kernel.png

Could you please help us to identify the root cause of this
regression? We don't have the skills to fix the problem ourselves but
we will be more than happy to test any proposed patch for this issue.

Thanks a lot for your help on that!
Jirka

Further details:

[1] Log files can be downloaded here:
https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/4.4RC4_stream_log_files.tar.bz2

$grep "User time" *log
stream.defaultRun.004streams.loop01.instance001.log:User time:  12.370 seconds
stream.defaultRun.004streams.loop01.instance002.log:User time:  10.560 seconds
stream.defaultRun.004streams.loop01.instance003.log:User time:  19.330 seconds
stream.defaultRun.004streams.loop01.instance004.log:User time:  17.820 seconds


$grep "NUMA nodes:" *log
stream.defaultRun.004streams.loop01.instance001.log:NUMA nodes: 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2
stream.defaultRun.004streams.loop01.instance002.log:NUMA nodes: 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
stream.defaultRun.004streams.loop01.instance003.log:NUMA nodes: 3
3 3 3 3 3 3 3 3 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3
stream.defaultRun.004streams.loop01.instance004.log:NUMA nodes: 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 0 0 0 0 0 0 0 0 0 0 0 0

=> please note that NO bench is running on NUMA node #1 and instances
#3 and #4 are running both on NUMA node #3. This has huge performance
impact as stream instances on node #3 need 19 and 17 seconds to finish
compared to 10 and 12 seconds for instances running alone on one NUMA
node.

[2] Graph:
https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/Stream_benchmark_on_4_NUMA_node_server_4.3vs4.4rc4_kernel.png

Graph Legend:
GREEN line => kernel 4.3
BLUE line =>kernel 4.4rc4
x-axis  => number of parallel stream instances
y-axis  => Sum [1/runtime] over all stream instances


Details on server: DELL PowerEdge R820, 4x E5-4607 0 @ 2.20GHz and 128GB RAM
http://ark.intel.com/products/64604
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sched : performance regression 24% between 4.4rc4 and 4.3 kernel

2015-12-11 Thread Jirka Hladky
Hello,

we are doing performance testing of the new kernel scheduler (commit
53528695ff6d8b77011bc818407c13e30914a946). In most cases we see
performance improvements compared to 4.3 kernel with the exception of
stream benchmark when running on 4 NUMA node server.

When we run 4 stream benchmark processes on 4 NUMA node server and we
compare the total performance we see drop about 24% compared to 4.3
kernel. This is caused by the fact that 2 stream benchmarks are
running on the same NUMA node while 1 NUMA node does not run any
stream benchmark. With kernel 4.3, load is distributed evenly among
all 4 NUMA nodes. When two stream benchmarks are running on the same
NUMA node then the runtime is almost twice as long compared to one
stream bench running on one NUMA node. See log files [1] bellow.

Please see the graph comparing stream benchmark results between kernel
4.3 and 4.4rc4 (for legend see [2] bellow).
https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/Stream_benchmark_on_4_NUMA_node_server_4.3vs4.4rc4_kernel.png

Could you please help us to identify the root cause of this
regression? We don't have the skills to fix the problem ourselves but
we will be more than happy to test any proposed patch for this issue.

Thanks a lot for your help on that!
Jirka

Further details:

[1] Log files can be downloaded here:
https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/4.4RC4_stream_log_files.tar.bz2

$grep "User time" *log
stream.defaultRun.004streams.loop01.instance001.log:User time:  12.370 seconds
stream.defaultRun.004streams.loop01.instance002.log:User time:  10.560 seconds
stream.defaultRun.004streams.loop01.instance003.log:User time:  19.330 seconds
stream.defaultRun.004streams.loop01.instance004.log:User time:  17.820 seconds


$grep "NUMA nodes:" *log
stream.defaultRun.004streams.loop01.instance001.log:NUMA nodes: 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2
stream.defaultRun.004streams.loop01.instance002.log:NUMA nodes: 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
stream.defaultRun.004streams.loop01.instance003.log:NUMA nodes: 3
3 3 3 3 3 3 3 3 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3
stream.defaultRun.004streams.loop01.instance004.log:NUMA nodes: 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 0 0 0 0 0 0 0 0 0 0 0 0

=> please note that NO bench is running on NUMA node #1 and instances
#3 and #4 are running both on NUMA node #3. This has huge performance
impact as stream instances on node #3 need 19 and 17 seconds to finish
compared to 10 and 12 seconds for instances running alone on one NUMA
node.

[2] Graph:
https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/Stream_benchmark_on_4_NUMA_node_server_4.3vs4.4rc4_kernel.png

Graph Legend:
GREEN line => kernel 4.3
BLUE line =>kernel 4.4rc4
x-axis  => number of parallel stream instances
y-axis  => Sum [1/runtime] over all stream instances


Details on server: DELL PowerEdge R820, 4x E5-4607 0 @ 2.20GHz and 128GB RAM
http://ark.intel.com/products/64604
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] numa,sched: only consider less busy nodes as numa balancing destination

2015-05-11 Thread Jirka Hladky

Hi Rik,

we have results for SPECjbb2005 and Linpack benchmarks with

4.1.0-0.rc1.git0.1.el7.x86_64 (without patch)
4.1.0-0.rc2.git0.3.el7.x86_64 with your patch
4.1.0-0.rc2.git0.3.el7.x86_64 with your patch and AUTONUMA disabled

The tests has been conducted on 3 different systems with 4 NUMA nodes 
and different versions of Intel processors and different amount of RAM.



For SPECjbb benchmark we see
-with your latest proposed patch applied
  * gains in range 7-15% !! for single instance  SPECjbb (tested on 
variety of systems, biggest gains on brickland system, gains are growing 
with growing number of threads)
  * for multi-instance SPECjbb run (4 parallel jobs on 4 NUMA node 
system) on change in results

  * for linpack no change
  * for stream bench slight improvements (but very close to error margin)
- with AUTONUMA disabled
  * with SPECjbb (both single and 4 parallel jobs) performance drop to 
1/2 of performance with AUTONUMA enabled
  * for linpack and stream performance drop by 30% compared with 
AUTONUMA enabled


In summary:
* the proposed patch improves performance for single process SPECjbb 
bench without hurting anything

* With AUTUNUMA disabled, performance drop is huge

Please let me know if you need more details.

Thanks
Jirka

On 05/06/2015 05:41 PM, Rik van Riel wrote:

On Wed, 06 May 2015 13:35:30 +0300
Artem Bityutskiy  wrote:


we observe a tremendous regression between kernel version 3.16 and 3.17
(and up), and I've bisected it to this commit:

a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node

Artem, Jirka, does this patch fix (or at least improve) the issues you
have been seeing?  Does it introduce any new regressions?

Peter, Mel, I think it may be time to stop waiting for the impedance
mismatch between the load balancer and NUMA balancing to be resolved,
and try to just avoid the issue in the NUMA balancing code...

8<

Subject: numa,sched: only consider less busy nodes as numa balancing destination

Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the
preferred node") fixes an issue where workloads would never converge
on a fully loaded (or overloaded) system.

However, it introduces a regression on less than fully loaded systems,
where workloads converge on a few NUMA nodes, instead of properly staying
spread out across the whole system. This leads to a reduction in available
memory bandwidth, and usable CPU cache, with predictable performance problems.

The root cause appears to be an interaction between the load balancer and
NUMA balancing, where the short term load represented by the load balancer
differs from the long term load the NUMA balancing code would like to base
its decisions on.

Simply reverting a43455a1 would re-introduce the non-convergence of
workloads on fully loaded systems, so that is not a good option. As
an aside, the check done before a43455a1 only applied to a task's
preferred node, not to other candidate nodes in the system, so the
converge-on-too-few-nodes problem still happens, just to a lesser
degree.

Instead, try to compensate for the impedance mismatch between the
load balancer and NUMA balancing by only ever considering a lesser
loaded node as a destination for NUMA balancing, regardless of
whether the task is trying to move to the preferred node, or to another
node.

Signed-off-by: Rik van Riel 
Reported-by: Artem Bityutski 
Reported-by: Jirka Hladky 
---
  kernel/sched/fair.c | 30 --
  1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffeaa4105e48..480e6a35ab35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
}
  }
  
+/* Only move tasks to a NUMA node less busy than the current node. */

+static bool numa_has_capacity(struct task_numa_env *env)
+{
+   struct numa_stats *src = >src_stats;
+   struct numa_stats *dst = >dst_stats;
+
+   if (src->has_free_capacity && !dst->has_free_capacity)
+   return false;
+
+   /*
+* Only consider a task move if the source has a higher destination
+* than the destination, corrected for CPU capacity on each node.
+*
+*  src->loaddst->load
+* - vs -
+* src->compute_capacitydst->compute_capacity
+*/
+   if (src->load * dst->compute_capacity >
+   dst->load * src->compute_capacity)
+   return true;
+
+   return false;
+}
+
  static int task_numa_migrate(struct task_struct *p)
  {
struct task_numa_env env = {
@@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p)
update_numa_stats(_stats, env.dst_nid);
  
  	/* Try to find a spot on the preferred nid. */

-   task_numa_find_cpu(, taskimp, groupimp);
+  

Re: [PATCH] numa,sched: only consider less busy nodes as numa balancing destination

2015-05-11 Thread Jirka Hladky

Hi Rik,

we have results for SPECjbb2005 and LinpackStream benchmarks with

4.1.0-0.rc1.git0.1.el7.x86_64 (without patch)
4.1.0-0.rc2.git0.3.el7.x86_64 with your patch
4.1.0-0.rc2.git0.3.el7.x86_64 with your patch and AUTONUMA disabled

The tests has been conducted on 3 different systems with 4 NUMA nodes 
and different versions of Intel processors and different amount of RAM.



For SPECjbb benchmark we see
-with your latest proposed patch applied
  * gains in range 7-15% !! for single instance  SPECjbb (tested on 
variety of systems, biggest gains on brickland system, gains are growing 
with growing number of threads)
  * for multi-instance SPECjbb run (4 parallel jobs on 4 NUMA node 
system) on change in results

  * for linpack no change
  * for stream bench slight improvements (but very close to error margin)
- with AUTONUMA disabled
  * with SPECjbb (both single and 4 parallel jobs) performance drop to 
1/2 of performance with AUTONUMA enabled
  * for linpack and stream performance drop by 30% compared with 
AUTONUMA enabled


In summary:
* the proposed patch improves performance for single process SPECjbb 
bench without hurting anything

* With AUTUNUMA disabled, performance drop is huge

Please let me know if you need more details.

Thanks
Jirka

On 05/06/2015 05:41 PM, Rik van Riel wrote:

On Wed, 06 May 2015 13:35:30 +0300
Artem Bityutskiy dedeki...@gmail.com wrote:


we observe a tremendous regression between kernel version 3.16 and 3.17
(and up), and I've bisected it to this commit:

a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node

Artem, Jirka, does this patch fix (or at least improve) the issues you
have been seeing?  Does it introduce any new regressions?

Peter, Mel, I think it may be time to stop waiting for the impedance
mismatch between the load balancer and NUMA balancing to be resolved,
and try to just avoid the issue in the NUMA balancing code...

8

Subject: numa,sched: only consider less busy nodes as numa balancing destination

Changeset a43455a1 (sched/numa: Ensure task_numa_migrate() checks the
preferred node) fixes an issue where workloads would never converge
on a fully loaded (or overloaded) system.

However, it introduces a regression on less than fully loaded systems,
where workloads converge on a few NUMA nodes, instead of properly staying
spread out across the whole system. This leads to a reduction in available
memory bandwidth, and usable CPU cache, with predictable performance problems.

The root cause appears to be an interaction between the load balancer and
NUMA balancing, where the short term load represented by the load balancer
differs from the long term load the NUMA balancing code would like to base
its decisions on.

Simply reverting a43455a1 would re-introduce the non-convergence of
workloads on fully loaded systems, so that is not a good option. As
an aside, the check done before a43455a1 only applied to a task's
preferred node, not to other candidate nodes in the system, so the
converge-on-too-few-nodes problem still happens, just to a lesser
degree.

Instead, try to compensate for the impedance mismatch between the
load balancer and NUMA balancing by only ever considering a lesser
loaded node as a destination for NUMA balancing, regardless of
whether the task is trying to move to the preferred node, or to another
node.

Signed-off-by: Rik van Riel r...@redhat.com
Reported-by: Artem Bityutski dedeki...@gmail.com
Reported-by: Jirka Hladky jhla...@redhat.com
---
  kernel/sched/fair.c | 30 --
  1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffeaa4105e48..480e6a35ab35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
}
  }
  
+/* Only move tasks to a NUMA node less busy than the current node. */

+static bool numa_has_capacity(struct task_numa_env *env)
+{
+   struct numa_stats *src = env-src_stats;
+   struct numa_stats *dst = env-dst_stats;
+
+   if (src-has_free_capacity  !dst-has_free_capacity)
+   return false;
+
+   /*
+* Only consider a task move if the source has a higher destination
+* than the destination, corrected for CPU capacity on each node.
+*
+*  src-loaddst-load
+* - vs -
+* src-compute_capacitydst-compute_capacity
+*/
+   if (src-load * dst-compute_capacity 
+   dst-load * src-compute_capacity)
+   return true;
+
+   return false;
+}
+
  static int task_numa_migrate(struct task_struct *p)
  {
struct task_numa_env env = {
@@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p)
update_numa_stats(env.dst_stats, env.dst_nid);
  
  	/* Try to find a spot on the preferred nid. */

-   task_numa_find_cpu(env, taskimp

Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

2014-08-01 Thread Jirka Hladky

On 08/02/2014 06:17 AM, Rik van Riel wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/01/2014 05:30 PM, Jirka Hladky wrote:


I see the regression only on this box. It has 4 "Ivy Bridge-EX"
Xeon E7-4890 v2 CPUs.

http://ark.intel.com/products/75251
http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Ivy_Bridge-EX.22_.2822_nm.29_Expandable_2



Please rerun the test on box with Ivy Bridge CPUs. It seems that
older CPU generations are not affected.

That would have been good info to know :)

I've been spending about a month trying to reproduce your issue on a
Westmere E7-4860.

Good thing I found all kinds of other scheduler issues along the way...


Hi Rik,

till recently I have seen the regression on all systems.

With the latest kernel, only Ivy Bridge system seems to be affected.

Jirka

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

2014-08-01 Thread Jirka Hladky

On 08/01/2014 10:46 PM, Davidlohr Bueso wrote:

On Thu, 2014-07-31 at 18:16 +0200, Jirka Hladky wrote:

Peter, I'm seeing regressions for

SINGLE SPECjbb instance for number of warehouses being the same as total
number of cores in the box.

Example: 4 NUMA node box, each CPU has 6 cores => biggest regression is
for 24 warehouses.

By looking at your graph, that's around a 10% difference.

So I'm not seeing anywhere near as bad a regression on a 80-core box.
Testing single with 80 warehouses, I get:

tip/master baseline:
677476.36 bops
705826.70 bops
704870.87 bops
681741.20 bops
707014.59 bops

Avg: 695385.94 bops

tip/master + patch (NUMA_SCALE/8 variant):
698242.66 bops
693873.18 bops
707852.28 bops
691785.96 bops
747206.03 bopsthis

Avg: 707792.022 bops

So both these are pretty similar, however, when reverting, on avg we
increase the amount of bops a mere ~4%:

tip/master + reverted:
778416.02 bops
702602.62 bops
712557.32 bops
713982.90 bops
783300.36 bops

Avg: 738171.84 bops

Are there perhaps any special specjbb options you are using?



I see the regression only on this box. It has 4 "Ivy Bridge-EX" Xeon 
E7-4890 v2 CPUs.


http://ark.intel.com/products/75251
http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Ivy_Bridge-EX.22_.2822_nm.29_Expandable_2

Please rerun the test on box with Ivy Bridge CPUs. It seems that older 
CPU generations are not affected.


Thanks
Jirka


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

2014-08-01 Thread Jirka Hladky

On 08/01/2014 10:46 PM, Davidlohr Bueso wrote:

On Thu, 2014-07-31 at 18:16 +0200, Jirka Hladky wrote:

Peter, I'm seeing regressions for

SINGLE SPECjbb instance for number of warehouses being the same as total
number of cores in the box.

Example: 4 NUMA node box, each CPU has 6 cores = biggest regression is
for 24 warehouses.

By looking at your graph, that's around a 10% difference.

So I'm not seeing anywhere near as bad a regression on a 80-core box.
Testing single with 80 warehouses, I get:

tip/master baseline:
677476.36 bops
705826.70 bops
704870.87 bops
681741.20 bops
707014.59 bops

Avg: 695385.94 bops

tip/master + patch (NUMA_SCALE/8 variant):
698242.66 bops
693873.18 bops
707852.28 bops
691785.96 bops
747206.03 bopsthis

Avg: 707792.022 bops

So both these are pretty similar, however, when reverting, on avg we
increase the amount of bops a mere ~4%:

tip/master + reverted:
778416.02 bops
702602.62 bops
712557.32 bops
713982.90 bops
783300.36 bops

Avg: 738171.84 bops

Are there perhaps any special specjbb options you are using?



I see the regression only on this box. It has 4 Ivy Bridge-EX Xeon 
E7-4890 v2 CPUs.


http://ark.intel.com/products/75251
http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Ivy_Bridge-EX.22_.2822_nm.29_Expandable_2

Please rerun the test on box with Ivy Bridge CPUs. It seems that older 
CPU generations are not affected.


Thanks
Jirka


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

2014-08-01 Thread Jirka Hladky

On 08/02/2014 06:17 AM, Rik van Riel wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/01/2014 05:30 PM, Jirka Hladky wrote:


I see the regression only on this box. It has 4 Ivy Bridge-EX
Xeon E7-4890 v2 CPUs.

http://ark.intel.com/products/75251
http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#.22Ivy_Bridge-EX.22_.2822_nm.29_Expandable_2



Please rerun the test on box with Ivy Bridge CPUs. It seems that
older CPU generations are not affected.

That would have been good info to know :)

I've been spending about a month trying to reproduce your issue on a
Westmere E7-4860.

Good thing I found all kinds of other scheduler issues along the way...


Hi Rik,

till recently I have seen the regression on all systems.

With the latest kernel, only Ivy Bridge system seems to be affected.

Jirka

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

2014-07-31 Thread Jirka Hladky

On 07/31/2014 06:27 PM, Peter Zijlstra wrote:

On Thu, Jul 31, 2014 at 06:16:26PM +0200, Jirka Hladky wrote:

On 07/31/2014 05:57 PM, Peter Zijlstra wrote:

On Thu, Jul 31, 2014 at 12:42:41PM +0200, Peter Zijlstra wrote:

On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:

On Tue, 29 Jul 2014 13:24:05 +0800
Aaron Lu  wrote:


FYI, we noticed the below changes on

git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
commit a43455a1d572daf7b730fe12eb747d1e17411365 ("sched/numa: Ensure 
task_numa_migrate() checks the preferred node")

ebe06187bf2aec1  a43455a1d572daf7b730fe12e
---  -
  94500 ~ 3%+115.6% 203711 ~ 6%  ivb42/hackbench/50%-threads-pipe
  67745 ~ 4% +64.1% 74 ~ 5%  
lkp-snb01/hackbench/50%-threads-socket
 162245 ~ 3% +94.1% 314885 ~ 6%  TOTAL 
proc-vmstat.numa_hint_faults_local

Hi Aaron,

Jirka Hladky has reported a regression with that changeset as
well, and I have already spent some time debugging the issue.

Let me see if I can still find my SPECjbb2005 copy to see what that
does.

Jirka, what kind of setup were you seeing SPECjbb regressions?

I'm not seeing any on 2 sockets with a single SPECjbb instance, I'll go
check one instance per socket now.



Peter, I'm seeing regressions for

SINGLE SPECjbb instance for number of warehouses being the same as total
number of cores in the box.

Example: 4 NUMA node box, each CPU has 6 cores => biggest regression is for
24 warehouses.

IVB-EP: 2 node, 10 cores, 2 thread per core:

tip/master+origin/master:

  Warehouses   Thrput
   4   196781
   8   358064
  12   511318
  16   589251
  20   656123
  24   710789
  28   765426
  32   787059
  36   777899
* 40   748568
 
Throughput  18258


  Warehouses   Thrput
   4   201598
   8   363470
  12   512968
  16   584289
  20   605299
  24   720142
  28   776066
  32   791263
  36   776965
* 40   760572
 
Throughput  18551



tip/master+origin/master-a43455a1d57

SPEC scores
  Warehouses   Thrput
   4   198667
   8   362481
  12   503344
  16   582602
  20   647688
  24   731639
  28   786135
  32   794124
  36   774567
* 40   757559
 
Throughput  18477



Given that there's fairly large variance between the two runs with the
commit in, I'm not sure I can say there's a problem here.

The one run without the patch is more or less between the two runs with
the patch.

And doing this many runs takes ages, so I'm not tempted to either make
the runs longer or do more of them.

Lemme try on a 4 node box though, who knows.


IVB-EP: 2 node, 10 cores, 2 thread per core
=> on such system, I run only 20 warenhouses as maximum. (number of 
nodes * number of PHYSICAL cores)


The kernels you have tested shows following results:
656123/605299/647688


I'm doing 3 iterations (3 runs) to get some statistics. To speed up the 
test significantly please do the run with 20 warehouses only
(or in general with #warehouses ==  number of nodes * number of PHYSICAL 
cores)


Jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

2014-07-31 Thread Jirka Hladky

On 07/31/2014 05:57 PM, Peter Zijlstra wrote:

On Thu, Jul 31, 2014 at 12:42:41PM +0200, Peter Zijlstra wrote:

On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:

On Tue, 29 Jul 2014 13:24:05 +0800
Aaron Lu  wrote:


FYI, we noticed the below changes on

git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
commit a43455a1d572daf7b730fe12eb747d1e17411365 ("sched/numa: Ensure 
task_numa_migrate() checks the preferred node")

ebe06187bf2aec1  a43455a1d572daf7b730fe12e
---  -
  94500 ~ 3%+115.6% 203711 ~ 6%  ivb42/hackbench/50%-threads-pipe
  67745 ~ 4% +64.1% 74 ~ 5%  
lkp-snb01/hackbench/50%-threads-socket
 162245 ~ 3% +94.1% 314885 ~ 6%  TOTAL 
proc-vmstat.numa_hint_faults_local

Hi Aaron,

Jirka Hladky has reported a regression with that changeset as
well, and I have already spent some time debugging the issue.

Let me see if I can still find my SPECjbb2005 copy to see what that
does.

Jirka, what kind of setup were you seeing SPECjbb regressions?

I'm not seeing any on 2 sockets with a single SPECjbb instance, I'll go
check one instance per socket now.



Peter, I'm seeing regressions for

SINGLE SPECjbb instance for number of warehouses being the same as total 
number of cores in the box.


Example: 4 NUMA node box, each CPU has 6 cores => biggest regression is 
for 24 warehouses.


See the attached snapshot.

Jirka


Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

2014-07-31 Thread Jirka Hladky

On 07/31/2014 05:57 PM, Peter Zijlstra wrote:

On Thu, Jul 31, 2014 at 12:42:41PM +0200, Peter Zijlstra wrote:

On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:

On Tue, 29 Jul 2014 13:24:05 +0800
Aaron Lu aaron...@intel.com wrote:


FYI, we noticed the below changes on

git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
commit a43455a1d572daf7b730fe12eb747d1e17411365 (sched/numa: Ensure 
task_numa_migrate() checks the preferred node)

ebe06187bf2aec1  a43455a1d572daf7b730fe12e
---  -
  94500 ~ 3%+115.6% 203711 ~ 6%  ivb42/hackbench/50%-threads-pipe
  67745 ~ 4% +64.1% 74 ~ 5%  
lkp-snb01/hackbench/50%-threads-socket
 162245 ~ 3% +94.1% 314885 ~ 6%  TOTAL 
proc-vmstat.numa_hint_faults_local

Hi Aaron,

Jirka Hladky has reported a regression with that changeset as
well, and I have already spent some time debugging the issue.

Let me see if I can still find my SPECjbb2005 copy to see what that
does.

Jirka, what kind of setup were you seeing SPECjbb regressions?

I'm not seeing any on 2 sockets with a single SPECjbb instance, I'll go
check one instance per socket now.



Peter, I'm seeing regressions for

SINGLE SPECjbb instance for number of warehouses being the same as total 
number of cores in the box.


Example: 4 NUMA node box, each CPU has 6 cores = biggest regression is 
for 24 warehouses.


See the attached snapshot.

Jirka


Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

2014-07-31 Thread Jirka Hladky

On 07/31/2014 06:27 PM, Peter Zijlstra wrote:

On Thu, Jul 31, 2014 at 06:16:26PM +0200, Jirka Hladky wrote:

On 07/31/2014 05:57 PM, Peter Zijlstra wrote:

On Thu, Jul 31, 2014 at 12:42:41PM +0200, Peter Zijlstra wrote:

On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:

On Tue, 29 Jul 2014 13:24:05 +0800
Aaron Lu aaron...@intel.com wrote:


FYI, we noticed the below changes on

git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
commit a43455a1d572daf7b730fe12eb747d1e17411365 (sched/numa: Ensure 
task_numa_migrate() checks the preferred node)

ebe06187bf2aec1  a43455a1d572daf7b730fe12e
---  -
  94500 ~ 3%+115.6% 203711 ~ 6%  ivb42/hackbench/50%-threads-pipe
  67745 ~ 4% +64.1% 74 ~ 5%  
lkp-snb01/hackbench/50%-threads-socket
 162245 ~ 3% +94.1% 314885 ~ 6%  TOTAL 
proc-vmstat.numa_hint_faults_local

Hi Aaron,

Jirka Hladky has reported a regression with that changeset as
well, and I have already spent some time debugging the issue.

Let me see if I can still find my SPECjbb2005 copy to see what that
does.

Jirka, what kind of setup were you seeing SPECjbb regressions?

I'm not seeing any on 2 sockets with a single SPECjbb instance, I'll go
check one instance per socket now.



Peter, I'm seeing regressions for

SINGLE SPECjbb instance for number of warehouses being the same as total
number of cores in the box.

Example: 4 NUMA node box, each CPU has 6 cores = biggest regression is for
24 warehouses.

IVB-EP: 2 node, 10 cores, 2 thread per core:

tip/master+origin/master:

  Warehouses   Thrput
   4   196781
   8   358064
  12   511318
  16   589251
  20   656123
  24   710789
  28   765426
  32   787059
  36   777899
* 40   748568
 
Throughput  18258


  Warehouses   Thrput
   4   201598
   8   363470
  12   512968
  16   584289
  20   605299
  24   720142
  28   776066
  32   791263
  36   776965
* 40   760572
 
Throughput  18551



tip/master+origin/master-a43455a1d57

SPEC scores
  Warehouses   Thrput
   4   198667
   8   362481
  12   503344
  16   582602
  20   647688
  24   731639
  28   786135
  32   794124
  36   774567
* 40   757559
 
Throughput  18477



Given that there's fairly large variance between the two runs with the
commit in, I'm not sure I can say there's a problem here.

The one run without the patch is more or less between the two runs with
the patch.

And doing this many runs takes ages, so I'm not tempted to either make
the runs longer or do more of them.

Lemme try on a 4 node box though, who knows.


IVB-EP: 2 node, 10 cores, 2 thread per core
= on such system, I run only 20 warenhouses as maximum. (number of 
nodes * number of PHYSICAL cores)


The kernels you have tested shows following results:
656123/605299/647688


I'm doing 3 iterations (3 runs) to get some statistics. To speed up the 
test significantly please do the run with 20 warehouses only
(or in general with #warehouses ==  number of nodes * number of PHYSICAL 
cores)


Jirka
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/