Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-07-28 Thread Jirka Hladky
Hi Peter,

I have updated regarding the performance degradation after disabling
and re-enabling a core.

It turns out that lu.C.x results show quite big variation and tests
have to be repeated several times and mean value of real time has to
be used to get reliable results.

There is NO regression on following CPUs

4x Xeon(R) CPU E5-4610 v2 @ 2.30GHz
4x Xeon(R) CPU E5-2690 v3 @ 2.60GHz

but there is regression (slow down by factor 6x) on

AMD Opteron(TM) Processor 6272

Kernel 4.7.0-0.rc7.git0.1.el7.x86_64

real_time to run ./lu.C.x benchmark (mean value out of 10 runs)

Right after boot: 273 seconds
After disabling and enabling a core: 1702 seconds!

So you were right that it's related to COD technology

> The Opteron 6272, which they use, is an Interlagos, that has something
> similar in that each package contains two nodes.

Lauro Venancio is now working on a fix.

Jirka


On Tue, Jul 12, 2016 at 11:04 AM, Jirka Hladky  wrote:
> Hi Peter,
>
> have you a chance to look into this? Is there anything I can do to
> help you to fix it?
>
> Thanks a lot!
> Jirka
>
>
> On Wed, Jun 29, 2016 at 11:58 AM, Peter Zijlstra  wrote:
>> On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> I think Cluster on Die technology was introduced in Haswell generation. The
>>> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
>>> double checked the BIOS and there is no cluster on die setting.
>>
>> Oh right, that's E5v3..
>>
>>> The authors of the paper have reported the issue on AMD Bulldozer CPU which
>>> also does not have COD technology.
>>
>> The Opteron 6272, which they use, is an Interlagos, that has something
>> similar in that each package contains two nodes.
>>
>> And their patch touches exactly that part of the x86 topo setup, the
>> match_die() && !same_node() condition, IOW same package, different node.
>>
>> That's not a path an Intel chip would trigger without COD support.


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-07-28 Thread Jirka Hladky
Hi Peter,

I have updated regarding the performance degradation after disabling
and re-enabling a core.

It turns out that lu.C.x results show quite big variation and tests
have to be repeated several times and mean value of real time has to
be used to get reliable results.

There is NO regression on following CPUs

4x Xeon(R) CPU E5-4610 v2 @ 2.30GHz
4x Xeon(R) CPU E5-2690 v3 @ 2.60GHz

but there is regression (slow down by factor 6x) on

AMD Opteron(TM) Processor 6272

Kernel 4.7.0-0.rc7.git0.1.el7.x86_64

real_time to run ./lu.C.x benchmark (mean value out of 10 runs)

Right after boot: 273 seconds
After disabling and enabling a core: 1702 seconds!

So you were right that it's related to COD technology

> The Opteron 6272, which they use, is an Interlagos, that has something
> similar in that each package contains two nodes.

Lauro Venancio is now working on a fix.

Jirka


On Tue, Jul 12, 2016 at 11:04 AM, Jirka Hladky  wrote:
> Hi Peter,
>
> have you a chance to look into this? Is there anything I can do to
> help you to fix it?
>
> Thanks a lot!
> Jirka
>
>
> On Wed, Jun 29, 2016 at 11:58 AM, Peter Zijlstra  wrote:
>> On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
>>> Hi Peter,
>>>
>>> I think Cluster on Die technology was introduced in Haswell generation. The
>>> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
>>> double checked the BIOS and there is no cluster on die setting.
>>
>> Oh right, that's E5v3..
>>
>>> The authors of the paper have reported the issue on AMD Bulldozer CPU which
>>> also does not have COD technology.
>>
>> The Opteron 6272, which they use, is an Interlagos, that has something
>> similar in that each package contains two nodes.
>>
>> And their patch touches exactly that part of the x86 topo setup, the
>> match_die() && !same_node() condition, IOW same package, different node.
>>
>> That's not a path an Intel chip would trigger without COD support.


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-07-12 Thread Jirka Hladky
Hi Peter,

have you a chance to look into this? Is there anything I can do to
help you to fix it?

Thanks a lot!
Jirka


On Wed, Jun 29, 2016 at 11:58 AM, Peter Zijlstra  wrote:
> On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
>> Hi Peter,
>>
>> I think Cluster on Die technology was introduced in Haswell generation. The
>> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
>> double checked the BIOS and there is no cluster on die setting.
>
> Oh right, that's E5v3..
>
>> The authors of the paper have reported the issue on AMD Bulldozer CPU which
>> also does not have COD technology.
>
> The Opteron 6272, which they use, is an Interlagos, that has something
> similar in that each package contains two nodes.
>
> And their patch touches exactly that part of the x86 topo setup, the
> match_die() && !same_node() condition, IOW same package, different node.
>
> That's not a path an Intel chip would trigger without COD support.


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-07-12 Thread Jirka Hladky
Hi Peter,

have you a chance to look into this? Is there anything I can do to
help you to fix it?

Thanks a lot!
Jirka


On Wed, Jun 29, 2016 at 11:58 AM, Peter Zijlstra  wrote:
> On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
>> Hi Peter,
>>
>> I think Cluster on Die technology was introduced in Haswell generation. The
>> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
>> double checked the BIOS and there is no cluster on die setting.
>
> Oh right, that's E5v3..
>
>> The authors of the paper have reported the issue on AMD Bulldozer CPU which
>> also does not have COD technology.
>
> The Opteron 6272, which they use, is an Interlagos, that has something
> similar in that each package contains two nodes.
>
> And their patch touches exactly that part of the x86 topo setup, the
> match_die() && !same_node() condition, IOW same package, different node.
>
> That's not a path an Intel chip would trigger without COD support.


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-06-29 Thread Peter Zijlstra
On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
> Hi Peter,
> 
> I think Cluster on Die technology was introduced in Haswell generation. The
> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
> double checked the BIOS and there is no cluster on die setting.

Oh right, that's E5v3..

> The authors of the paper have reported the issue on AMD Bulldozer CPU which
> also does not have COD technology.

The Opteron 6272, which they use, is an Interlagos, that has something
similar in that each package contains two nodes.

And their patch touches exactly that part of the x86 topo setup, the
match_die() && !same_node() condition, IOW same package, different node.

That's not a path an Intel chip would trigger without COD support.


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-06-29 Thread Peter Zijlstra
On Wed, Jun 29, 2016 at 11:47:56AM +0200, Jirka Hladky wrote:
> Hi Peter,
> 
> I think Cluster on Die technology was introduced in Haswell generation. The
> server I'm using is equipped with 4x Intel E5-4610 v2 (Ivy Bridge). I have
> double checked the BIOS and there is no cluster on die setting.

Oh right, that's E5v3..

> The authors of the paper have reported the issue on AMD Bulldozer CPU which
> also does not have COD technology.

The Opteron 6272, which they use, is an Interlagos, that has something
similar in that each package contains two nodes.

And their patch touches exactly that part of the x86 topo setup, the
match_die() && !same_node() condition, IOW same package, different node.

That's not a path an Intel chip would trigger without COD support.


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-06-29 Thread Peter Zijlstra
On Wed, Jun 29, 2016 at 01:15:17AM +0200, Jirka Hladky wrote:
> Hello,
> 
> on NUMA enabled server equipped with 4 Intel E5-4610 v2 CPUs we
> observe following performance degradation:

Do you have cluster on die enabled on that machine? If you disable it,
does it still reproduce?


Re: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-06-29 Thread Peter Zijlstra
On Wed, Jun 29, 2016 at 01:15:17AM +0200, Jirka Hladky wrote:
> Hello,
> 
> on NUMA enabled server equipped with 4 Intel E5-4610 v2 CPUs we
> observe following performance degradation:

Do you have cluster on die enabled on that machine? If you disable it,
does it still reproduce?


Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-06-28 Thread Jirka Hladky
Hello,

on NUMA enabled server equipped with 4 Intel E5-4610 v2 CPUs we
observe following performance degradation:

Runtime to run "lu.C.x" test from NAS Parallel Benchmarks after
booting the kernel:

real  1m57.834s
user  113m51.520s

Then we disable and re-enable one core:

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

and rerun the same test. Runtime is now degraded (by 40% for user time
and by 30% for the real (wall-clock) time) using all 64 cores

real 2m47.746s
user 160m46.109s

The issue was first reported in "The Linux Scheduler: a Decade of
Wasted Cores" paper
http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
https://github.com/jplozi/wastedcores/issues/1

How to reproduce the issue:

A) Get benchmark and compile it:

1) wget http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
2) tar zxvf NPB3.3.1.tar.gz
3) cd ~/NPB3.3.1/NPB3.3-OMP/config/
4) ln -sf NAS.samples/make.def.gcc_x86 make.def (assuming using gcc compiler)
5) ln -sf NAS.samples/suite.def.lu suite.def
6) cd ~/NPB3.3.1/NPB3.3-OMP
7) make suite
8) You should have now in directory ~/NPB3.3.1/NPB3.3-OMP/bin
benchmarks  lu.*. The binaries are alphabetically sorted by runtime
with  "lu.A.x" having the shortest runtime.

B) Reproducing the issue (see also attached script)

Remark: we have done the tests with autogroup disabled
sysctl -w kernel.sched_autogroup_enabled=0
to avoid this issue on 4.7 kernel:
https://bugzilla.kernel.org/show_bug.cgi?id=120481

The test was conducted on NUMA server with 4 nodes and using all
available 64 cores.

1) (time bin/lu.C.x) |& tee $(uname
-r)_lu.C.x.log_before_reenable_kernel.sched_autogroup_enabled=0

2) disable and re-enable one core
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

3) (time bin/lu.C.x) |& tee $(uname
-r)_lu.C.x.log_after_reenable_kernel.sched_autogroup_enabled=0

grep "real\|user" *lu.C*

You will see significant difference in both real and user time.

Regarding to the authors of the paper the root cause of the problem is
a missing call to regenerate domains inside NUMA nodes after
re-enabling CPU. The problem was introduced in 3.19 kernel. The
authors of paper has proposed a patch which applies to 4.1 kernel.
Here is the link:
https://github.com/jplozi/wastedcores/blob/master/patches/missing_sched_domains_linux_4.1.patch

===For the completeness here are the results with 4.6 kernel===

AFTER BOOT
real1m31.639s
user89m24.657s

AFTER core has been disabled and re-enabled
real2m44.566s
user157m59.814s

Please notice that 4.6 kernel problem is much more visible than with
4.7 rc5 kernel.

At the same time, 4.6 kernel delivers much better performance after
boot than 4.7 rc5 kernel which might indicate that another problem is
in play.
=

I have also tested kernel provided by Peter Zijlstra on Friday, June
24th which provides fix for
https://bugzilla.kernel.org/show_bug.cgi?id=120481. It does not fix
this issue and kernel right after boot performs worse than 4.6 kernel
right after boot so we may in fact face two problems here.

Results with 4.7.0-02548776ded1185e6e16ad0a475481e982741ee9 kernel=
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
$ git rev-parse HEAD
02548776ded1185e6e16ad0a475481e982741ee9

 AFTER BOOT
real1m58.549s
user113m31.448s

AFTER core has been disabled and re-enabled
real 2m35.930s
user 148m20.795s
=

Thanks a lot!
Jirka

PS: I have opened this BZ to track this issue
Bug 121121 - Kernel v4.7-rc5 - performance degradation upto 40% after
disabling and re-enabling a core
https://bugzilla.kernel.org/show_bug.cgi?id=121121


reproduce.sh
Description: Bourne shell script


Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core

2016-06-28 Thread Jirka Hladky
Hello,

on NUMA enabled server equipped with 4 Intel E5-4610 v2 CPUs we
observe following performance degradation:

Runtime to run "lu.C.x" test from NAS Parallel Benchmarks after
booting the kernel:

real  1m57.834s
user  113m51.520s

Then we disable and re-enable one core:

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

and rerun the same test. Runtime is now degraded (by 40% for user time
and by 30% for the real (wall-clock) time) using all 64 cores

real 2m47.746s
user 160m46.109s

The issue was first reported in "The Linux Scheduler: a Decade of
Wasted Cores" paper
http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
https://github.com/jplozi/wastedcores/issues/1

How to reproduce the issue:

A) Get benchmark and compile it:

1) wget http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
2) tar zxvf NPB3.3.1.tar.gz
3) cd ~/NPB3.3.1/NPB3.3-OMP/config/
4) ln -sf NAS.samples/make.def.gcc_x86 make.def (assuming using gcc compiler)
5) ln -sf NAS.samples/suite.def.lu suite.def
6) cd ~/NPB3.3.1/NPB3.3-OMP
7) make suite
8) You should have now in directory ~/NPB3.3.1/NPB3.3-OMP/bin
benchmarks  lu.*. The binaries are alphabetically sorted by runtime
with  "lu.A.x" having the shortest runtime.

B) Reproducing the issue (see also attached script)

Remark: we have done the tests with autogroup disabled
sysctl -w kernel.sched_autogroup_enabled=0
to avoid this issue on 4.7 kernel:
https://bugzilla.kernel.org/show_bug.cgi?id=120481

The test was conducted on NUMA server with 4 nodes and using all
available 64 cores.

1) (time bin/lu.C.x) |& tee $(uname
-r)_lu.C.x.log_before_reenable_kernel.sched_autogroup_enabled=0

2) disable and re-enable one core
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

3) (time bin/lu.C.x) |& tee $(uname
-r)_lu.C.x.log_after_reenable_kernel.sched_autogroup_enabled=0

grep "real\|user" *lu.C*

You will see significant difference in both real and user time.

Regarding to the authors of the paper the root cause of the problem is
a missing call to regenerate domains inside NUMA nodes after
re-enabling CPU. The problem was introduced in 3.19 kernel. The
authors of paper has proposed a patch which applies to 4.1 kernel.
Here is the link:
https://github.com/jplozi/wastedcores/blob/master/patches/missing_sched_domains_linux_4.1.patch

===For the completeness here are the results with 4.6 kernel===

AFTER BOOT
real1m31.639s
user89m24.657s

AFTER core has been disabled and re-enabled
real2m44.566s
user157m59.814s

Please notice that 4.6 kernel problem is much more visible than with
4.7 rc5 kernel.

At the same time, 4.6 kernel delivers much better performance after
boot than 4.7 rc5 kernel which might indicate that another problem is
in play.
=

I have also tested kernel provided by Peter Zijlstra on Friday, June
24th which provides fix for
https://bugzilla.kernel.org/show_bug.cgi?id=120481. It does not fix
this issue and kernel right after boot performs worse than 4.6 kernel
right after boot so we may in fact face two problems here.

Results with 4.7.0-02548776ded1185e6e16ad0a475481e982741ee9 kernel=
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
$ git rev-parse HEAD
02548776ded1185e6e16ad0a475481e982741ee9

 AFTER BOOT
real1m58.549s
user113m31.448s

AFTER core has been disabled and re-enabled
real 2m35.930s
user 148m20.795s
=

Thanks a lot!
Jirka

PS: I have opened this BZ to track this issue
Bug 121121 - Kernel v4.7-rc5 - performance degradation upto 40% after
disabling and re-enabling a core
https://bugzilla.kernel.org/show_bug.cgi?id=121121


reproduce.sh
Description: Bourne shell script