[PATCH 4.14 069/137] perf/x86/intel/lbr: Fix incomplete LBR call stack

2018-10-02 Thread Greg Kroah-Hartman
4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Kan Liang 

[ Upstream commit 0592e57b24e7e05ec1f4c50b9666c013abff7017 ]

LBR has a limited stack size. If a task has a deeper call stack than
LBR's stack size, only the overflowed part is reported. A complete call
stack may not be reconstructed by perf tool.

Current code doesn't access all LBR registers. It only read the ones
below the TOS. The LBR registers above the TOS will be discarded
unconditionally.

When a CALL is captured, the TOS is incremented by 1 , modulo max LBR
stack size. The LBR HW only records the call stack information to the
register which the TOS points to. It will not touch other LBR
registers. So the registers above the TOS probably still store the valid
call stack information for an overflowed call stack, which need to be
reported.

To retrieve complete call stack information, we need to start from TOS,
read all LBR registers until an invalid entry is detected.
0s can be used to detect the invalid entry, because:

 - When a RET is captured, the HW zeros the LBR register which TOS points
   to, then decreases the TOS.
 - The LBR registers are reset to 0 when adding a new LBR event or
   scheduling an existing LBR event.
 - A taken branch at IP 0 is not expected

The context switch code is also modified to save/restore all valid LBR
registers. Furthermore, the LBR registers, which don't have valid call
stack information, need to be reset in restore, because they may be
polluted while swapped out.

Here is a small test program, tchain_deep.
Its call stack is deeper than 32.

 noinline void f33(void)
 {
int i;

for (i = 0; i < 1000;) {
if (i%2)
i++;
else
i++;
}
 }

 noinline void f32(void)
 {
f33();
 }

 noinline void f31(void)
 {
f32();
 }

 ... ...

 noinline void f1(void)
 {
f2();
 }

 int main()
 {
f1();
 }

Here is the test result on SKX. The max stack size of SKX is 32.

Without the patch:

 $ perf record -e cycles --call-graph lbr -- ./tchain_deep
 $ perf report --stdio
 #
 # Children  Self  Command  Shared Object Symbol
 #     ...    .
 #
   100.00%99.99%  tchain_deeptchain_deep   [.] f33
|
 --99.99%--f30
   f31
   f32
   f33

With the patch:

 $ perf record -e cycles --call-graph lbr -- ./tchain_deep
 $ perf report --stdio
 # Children  Self  Command  Shared Object Symbol
 #     ...    ..
 #
99.99% 0.00%  tchain_deeptchain_deep   [.] f1
|
---f1
   f2
   f3
   f4
   f5
   f6
   f7
   f8
   f9
   f10
   f11
   f12
   f13
   f14
   f15
   f16
   f17
   f18
   f19
   f20
   f21
   f22
   f23
   f24
   f25
   f26
   f27
   f28
   f29
   f30
   f31
   f32
   f33

Signed-off-by: Kan Liang 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Stephane Eranian 
Cc: Vince Weaver 
Cc: Alexander Shishkin 
Cc: Thomas Gleixner 
Cc: a...@kernel.org
Cc: eran...@google.com
Link: 
https://lore.kernel.org/lkml/1528213126-4312-1-git-send-email-kan.li...@linux.intel.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/events/intel/lbr.c  |   32 ++--
 arch/x86/events/perf_event.h |1 +
 2 files changed, 27 insertions(+), 6 deletions(-)

--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -346,7 +346,7 @@ static void __intel_pmu_lbr_restore(stru
 
mask = x86_pmu.lbr_nr - 1;
tos = task_ctx->tos;
-   for (i = 0; i < tos; i++) {
+   for (i = 0; i < task_ctx->valid_lbrs; i++) {
lbr_idx = (tos - i) & mask;
wrlbr_from(lbr_idx, task_ctx->lbr_from[i]);
wrlbr_to  (lbr_idx, task_ctx->lbr_to[i]);
@@ -354,6 +354,15 @@ static void __intel_pmu_lbr_restore(stru
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
}
+
+   for (; i < x86_pmu.lbr_nr; i++) {
+   lbr_idx = (tos - i) & mask;
+   wrlbr_from(lbr_idx, 0);
+   wrlbr_to(lbr_idx, 0);
+   if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+   wrmsrl(MSR_LBR_INFO_0 + 

[PATCH 4.14 069/137] perf/x86/intel/lbr: Fix incomplete LBR call stack

2018-10-02 Thread Greg Kroah-Hartman
4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Kan Liang 

[ Upstream commit 0592e57b24e7e05ec1f4c50b9666c013abff7017 ]

LBR has a limited stack size. If a task has a deeper call stack than
LBR's stack size, only the overflowed part is reported. A complete call
stack may not be reconstructed by perf tool.

Current code doesn't access all LBR registers. It only read the ones
below the TOS. The LBR registers above the TOS will be discarded
unconditionally.

When a CALL is captured, the TOS is incremented by 1 , modulo max LBR
stack size. The LBR HW only records the call stack information to the
register which the TOS points to. It will not touch other LBR
registers. So the registers above the TOS probably still store the valid
call stack information for an overflowed call stack, which need to be
reported.

To retrieve complete call stack information, we need to start from TOS,
read all LBR registers until an invalid entry is detected.
0s can be used to detect the invalid entry, because:

 - When a RET is captured, the HW zeros the LBR register which TOS points
   to, then decreases the TOS.
 - The LBR registers are reset to 0 when adding a new LBR event or
   scheduling an existing LBR event.
 - A taken branch at IP 0 is not expected

The context switch code is also modified to save/restore all valid LBR
registers. Furthermore, the LBR registers, which don't have valid call
stack information, need to be reset in restore, because they may be
polluted while swapped out.

Here is a small test program, tchain_deep.
Its call stack is deeper than 32.

 noinline void f33(void)
 {
int i;

for (i = 0; i < 1000;) {
if (i%2)
i++;
else
i++;
}
 }

 noinline void f32(void)
 {
f33();
 }

 noinline void f31(void)
 {
f32();
 }

 ... ...

 noinline void f1(void)
 {
f2();
 }

 int main()
 {
f1();
 }

Here is the test result on SKX. The max stack size of SKX is 32.

Without the patch:

 $ perf record -e cycles --call-graph lbr -- ./tchain_deep
 $ perf report --stdio
 #
 # Children  Self  Command  Shared Object Symbol
 #     ...    .
 #
   100.00%99.99%  tchain_deeptchain_deep   [.] f33
|
 --99.99%--f30
   f31
   f32
   f33

With the patch:

 $ perf record -e cycles --call-graph lbr -- ./tchain_deep
 $ perf report --stdio
 # Children  Self  Command  Shared Object Symbol
 #     ...    ..
 #
99.99% 0.00%  tchain_deeptchain_deep   [.] f1
|
---f1
   f2
   f3
   f4
   f5
   f6
   f7
   f8
   f9
   f10
   f11
   f12
   f13
   f14
   f15
   f16
   f17
   f18
   f19
   f20
   f21
   f22
   f23
   f24
   f25
   f26
   f27
   f28
   f29
   f30
   f31
   f32
   f33

Signed-off-by: Kan Liang 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Stephane Eranian 
Cc: Vince Weaver 
Cc: Alexander Shishkin 
Cc: Thomas Gleixner 
Cc: a...@kernel.org
Cc: eran...@google.com
Link: 
https://lore.kernel.org/lkml/1528213126-4312-1-git-send-email-kan.li...@linux.intel.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/events/intel/lbr.c  |   32 ++--
 arch/x86/events/perf_event.h |1 +
 2 files changed, 27 insertions(+), 6 deletions(-)

--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -346,7 +346,7 @@ static void __intel_pmu_lbr_restore(stru
 
mask = x86_pmu.lbr_nr - 1;
tos = task_ctx->tos;
-   for (i = 0; i < tos; i++) {
+   for (i = 0; i < task_ctx->valid_lbrs; i++) {
lbr_idx = (tos - i) & mask;
wrlbr_from(lbr_idx, task_ctx->lbr_from[i]);
wrlbr_to  (lbr_idx, task_ctx->lbr_to[i]);
@@ -354,6 +354,15 @@ static void __intel_pmu_lbr_restore(stru
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
}
+
+   for (; i < x86_pmu.lbr_nr; i++) {
+   lbr_idx = (tos - i) & mask;
+   wrlbr_from(lbr_idx, 0);
+   wrlbr_to(lbr_idx, 0);
+   if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+   wrmsrl(MSR_LBR_INFO_0 +