Re: [PATCH v2] x86/microcode: Handle negative microcode revisions

2018-10-25 Thread Andi Kleen
On Sat, Oct 20, 2018 at 07:41:36PM +0200, Borislav Petkov wrote:
> Dropping stable.
> 
> On Sat, Oct 20, 2018 at 07:41:58AM -0700, Andi Kleen wrote:
> > From: Andi Kleen 
> > 
> > The Intel microcode revision space is unsigned. Inside Intel there are 
> > special
> > microcodes that have the highest bit set, and they are considered to have
> > a higher revision than any microcodes that don't have this bit set.
> > 
> > The function comparing the microcode revision in the Linux driver compares
> > u32 with int, which ends up being signed extended to long on 64bit
> > systems. This results in these highest bit set microcode revision not 
> > loading
> > because their revision appears negative and smaller than the
> > existing microcode.
> > 
> > Change the comparison to unsigned. With that the loading works
> > as expected.
> > 
> > Cc: sta...@vger.kernel.org # Any supported stable
> > Signed-off-by: Andi Kleen 
> > --
> > v2: White space changes.
> > ---
> >  arch/x86/kernel/cpu/microcode/intel.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
> > b/arch/x86/kernel/cpu/microcode/intel.c
> > index 16936a24795c..e54d402500d3 100644
> > --- a/arch/x86/kernel/cpu/microcode/intel.c
> > +++ b/arch/x86/kernel/cpu/microcode/intel.c
> > @@ -93,7 +93,8 @@ static int find_matching_signature(void *mc, unsigned int 
> > csig, int cpf)
> >  /*
> >   * Returns 1 if update has been found, 0 otherwise.
> >   */
> > -static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int 
> > new_rev)
> > +static int has_newer_microcode(void *mc, unsigned int csig, int cpf,
> > +  unsigned new_rev)
> >  {
> > struct microcode_header_intel *mc_hdr = mc;
> >  
> > -- 
> 
> Please incorporate all review comments before sending a new version of
> your patch.

I replaced one more microcodes with microcodes revisions if that is
what you meant.

-Andi


[PATCH v4 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering

2018-10-25 Thread Andi Kleen
From: Andi Kleen 

KVM added a workaround for PEBS events leaking
into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.")
This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR.

Intel also added a fix for this issue to microcode updates on
Haswell/Broadwell/Skylake.

It turns out using the MSR entry/exit list makes VM exits
significantly slower. The list is only needed for disabling
PEBS, because the GLOBAL_CTRL change gets optimized by
KVM into changing the VMCS.

Check for the microcode updates that have the microcode
fix for leaking PEBS, and disable the extra entry/exit list
entry for PEBS_ENABLE. In addition we always clear the
GLOBAL_CTRL for the PEBS counter while running in the guest,
which is enough to make them never fire at the wrong
side of the host/guest transition.

We see significantly reduced overhead for VM exits with the
filtering active with the patch from 8% to 4%.

Signed-off-by: Andi Kleen 
---
v2:
Use match_ucode, not match_ucode_all
Remove cpu lock
Use INTEL_MIN_UCODE and move to header
Update Table to include skylake clients.
v3:
Use x86_min_microcode
---
 arch/x86/events/intel/core.c | 80 
 arch/x86/events/perf_event.h |  3 +-
 2 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0fb8659b20d8..89ec85c3359c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../perf_event.h"
 
@@ -3170,16 +3171,27 @@ static struct perf_guest_switch_msr 
*intel_guest_get_msrs(int *nr)
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
-   /*
-* If PMU counter has PEBS enabled it is not enough to disable counter
-* on a guest entry since PEBS memory write can overshoot guest entry
-* and corrupt guest memory. Disabling PEBS solves the problem.
-*/
-   arr[1].msr = MSR_IA32_PEBS_ENABLE;
-   arr[1].host = cpuc->pebs_enabled;
-   arr[1].guest = 0;
+   if (x86_pmu.flags & PMU_FL_PEBS_ALL)
+   arr[0].guest &= ~cpuc->pebs_enabled;
+   else
+   arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+   *nr = 1;
+
+   if (!x86_pmu.pebs_isolated) {
+   /*
+* If PMU counter has PEBS enabled it is not enough to
+* disable counter on a guest entry since PEBS memory
+* write can overshoot guest entry and corrupt guest
+* memory. Disabling PEBS solves the problem.
+*
+* Don't do this if the CPU already enforces it.
+*/
+   arr[1].msr = MSR_IA32_PEBS_ENABLE;
+   arr[1].host = cpuc->pebs_enabled;
+   arr[1].guest = 0;
+   *nr = 2;
+   }
 
-   *nr = 2;
return arr;
 }
 
@@ -3697,6 +3709,45 @@ static __init void intel_clovertown_quirk(void)
x86_pmu.pebs_constraints = NULL;
 }
 
+static const struct x86_ucode_id isolation_ucodes[] = {
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT,  1, 0x001e),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE,   4, 0x0023),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E,   1, 0x0014),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X,  2, 0x0b14),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE,   3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP,  3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE,  9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004

[PATCH v4 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-25 Thread Andi Kleen
From: Andi Kleen 

For bug workarounds or checks it is useful to check for specific
microcode revisionss. Add a new table format to check for steppings
with min microcode revisions.

This does not change the existing x86_cpu_id because it's an ABI
shared with modutils, and also has quite different requirements,
as in no wildcards, but everything has to be matched exactly.

Signed-off-by: Andi Kleen 
---
v2:
Remove all CPU match, only check boot cpu
Move INTEL_MIN_UCODE macro to header.
Minor cleanups.
Remove max ucode and driver data
v3:
Rename function
Update comments.
Document mixed stepping caveats.
Use u8 for model
Remove vendor 0 check.
Change return check
v4:
Rename to x86_min_microcode
Change return value to bool
---
 arch/x86/include/asm/cpu_device_id.h | 28 
 arch/x86/kernel/cpu/match.c  | 19 +++
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/cpu_device_id.h 
b/arch/x86/include/asm/cpu_device_id.h
index baeba0567126..28847d5ea1fa 100644
--- a/arch/x86/include/asm/cpu_device_id.h
+++ b/arch/x86/include/asm/cpu_device_id.h
@@ -11,4 +11,32 @@
 
 extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match);
 
+/*
+ * Match specific microcode revisions.
+ *
+ * vendor/family/model/stepping must be all set.
+ *
+ * only checks against the boot cpu.  When mixed-stepping configs are
+ * valid for a CPU model, add a quirk for every valid stepping and
+ * do the fine-tuning in the quirk handler.
+ */
+
+struct x86_ucode_id {
+   u8  vendor;
+   u8  family;
+   u8  model;
+   u8  stepping;
+   u32 min_ucode;
+};
+
+#define INTEL_MIN_UCODE(mod, step, rev) {  \
+   .vendor = X86_VENDOR_INTEL, \
+   .family = 6,\
+   .model = mod,   \
+   .stepping = step,   \
+   .min_ucode = rev,   \
+}
+
+extern bool x86_min_microcode(const struct x86_ucode_id *mt);
+
 #endif
diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c
index 3fed38812eea..12db14232d62 100644
--- a/arch/x86/kernel/cpu/match.c
+++ b/arch/x86/kernel/cpu/match.c
@@ -48,3 +48,22 @@ const struct x86_cpu_id *x86_match_cpu(const struct 
x86_cpu_id *match)
return NULL;
 }
 EXPORT_SYMBOL(x86_match_cpu);
+
+bool x86_min_microcode(const struct x86_ucode_id *mt)
+{
+   struct cpuinfo_x86 *c = _cpu_data;
+   const struct x86_ucode_id *m;
+
+   for (m = mt; m->family | m->model; m++) {
+   if (c->x86_vendor != m->vendor)
+   continue;
+   if (c->x86 != m->family)
+   continue;
+   if (c->x86_model != m->model)
+   continue;
+   if (c->x86_stepping != m->stepping)
+   continue;
+   return c->microcode >= m->min_ucode;
+   }
+   return false;
+}
-- 
2.17.2



[PATCH v4 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering

2018-10-25 Thread Andi Kleen
From: Andi Kleen 

KVM added a workaround for PEBS events leaking
into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.")
This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR.

Intel also added a fix for this issue to microcode updates on
Haswell/Broadwell/Skylake.

It turns out using the MSR entry/exit list makes VM exits
significantly slower. The list is only needed for disabling
PEBS, because the GLOBAL_CTRL change gets optimized by
KVM into changing the VMCS.

Check for the microcode updates that have the microcode
fix for leaking PEBS, and disable the extra entry/exit list
entry for PEBS_ENABLE. In addition we always clear the
GLOBAL_CTRL for the PEBS counter while running in the guest,
which is enough to make them never fire at the wrong
side of the host/guest transition.

We see significantly reduced overhead for VM exits with the
filtering active with the patch from 8% to 4%.

Signed-off-by: Andi Kleen 
---
v2:
Use match_ucode, not match_ucode_all
Remove cpu lock
Use INTEL_MIN_UCODE and move to header
Update Table to include skylake clients.
v3:
Use x86_min_microcode
---
 arch/x86/events/intel/core.c | 80 
 arch/x86/events/perf_event.h |  3 +-
 2 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0fb8659b20d8..89ec85c3359c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../perf_event.h"
 
@@ -3170,16 +3171,27 @@ static struct perf_guest_switch_msr 
*intel_guest_get_msrs(int *nr)
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
-   /*
-* If PMU counter has PEBS enabled it is not enough to disable counter
-* on a guest entry since PEBS memory write can overshoot guest entry
-* and corrupt guest memory. Disabling PEBS solves the problem.
-*/
-   arr[1].msr = MSR_IA32_PEBS_ENABLE;
-   arr[1].host = cpuc->pebs_enabled;
-   arr[1].guest = 0;
+   if (x86_pmu.flags & PMU_FL_PEBS_ALL)
+   arr[0].guest &= ~cpuc->pebs_enabled;
+   else
+   arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+   *nr = 1;
+
+   if (!x86_pmu.pebs_isolated) {
+   /*
+* If PMU counter has PEBS enabled it is not enough to
+* disable counter on a guest entry since PEBS memory
+* write can overshoot guest entry and corrupt guest
+* memory. Disabling PEBS solves the problem.
+*
+* Don't do this if the CPU already enforces it.
+*/
+   arr[1].msr = MSR_IA32_PEBS_ENABLE;
+   arr[1].host = cpuc->pebs_enabled;
+   arr[1].guest = 0;
+   *nr = 2;
+   }
 
-   *nr = 2;
return arr;
 }
 
@@ -3697,6 +3709,45 @@ static __init void intel_clovertown_quirk(void)
x86_pmu.pebs_constraints = NULL;
 }
 
+static const struct x86_ucode_id isolation_ucodes[] = {
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT,  1, 0x001e),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE,   4, 0x0023),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E,   1, 0x0014),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X,  2, 0x0b14),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE,   3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP,  3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE,  9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004

[PATCH v4 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-25 Thread Andi Kleen
From: Andi Kleen 

For bug workarounds or checks it is useful to check for specific
microcode revisionss. Add a new table format to check for steppings
with min microcode revisions.

This does not change the existing x86_cpu_id because it's an ABI
shared with modutils, and also has quite different requirements,
as in no wildcards, but everything has to be matched exactly.

Signed-off-by: Andi Kleen 
---
v2:
Remove all CPU match, only check boot cpu
Move INTEL_MIN_UCODE macro to header.
Minor cleanups.
Remove max ucode and driver data
v3:
Rename function
Update comments.
Document mixed stepping caveats.
Use u8 for model
Remove vendor 0 check.
Change return check
v4:
Rename to x86_min_microcode
Change return value to bool
---
 arch/x86/include/asm/cpu_device_id.h | 28 
 arch/x86/kernel/cpu/match.c  | 19 +++
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/cpu_device_id.h 
b/arch/x86/include/asm/cpu_device_id.h
index baeba0567126..28847d5ea1fa 100644
--- a/arch/x86/include/asm/cpu_device_id.h
+++ b/arch/x86/include/asm/cpu_device_id.h
@@ -11,4 +11,32 @@
 
 extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match);
 
+/*
+ * Match specific microcode revisions.
+ *
+ * vendor/family/model/stepping must be all set.
+ *
+ * only checks against the boot cpu.  When mixed-stepping configs are
+ * valid for a CPU model, add a quirk for every valid stepping and
+ * do the fine-tuning in the quirk handler.
+ */
+
+struct x86_ucode_id {
+   u8  vendor;
+   u8  family;
+   u8  model;
+   u8  stepping;
+   u32 min_ucode;
+};
+
+#define INTEL_MIN_UCODE(mod, step, rev) {  \
+   .vendor = X86_VENDOR_INTEL, \
+   .family = 6,\
+   .model = mod,   \
+   .stepping = step,   \
+   .min_ucode = rev,   \
+}
+
+extern bool x86_min_microcode(const struct x86_ucode_id *mt);
+
 #endif
diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c
index 3fed38812eea..12db14232d62 100644
--- a/arch/x86/kernel/cpu/match.c
+++ b/arch/x86/kernel/cpu/match.c
@@ -48,3 +48,22 @@ const struct x86_cpu_id *x86_match_cpu(const struct 
x86_cpu_id *match)
return NULL;
 }
 EXPORT_SYMBOL(x86_match_cpu);
+
+bool x86_min_microcode(const struct x86_ucode_id *mt)
+{
+   struct cpuinfo_x86 *c = _cpu_data;
+   const struct x86_ucode_id *m;
+
+   for (m = mt; m->family | m->model; m++) {
+   if (c->x86_vendor != m->vendor)
+   continue;
+   if (c->x86 != m->family)
+   continue;
+   if (c->x86_model != m->model)
+   continue;
+   if (c->x86_stepping != m->stepping)
+   continue;
+   return c->microcode >= m->min_ucode;
+   }
+   return false;
+}
-- 
2.17.2



Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-25 Thread Andi Kleen
On Sun, Oct 21, 2018 at 12:20:47PM +0200, Thomas Gleixner wrote:
> Andi,
> 
> On Sat, 20 Oct 2018, Andi Kleen wrote:
> > On Sat, Oct 20, 2018 at 10:19:37AM +0200, Thomas Gleixner wrote:
> > > On Fri, 19 Oct 2018, Andi Kleen wrote:
> > > There is no point to return the pointer because it's not a compound
> > > structure. If you want to provide the possibility to use the index then
> > > return the index and an error code if it does not match.
> > 
> > It will be useful with the driver_data pointer, which you short sightedly
> > forced me to remove, and likely will need to be readded at some point
> > anyways if this gets more widely used.
> 
> It's good and established practice not to add functionality on a 'might be
> used' basis. If you'd provide at least one or two patches which demonstrate
> how that is useful then that would be convincing.
> 
> >  At least with the pointer not all callers will need to be changed then.
> 
> It doesn't need to be changed at all, when done correctly.

Thanks.

I opted for the simpler method of returning a boolean now.

-Andi


Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-25 Thread Andi Kleen
On Sun, Oct 21, 2018 at 12:20:47PM +0200, Thomas Gleixner wrote:
> Andi,
> 
> On Sat, 20 Oct 2018, Andi Kleen wrote:
> > On Sat, Oct 20, 2018 at 10:19:37AM +0200, Thomas Gleixner wrote:
> > > On Fri, 19 Oct 2018, Andi Kleen wrote:
> > > There is no point to return the pointer because it's not a compound
> > > structure. If you want to provide the possibility to use the index then
> > > return the index and an error code if it does not match.
> > 
> > It will be useful with the driver_data pointer, which you short sightedly
> > forced me to remove, and likely will need to be readded at some point
> > anyways if this gets more widely used.
> 
> It's good and established practice not to add functionality on a 'might be
> used' basis. If you'd provide at least one or two patches which demonstrate
> how that is useful then that would be convincing.
> 
> >  At least with the pointer not all callers will need to be changed then.
> 
> It doesn't need to be changed at all, when done correctly.

Thanks.

I opted for the simpler method of returning a boolean now.

-Andi


Re: [v3 03/12] x86/fsgsbase/64: Add intrinsics/macros for FSGSBASE instructions

2018-10-24 Thread Andi Kleen
On Wed, Oct 24, 2018 at 11:53:54AM -0700, Andy Lutomirski wrote:
> On Tue, Oct 23, 2018 at 11:43 AM Chang S. Bae  
> wrote:
> >
> > From: Andi Kleen 
> >
> > Add C intrinsics and assembler macros for the new FSBASE and GSBASE
> > instructions.
> >
> > Very straight forward. Used in followon patches.
> >
> > [ luto: Rename the variables from FS and GS to FSBASE and GSBASE and
> >   make  safe to include on 32-bit kernels. ]
> >
> > v2: Use __always_inline
> >
> > [ chang: Revise the changelog. Place them in . Replace
> >   the macros with GAS-compatible ones. ]
> >
> > If GCC supports it, we can add -mfsgsbase to CFLAGS and use the builtins
> > here for extra performance.
> 
> Reviewed-by: Andy Lutomirski  # C parts only
> 
> With the caveat that I'm not convinced that the memory clobbers are
> needed.  The __force_order trick in special_insns.h would probably be
> more appropriate.
> 
> I don't feel qualified to review the asm part without some research.
> Whereas hpa or Boris could probably review it with their eyes closed.

BTW the other option would be to update the min-binutils requirement 
to 2.21 (currently it is 2.20) and then write it directly without .byte. 
I believe 2.21 added support for these instructions.

(It's only a binutils requirement, don't need gcc support)

-Andi


Re: [v3 03/12] x86/fsgsbase/64: Add intrinsics/macros for FSGSBASE instructions

2018-10-24 Thread Andi Kleen
On Wed, Oct 24, 2018 at 11:53:54AM -0700, Andy Lutomirski wrote:
> On Tue, Oct 23, 2018 at 11:43 AM Chang S. Bae  
> wrote:
> >
> > From: Andi Kleen 
> >
> > Add C intrinsics and assembler macros for the new FSBASE and GSBASE
> > instructions.
> >
> > Very straight forward. Used in followon patches.
> >
> > [ luto: Rename the variables from FS and GS to FSBASE and GSBASE and
> >   make  safe to include on 32-bit kernels. ]
> >
> > v2: Use __always_inline
> >
> > [ chang: Revise the changelog. Place them in . Replace
> >   the macros with GAS-compatible ones. ]
> >
> > If GCC supports it, we can add -mfsgsbase to CFLAGS and use the builtins
> > here for extra performance.
> 
> Reviewed-by: Andy Lutomirski  # C parts only
> 
> With the caveat that I'm not convinced that the memory clobbers are
> needed.  The __force_order trick in special_insns.h would probably be
> more appropriate.
> 
> I don't feel qualified to review the asm part without some research.
> Whereas hpa or Boris could probably review it with their eyes closed.

BTW the other option would be to update the min-binutils requirement 
to 2.21 (currently it is 2.20) and then write it directly without .byte. 
I believe 2.21 added support for these instructions.

(It's only a binutils requirement, don't need gcc support)

-Andi


Re: [PATCH 1/2] perf: Add munmap callback

2018-10-24 Thread Andi Kleen
> > void perf_event_mmap(struct vm_area_struct *vma)
> > {
> >  struct perf_mmap_event mmap_event;
> > 
> >  if (!atomic_read(_mmap_events))
> >  return;
> > 
> > }
> > 
> 
> Thanks. I'll add the nr_mmap_events check in V2.

No, that's the wrong check here. The PEBS flush is independent of mmap
events being requested.

If anything would need to check for any PEBS events active, which
would need a new counter.  I think the easiest is to just check if 
this_cpu_ptr(_cb_list)
is empty, which should be good enough.

-Andi


Re: [PATCH 1/2] perf: Add munmap callback

2018-10-24 Thread Andi Kleen
> > void perf_event_mmap(struct vm_area_struct *vma)
> > {
> >  struct perf_mmap_event mmap_event;
> > 
> >  if (!atomic_read(_mmap_events))
> >  return;
> > 
> > }
> > 
> 
> Thanks. I'll add the nr_mmap_events check in V2.

No, that's the wrong check here. The PEBS flush is independent of mmap
events being requested.

If anything would need to check for any PEBS events active, which
would need a new counter.  I think the easiest is to just check if 
this_cpu_ptr(_cb_list)
is empty, which should be good enough.

-Andi


Re: [PATCH 1/2] perf: Add munmap callback

2018-10-24 Thread Andi Kleen
> +void perf_event_munmap(void)
> +{
> + struct perf_cpu_context *cpuctx;
> + unsigned long flags;
> + struct pmu *pmu;
> +
> + local_irq_save(flags);
> + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), 
> sched_cb_entry) {

Would be good have a fast path here that checks for the list being empty
without disabling the interrupts. munmap can be somewhat hot. I think
it's ok to make it slower with perf running, but we shouldn't impact
it without perf.

-Andi


Re: [PATCH 1/2] perf: Add munmap callback

2018-10-24 Thread Andi Kleen
> +void perf_event_munmap(void)
> +{
> + struct perf_cpu_context *cpuctx;
> + unsigned long flags;
> + struct pmu *pmu;
> +
> + local_irq_save(flags);
> + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), 
> sched_cb_entry) {

Would be good have a fast path here that checks for the list being empty
without disabling the interrupts. munmap can be somewhat hot. I think
it's ok to make it slower with perf running, but we shouldn't impact
it without perf.

-Andi


Re: Broken dwarf unwinding - wrong stack pointer register value?

2018-10-24 Thread Andi Kleen
> 
> Can someone at least confirm whether unwinding from a function prologue via 
> .eh_frame (but without .debug_frame) should actually be possible?

Yes it should be possible. Asynchronous unwind tables should work
from any instruction.

-Andi


Re: Broken dwarf unwinding - wrong stack pointer register value?

2018-10-24 Thread Andi Kleen
> 
> Can someone at least confirm whether unwinding from a function prologue via 
> .eh_frame (but without .debug_frame) should actually be possible?

Yes it should be possible. Asynchronous unwind tables should work
from any instruction.

-Andi


Re: Broken dwarf unwinding - wrong stack pointer register value?

2018-10-22 Thread Andi Kleen
> So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We 

It's default (64bit since always and 32bit now too) Unless someone disabled it.

However libm might be partially written in assembler and hand written assembler
often has problems with unwind tables because the programmer has to get them
correct explicitely.

-Andi


Re: Broken dwarf unwinding - wrong stack pointer register value?

2018-10-22 Thread Andi Kleen
> So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We 

It's default (64bit since always and 32bit now too) Unless someone disabled it.

However libm might be partially written in assembler and hand written assembler
often has problems with unwind tables because the programmer has to get them
correct explicitely.

-Andi


Re: Broken dwarf unwinding - wrong stack pointer register value?

2018-10-22 Thread Andi Kleen
Milian Wolff  writes:
>
> After more digging, it turns out that I've apparently chased a red herring. 
> I'm running archlinux which isn't shipping debug symbols for libm.

64bit executables normally have unwind information even when stripped.
Unless someone forcefully stripped those too.

You can checkout with objdump --sections.

-Andi


Re: Broken dwarf unwinding - wrong stack pointer register value?

2018-10-22 Thread Andi Kleen
Milian Wolff  writes:
>
> After more digging, it turns out that I've apparently chased a red herring. 
> I'm running archlinux which isn't shipping debug symbols for libm.

64bit executables normally have unwind information even when stripped.
Unless someone forcefully stripped those too.

You can checkout with objdump --sections.

-Andi


[PATCH v2] x86/microcode: Handle negative microcode revisions

2018-10-20 Thread Andi Kleen
From: Andi Kleen 

The Intel microcode revision space is unsigned. Inside Intel there are special
microcodes that have the highest bit set, and they are considered to have
a higher revision than any microcodes that don't have this bit set.

The function comparing the microcode revision in the Linux driver compares
u32 with int, which ends up being signed extended to long on 64bit
systems. This results in these highest bit set microcode revision not loading
because their revision appears negative and smaller than the
existing microcode.

Change the comparison to unsigned. With that the loading works
as expected.

Cc: sta...@vger.kernel.org # Any supported stable
Signed-off-by: Andi Kleen 
--
v2: White space changes.
---
 arch/x86/kernel/cpu/microcode/intel.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 16936a24795c..e54d402500d3 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -93,7 +93,8 @@ static int find_matching_signature(void *mc, unsigned int 
csig, int cpf)
 /*
  * Returns 1 if update has been found, 0 otherwise.
  */
-static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int 
new_rev)
+static int has_newer_microcode(void *mc, unsigned int csig, int cpf,
+  unsigned new_rev)
 {
struct microcode_header_intel *mc_hdr = mc;
 
-- 
2.17.1



[PATCH v2] x86/microcode: Handle negative microcode revisions

2018-10-20 Thread Andi Kleen
From: Andi Kleen 

The Intel microcode revision space is unsigned. Inside Intel there are special
microcodes that have the highest bit set, and they are considered to have
a higher revision than any microcodes that don't have this bit set.

The function comparing the microcode revision in the Linux driver compares
u32 with int, which ends up being signed extended to long on 64bit
systems. This results in these highest bit set microcode revision not loading
because their revision appears negative and smaller than the
existing microcode.

Change the comparison to unsigned. With that the loading works
as expected.

Cc: sta...@vger.kernel.org # Any supported stable
Signed-off-by: Andi Kleen 
--
v2: White space changes.
---
 arch/x86/kernel/cpu/microcode/intel.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 16936a24795c..e54d402500d3 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -93,7 +93,8 @@ static int find_matching_signature(void *mc, unsigned int 
csig, int cpf)
 /*
  * Returns 1 if update has been found, 0 otherwise.
  */
-static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int 
new_rev)
+static int has_newer_microcode(void *mc, unsigned int csig, int cpf,
+  unsigned new_rev)
 {
struct microcode_header_intel *mc_hdr = mc;
 
-- 
2.17.1



Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-20 Thread Andi Kleen
On Sat, Oct 20, 2018 at 10:19:37AM +0200, Thomas Gleixner wrote:
> On Fri, 19 Oct 2018, Andi Kleen wrote:
> 
> > > > +   u32 min_ucode;
> > > > +};
> > > > +
> > > > +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id 
> > > > *match)
> > > 
> > > What's the point of returning the struct pointer? Shouldn't it be enough 
> > > to
> > > make it return bool? Also the function name really should reflect that 
> > > this
> > > checks whether the minimal required microcode revision is active.
> > 
> > This allows the user to find the table entry to tie something to it
> > (e.g. use the index to match some other table)
> > 
> > Same pattern as pci discovery etc. use.
> > 
> > Given the current caller doesn't need it, but we still follow standard
> > conventions.
> 
> There is no point to return the pointer because it's not a compound
> structure. If you want to provide the possibility to use the index then
> return the index and an error code if it does not match.

It will be useful with the driver_data pointer, which you short sightedly
forced me to remove, and likely will need to be readded at some point
anyways if this gets more widely used. At least with the pointer not
all callers will need to be changed then. 

Also it's symmetric with how the PCI and USB and the existing x86 match
discovery interfaces work.

> > > VENDOR_INTEL = 0, so this check is obscure to begin with. Either you chose
> > > a explicit condition to put at the end of the table, e.g. vendor = U8_MAX
> > > or you hand in the array size to the function.
> > 
> > That would both be awkward. It's the same as match_cpu, and 0 terminators
> > are standard practice in practical all similar code. I removed
> > the or with the family.
> 
> That's debatable because it's more easy to miss the terminator than getting
> the ARRAY_SIZE() argument wrong. But it doesn't matter much.

Ok then please apply it. 

-Andi



Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-20 Thread Andi Kleen
On Sat, Oct 20, 2018 at 10:19:37AM +0200, Thomas Gleixner wrote:
> On Fri, 19 Oct 2018, Andi Kleen wrote:
> 
> > > > +   u32 min_ucode;
> > > > +};
> > > > +
> > > > +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id 
> > > > *match)
> > > 
> > > What's the point of returning the struct pointer? Shouldn't it be enough 
> > > to
> > > make it return bool? Also the function name really should reflect that 
> > > this
> > > checks whether the minimal required microcode revision is active.
> > 
> > This allows the user to find the table entry to tie something to it
> > (e.g. use the index to match some other table)
> > 
> > Same pattern as pci discovery etc. use.
> > 
> > Given the current caller doesn't need it, but we still follow standard
> > conventions.
> 
> There is no point to return the pointer because it's not a compound
> structure. If you want to provide the possibility to use the index then
> return the index and an error code if it does not match.

It will be useful with the driver_data pointer, which you short sightedly
forced me to remove, and likely will need to be readded at some point
anyways if this gets more widely used. At least with the pointer not
all callers will need to be changed then. 

Also it's symmetric with how the PCI and USB and the existing x86 match
discovery interfaces work.

> > > VENDOR_INTEL = 0, so this check is obscure to begin with. Either you chose
> > > a explicit condition to put at the end of the table, e.g. vendor = U8_MAX
> > > or you hand in the array size to the function.
> > 
> > That would both be awkward. It's the same as match_cpu, and 0 terminators
> > are standard practice in practical all similar code. I removed
> > the or with the family.
> 
> That's debatable because it's more easy to miss the terminator than getting
> the ARRAY_SIZE() argument wrong. But it doesn't matter much.

Ok then please apply it. 

-Andi



Re: [PATCH v1] x86/microcode: Handle negative microcode revisions

2018-10-20 Thread Andi Kleen
On Sat, Oct 20, 2018 at 03:42:05PM +0200, Thomas Gleixner wrote:
> Andi,
> 
> On Fri, 19 Oct 2018, Andi Kleen wrote:
> > Change the comparison to unsigned. With that the loading works
> > as expected.
> >
> 
> I assume that wants a fixes tag and needs to be backported to stable,
> right?

I think the bug was there since the original microcode loader, not
sure I can dig out the right commit id for that (it might be
in linux-historic)

Yes please cc stable for this.

-Andi


Re: [PATCH v1] x86/microcode: Handle negative microcode revisions

2018-10-20 Thread Andi Kleen
On Sat, Oct 20, 2018 at 03:42:05PM +0200, Thomas Gleixner wrote:
> Andi,
> 
> On Fri, 19 Oct 2018, Andi Kleen wrote:
> > Change the comparison to unsigned. With that the loading works
> > as expected.
> >
> 
> I assume that wants a fixes tag and needs to be backported to stable,
> right?

I think the bug was there since the original microcode loader, not
sure I can dig out the right commit id for that (it might be
in linux-historic)

Yes please cc stable for this.

-Andi


[PATCH v3 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering

2018-10-19 Thread Andi Kleen
From: Andi Kleen 

KVM added a workaround for PEBS events leaking
into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.")
This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR.

Intel also added a fix for this issue to microcode updates on
Haswell/Broadwell/Skylake.

It turns out using the MSR entry/exit list makes VM exits
significantly slower. The list is only needed for disabling
PEBS, because the GLOBAL_CTRL change gets optimized by
KVM into changing the VMCS.

Check for the microcode updates that have the microcode
fix for leaking PEBS, and disable the extra entry/exit list
entry for PEBS_ENABLE. In addition we always clear the
GLOBAL_CTRL for the PEBS counter while running in the guest,
which is enough to make them never fire at the wrong
side of the host/guest transition.

We see significantly reduced overhead for VM exits with the
filtering active with the patch from 8% to 4%.

Signed-off-by: Andi Kleen 
---
v2:
Use match_ucode, not match_ucode_all
Remove cpu lock
Use INTEL_MIN_UCODE and move to header
Update Table to include skylake clients.
---
 arch/x86/events/intel/core.c | 80 
 arch/x86/events/perf_event.h |  3 +-
 2 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0fb8659b20d8..5c45535c60b4 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../perf_event.h"
 
@@ -3170,16 +3171,27 @@ static struct perf_guest_switch_msr 
*intel_guest_get_msrs(int *nr)
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
-   /*
-* If PMU counter has PEBS enabled it is not enough to disable counter
-* on a guest entry since PEBS memory write can overshoot guest entry
-* and corrupt guest memory. Disabling PEBS solves the problem.
-*/
-   arr[1].msr = MSR_IA32_PEBS_ENABLE;
-   arr[1].host = cpuc->pebs_enabled;
-   arr[1].guest = 0;
+   if (x86_pmu.flags & PMU_FL_PEBS_ALL)
+   arr[0].guest &= ~cpuc->pebs_enabled;
+   else
+   arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+   *nr = 1;
+
+   if (!x86_pmu.pebs_isolated) {
+   /*
+* If PMU counter has PEBS enabled it is not enough to
+* disable counter on a guest entry since PEBS memory
+* write can overshoot guest entry and corrupt guest
+* memory. Disabling PEBS solves the problem.
+*
+* Don't do this if the CPU already enforces it.
+*/
+   arr[1].msr = MSR_IA32_PEBS_ENABLE;
+   arr[1].host = cpuc->pebs_enabled;
+   arr[1].guest = 0;
+   *nr = 2;
+   }
 
-   *nr = 2;
return arr;
 }
 
@@ -3697,6 +3709,45 @@ static __init void intel_clovertown_quirk(void)
x86_pmu.pebs_constraints = NULL;
 }
 
+static const struct x86_ucode_id isolation_ucodes[] = {
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT,  1, 0x001e),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE,   4, 0x0023),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E,   1, 0x0014),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X,  2, 0x0b14),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE,   3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP,  3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE,  9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004e),
+   

[PATCH v3 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-19 Thread Andi Kleen
From: Andi Kleen 

For bug workarounds or checks it is useful to check for specific
microcode revisionss. Add a new table format to check for steppings
with min microcode revisions.

This does not change the existing x86_cpu_id because it's an ABI
shared with modutils, and also has quite different requirements,
as in no wildcards, but everything has to be matched exactly.

Signed-off-by: Andi Kleen 
---
v2:
Remove all CPU match, only check boot cpu
Move INTEL_MIN_UCODE macro to header.
Minor cleanups.
Remove max ucode and driver data
v3:
Rename function
Update comments.
Document mixed stepping caveats.
Use u8 for model
Remove vendor 0 check.
Change return check
---
 arch/x86/include/asm/cpu_device_id.h | 29 
 arch/x86/kernel/cpu/match.c  | 19 ++
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/cpu_device_id.h 
b/arch/x86/include/asm/cpu_device_id.h
index baeba0567126..97d645910ec2 100644
--- a/arch/x86/include/asm/cpu_device_id.h
+++ b/arch/x86/include/asm/cpu_device_id.h
@@ -11,4 +11,33 @@
 
 extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match);
 
+/*
+ * Match specific microcode revisions.
+ *
+ * vendor/family/model/stepping must be all set.
+ *
+ * only checks against the boot cpu.  When mixed-stepping configs are
+ * valid for a CPU model, add a quirk for every valid stepping and
+ * do the fine-tuning in the quirk handler.
+ */
+
+struct x86_ucode_id {
+   u8  vendor;
+   u8  family;
+   u8  model;
+   u8  stepping;
+   u32 min_ucode;
+};
+
+#define INTEL_MIN_UCODE(mod, step, rev) {  \
+   .vendor = X86_VENDOR_INTEL, \
+   .family = 6,\
+   .model = mod,   \
+   .stepping = step,   \
+   .min_ucode = rev,   \
+}
+
+extern const struct x86_ucode_id *
+x86_match_microcode(const struct x86_ucode_id *match);
+
 #endif
diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c
index 3fed38812eea..d44c8cfd7b29 100644
--- a/arch/x86/kernel/cpu/match.c
+++ b/arch/x86/kernel/cpu/match.c
@@ -48,3 +48,22 @@ const struct x86_cpu_id *x86_match_cpu(const struct 
x86_cpu_id *match)
return NULL;
 }
 EXPORT_SYMBOL(x86_match_cpu);
+
+const struct x86_ucode_id *x86_match_microcode(const struct x86_ucode_id 
*match)
+{
+   struct cpuinfo_x86 *c = _cpu_data;
+   const struct x86_ucode_id *m;
+
+   for (m = match; m->family | m->model; m++) {
+   if (c->x86_vendor != m->vendor)
+   continue;
+   if (c->x86 != m->family)
+   continue;
+   if (c->x86_model != m->model)
+   continue;
+   if (c->x86_stepping != m->stepping)
+   continue;
+   return c->microcode >= m->min_ucode ? m : NULL;
+   }
+   return NULL;
+}
-- 
2.17.1



[PATCH v3 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering

2018-10-19 Thread Andi Kleen
From: Andi Kleen 

KVM added a workaround for PEBS events leaking
into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.")
This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR.

Intel also added a fix for this issue to microcode updates on
Haswell/Broadwell/Skylake.

It turns out using the MSR entry/exit list makes VM exits
significantly slower. The list is only needed for disabling
PEBS, because the GLOBAL_CTRL change gets optimized by
KVM into changing the VMCS.

Check for the microcode updates that have the microcode
fix for leaking PEBS, and disable the extra entry/exit list
entry for PEBS_ENABLE. In addition we always clear the
GLOBAL_CTRL for the PEBS counter while running in the guest,
which is enough to make them never fire at the wrong
side of the host/guest transition.

We see significantly reduced overhead for VM exits with the
filtering active with the patch from 8% to 4%.

Signed-off-by: Andi Kleen 
---
v2:
Use match_ucode, not match_ucode_all
Remove cpu lock
Use INTEL_MIN_UCODE and move to header
Update Table to include skylake clients.
---
 arch/x86/events/intel/core.c | 80 
 arch/x86/events/perf_event.h |  3 +-
 2 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0fb8659b20d8..5c45535c60b4 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../perf_event.h"
 
@@ -3170,16 +3171,27 @@ static struct perf_guest_switch_msr 
*intel_guest_get_msrs(int *nr)
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
-   /*
-* If PMU counter has PEBS enabled it is not enough to disable counter
-* on a guest entry since PEBS memory write can overshoot guest entry
-* and corrupt guest memory. Disabling PEBS solves the problem.
-*/
-   arr[1].msr = MSR_IA32_PEBS_ENABLE;
-   arr[1].host = cpuc->pebs_enabled;
-   arr[1].guest = 0;
+   if (x86_pmu.flags & PMU_FL_PEBS_ALL)
+   arr[0].guest &= ~cpuc->pebs_enabled;
+   else
+   arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+   *nr = 1;
+
+   if (!x86_pmu.pebs_isolated) {
+   /*
+* If PMU counter has PEBS enabled it is not enough to
+* disable counter on a guest entry since PEBS memory
+* write can overshoot guest entry and corrupt guest
+* memory. Disabling PEBS solves the problem.
+*
+* Don't do this if the CPU already enforces it.
+*/
+   arr[1].msr = MSR_IA32_PEBS_ENABLE;
+   arr[1].host = cpuc->pebs_enabled;
+   arr[1].guest = 0;
+   *nr = 2;
+   }
 
-   *nr = 2;
return arr;
 }
 
@@ -3697,6 +3709,45 @@ static __init void intel_clovertown_quirk(void)
x86_pmu.pebs_constraints = NULL;
 }
 
+static const struct x86_ucode_id isolation_ucodes[] = {
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT,  1, 0x001e),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE,   4, 0x0023),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E,   1, 0x0014),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X,  2, 0x0b14),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE,   3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP,  3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE,  9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004e),
+   

[PATCH v3 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-19 Thread Andi Kleen
From: Andi Kleen 

For bug workarounds or checks it is useful to check for specific
microcode revisionss. Add a new table format to check for steppings
with min microcode revisions.

This does not change the existing x86_cpu_id because it's an ABI
shared with modutils, and also has quite different requirements,
as in no wildcards, but everything has to be matched exactly.

Signed-off-by: Andi Kleen 
---
v2:
Remove all CPU match, only check boot cpu
Move INTEL_MIN_UCODE macro to header.
Minor cleanups.
Remove max ucode and driver data
v3:
Rename function
Update comments.
Document mixed stepping caveats.
Use u8 for model
Remove vendor 0 check.
Change return check
---
 arch/x86/include/asm/cpu_device_id.h | 29 
 arch/x86/kernel/cpu/match.c  | 19 ++
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/cpu_device_id.h 
b/arch/x86/include/asm/cpu_device_id.h
index baeba0567126..97d645910ec2 100644
--- a/arch/x86/include/asm/cpu_device_id.h
+++ b/arch/x86/include/asm/cpu_device_id.h
@@ -11,4 +11,33 @@
 
 extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match);
 
+/*
+ * Match specific microcode revisions.
+ *
+ * vendor/family/model/stepping must be all set.
+ *
+ * only checks against the boot cpu.  When mixed-stepping configs are
+ * valid for a CPU model, add a quirk for every valid stepping and
+ * do the fine-tuning in the quirk handler.
+ */
+
+struct x86_ucode_id {
+   u8  vendor;
+   u8  family;
+   u8  model;
+   u8  stepping;
+   u32 min_ucode;
+};
+
+#define INTEL_MIN_UCODE(mod, step, rev) {  \
+   .vendor = X86_VENDOR_INTEL, \
+   .family = 6,\
+   .model = mod,   \
+   .stepping = step,   \
+   .min_ucode = rev,   \
+}
+
+extern const struct x86_ucode_id *
+x86_match_microcode(const struct x86_ucode_id *match);
+
 #endif
diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c
index 3fed38812eea..d44c8cfd7b29 100644
--- a/arch/x86/kernel/cpu/match.c
+++ b/arch/x86/kernel/cpu/match.c
@@ -48,3 +48,22 @@ const struct x86_cpu_id *x86_match_cpu(const struct 
x86_cpu_id *match)
return NULL;
 }
 EXPORT_SYMBOL(x86_match_cpu);
+
+const struct x86_ucode_id *x86_match_microcode(const struct x86_ucode_id 
*match)
+{
+   struct cpuinfo_x86 *c = _cpu_data;
+   const struct x86_ucode_id *m;
+
+   for (m = match; m->family | m->model; m++) {
+   if (c->x86_vendor != m->vendor)
+   continue;
+   if (c->x86 != m->family)
+   continue;
+   if (c->x86_model != m->model)
+   continue;
+   if (c->x86_stepping != m->stepping)
+   continue;
+   return c->microcode >= m->min_ucode ? m : NULL;
+   }
+   return NULL;
+}
-- 
2.17.1



[PATCH v1] x86/microcode: Handle negative microcode revisions

2018-10-19 Thread Andi Kleen
From: Andi Kleen 

The Intel ucode revision space is unsigned. Inside Intel there are special
microcodes that have the highest bit set, and they are considered to have
a higher revision than any microcodes that don't have this bit set.

The function comparing the microcodes in the Linux driver compares
u32 with int, which ends up being signed extended to long on 64bit
systems. This results in these highest bit set microcodes not loading
because their revision appears negative and smaller than the
existing microcode.

Change the comparison to unsigned. With that the loading works
as expected.

Signed-off-by: Andi Kleen 
---
 arch/x86/kernel/cpu/microcode/intel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 16936a24795c..e95cebdd5993 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -93,7 +93,7 @@ static int find_matching_signature(void *mc, unsigned int 
csig, int cpf)
 /*
  * Returns 1 if update has been found, 0 otherwise.
  */
-static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int 
new_rev)
+static int has_newer_microcode(void *mc, unsigned int csig, int cpf, unsigned 
new_rev)
 {
struct microcode_header_intel *mc_hdr = mc;
 
-- 
2.17.1



[PATCH v1] x86/microcode: Handle negative microcode revisions

2018-10-19 Thread Andi Kleen
From: Andi Kleen 

The Intel ucode revision space is unsigned. Inside Intel there are special
microcodes that have the highest bit set, and they are considered to have
a higher revision than any microcodes that don't have this bit set.

The function comparing the microcodes in the Linux driver compares
u32 with int, which ends up being signed extended to long on 64bit
systems. This results in these highest bit set microcodes not loading
because their revision appears negative and smaller than the
existing microcode.

Change the comparison to unsigned. With that the loading works
as expected.

Signed-off-by: Andi Kleen 
---
 arch/x86/kernel/cpu/microcode/intel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c 
b/arch/x86/kernel/cpu/microcode/intel.c
index 16936a24795c..e95cebdd5993 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -93,7 +93,7 @@ static int find_matching_signature(void *mc, unsigned int 
csig, int cpf)
 /*
  * Returns 1 if update has been found, 0 otherwise.
  */
-static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int 
new_rev)
+static int has_newer_microcode(void *mc, unsigned int csig, int cpf, unsigned 
new_rev)
 {
struct microcode_header_intel *mc_hdr = mc;
 
-- 
2.17.1



Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-19 Thread Andi Kleen
> > +   u32 min_ucode;
> > +};
> > +
> > +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id 
> > *match)
> 
> What's the point of returning the struct pointer? Shouldn't it be enough to
> make it return bool? Also the function name really should reflect that this
> checks whether the minimal required microcode revision is active.

This allows the user to find the table entry to tie something to it
(e.g. use the index to match some other table)

Same pattern as pci discovery etc. use.

Given the current caller doesn't need it, but we still follow standard
conventions.

> 
> > +{
> > +   struct cpuinfo_x86 *c = _cpu_data;
> > +   const struct x86_ucode_id *m;
> > +
> > +   for (m = match; m->vendor | m->family | m->model; m++) {
> 
> VENDOR_INTEL = 0, so this check is obscure to begin with. Either you chose
> a explicit condition to put at the end of the table, e.g. vendor = U8_MAX
> or you hand in the array size to the function.

That would both be awkward. It's the same as match_cpu, and 0 terminators
are standard practice in practical all similar code. I removed
the or with the family.

-Andi


Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-19 Thread Andi Kleen
> > +   u32 min_ucode;
> > +};
> > +
> > +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id 
> > *match)
> 
> What's the point of returning the struct pointer? Shouldn't it be enough to
> make it return bool? Also the function name really should reflect that this
> checks whether the minimal required microcode revision is active.

This allows the user to find the table entry to tie something to it
(e.g. use the index to match some other table)

Same pattern as pci discovery etc. use.

Given the current caller doesn't need it, but we still follow standard
conventions.

> 
> > +{
> > +   struct cpuinfo_x86 *c = _cpu_data;
> > +   const struct x86_ucode_id *m;
> > +
> > +   for (m = match; m->vendor | m->family | m->model; m++) {
> 
> VENDOR_INTEL = 0, so this check is obscure to begin with. Either you chose
> a explicit condition to put at the end of the table, e.g. vendor = U8_MAX
> or you hand in the array size to the function.

That would both be awkward. It's the same as match_cpu, and 0 terminators
are standard practice in practical all similar code. I removed
the or with the family.

-Andi


Re: l1tf: Kernel suggests I throw away third of my memory. I'd rather not

2018-10-17 Thread Andi Kleen
On Wed, Oct 17, 2018 at 12:56:10PM +0200, Pavel Machek wrote:
> Hi!
> 
> 6a012288 suggests I throw away 1GB on RAM. On 3GB system.. that is not
> going to be pleasant.

Just rebuild your kernel with PAE? I assume your CPU supports it.

This will also give you NX, which if you're really worried
about security is far more important than L1TF.

If you don't worry about security just ignore.

-Andi


Re: l1tf: Kernel suggests I throw away third of my memory. I'd rather not

2018-10-17 Thread Andi Kleen
On Wed, Oct 17, 2018 at 12:56:10PM +0200, Pavel Machek wrote:
> Hi!
> 
> 6a012288 suggests I throw away 1GB on RAM. On 3GB system.. that is not
> going to be pleasant.

Just rebuild your kernel with PAE? I assume your CPU supports it.

This will also give you NX, which if you're really worried
about security is far more important than L1TF.

If you don't worry about security just ignore.

-Andi


Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-12 Thread Andi Kleen
> 4. Results
> - Without this optimization, the guest pmi handling time is
>   ~450 ns, and the max sampling rate is reduced to 250.
> - With this optimization, the guest pmi handling time is ~9000 ns
>   (i.e. 1 / 500 of the non-optimization case), and the max sampling
>   rate remains at the original 10.

Impressive performance improvement!

It's not clear to me why you're special casing PMIs here. The optimization
should work generically, right?

perf will enable/disable the PMU even outside PMIs, e.g. on context
switches, which is a very important path too.

> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
> struct msr_data *msr_info)
>   default:
>   if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>   (pmc = get_fixed_pmc(pmu, msr))) {
> - if (!msr_info->host_initiated)
> - data = (s64)(s32)data;
> - pmc->counter += data - pmc_read_counter(pmc);
> + if (pmu->in_pmi) {
> + /*
> +  * Since we are not re-allocating a perf event
> +  * to reconfigure the sampling time when the
> +  * guest pmu is in PMI, just set the value to
> +  * the hardware perf counter. Counting will
> +  * continue after the guest enables the
> +  * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
> +  */
> + struct hw_perf_event *hwc =
> + >perf_event->hw;
> + wrmsrl(hwc->event_base, data);

Is that guaranteed to be always called on the right CPU that will run the vcpu?

AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
it won't handle that.

May need to be delayed to entry time.

-Andi


Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-12 Thread Andi Kleen
> 4. Results
> - Without this optimization, the guest pmi handling time is
>   ~450 ns, and the max sampling rate is reduced to 250.
> - With this optimization, the guest pmi handling time is ~9000 ns
>   (i.e. 1 / 500 of the non-optimization case), and the max sampling
>   rate remains at the original 10.

Impressive performance improvement!

It's not clear to me why you're special casing PMIs here. The optimization
should work generically, right?

perf will enable/disable the PMU even outside PMIs, e.g. on context
switches, which is a very important path too.

> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
> struct msr_data *msr_info)
>   default:
>   if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>   (pmc = get_fixed_pmc(pmu, msr))) {
> - if (!msr_info->host_initiated)
> - data = (s64)(s32)data;
> - pmc->counter += data - pmc_read_counter(pmc);
> + if (pmu->in_pmi) {
> + /*
> +  * Since we are not re-allocating a perf event
> +  * to reconfigure the sampling time when the
> +  * guest pmu is in PMI, just set the value to
> +  * the hardware perf counter. Counting will
> +  * continue after the guest enables the
> +  * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
> +  */
> + struct hw_perf_event *hwc =
> + >perf_event->hw;
> + wrmsrl(hwc->event_base, data);

Is that guaranteed to be always called on the right CPU that will run the vcpu?

AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
it won't handle that.

May need to be delayed to entry time.

-Andi


[PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-10 Thread Andi Kleen
From: Andi Kleen 

For bug workarounds or checks it is useful to check for specific
microcode versions. Add a new table format to check for steppings
with min microcode revisions.

This does not change the existing x86_cpu_id because it's an ABI
shared with modutils, and also has quite difference requirements,
as in no wildcards, but everything has to be matched exactly.

Signed-off-by: Andi Kleen 
---
v2:
Remove all CPU match, only check boot cpu
Move INTEL_MIN_UCODE macro to header.
Minor cleanups.
Remove max ucode and driver data
---
 arch/x86/include/asm/cpu_device_id.h | 26 ++
 arch/x86/kernel/cpu/match.c  | 21 +
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/cpu_device_id.h 
b/arch/x86/include/asm/cpu_device_id.h
index baeba0567126..1b90bd1d0b95 100644
--- a/arch/x86/include/asm/cpu_device_id.h
+++ b/arch/x86/include/asm/cpu_device_id.h
@@ -11,4 +11,30 @@
 
 extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match);
 
+/*
+ * Match specific microcodes
+ *
+ * vendor/family/model/stepping must be all set.
+ * min_ucode is optional and can be 0.
+ */
+
+struct x86_ucode_id {
+   u8 vendor;
+   u8 family;
+   u16 model;
+   u16 stepping;
+   u32 min_ucode;
+};
+
+#define INTEL_MIN_UCODE(mod, step, rev) {  \
+   .vendor = X86_VENDOR_INTEL, \
+   .family = 6,\
+   .model = mod,   \
+   .stepping = step,   \
+   .min_ucode = rev,   \
+}
+
+extern const struct x86_ucode_id *
+x86_match_ucode(const struct x86_ucode_id *match);
+
 #endif
diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c
index 3fed38812eea..ec8ee31699cd 100644
--- a/arch/x86/kernel/cpu/match.c
+++ b/arch/x86/kernel/cpu/match.c
@@ -48,3 +48,24 @@ const struct x86_cpu_id *x86_match_cpu(const struct 
x86_cpu_id *match)
return NULL;
 }
 EXPORT_SYMBOL(x86_match_cpu);
+
+const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id *match)
+{
+   struct cpuinfo_x86 *c = _cpu_data;
+   const struct x86_ucode_id *m;
+
+   for (m = match; m->vendor | m->family | m->model; m++) {
+   if (c->x86_vendor != m->vendor)
+   continue;
+   if (c->x86 != m->family)
+   continue;
+   if (c->x86_model != m->model)
+   continue;
+   if (c->x86_stepping != m->stepping)
+   continue;
+   if (c->microcode < m->min_ucode)
+   continue;
+   return m;
+   }
+   return NULL;
+}
-- 
2.17.1



[PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-10 Thread Andi Kleen
From: Andi Kleen 

For bug workarounds or checks it is useful to check for specific
microcode versions. Add a new table format to check for steppings
with min microcode revisions.

This does not change the existing x86_cpu_id because it's an ABI
shared with modutils, and also has quite difference requirements,
as in no wildcards, but everything has to be matched exactly.

Signed-off-by: Andi Kleen 
---
v2:
Remove all CPU match, only check boot cpu
Move INTEL_MIN_UCODE macro to header.
Minor cleanups.
Remove max ucode and driver data
---
 arch/x86/include/asm/cpu_device_id.h | 26 ++
 arch/x86/kernel/cpu/match.c  | 21 +
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/cpu_device_id.h 
b/arch/x86/include/asm/cpu_device_id.h
index baeba0567126..1b90bd1d0b95 100644
--- a/arch/x86/include/asm/cpu_device_id.h
+++ b/arch/x86/include/asm/cpu_device_id.h
@@ -11,4 +11,30 @@
 
 extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match);
 
+/*
+ * Match specific microcodes
+ *
+ * vendor/family/model/stepping must be all set.
+ * min_ucode is optional and can be 0.
+ */
+
+struct x86_ucode_id {
+   u8 vendor;
+   u8 family;
+   u16 model;
+   u16 stepping;
+   u32 min_ucode;
+};
+
+#define INTEL_MIN_UCODE(mod, step, rev) {  \
+   .vendor = X86_VENDOR_INTEL, \
+   .family = 6,\
+   .model = mod,   \
+   .stepping = step,   \
+   .min_ucode = rev,   \
+}
+
+extern const struct x86_ucode_id *
+x86_match_ucode(const struct x86_ucode_id *match);
+
 #endif
diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c
index 3fed38812eea..ec8ee31699cd 100644
--- a/arch/x86/kernel/cpu/match.c
+++ b/arch/x86/kernel/cpu/match.c
@@ -48,3 +48,24 @@ const struct x86_cpu_id *x86_match_cpu(const struct 
x86_cpu_id *match)
return NULL;
 }
 EXPORT_SYMBOL(x86_match_cpu);
+
+const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id *match)
+{
+   struct cpuinfo_x86 *c = _cpu_data;
+   const struct x86_ucode_id *m;
+
+   for (m = match; m->vendor | m->family | m->model; m++) {
+   if (c->x86_vendor != m->vendor)
+   continue;
+   if (c->x86 != m->family)
+   continue;
+   if (c->x86_model != m->model)
+   continue;
+   if (c->x86_stepping != m->stepping)
+   continue;
+   if (c->microcode < m->min_ucode)
+   continue;
+   return m;
+   }
+   return NULL;
+}
-- 
2.17.1



[PATCH v2 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering

2018-10-10 Thread Andi Kleen
From: Andi Kleen 

KVM added a workaround for PEBS events leaking
into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.")
This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR.

Intel also added a fix for this issue to microcode updates on
Haswell/Broadwell/Skylake.

It turns out using the MSR entry/exit list makes VM exits
significantly slower. The list is only needed for disabling
PEBS, because the GLOBAL_CTRL change gets optimized by
KVM into changing the VMCS.

Check for the microcode updates that have the microcode
fix for leaking PEBS, and disable the extra entry/exit list
entry for PEBS_ENABLE. In addition we always clear the
GLOBAL_CTRL for the PEBS counter while running in the guest,
which is enough to make them never fire at the wrong
side of the host/guest transition.

We see significantly reduced overhead for VM exits with the
filtering active with the patch from 8% to 4%.

Signed-off-by: Andi Kleen 
---
v2:
Use match_ucode, not match_ucode_all
Remove cpu lock
Use INTEL_MIN_UCODE and move to header
Update Table to include skylake clients.
---
 arch/x86/events/intel/core.c | 80 
 arch/x86/events/perf_event.h |  3 +-
 2 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ab01ef9ddd77..5e8e76753eea 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../perf_event.h"
 
@@ -3166,16 +3167,27 @@ static struct perf_guest_switch_msr 
*intel_guest_get_msrs(int *nr)
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
-   /*
-* If PMU counter has PEBS enabled it is not enough to disable counter
-* on a guest entry since PEBS memory write can overshoot guest entry
-* and corrupt guest memory. Disabling PEBS solves the problem.
-*/
-   arr[1].msr = MSR_IA32_PEBS_ENABLE;
-   arr[1].host = cpuc->pebs_enabled;
-   arr[1].guest = 0;
+   if (x86_pmu.flags & PMU_FL_PEBS_ALL)
+   arr[0].guest &= ~cpuc->pebs_enabled;
+   else
+   arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+   *nr = 1;
+
+   if (!x86_pmu.pebs_isolated) {
+   /*
+* If PMU counter has PEBS enabled it is not enough to
+* disable counter on a guest entry since PEBS memory
+* write can overshoot guest entry and corrupt guest
+* memory. Disabling PEBS solves the problem.
+*
+* Don't do this if the CPU already enforces it.
+*/
+   arr[1].msr = MSR_IA32_PEBS_ENABLE;
+   arr[1].host = cpuc->pebs_enabled;
+   arr[1].guest = 0;
+   *nr = 2;
+   }
 
-   *nr = 2;
return arr;
 }
 
@@ -3693,6 +3705,45 @@ static __init void intel_clovertown_quirk(void)
x86_pmu.pebs_constraints = NULL;
 }
 
+static const struct x86_ucode_id isolation_ucodes[] = {
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT,  1, 0x001e),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE,   4, 0x0023),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E,   1, 0x0014),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X,  2, 0x0b14),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE,   3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP,  3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE,  9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004e),
+   

[PATCH v2 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering

2018-10-10 Thread Andi Kleen
From: Andi Kleen 

KVM added a workaround for PEBS events leaking
into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.")
This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR.

Intel also added a fix for this issue to microcode updates on
Haswell/Broadwell/Skylake.

It turns out using the MSR entry/exit list makes VM exits
significantly slower. The list is only needed for disabling
PEBS, because the GLOBAL_CTRL change gets optimized by
KVM into changing the VMCS.

Check for the microcode updates that have the microcode
fix for leaking PEBS, and disable the extra entry/exit list
entry for PEBS_ENABLE. In addition we always clear the
GLOBAL_CTRL for the PEBS counter while running in the guest,
which is enough to make them never fire at the wrong
side of the host/guest transition.

We see significantly reduced overhead for VM exits with the
filtering active with the patch from 8% to 4%.

Signed-off-by: Andi Kleen 
---
v2:
Use match_ucode, not match_ucode_all
Remove cpu lock
Use INTEL_MIN_UCODE and move to header
Update Table to include skylake clients.
---
 arch/x86/events/intel/core.c | 80 
 arch/x86/events/perf_event.h |  3 +-
 2 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ab01ef9ddd77..5e8e76753eea 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../perf_event.h"
 
@@ -3166,16 +3167,27 @@ static struct perf_guest_switch_msr 
*intel_guest_get_msrs(int *nr)
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
-   /*
-* If PMU counter has PEBS enabled it is not enough to disable counter
-* on a guest entry since PEBS memory write can overshoot guest entry
-* and corrupt guest memory. Disabling PEBS solves the problem.
-*/
-   arr[1].msr = MSR_IA32_PEBS_ENABLE;
-   arr[1].host = cpuc->pebs_enabled;
-   arr[1].guest = 0;
+   if (x86_pmu.flags & PMU_FL_PEBS_ALL)
+   arr[0].guest &= ~cpuc->pebs_enabled;
+   else
+   arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+   *nr = 1;
+
+   if (!x86_pmu.pebs_isolated) {
+   /*
+* If PMU counter has PEBS enabled it is not enough to
+* disable counter on a guest entry since PEBS memory
+* write can overshoot guest entry and corrupt guest
+* memory. Disabling PEBS solves the problem.
+*
+* Don't do this if the CPU already enforces it.
+*/
+   arr[1].msr = MSR_IA32_PEBS_ENABLE;
+   arr[1].host = cpuc->pebs_enabled;
+   arr[1].guest = 0;
+   *nr = 2;
+   }
 
-   *nr = 2;
return arr;
 }
 
@@ -3693,6 +3705,45 @@ static __init void intel_clovertown_quirk(void)
x86_pmu.pebs_constraints = NULL;
 }
 
+static const struct x86_ucode_id isolation_ucodes[] = {
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT,  1, 0x001e),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037),
+   INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE,   4, 0x0023),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E,   1, 0x0014),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02),
+   INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X,  2, 0x0b14),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE,   3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP,  3, 0x007c),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE,  9, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e),
+   INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004e),
+   

Re: [RFC] perf tools: Wrong filter_band* values in json calculation"

2018-10-09 Thread Andi Kleen
On Tue, Oct 09, 2018 at 12:01:44PM +0200, Jiri Olsa wrote:
> On Wed, Oct 03, 2018 at 07:45:50AM -0700, Andi Kleen wrote:
> > > note there's couple of changes that actually changed
> > > the number completely, like:
> > > 
> > > -"Filter": "edge=1,filter_band2=4000",
> > > +"Filter": "edge=1,filter_band2=30",
> > 
> > Thanks. Looks good. I'll fix the scripts to generate the uncore events.
> 
> hi,
> any idea when you could post an update for this?

Can just use your patch for the existing event lists. You can add

Acked-by: Andi Kleen 

I was mainly worried about future updates, but it doesn't seem to be a problem.

-Andi


Re: [RFC] perf tools: Wrong filter_band* values in json calculation"

2018-10-09 Thread Andi Kleen
On Tue, Oct 09, 2018 at 12:01:44PM +0200, Jiri Olsa wrote:
> On Wed, Oct 03, 2018 at 07:45:50AM -0700, Andi Kleen wrote:
> > > note there's couple of changes that actually changed
> > > the number completely, like:
> > > 
> > > -"Filter": "edge=1,filter_band2=4000",
> > > +"Filter": "edge=1,filter_band2=30",
> > 
> > Thanks. Looks good. I'll fix the scripts to generate the uncore events.
> 
> hi,
> any idea when you could post an update for this?

Can just use your patch for the existing event lists. You can add

Acked-by: Andi Kleen 

I was mainly worried about future updates, but it doesn't seem to be a problem.

-Andi


Re: [PATCH v6 1/5] tools, perf, script: Add --insn-trace for instruction decoding

2018-10-08 Thread Andi Kleen
> > +> git clone https://github.com/intelxed/mbuild.git mbuild  
> >   
> > +> git clone https://github.com/intelxed/xed
> >   
> > +> cd xed   
> >   
> > +> mkdir obj
> >   
> > +> cd obj   
> >   
> > +> ../mfile.py  
> >   
> > +> sudo ../mfile.py --prefix=/usr/local install


Need 

../mfile.py examples
sudo ../mfile.py --prefix=/usr/local examples install


-Andi



Re: [PATCH v6 1/5] tools, perf, script: Add --insn-trace for instruction decoding

2018-10-08 Thread Andi Kleen
> > +> git clone https://github.com/intelxed/mbuild.git mbuild  
> >   
> > +> git clone https://github.com/intelxed/xed
> >   
> > +> cd xed   
> >   
> > +> mkdir obj
> >   
> > +> cd obj   
> >   
> > +> ../mfile.py  
> >   
> > +> sudo ../mfile.py --prefix=/usr/local install


Need 

../mfile.py examples
sudo ../mfile.py --prefix=/usr/local examples install


-Andi



Re: [PATCH v1 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-06 Thread Andi Kleen
On Sat, Oct 06, 2018 at 04:14:54PM +0200, Thomas Gleixner wrote:
> On Fri, 5 Oct 2018, Andi Kleen wrote:
> > +/*
> > + * Match specific microcodes or steppings.
> 
> What means microcodes or steppings? If you mean microcode revisions then
> please spell it out and use it all over the place. steppings is confusing
> at best as its associated to the CPU stepping.

The matcher can be used to match specific hardware steppings by setting
the min/max_ucode to 0 or specific microcode revisions 
(which are associated with steppings) 

> > +const struct x86_ucode_id *x86_match_ucode_all(const struct x86_ucode_id 
> > *match)
> 
> Can you please name that so it's obvious that this checks all cpus, but
> aside of that, why would a system ever end up with different microcode
> revisions at all? The changelog is not mentioning any reason for this
> function and "/* Check all CPUs */" is not helpful either.

We still support the old microcode interface that allows updates
per CPU, and also it could happen during CPU hotplug.

> 
> > +   int cpu;
> > +   const struct x86_ucode_id *all_m = NULL;
> > +   bool first = true;
> > +
> > +   for_each_online_cpu(cpu) {
> 
> What guarantees that CPUs cannot be plugged? You either need to have
> cpus_read_lock() in this function or a lockdep_assert_cpus_held().

In my case the caller, but yes should be documented.

-Andi


Re: [PATCH v1 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-06 Thread Andi Kleen
On Sat, Oct 06, 2018 at 04:14:54PM +0200, Thomas Gleixner wrote:
> On Fri, 5 Oct 2018, Andi Kleen wrote:
> > +/*
> > + * Match specific microcodes or steppings.
> 
> What means microcodes or steppings? If you mean microcode revisions then
> please spell it out and use it all over the place. steppings is confusing
> at best as its associated to the CPU stepping.

The matcher can be used to match specific hardware steppings by setting
the min/max_ucode to 0 or specific microcode revisions 
(which are associated with steppings) 

> > +const struct x86_ucode_id *x86_match_ucode_all(const struct x86_ucode_id 
> > *match)
> 
> Can you please name that so it's obvious that this checks all cpus, but
> aside of that, why would a system ever end up with different microcode
> revisions at all? The changelog is not mentioning any reason for this
> function and "/* Check all CPUs */" is not helpful either.

We still support the old microcode interface that allows updates
per CPU, and also it could happen during CPU hotplug.

> 
> > +   int cpu;
> > +   const struct x86_ucode_id *all_m = NULL;
> > +   bool first = true;
> > +
> > +   for_each_online_cpu(cpu) {
> 
> What guarantees that CPUs cannot be plugged? You either need to have
> cpus_read_lock() in this function or a lockdep_assert_cpus_held().

In my case the caller, but yes should be documented.

-Andi


[PATCH v1 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-05 Thread Andi Kleen
From: Andi Kleen 

For bug workarounds or checks it is useful to check for specific
microcode versions. Add a new table format to check for steppings
with min/max microcode revisions.

This does not change the existing x86_cpu_id because it's an ABI
shared with modutils, and also has quite difference requirements,
as in no wildcards, but everything has to be matched exactly.

Signed-off-by: Andi Kleen 
---
 arch/x86/include/asm/cpu_device_id.h | 22 ++
 arch/x86/kernel/cpu/match.c  | 43 
 2 files changed, 65 insertions(+)

diff --git a/arch/x86/include/asm/cpu_device_id.h 
b/arch/x86/include/asm/cpu_device_id.h
index baeba0567126..bfd5438c 100644
--- a/arch/x86/include/asm/cpu_device_id.h
+++ b/arch/x86/include/asm/cpu_device_id.h
@@ -11,4 +11,26 @@
 
 extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match);
 
+/*
+ * Match specific microcodes or steppings.
+ *
+ * vendor/family/model/stepping must be all set.
+ * min_ucode/max_ucode/driver_data are optional and can be 0.
+ */
+
+struct x86_ucode_id {
+   __u16 vendor;
+   __u16 family;
+   __u16 model;
+   __u16 stepping;
+   __u32 min_ucode;
+   __u32 max_ucode;
+   kernel_ulong_t driver_data;
+};
+
+extern const struct x86_ucode_id *
+x86_match_ucode_cpu(int cpu, const struct x86_ucode_id *match);
+extern const struct x86_ucode_id *
+x86_match_ucode_all(const struct x86_ucode_id *match);
+
 #endif
diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c
index 3fed38812eea..f29a21b2809c 100644
--- a/arch/x86/kernel/cpu/match.c
+++ b/arch/x86/kernel/cpu/match.c
@@ -48,3 +48,46 @@ const struct x86_cpu_id *x86_match_cpu(const struct 
x86_cpu_id *match)
return NULL;
 }
 EXPORT_SYMBOL(x86_match_cpu);
+
+const struct x86_ucode_id *x86_match_ucode_cpu(int cpu,
+  const struct x86_ucode_id *match)
+{
+   const struct x86_ucode_id *m;
+   struct cpuinfo_x86 *c = _data(cpu);
+
+   for (m = match; m->vendor | m->family | m->model; m++) {
+   if (c->x86_vendor != m->vendor)
+   continue;
+   if (c->x86 != m->family)
+   continue;
+   if (c->x86_model != m->model)
+   continue;
+   if (c->x86_stepping != m->stepping)
+   continue;
+   if (m->min_ucode && c->microcode < m->min_ucode)
+   continue;
+   if (m->max_ucode && c->microcode > m->max_ucode)
+   continue;
+   return m;
+   }
+   return NULL;
+}
+
+/* Check all CPUs */
+const struct x86_ucode_id *x86_match_ucode_all(const struct x86_ucode_id 
*match)
+{
+   int cpu;
+   const struct x86_ucode_id *all_m = NULL;
+   bool first = true;
+
+   for_each_online_cpu(cpu) {
+   const struct x86_ucode_id *m = x86_match_ucode_cpu(cpu, match);
+
+   if (first)
+   all_m = m;
+   else if (m != all_m)
+   return NULL;
+   first = false;
+   }
+   return all_m;
+}
-- 
2.17.1



[PATCH v1 1/2] x86/cpufeature: Add facility to match microcode revisions

2018-10-05 Thread Andi Kleen
From: Andi Kleen 

For bug workarounds or checks it is useful to check for specific
microcode versions. Add a new table format to check for steppings
with min/max microcode revisions.

This does not change the existing x86_cpu_id because it's an ABI
shared with modutils, and also has quite difference requirements,
as in no wildcards, but everything has to be matched exactly.

Signed-off-by: Andi Kleen 
---
 arch/x86/include/asm/cpu_device_id.h | 22 ++
 arch/x86/kernel/cpu/match.c  | 43 
 2 files changed, 65 insertions(+)

diff --git a/arch/x86/include/asm/cpu_device_id.h 
b/arch/x86/include/asm/cpu_device_id.h
index baeba0567126..bfd5438c 100644
--- a/arch/x86/include/asm/cpu_device_id.h
+++ b/arch/x86/include/asm/cpu_device_id.h
@@ -11,4 +11,26 @@
 
 extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match);
 
+/*
+ * Match specific microcodes or steppings.
+ *
+ * vendor/family/model/stepping must be all set.
+ * min_ucode/max_ucode/driver_data are optional and can be 0.
+ */
+
+struct x86_ucode_id {
+   __u16 vendor;
+   __u16 family;
+   __u16 model;
+   __u16 stepping;
+   __u32 min_ucode;
+   __u32 max_ucode;
+   kernel_ulong_t driver_data;
+};
+
+extern const struct x86_ucode_id *
+x86_match_ucode_cpu(int cpu, const struct x86_ucode_id *match);
+extern const struct x86_ucode_id *
+x86_match_ucode_all(const struct x86_ucode_id *match);
+
 #endif
diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c
index 3fed38812eea..f29a21b2809c 100644
--- a/arch/x86/kernel/cpu/match.c
+++ b/arch/x86/kernel/cpu/match.c
@@ -48,3 +48,46 @@ const struct x86_cpu_id *x86_match_cpu(const struct 
x86_cpu_id *match)
return NULL;
 }
 EXPORT_SYMBOL(x86_match_cpu);
+
+const struct x86_ucode_id *x86_match_ucode_cpu(int cpu,
+  const struct x86_ucode_id *match)
+{
+   const struct x86_ucode_id *m;
+   struct cpuinfo_x86 *c = _data(cpu);
+
+   for (m = match; m->vendor | m->family | m->model; m++) {
+   if (c->x86_vendor != m->vendor)
+   continue;
+   if (c->x86 != m->family)
+   continue;
+   if (c->x86_model != m->model)
+   continue;
+   if (c->x86_stepping != m->stepping)
+   continue;
+   if (m->min_ucode && c->microcode < m->min_ucode)
+   continue;
+   if (m->max_ucode && c->microcode > m->max_ucode)
+   continue;
+   return m;
+   }
+   return NULL;
+}
+
+/* Check all CPUs */
+const struct x86_ucode_id *x86_match_ucode_all(const struct x86_ucode_id 
*match)
+{
+   int cpu;
+   const struct x86_ucode_id *all_m = NULL;
+   bool first = true;
+
+   for_each_online_cpu(cpu) {
+   const struct x86_ucode_id *m = x86_match_ucode_cpu(cpu, match);
+
+   if (first)
+   all_m = m;
+   else if (m != all_m)
+   return NULL;
+   first = false;
+   }
+   return all_m;
+}
-- 
2.17.1



Re: [RFC] perf tools: Wrong filter_band* values in json calculation"

2018-10-03 Thread Andi Kleen
> note there's couple of changes that actually changed
> the number completely, like:
> 
> -"Filter": "edge=1,filter_band2=4000",
> +"Filter": "edge=1,filter_band2=30",

Thanks. Looks good. I'll fix the scripts to generate the uncore events.

-Andi


Re: [RFC] perf tools: Wrong filter_band* values in json calculation"

2018-10-03 Thread Andi Kleen
> note there's couple of changes that actually changed
> the number completely, like:
> 
> -"Filter": "edge=1,filter_band2=4000",
> +"Filter": "edge=1,filter_band2=30",

Thanks. Looks good. I'll fix the scripts to generate the uncore events.

-Andi


Re: [PATCH] perf/x86/intel: Add counter freezing quirk for Goldmont

2018-10-03 Thread Andi Kleen
> There is another variant of model/stepping micro code verification code in
> intel_snb_pebs_broken(). Can we please make this table based and use a
> common function? That's certainly not the last quirk we're going to have.

I have a patch to add a table driven microcode matcher for another fix.
Will post shortly.

-Andi


Re: [PATCH] perf/x86/intel: Add counter freezing quirk for Goldmont

2018-10-03 Thread Andi Kleen
> There is another variant of model/stepping micro code verification code in
> intel_snb_pebs_broken(). Can we please make this table based and use a
> common function? That's certainly not the last quirk we're going to have.

I have a patch to add a table driven microcode matcher for another fix.
Will post shortly.

-Andi


[tip:perf/core] perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler

2018-10-02 Thread tip-bot for Andi Kleen
Commit-ID:  af3bdb991a5cb57c189d34aadbd3aa88995e0d9f
Gitweb: https://git.kernel.org/tip/af3bdb991a5cb57c189d34aadbd3aa88995e0d9f
Author: Andi Kleen 
AuthorDate: Wed, 8 Aug 2018 00:12:07 -0700
Committer:  Ingo Molnar 
CommitDate: Tue, 2 Oct 2018 10:14:31 +0200

perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler

Implements counter freezing for Arch Perfmon v4 (Skylake and
newer). This allows to speed up the PMI handler by avoiding
unnecessary MSR writes and make it more accurate.

The Arch Perfmon v4 PMI handler is substantially different than
the older PMI handler.

Differences to the old handler:

- It relies on counter freezing, which eliminates several MSR
  writes from the PMI handler and lowers the overhead significantly.

  It makes the PMI handler more accurate, as all counters get
  frozen atomically as soon as any counter overflows. So there is
  much less counting of the PMI handler itself.

  With the freezing we don't need to disable or enable counters or
  PEBS. Only BTS which does not support auto-freezing still needs to
  be explicitly managed.

- The PMU acking is done at the end, not the beginning.
  This makes it possible to avoid manual enabling/disabling
  of the PMU, instead we just rely on the freezing/acking.

- The APIC is acked before reenabling the PMU, which avoids
  problems with LBRs occasionally not getting unfreezed on Skylake.

- Looping is only needed to workaround a corner case which several PMIs
  are very close to each other. For common cases, the counters are freezed
  during PMI handler. It doesn't need to do re-check.

This patch:

- Adds code to enable v4 counter freezing
- Fork <=v3 and >=v4 PMI handlers into separate functions.
- Add kernel parameter to disable counter freezing. It took some time to
  debug counter freezing, so in case there are new problems we added an
  option to turn it off. Would not expect this to be used until there
  are new bugs.
- Only for big core. The patch for small core will be posted later
  separately.

Performance:

When profiling a kernel build on Kabylake with different perf options,
measuring the length of all NMI handlers using the nmi handler
trace point:

V3 is without counter freezing.
V4 is with counter freezing.
The value is the average cost of the PMI handler.
(lower is better)

perf options`   V3(ns) V4(ns)  delta
-c 10   1088   894 -18%
-g -c 101862   1646-12%
--call-graph lbr -c 10  3649   3367-8%
--c.g. dwarf -c 10  2248   1982-12%

Signed-off-by: Andi Kleen 
Signed-off-by: Kan Liang 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: a...@kernel.org
Link: 
http://lkml.kernel.org/r/1533712328-2834-2-git-send-email-kan.li...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 Documentation/admin-guide/kernel-parameters.txt |   5 ++
 arch/x86/events/intel/core.c| 113 
 arch/x86/events/perf_event.h|   4 +-
 arch/x86/include/asm/msr-index.h|   1 +
 4 files changed, 122 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 92eb1f42240d..6795dedcbd1e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -856,6 +856,11 @@
causing system reset or hang due to sending
INIT from AP to BSP.
 
+   disable_counter_freezing [HW]
+   Disable Intel PMU counter freezing feature.
+   The feature only exists starting from
+   Arch Perfmon v4 (Skylake and newer).
+
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware.
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 9b320a51f82f..bd3b8f3600b2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1995,6 +1995,18 @@ static void intel_pmu_nhm_enable_all(int added)
intel_pmu_enable_all(added);
 }
 
+static void enable_counter_freeze(void)
+{
+   update_debugctlmsr(get_debugctlmsr() |
+   DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
+}
+
+static void disable_counter_freeze(void)
+{
+   update_debugctlmsr(get_debugctlmsr() &
+   ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
+}
+
 static inline u64 intel_pmu_get_status(void)
 {
u64 status;
@@ -2290,6 +2302,91 @@ static int handle_pmi_common(struct pt_regs *regs, u64 
status)
return handled;
 }
 
+static bool disable_counter_freezing;
+static int __init intel_perf_counter_freezing_setup(char *s)
+{
+   disable_counte

[tip:perf/core] perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler

2018-10-02 Thread tip-bot for Andi Kleen
Commit-ID:  af3bdb991a5cb57c189d34aadbd3aa88995e0d9f
Gitweb: https://git.kernel.org/tip/af3bdb991a5cb57c189d34aadbd3aa88995e0d9f
Author: Andi Kleen 
AuthorDate: Wed, 8 Aug 2018 00:12:07 -0700
Committer:  Ingo Molnar 
CommitDate: Tue, 2 Oct 2018 10:14:31 +0200

perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler

Implements counter freezing for Arch Perfmon v4 (Skylake and
newer). This allows to speed up the PMI handler by avoiding
unnecessary MSR writes and make it more accurate.

The Arch Perfmon v4 PMI handler is substantially different than
the older PMI handler.

Differences to the old handler:

- It relies on counter freezing, which eliminates several MSR
  writes from the PMI handler and lowers the overhead significantly.

  It makes the PMI handler more accurate, as all counters get
  frozen atomically as soon as any counter overflows. So there is
  much less counting of the PMI handler itself.

  With the freezing we don't need to disable or enable counters or
  PEBS. Only BTS which does not support auto-freezing still needs to
  be explicitly managed.

- The PMU acking is done at the end, not the beginning.
  This makes it possible to avoid manual enabling/disabling
  of the PMU, instead we just rely on the freezing/acking.

- The APIC is acked before reenabling the PMU, which avoids
  problems with LBRs occasionally not getting unfreezed on Skylake.

- Looping is only needed to workaround a corner case which several PMIs
  are very close to each other. For common cases, the counters are freezed
  during PMI handler. It doesn't need to do re-check.

This patch:

- Adds code to enable v4 counter freezing
- Fork <=v3 and >=v4 PMI handlers into separate functions.
- Add kernel parameter to disable counter freezing. It took some time to
  debug counter freezing, so in case there are new problems we added an
  option to turn it off. Would not expect this to be used until there
  are new bugs.
- Only for big core. The patch for small core will be posted later
  separately.

Performance:

When profiling a kernel build on Kabylake with different perf options,
measuring the length of all NMI handlers using the nmi handler
trace point:

V3 is without counter freezing.
V4 is with counter freezing.
The value is the average cost of the PMI handler.
(lower is better)

perf options`   V3(ns) V4(ns)  delta
-c 10   1088   894 -18%
-g -c 101862   1646-12%
--call-graph lbr -c 10  3649   3367-8%
--c.g. dwarf -c 10  2248   1982-12%

Signed-off-by: Andi Kleen 
Signed-off-by: Kan Liang 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: a...@kernel.org
Link: 
http://lkml.kernel.org/r/1533712328-2834-2-git-send-email-kan.li...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 Documentation/admin-guide/kernel-parameters.txt |   5 ++
 arch/x86/events/intel/core.c| 113 
 arch/x86/events/perf_event.h|   4 +-
 arch/x86/include/asm/msr-index.h|   1 +
 4 files changed, 122 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 92eb1f42240d..6795dedcbd1e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -856,6 +856,11 @@
causing system reset or hang due to sending
INIT from AP to BSP.
 
+   disable_counter_freezing [HW]
+   Disable Intel PMU counter freezing feature.
+   The feature only exists starting from
+   Arch Perfmon v4 (Skylake and newer).
+
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware.
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 9b320a51f82f..bd3b8f3600b2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1995,6 +1995,18 @@ static void intel_pmu_nhm_enable_all(int added)
intel_pmu_enable_all(added);
 }
 
+static void enable_counter_freeze(void)
+{
+   update_debugctlmsr(get_debugctlmsr() |
+   DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
+}
+
+static void disable_counter_freeze(void)
+{
+   update_debugctlmsr(get_debugctlmsr() &
+   ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
+}
+
 static inline u64 intel_pmu_get_status(void)
 {
u64 status;
@@ -2290,6 +2302,91 @@ static int handle_pmi_common(struct pt_regs *regs, u64 
status)
return handled;
 }
 
+static bool disable_counter_freezing;
+static int __init intel_perf_counter_freezing_setup(char *s)
+{
+   disable_counte

[PATCH v1 1/2] perf, tools: Move perf_evsel__reset_weak_group into evlist

2018-10-01 Thread Andi Kleen
From: Andi Kleen 

- Move the function from builtin-stat to evlist for reuse
- Rename to evlist to match purpose better
- Pass the evlist as first argument.
- No functional changes

Signed-off-by: Andi Kleen 
---
 tools/perf/builtin-stat.c | 29 ++---
 tools/perf/util/evlist.c  | 27 +++
 tools/perf/util/evlist.h  |  3 +++
 3 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 08a1e64078ca..e60f6991dbb8 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -567,32 +567,6 @@ static bool perf_evsel__should_store_id(struct perf_evsel 
*counter)
return STAT_RECORD || counter->attr.read_format & PERF_FORMAT_ID;
 }
 
-static struct perf_evsel *perf_evsel__reset_weak_group(struct perf_evsel 
*evsel)
-{
-   struct perf_evsel *c2, *leader;
-   bool is_open = true;
-
-   leader = evsel->leader;
-   pr_debug("Weak group for %s/%d failed\n",
-   leader->name, leader->nr_members);
-
-   /*
-* for_each_group_member doesn't work here because it doesn't
-* include the first entry.
-*/
-   evlist__for_each_entry(evsel_list, c2) {
-   if (c2 == evsel)
-   is_open = false;
-   if (c2->leader == leader) {
-   if (is_open)
-   perf_evsel__close(c2);
-   c2->leader = c2;
-   c2->nr_members = 0;
-   }
-   }
-   return leader;
-}
-
 static int __run_perf_stat(int argc, const char **argv, int run_idx)
 {
int interval = stat_config.interval;
@@ -639,7 +613,8 @@ static int __run_perf_stat(int argc, const char **argv, int 
run_idx)
if ((errno == EINVAL || errno == EBADF) &&
counter->leader != counter &&
counter->weak_group) {
-   counter = perf_evsel__reset_weak_group(counter);
+   counter = 
perf_evlist__reset_weak_group(evsel_list,
+   
counter);
goto try_again;
}
 
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index e7a4b31a84fb..de90f9097e97 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1810,3 +1810,30 @@ void perf_evlist__force_leader(struct perf_evlist 
*evlist)
leader->forced_leader = true;
}
 }
+
+struct perf_evsel *perf_evlist__reset_weak_group(struct perf_evlist 
*evsel_list,
+struct perf_evsel *evsel)
+{
+   struct perf_evsel *c2, *leader;
+   bool is_open = true;
+
+   leader = evsel->leader;
+   pr_debug("Weak group for %s/%d failed\n",
+   leader->name, leader->nr_members);
+
+   /*
+* for_each_group_member doesn't work here because it doesn't
+* include the first entry.
+*/
+   evlist__for_each_entry(evsel_list, c2) {
+   if (c2 == evsel)
+   is_open = false;
+   if (c2->leader == leader) {
+   if (is_open)
+   perf_evsel__close(c2);
+   c2->leader = c2;
+   c2->nr_members = 0;
+   }
+   }
+   return leader;
+}
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index dc66436add98..9919eed6d15b 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -312,4 +312,7 @@ bool perf_evlist__exclude_kernel(struct perf_evlist 
*evlist);
 
 void perf_evlist__force_leader(struct perf_evlist *evlist);
 
+struct perf_evsel *perf_evlist__reset_weak_group(struct perf_evlist *evlist,
+struct perf_evsel *evsel);
+
 #endif /* __PERF_EVLIST_H */
-- 
2.17.1



[PATCH v1 1/2] perf, tools: Move perf_evsel__reset_weak_group into evlist

2018-10-01 Thread Andi Kleen
From: Andi Kleen 

- Move the function from builtin-stat to evlist for reuse
- Rename to evlist to match purpose better
- Pass the evlist as first argument.
- No functional changes

Signed-off-by: Andi Kleen 
---
 tools/perf/builtin-stat.c | 29 ++---
 tools/perf/util/evlist.c  | 27 +++
 tools/perf/util/evlist.h  |  3 +++
 3 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 08a1e64078ca..e60f6991dbb8 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -567,32 +567,6 @@ static bool perf_evsel__should_store_id(struct perf_evsel 
*counter)
return STAT_RECORD || counter->attr.read_format & PERF_FORMAT_ID;
 }
 
-static struct perf_evsel *perf_evsel__reset_weak_group(struct perf_evsel 
*evsel)
-{
-   struct perf_evsel *c2, *leader;
-   bool is_open = true;
-
-   leader = evsel->leader;
-   pr_debug("Weak group for %s/%d failed\n",
-   leader->name, leader->nr_members);
-
-   /*
-* for_each_group_member doesn't work here because it doesn't
-* include the first entry.
-*/
-   evlist__for_each_entry(evsel_list, c2) {
-   if (c2 == evsel)
-   is_open = false;
-   if (c2->leader == leader) {
-   if (is_open)
-   perf_evsel__close(c2);
-   c2->leader = c2;
-   c2->nr_members = 0;
-   }
-   }
-   return leader;
-}
-
 static int __run_perf_stat(int argc, const char **argv, int run_idx)
 {
int interval = stat_config.interval;
@@ -639,7 +613,8 @@ static int __run_perf_stat(int argc, const char **argv, int 
run_idx)
if ((errno == EINVAL || errno == EBADF) &&
counter->leader != counter &&
counter->weak_group) {
-   counter = perf_evsel__reset_weak_group(counter);
+   counter = 
perf_evlist__reset_weak_group(evsel_list,
+   
counter);
goto try_again;
}
 
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index e7a4b31a84fb..de90f9097e97 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1810,3 +1810,30 @@ void perf_evlist__force_leader(struct perf_evlist 
*evlist)
leader->forced_leader = true;
}
 }
+
+struct perf_evsel *perf_evlist__reset_weak_group(struct perf_evlist 
*evsel_list,
+struct perf_evsel *evsel)
+{
+   struct perf_evsel *c2, *leader;
+   bool is_open = true;
+
+   leader = evsel->leader;
+   pr_debug("Weak group for %s/%d failed\n",
+   leader->name, leader->nr_members);
+
+   /*
+* for_each_group_member doesn't work here because it doesn't
+* include the first entry.
+*/
+   evlist__for_each_entry(evsel_list, c2) {
+   if (c2 == evsel)
+   is_open = false;
+   if (c2->leader == leader) {
+   if (is_open)
+   perf_evsel__close(c2);
+   c2->leader = c2;
+   c2->nr_members = 0;
+   }
+   }
+   return leader;
+}
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index dc66436add98..9919eed6d15b 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -312,4 +312,7 @@ bool perf_evlist__exclude_kernel(struct perf_evlist 
*evlist);
 
 void perf_evlist__force_leader(struct perf_evlist *evlist);
 
+struct perf_evsel *perf_evlist__reset_weak_group(struct perf_evlist *evlist,
+struct perf_evsel *evsel);
+
 #endif /* __PERF_EVLIST_H */
-- 
2.17.1



[PATCH v1 2/2] perf record: Support weak groups

2018-10-01 Thread Andi Kleen
From: Andi Kleen 

Implement a weak group fallback for perf record, similar to the existing perf 
stat support.
This allows to use groups that might be longer than the available counters 
without
failing.

Before:

$ perf record  -e 
'{cycles,cache-misses,cache-references,cpu_clk_unhalted.thread,cycles,cycles,cycles}'
 -a sleep 1
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event 
(cycles).
/bin/dmesg | grep -i perf may provide additional information.

After:

$ ./perf record  -e 
'{cycles,cache-misses,cache-references,cpu_clk_unhalted.thread,cycles,cycles,cycles}:W'
 -a sleep 1
WARNING: No sample_id_all support, falling back to unordered processing
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 8.136 MB perf.data (134069 samples) ]

Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-list.txt | 1 -
 tools/perf/builtin-record.c| 7 ++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-list.txt 
b/tools/perf/Documentation/perf-list.txt
index 11300dbe35c5..c50ed1177984 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -49,7 +49,6 @@ counted. The following modifiers exist:
  S - read sample value (PERF_SAMPLE_READ)
  D - pin the event to the PMU
  W - group is weak and will fallback to non-group if not schedulable,
- only supported in 'perf stat' for now.
 
 The 'p' modifier can be used for specifying how precise the instruction
 address should be. The 'p' modifier can be specified multiple times:
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 22ebeb92ac51..504d89d67b3a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -388,7 +388,12 @@ static int record__open(struct record *rec)
ui__warning("%s\n", msg);
goto try_again;
}
-
+   if ((errno == EINVAL || errno == EBADF) &&
+   pos->leader != pos &&
+   pos->weak_group) {
+   pos = perf_evlist__reset_weak_group(evlist, 
pos);
+   goto try_again;
+   }
rc = -errno;
perf_evsel__open_strerror(pos, >target,
  errno, msg, sizeof(msg));
-- 
2.17.1



[PATCH v1 2/2] perf record: Support weak groups

2018-10-01 Thread Andi Kleen
From: Andi Kleen 

Implement a weak group fallback for perf record, similar to the existing perf 
stat support.
This allows to use groups that might be longer than the available counters 
without
failing.

Before:

$ perf record  -e 
'{cycles,cache-misses,cache-references,cpu_clk_unhalted.thread,cycles,cycles,cycles}'
 -a sleep 1
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event 
(cycles).
/bin/dmesg | grep -i perf may provide additional information.

After:

$ ./perf record  -e 
'{cycles,cache-misses,cache-references,cpu_clk_unhalted.thread,cycles,cycles,cycles}:W'
 -a sleep 1
WARNING: No sample_id_all support, falling back to unordered processing
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 8.136 MB perf.data (134069 samples) ]

Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-list.txt | 1 -
 tools/perf/builtin-record.c| 7 ++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-list.txt 
b/tools/perf/Documentation/perf-list.txt
index 11300dbe35c5..c50ed1177984 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -49,7 +49,6 @@ counted. The following modifiers exist:
  S - read sample value (PERF_SAMPLE_READ)
  D - pin the event to the PMU
  W - group is weak and will fallback to non-group if not schedulable,
- only supported in 'perf stat' for now.
 
 The 'p' modifier can be used for specifying how precise the instruction
 address should be. The 'p' modifier can be specified multiple times:
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 22ebeb92ac51..504d89d67b3a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -388,7 +388,12 @@ static int record__open(struct record *rec)
ui__warning("%s\n", msg);
goto try_again;
}
-
+   if ((errno == EINVAL || errno == EBADF) &&
+   pos->leader != pos &&
+   pos->weak_group) {
+   pos = perf_evlist__reset_weak_group(evlist, 
pos);
+   goto try_again;
+   }
rc = -errno;
perf_evsel__open_strerror(pos, >target,
  errno, msg, sizeof(msg));
-- 
2.17.1



Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
On Fri, Sep 28, 2018 at 11:22:37PM +0200, Jann Horn wrote:
> On Fri, Sep 28, 2018 at 10:59 PM Andi Kleen  wrote:
> > > > This new file descriptor argument doesn't exist today so it would
> > > > need to create a new system call with more arguments
> > >
> > > Is that true? The first argument is a pointer to a struct that
> > > contains its own size, so it can be expanded without an ABI break. I
> > > don't see any reason why you couldn't cram more stuff in there.
> >
> > You're right we could put the fd into the perf_event, but the following is
> > still true:
> >
> > > > Obviously we would need to keep the old system call around
> > > > for compability, so you would need to worry about this
> > > > interaction in any case!
> 
> 
> Is that true? IIRC if you want to use the perf tools after a kernel
> update, you have to install a new version of perf anyway, no?

Not at all. perf is fully ABI compatible.

Yes Ubuntu/Debian make you do it, but there is no reason for it other
than their ignorance. Other sane distributions don't.

Usually the first step when I'm forced to use one of those machine is to
remove the useless wrapper and call the perf binary directly.

-Andi


Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
On Fri, Sep 28, 2018 at 11:22:37PM +0200, Jann Horn wrote:
> On Fri, Sep 28, 2018 at 10:59 PM Andi Kleen  wrote:
> > > > This new file descriptor argument doesn't exist today so it would
> > > > need to create a new system call with more arguments
> > >
> > > Is that true? The first argument is a pointer to a struct that
> > > contains its own size, so it can be expanded without an ABI break. I
> > > don't see any reason why you couldn't cram more stuff in there.
> >
> > You're right we could put the fd into the perf_event, but the following is
> > still true:
> >
> > > > Obviously we would need to keep the old system call around
> > > > for compability, so you would need to worry about this
> > > > interaction in any case!
> 
> 
> Is that true? IIRC if you want to use the perf tools after a kernel
> update, you have to install a new version of perf anyway, no?

Not at all. perf is fully ABI compatible.

Yes Ubuntu/Debian make you do it, but there is no reason for it other
than their ignorance. Other sane distributions don't.

Usually the first step when I'm forced to use one of those machine is to
remove the useless wrapper and call the perf binary directly.

-Andi


Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
> > This new file descriptor argument doesn't exist today so it would
> > need to create a new system call with more arguments
> 
> Is that true? The first argument is a pointer to a struct that
> contains its own size, so it can be expanded without an ABI break. I
> don't see any reason why you couldn't cram more stuff in there.

You're right we could put the fd into the perf_event, but the following is
still true:

> > Obviously we would need to keep the old system call around
> > for compability, so you would need to worry about this
> > interaction in any case!
> >
> > So tying it together doesn't make any sense, because
> > the problem has to be solved separately anyways.


-Andi


Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
> > This new file descriptor argument doesn't exist today so it would
> > need to create a new system call with more arguments
> 
> Is that true? The first argument is a pointer to a struct that
> contains its own size, so it can be expanded without an ABI break. I
> don't see any reason why you couldn't cram more stuff in there.

You're right we could put the fd into the perf_event, but the following is
still true:

> > Obviously we would need to keep the old system call around
> > for compability, so you would need to worry about this
> > interaction in any case!
> >
> > So tying it together doesn't make any sense, because
> > the problem has to be solved separately anyways.


-Andi


Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
On Fri, Sep 28, 2018 at 06:40:17PM +0100, Mark Rutland wrote:
> On Fri, Sep 28, 2018 at 10:23:40AM -0700, Andi Kleen wrote:
> > > There's also been prior discussion on these feature in other contexts
> > > (e.g. android expoits resulting from out-of-tree drivers). It would be
> > > nice to see those considered.
> > > 
> > > IIRC The conclusion from prior discussions (e.g. [1]) was that we wanted
> > > finer granularity of control such that we could limit PMU access to
> > > specific users -- e.g. disallow arbitrary android apps from poking *any*
> > > PMU, while allowing some more trusted apps/users to uses *some* specific
> > > PMUs.
> > > 
> > > e.g. we could add /sys/bus/event_source/devices/${PMU}/device, protect
> > > this via the usual fs ACLs, and pass the fd to perf_event_open()
> > > somehow. A valid fd would act as a capability, taking precedence over
> > > perf_event_paranoid.
> > 
> > That sounds like an orthogonal feature. I don't think the original
> > patchkit would need to be hold up for this. It would be something
> > in addition.
> 
> I have to say that I disagree -- these controls will have to interact
> somehow, and the fewer of them we have, the less complexity we'll have
> to deal with longer-term.

You're proposing to completely redesign perf_event_open.

This new file descriptor argument doesn't exist today so it would
need to create a new system call with more arguments

(and BTW it would be more than the normal 6 argument limit
we have, so actually it couldn't even be a standard sycall)

Obviously we would need to keep the old system call around
for compability, so you would need to worry about this
interaction in any case!

So tying it together doesn't make any sense, because
the problem has to be solved separately anyways.

> 
> > BTW can't you already do that with the syscall filter? I assume
> > the Android sandboxes already use that. Just forbid perf_event_open
> > for the apps.
> 
> Note that this was about providing access to *some* PMUs in some cases.
> 
> IIUC, if that can be done today via a syscall filter, the same is true
> of per-pmu paranoid settings.

The difference is that the Android sandboxes likely already doing this
and have all the infrastructure, and it's just another rule.

Requiring syscall filters just to use the PMU on xn system
that otherwise doesn't need them would be very odd.

-Andi


Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
On Fri, Sep 28, 2018 at 06:40:17PM +0100, Mark Rutland wrote:
> On Fri, Sep 28, 2018 at 10:23:40AM -0700, Andi Kleen wrote:
> > > There's also been prior discussion on these feature in other contexts
> > > (e.g. android expoits resulting from out-of-tree drivers). It would be
> > > nice to see those considered.
> > > 
> > > IIRC The conclusion from prior discussions (e.g. [1]) was that we wanted
> > > finer granularity of control such that we could limit PMU access to
> > > specific users -- e.g. disallow arbitrary android apps from poking *any*
> > > PMU, while allowing some more trusted apps/users to uses *some* specific
> > > PMUs.
> > > 
> > > e.g. we could add /sys/bus/event_source/devices/${PMU}/device, protect
> > > this via the usual fs ACLs, and pass the fd to perf_event_open()
> > > somehow. A valid fd would act as a capability, taking precedence over
> > > perf_event_paranoid.
> > 
> > That sounds like an orthogonal feature. I don't think the original
> > patchkit would need to be hold up for this. It would be something
> > in addition.
> 
> I have to say that I disagree -- these controls will have to interact
> somehow, and the fewer of them we have, the less complexity we'll have
> to deal with longer-term.

You're proposing to completely redesign perf_event_open.

This new file descriptor argument doesn't exist today so it would
need to create a new system call with more arguments

(and BTW it would be more than the normal 6 argument limit
we have, so actually it couldn't even be a standard sycall)

Obviously we would need to keep the old system call around
for compability, so you would need to worry about this
interaction in any case!

So tying it together doesn't make any sense, because
the problem has to be solved separately anyways.

> 
> > BTW can't you already do that with the syscall filter? I assume
> > the Android sandboxes already use that. Just forbid perf_event_open
> > for the apps.
> 
> Note that this was about providing access to *some* PMUs in some cases.
> 
> IIUC, if that can be done today via a syscall filter, the same is true
> of per-pmu paranoid settings.

The difference is that the Android sandboxes likely already doing this
and have all the infrastructure, and it's just another rule.

Requiring syscall filters just to use the PMU on xn system
that otherwise doesn't need them would be very odd.

-Andi


Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
> Right now we have a single knob, which is poorly documented and that should
> be fixed first. But some googling gives you the information that allowing
> unprivilegded access is a security risk. So the security focussed sysadmin

Ah only if google could simply answer all our questions!

> will deny access to the PMUs no matter what.

It's not like there is or isn't a security risk and that you
can say that it is or it isn't in a global way.

Essentially these are channels of information. The channels always exist
in form of timing variances for any shared resource (like shared caches
or shared memory/IO/interconnect bandwidth) that can be measured.

Perfmon counters make the channels generally less noisy, but they do not cause
them.

To really close them completely you would need to avoid sharing
anything, or not allowing to measure time, neither of which is practical
short of an air gap.

There are reasonable assesments you can make either way and the answers
will be different based on your requirements. There isn't a single
answer that works for everyone. 

There are cases where it isn't a problem at all.

If you don't have multiple users on the system your tolerance
should be extremely high.

For users who have multiple users there can be different tradeoffs.

So there isn't a single answer, and that is why it is important
that this if configurable.

-Andi



Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
> Right now we have a single knob, which is poorly documented and that should
> be fixed first. But some googling gives you the information that allowing
> unprivilegded access is a security risk. So the security focussed sysadmin

Ah only if google could simply answer all our questions!

> will deny access to the PMUs no matter what.

It's not like there is or isn't a security risk and that you
can say that it is or it isn't in a global way.

Essentially these are channels of information. The channels always exist
in form of timing variances for any shared resource (like shared caches
or shared memory/IO/interconnect bandwidth) that can be measured.

Perfmon counters make the channels generally less noisy, but they do not cause
them.

To really close them completely you would need to avoid sharing
anything, or not allowing to measure time, neither of which is practical
short of an air gap.

There are reasonable assesments you can make either way and the answers
will be different based on your requirements. There isn't a single
answer that works for everyone. 

There are cases where it isn't a problem at all.

If you don't have multiple users on the system your tolerance
should be extremely high.

For users who have multiple users there can be different tradeoffs.

So there isn't a single answer, and that is why it is important
that this if configurable.

-Andi



Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
> There's also been prior discussion on these feature in other contexts
> (e.g. android expoits resulting from out-of-tree drivers). It would be
> nice to see those considered.
> 
> IIRC The conclusion from prior discussions (e.g. [1]) was that we wanted
> finer granularity of control such that we could limit PMU access to
> specific users -- e.g. disallow arbitrary android apps from poking *any*
> PMU, while allowing some more trusted apps/users to uses *some* specific
> PMUs.
> 
> e.g. we could add /sys/bus/event_source/devices/${PMU}/device, protect
> this via the usual fs ACLs, and pass the fd to perf_event_open()
> somehow. A valid fd would act as a capability, taking precedence over
> perf_event_paranoid.

That sounds like an orthogonal feature. I don't think the original
patchkit would need to be hold up for this. It would be something
in addition.

BTW can't you already do that with the syscall filter? I assume
the Android sandboxes already use that. Just forbid perf_event_open
for the apps.

-Andi



Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Andi Kleen
> There's also been prior discussion on these feature in other contexts
> (e.g. android expoits resulting from out-of-tree drivers). It would be
> nice to see those considered.
> 
> IIRC The conclusion from prior discussions (e.g. [1]) was that we wanted
> finer granularity of control such that we could limit PMU access to
> specific users -- e.g. disallow arbitrary android apps from poking *any*
> PMU, while allowing some more trusted apps/users to uses *some* specific
> PMUs.
> 
> e.g. we could add /sys/bus/event_source/devices/${PMU}/device, protect
> this via the usual fs ACLs, and pass the fd to perf_event_open()
> somehow. A valid fd would act as a capability, taking precedence over
> perf_event_paranoid.

That sounds like an orthogonal feature. I don't think the original
patchkit would need to be hold up for this. It would be something
in addition.

BTW can't you already do that with the syscall filter? I assume
the Android sandboxes already use that. Just forbid perf_event_open
for the apps.

-Andi



Re: [PATCH v6 3/5] tools, perf, script: Add --call-trace and --call-ret-trace

2018-09-28 Thread Andi Kleen
> Seems to me, these two features are _NOT_ only benefit for intel_pt,
> other hardware tracing (e.g. Arm CoreSight) can enable these features
> as well.  This patch is to document only for intel_pt, later if we
> enable this feature on Arm platform we need to change the doc;
> alternatively we can use more general description for these two options
> at the first place.  How about you think for this?

Likely it already works for CoreSight

I specified intel_pt, because if we just say traces the users won't
know what PMU to specify for record. Being too abstract is
often not helpful.

If someone successfully tests it on CoreSight they could submit
a patch to the documentation to add "or " to these
two cases. That would make it then clear for those users too.

-Andi


Re: [PATCH v6 3/5] tools, perf, script: Add --call-trace and --call-ret-trace

2018-09-28 Thread Andi Kleen
> Seems to me, these two features are _NOT_ only benefit for intel_pt,
> other hardware tracing (e.g. Arm CoreSight) can enable these features
> as well.  This patch is to document only for intel_pt, later if we
> enable this feature on Arm platform we need to change the doc;
> alternatively we can use more general description for these two options
> at the first place.  How about you think for this?

Likely it already works for CoreSight

I specified intel_pt, because if we just say traces the users won't
know what PMU to specify for record. Being too abstract is
often not helpful.

If someone successfully tests it on CoreSight they could submit
a patch to the documentation to add "or " to these
two cases. That would make it then clear for those users too.

-Andi


Re: [RFC 3/5] perf: Allow per PMU access control

2018-09-27 Thread Andi Kleen
> + mutex_lock(_lock);
> + list_for_each_entry(pmu, , entry)
> + pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
> + mutex_unlock(_lock);

What happens to pmus that got added later?

The rest looks good.

Can you post a non RFC version?

-Andi


Re: [RFC 3/5] perf: Allow per PMU access control

2018-09-27 Thread Andi Kleen
> + mutex_lock(_lock);
> + list_for_each_entry(pmu, , entry)
> + pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
> + mutex_unlock(_lock);

What happens to pmus that got added later?

The rest looks good.

Can you post a non RFC version?

-Andi


Re: perf segmentation fault from NULL dereference

2018-09-26 Thread Andi Kleen
> Please me let me know if a valid issue so we can get a fix in.

If it crashes it must be a valid issue of course.

But I'm not sure about your bisect. Hard to see how my patch
could cause this. Sometimes bisects go wrong. 
You verified by just reverting the patch?

First thing I would also try is to run with valgrind or ASan and see if it
reports anything.

-Andi


Re: perf segmentation fault from NULL dereference

2018-09-26 Thread Andi Kleen
> Please me let me know if a valid issue so we can get a fix in.

If it crashes it must be a valid issue of course.

But I'm not sure about your bisect. Hard to see how my patch
could cause this. Sometimes bisects go wrong. 
You verified by just reverting the patch?

First thing I would also try is to run with valgrind or ASan and see if it
reports anything.

-Andi


[tip:perf/core] perf script: Print DSO for callindent

2018-09-26 Thread tip-bot for Andi Kleen
Commit-ID:  a78cdee6fbb136694334ade4cedb331a9d0b4d5e
Gitweb: https://git.kernel.org/tip/a78cdee6fbb136694334ade4cedb331a9d0b4d5e
Author: Andi Kleen 
AuthorDate: Tue, 18 Sep 2018 05:32:10 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 19 Sep 2018 15:25:51 -0300

perf script: Print DSO for callindent

Now that we don't need to print the IP/ADDR for callindent the DSO is
also not printed. It's useful for some cases, so add an own DSO printout
for callindent for the case when IP/ADDR is not enabled.

Before:

% perf script --itrace=cr -F +callindent,-ip,-sym,-symoff,-addr
 swapper 0 [000]  3377.917072:  1   branches: pt_config
 swapper 0 [000]  3377.917072:  1   branches: pt_config
 swapper 0 [000]  3377.917072:  1   branches: 
pt_event_add
 swapper 0 [000]  3377.917072:  1   branches: 
perf_pmu_enable
 swapper 0 [000]  3377.917072:  1   branches: 
perf_pmu_nop_void
 swapper 0 [000]  3377.917072:  1   branches: 
event_sched_in.isra.107
 swapper 0 [000]  3377.917072:  1   branches: 
__x86_indirect_thunk_rax
 swapper 0 [000]  3377.917072:  1   branches: 
perf_pmu_nop_int
 swapper 0 [000]  3377.917072:  1   branches: 
group_sched_in
 swapper 0 [000]  3377.917072:  1   branches: 
event_filter_match
 swapper 0 [000]  3377.917072:  1   branches: 
event_filter_match
 swapper 0 [000]  3377.917072:  1   branches: 
group_sched_in

After:

 swapper 0 [000]  3377.917072:  1   branches: ([unknown])   
pt_config
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   pt_config
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   pt_event_add
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   perf_pmu_enable
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   perf_pmu_nop_void
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   event_sched_in.isra.107
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   __x86_indirect_thunk_rax
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   perf_pmu_nop_int
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   group_sched_in
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   event_filter_match
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   event_filter_match
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   group_sched_in
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   __x86_indirect_thunk_rax
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   perf_pmu_nop_txn
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   event_sched_in.isra.107

(in the kernel case of course it's not very useful, but it's important
with user programs where symbols are not unique)

Signed-off-by: Andi Kleen 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Adrian Hunter 
Cc: Jiri Olsa 
Cc: Kim Phillips 
Link: http://lkml.kernel.org/r/20180918123214.26728-6-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-script.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 7aa59696e97a..7732346bd9dd 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -1115,6 +1115,7 @@ static int perf_sample__fprintf_callindent(struct 
perf_sample *sample,
const char *name = NULL;
static int spacing;
int len = 0;
+   int dlen = 0;
u64 ip = 0;
 
/*
@@ -1141,6 +1142,12 @@ static int perf_sample__fprintf_callindent(struct 
perf_sample *sample,
ip = sample->ip;
}
 
+   if (PRINT_FIELD(DSO) && !(PRINT_FIELD(IP) || PRINT_FIELD(ADDR))) {
+   dlen += fprintf(fp, "(");
+   dlen += map__fprintf_dsoname(al->map, fp);
+   dlen += fprintf(fp, ")\t");
+   }
+
if (name)
len = fprintf(fp, "%*s%s", (int)depth * 4, "", name);
else if (ip)
@@ -1159,7 +1166,7 @@ static int perf_sample__fprintf_callindent(struct 
perf_sample *sample,
if (len < spacing)
len += fprintf(fp, "%*s", spacing - len, "");
 
-   return len;
+   return len + dlen;
 }
 
 static int perf_sample__fprintf_insn(struct perf_sample *sample,


[tip:perf/core] perf script: Print DSO for callindent

2018-09-26 Thread tip-bot for Andi Kleen
Commit-ID:  a78cdee6fbb136694334ade4cedb331a9d0b4d5e
Gitweb: https://git.kernel.org/tip/a78cdee6fbb136694334ade4cedb331a9d0b4d5e
Author: Andi Kleen 
AuthorDate: Tue, 18 Sep 2018 05:32:10 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 19 Sep 2018 15:25:51 -0300

perf script: Print DSO for callindent

Now that we don't need to print the IP/ADDR for callindent the DSO is
also not printed. It's useful for some cases, so add an own DSO printout
for callindent for the case when IP/ADDR is not enabled.

Before:

% perf script --itrace=cr -F +callindent,-ip,-sym,-symoff,-addr
 swapper 0 [000]  3377.917072:  1   branches: pt_config
 swapper 0 [000]  3377.917072:  1   branches: pt_config
 swapper 0 [000]  3377.917072:  1   branches: 
pt_event_add
 swapper 0 [000]  3377.917072:  1   branches: 
perf_pmu_enable
 swapper 0 [000]  3377.917072:  1   branches: 
perf_pmu_nop_void
 swapper 0 [000]  3377.917072:  1   branches: 
event_sched_in.isra.107
 swapper 0 [000]  3377.917072:  1   branches: 
__x86_indirect_thunk_rax
 swapper 0 [000]  3377.917072:  1   branches: 
perf_pmu_nop_int
 swapper 0 [000]  3377.917072:  1   branches: 
group_sched_in
 swapper 0 [000]  3377.917072:  1   branches: 
event_filter_match
 swapper 0 [000]  3377.917072:  1   branches: 
event_filter_match
 swapper 0 [000]  3377.917072:  1   branches: 
group_sched_in

After:

 swapper 0 [000]  3377.917072:  1   branches: ([unknown])   
pt_config
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   pt_config
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   pt_event_add
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   perf_pmu_enable
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   perf_pmu_nop_void
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   event_sched_in.isra.107
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   __x86_indirect_thunk_rax
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   perf_pmu_nop_int
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   group_sched_in
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   event_filter_match
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   event_filter_match
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   group_sched_in
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   __x86_indirect_thunk_rax
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   perf_pmu_nop_txn
 swapper 0 [000]  3377.917072:  1   branches: 
([kernel.kallsyms])   event_sched_in.isra.107

(in the kernel case of course it's not very useful, but it's important
with user programs where symbols are not unique)

Signed-off-by: Andi Kleen 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Adrian Hunter 
Cc: Jiri Olsa 
Cc: Kim Phillips 
Link: http://lkml.kernel.org/r/20180918123214.26728-6-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-script.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 7aa59696e97a..7732346bd9dd 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -1115,6 +1115,7 @@ static int perf_sample__fprintf_callindent(struct 
perf_sample *sample,
const char *name = NULL;
static int spacing;
int len = 0;
+   int dlen = 0;
u64 ip = 0;
 
/*
@@ -1141,6 +1142,12 @@ static int perf_sample__fprintf_callindent(struct 
perf_sample *sample,
ip = sample->ip;
}
 
+   if (PRINT_FIELD(DSO) && !(PRINT_FIELD(IP) || PRINT_FIELD(ADDR))) {
+   dlen += fprintf(fp, "(");
+   dlen += map__fprintf_dsoname(al->map, fp);
+   dlen += fprintf(fp, ")\t");
+   }
+
if (name)
len = fprintf(fp, "%*s%s", (int)depth * 4, "", name);
else if (ip)
@@ -1159,7 +1166,7 @@ static int perf_sample__fprintf_callindent(struct 
perf_sample *sample,
if (len < spacing)
len += fprintf(fp, "%*s", spacing - len, "");
 
-   return len;
+   return len + dlen;
 }
 
 static int perf_sample__fprintf_insn(struct perf_sample *sample,


[tip:perf/core] tools lib subcmd: Support overwriting the pager

2018-09-26 Thread tip-bot for Andi Kleen
Commit-ID:  03a1f49f26482cf832e7f0157ae5da716d927701
Gitweb: https://git.kernel.org/tip/03a1f49f26482cf832e7f0157ae5da716d927701
Author: Andi Kleen 
AuthorDate: Tue, 18 Sep 2018 05:32:07 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 19 Sep 2018 15:16:19 -0300

tools lib subcmd: Support overwriting the pager

Add an interface to the auto pager code that allows callers to overwrite
the pager.

Signed-off-by: Andi Kleen 
Cc: Adrian Hunter 
Cc: Jiri Olsa 
Cc: Josh Poimboeuf 
Cc: Kim Phillips 
Cc: Namhyung Kim 
Link: http://lkml.kernel.org/r/20180918123214.26728-3-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/lib/subcmd/pager.c | 11 ++-
 tools/lib/subcmd/pager.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/tools/lib/subcmd/pager.c b/tools/lib/subcmd/pager.c
index 9997a8805a82..e3d47b59b14d 100644
--- a/tools/lib/subcmd/pager.c
+++ b/tools/lib/subcmd/pager.c
@@ -23,6 +23,13 @@ void pager_init(const char *pager_env)
subcmd_config.pager_env = pager_env;
 }
 
+static const char *forced_pager;
+
+void force_pager(const char *pager)
+{
+   forced_pager = pager;
+}
+
 static void pager_preexec(void)
 {
/*
@@ -66,7 +73,9 @@ void setup_pager(void)
const char *pager = getenv(subcmd_config.pager_env);
struct winsize sz;
 
-   if (!isatty(1))
+   if (forced_pager)
+   pager = forced_pager;
+   if (!isatty(1) && !forced_pager)
return;
if (ioctl(1, TIOCGWINSZ, ) == 0)
pager_columns = sz.ws_col;
diff --git a/tools/lib/subcmd/pager.h b/tools/lib/subcmd/pager.h
index f1a53cf29880..a818964693ab 100644
--- a/tools/lib/subcmd/pager.h
+++ b/tools/lib/subcmd/pager.h
@@ -7,5 +7,6 @@ extern void pager_init(const char *pager_env);
 extern void setup_pager(void);
 extern int pager_in_use(void);
 extern int pager_get_columns(void);
+extern void force_pager(const char *);
 
 #endif /* __SUBCMD_PAGER_H */


[tip:perf/core] tools lib subcmd: Support overwriting the pager

2018-09-26 Thread tip-bot for Andi Kleen
Commit-ID:  03a1f49f26482cf832e7f0157ae5da716d927701
Gitweb: https://git.kernel.org/tip/03a1f49f26482cf832e7f0157ae5da716d927701
Author: Andi Kleen 
AuthorDate: Tue, 18 Sep 2018 05:32:07 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 19 Sep 2018 15:16:19 -0300

tools lib subcmd: Support overwriting the pager

Add an interface to the auto pager code that allows callers to overwrite
the pager.

Signed-off-by: Andi Kleen 
Cc: Adrian Hunter 
Cc: Jiri Olsa 
Cc: Josh Poimboeuf 
Cc: Kim Phillips 
Cc: Namhyung Kim 
Link: http://lkml.kernel.org/r/20180918123214.26728-3-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/lib/subcmd/pager.c | 11 ++-
 tools/lib/subcmd/pager.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/tools/lib/subcmd/pager.c b/tools/lib/subcmd/pager.c
index 9997a8805a82..e3d47b59b14d 100644
--- a/tools/lib/subcmd/pager.c
+++ b/tools/lib/subcmd/pager.c
@@ -23,6 +23,13 @@ void pager_init(const char *pager_env)
subcmd_config.pager_env = pager_env;
 }
 
+static const char *forced_pager;
+
+void force_pager(const char *pager)
+{
+   forced_pager = pager;
+}
+
 static void pager_preexec(void)
 {
/*
@@ -66,7 +73,9 @@ void setup_pager(void)
const char *pager = getenv(subcmd_config.pager_env);
struct winsize sz;
 
-   if (!isatty(1))
+   if (forced_pager)
+   pager = forced_pager;
+   if (!isatty(1) && !forced_pager)
return;
if (ioctl(1, TIOCGWINSZ, ) == 0)
pager_columns = sz.ws_col;
diff --git a/tools/lib/subcmd/pager.h b/tools/lib/subcmd/pager.h
index f1a53cf29880..a818964693ab 100644
--- a/tools/lib/subcmd/pager.h
+++ b/tools/lib/subcmd/pager.h
@@ -7,5 +7,6 @@ extern void pager_init(const char *pager_env);
 extern void setup_pager(void);
 extern int pager_in_use(void);
 extern int pager_get_columns(void);
+extern void force_pager(const char *);
 
 #endif /* __SUBCMD_PAGER_H */


[tip:perf/core] perf script: Allow sym and dso without ip, addr

2018-09-26 Thread tip-bot for Andi Kleen
Commit-ID:  37fed3de555199733805a2d3e03aee7727c09ea4
Gitweb: https://git.kernel.org/tip/37fed3de555199733805a2d3e03aee7727c09ea4
Author: Andi Kleen 
AuthorDate: Tue, 18 Sep 2018 05:32:09 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 19 Sep 2018 15:20:03 -0300

perf script: Allow sym and dso without ip, addr

Currently sym and dso require printing ip and addr because the print
function is tied to those outputs. With callindent it makes sense to
print the symbol or dso without numerical IP or ADDR. So change the
dependency check to only check the underlying attribute.

Also the branch target output relies on the user_set flag to determine
if the branch target should be implicitely printed. When modifying the
fields with + or - also set user_set, so that ADDR can be removed. We
also need to set wildcard_set to make the initial sanity check pass.

This allows to remove a lot of noise in callindent output by dropping
the numerical addresses, which are not all that useful.

Before

% perf script --itrace=cr -F +callindent
 swapper 0 [000] 156546.354971:  1   branches: pt_config
   0 [unknown] ([unknown]) => 81010486 
pt_config ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
pt_config81010499 pt_config ([kernel.kallsyms]) => 
8101063e pt_event_add ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
pt_event_add 81010635 pt_event_add ([kernel.kallsyms]) 
=> 8115e687 event_sched_in.isra.107 ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
perf_pmu_enable  8115e726 event_sched_in.isra.107 
([kernel.kallsyms]) => 811579b0 perf_pmu_enable ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
perf_pmu_nop_void81151730 perf_pmu_nop_void 
([kernel.kallsyms]) => 8115e72b event_sched_in.isra.107 
([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
event_sched_in.isra.107  8115e737 event_sched_in.isra.107 
([kernel.kallsyms]) => 8115e7a5 group_sched_in ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
__x86_indirect_thunk_rax 8115e7f6 group_sched_in 
([kernel.kallsyms]) => 81a03000 __x86_indirect_thunk_rax 
([kernel.kallsyms])

After

% perf script --itrace=cr -F +callindent,-ip,-sym,-symoff
   swapper 0 [000] 156546.354971:  1   branches:  pt_config
 swapper 0 [000] 156546.354971:  1   branches:  
pt_config
 swapper 0 [000] 156546.354971:  1   branches:  
pt_event_add
 swapper 0 [000] 156546.354971:  1   branches:   
perf_pmu_enable
 swapper 0 [000] 156546.354971:  1   branches:   
perf_pmu_nop_void
 swapper 0 [000] 156546.354971:  1   branches:   
event_sched_in.isra.107
 swapper 0 [000] 156546.354971:  1   branches:   
__x86_indirect_thunk_rax
 swapper 0 [000] 156546.354971:  1   branches:   
perf_pmu_nop_int
 swapper 0 [000] 156546.354971:  1   branches:   
group_sched_in
 swapper 0 [000] 156546.354971:  1   branches:   
event_filter_match
 swapper 0 [000] 156546.354971:  1   branches:   
event_filter_match
 swapper 0 [000] 156546.354971:  1   branches:   
group_sched_in

Signed-off-by: Andi Kleen 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Adrian Hunter 
Cc: Jiri Olsa 
Cc: Kim Phillips 
Link: http://lkml.kernel.org/r/20180918123214.26728-5-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-script.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 66699dd351e0..7aa59696e97a 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -406,9 +406,10 @@ static int perf_evsel__check_attr(struct perf_evsel *evsel,
PERF_OUTPUT_WEIGHT))
return -EINVAL;
 
-   if (PRINT_FIELD(SYM) && !PRINT_FIELD(IP) && !PRINT_FIELD(ADDR)) {
+   if (PRINT_FIELD(SYM) &&
+   !(evsel->attr.sample_type & (PERF_SAMPLE_IP|PERF_SAMPLE_ADDR))) 
{
pr_err("Display of symbols requested but neither sample IP nor "
-  "sample address\nis selected. Hence, no addresses to 
convert "
+  "sample address\navailable. Hence, no addresses to 
convert "
   "to symbols.\n");
return -EINVA

[tip:perf/core] perf script: Allow sym and dso without ip, addr

2018-09-26 Thread tip-bot for Andi Kleen
Commit-ID:  37fed3de555199733805a2d3e03aee7727c09ea4
Gitweb: https://git.kernel.org/tip/37fed3de555199733805a2d3e03aee7727c09ea4
Author: Andi Kleen 
AuthorDate: Tue, 18 Sep 2018 05:32:09 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 19 Sep 2018 15:20:03 -0300

perf script: Allow sym and dso without ip, addr

Currently sym and dso require printing ip and addr because the print
function is tied to those outputs. With callindent it makes sense to
print the symbol or dso without numerical IP or ADDR. So change the
dependency check to only check the underlying attribute.

Also the branch target output relies on the user_set flag to determine
if the branch target should be implicitely printed. When modifying the
fields with + or - also set user_set, so that ADDR can be removed. We
also need to set wildcard_set to make the initial sanity check pass.

This allows to remove a lot of noise in callindent output by dropping
the numerical addresses, which are not all that useful.

Before

% perf script --itrace=cr -F +callindent
 swapper 0 [000] 156546.354971:  1   branches: pt_config
   0 [unknown] ([unknown]) => 81010486 
pt_config ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
pt_config81010499 pt_config ([kernel.kallsyms]) => 
8101063e pt_event_add ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
pt_event_add 81010635 pt_event_add ([kernel.kallsyms]) 
=> 8115e687 event_sched_in.isra.107 ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
perf_pmu_enable  8115e726 event_sched_in.isra.107 
([kernel.kallsyms]) => 811579b0 perf_pmu_enable ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
perf_pmu_nop_void81151730 perf_pmu_nop_void 
([kernel.kallsyms]) => 8115e72b event_sched_in.isra.107 
([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
event_sched_in.isra.107  8115e737 event_sched_in.isra.107 
([kernel.kallsyms]) => 8115e7a5 group_sched_in ([kernel.kallsyms])
 swapper 0 [000] 156546.354971:  1   branches: 
__x86_indirect_thunk_rax 8115e7f6 group_sched_in 
([kernel.kallsyms]) => 81a03000 __x86_indirect_thunk_rax 
([kernel.kallsyms])

After

% perf script --itrace=cr -F +callindent,-ip,-sym,-symoff
   swapper 0 [000] 156546.354971:  1   branches:  pt_config
 swapper 0 [000] 156546.354971:  1   branches:  
pt_config
 swapper 0 [000] 156546.354971:  1   branches:  
pt_event_add
 swapper 0 [000] 156546.354971:  1   branches:   
perf_pmu_enable
 swapper 0 [000] 156546.354971:  1   branches:   
perf_pmu_nop_void
 swapper 0 [000] 156546.354971:  1   branches:   
event_sched_in.isra.107
 swapper 0 [000] 156546.354971:  1   branches:   
__x86_indirect_thunk_rax
 swapper 0 [000] 156546.354971:  1   branches:   
perf_pmu_nop_int
 swapper 0 [000] 156546.354971:  1   branches:   
group_sched_in
 swapper 0 [000] 156546.354971:  1   branches:   
event_filter_match
 swapper 0 [000] 156546.354971:  1   branches:   
event_filter_match
 swapper 0 [000] 156546.354971:  1   branches:   
group_sched_in

Signed-off-by: Andi Kleen 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Adrian Hunter 
Cc: Jiri Olsa 
Cc: Kim Phillips 
Link: http://lkml.kernel.org/r/20180918123214.26728-5-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-script.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 66699dd351e0..7aa59696e97a 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -406,9 +406,10 @@ static int perf_evsel__check_attr(struct perf_evsel *evsel,
PERF_OUTPUT_WEIGHT))
return -EINVAL;
 
-   if (PRINT_FIELD(SYM) && !PRINT_FIELD(IP) && !PRINT_FIELD(ADDR)) {
+   if (PRINT_FIELD(SYM) &&
+   !(evsel->attr.sample_type & (PERF_SAMPLE_IP|PERF_SAMPLE_ADDR))) 
{
pr_err("Display of symbols requested but neither sample IP nor "
-  "sample address\nis selected. Hence, no addresses to 
convert "
+  "sample address\navailable. Hence, no addresses to 
convert "
   "to symbols.\n");
return -EINVA

[tip:perf/core] perf tools: Report itrace options in help

2018-09-26 Thread tip-bot for Andi Kleen
Commit-ID:  c12e039d1233f24ab2726945f883037f47b26f1d
Gitweb: https://git.kernel.org/tip/c12e039d1233f24ab2726945f883037f47b26f1d
Author: Andi Kleen 
AuthorDate: Thu, 13 Sep 2018 20:10:31 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 19 Sep 2018 15:06:59 -0300

perf tools: Report itrace options in help

I often forget all the options that --itrace accepts. Instead of burying
them in the man page only report them in the normal command line help
too to make them easier accessible.

v2: Align

Signed-off-by: Andi Kleen 
Cc: Jiri Olsa 
Cc: Kim Phillips 
Link: http://lkml.kernel.org/r/20180914031038.4160-2-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-inject.c |  3 ++-
 tools/perf/builtin-report.c |  2 +-
 tools/perf/builtin-script.c |  2 +-
 tools/perf/util/auxtrace.h  | 19 +++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index b4a29f435b06..eda41673c4f3 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -798,7 +798,8 @@ int cmd_inject(int argc, const char **argv)
   "kallsyms pathname"),
OPT_BOOLEAN('f', "force", , "don't complain, do it"),
OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts,
-   NULL, "opts", "Instruction Tracing options",
+   NULL, "opts", "Instruction Tracing 
options\n"
+   ITRACE_HELP,
itrace_parse_synth_opts),
OPT_BOOLEAN(0, "strip", ,
"strip non-synthesized events (use with --itrace)"),
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 7507e4d6dce1..c0703979c51d 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1104,7 +1104,7 @@ int cmd_report(int argc, const char **argv)
OPT_CALLBACK(0, "percentage", NULL, "relative|absolute",
 "how to display percentage of filtered entries", 
parse_filter_percentage),
OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, NULL, "opts",
-   "Instruction Tracing options",
+   "Instruction Tracing options\n" ITRACE_HELP,
itrace_parse_synth_opts),
OPT_BOOLEAN(0, "full-source-path", _full_filename,
"Show full source file name path for source lines"),
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 765391b6c88c..66699dd351e0 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -3193,7 +3193,7 @@ int cmd_script(int argc, const char **argv)
OPT_BOOLEAN(0, "ns", ,
"Use 9 decimal places when displaying time"),
OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, NULL, "opts",
-   "Instruction Tracing options",
+   "Instruction Tracing options\n" ITRACE_HELP,
itrace_parse_synth_opts),
OPT_BOOLEAN(0, "full-source-path", _full_filename,
"Show full source file name path for source lines"),
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index a86b7eab6673..0a6ce9c4fc11 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -576,6 +576,23 @@ static inline void auxtrace__free(struct perf_session 
*session)
return session->auxtrace->free(session);
 }
 
+#define ITRACE_HELP \
+"  i:  synthesize instructions 
events\n"   \
+"  b:  synthesize branches 
events\n"   \
+"  c:  synthesize branches 
events (calls only)\n"  \
+"  r:  synthesize branches 
events (returns only)\n" \
+"  x:  synthesize transactions 
events\n"   \
+"  w:  synthesize ptwrite 
events\n"\
+"  p:  synthesize power 
events\n"  \
+"  e:  synthesize error 
events\n"  \
+"  d:  create a debug log\n"   
\
+"  g[len]: synthesize a call chain 
(use wi

[tip:perf/core] perf tools: Report itrace options in help

2018-09-26 Thread tip-bot for Andi Kleen
Commit-ID:  c12e039d1233f24ab2726945f883037f47b26f1d
Gitweb: https://git.kernel.org/tip/c12e039d1233f24ab2726945f883037f47b26f1d
Author: Andi Kleen 
AuthorDate: Thu, 13 Sep 2018 20:10:31 -0700
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 19 Sep 2018 15:06:59 -0300

perf tools: Report itrace options in help

I often forget all the options that --itrace accepts. Instead of burying
them in the man page only report them in the normal command line help
too to make them easier accessible.

v2: Align

Signed-off-by: Andi Kleen 
Cc: Jiri Olsa 
Cc: Kim Phillips 
Link: http://lkml.kernel.org/r/20180914031038.4160-2-a...@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-inject.c |  3 ++-
 tools/perf/builtin-report.c |  2 +-
 tools/perf/builtin-script.c |  2 +-
 tools/perf/util/auxtrace.h  | 19 +++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index b4a29f435b06..eda41673c4f3 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -798,7 +798,8 @@ int cmd_inject(int argc, const char **argv)
   "kallsyms pathname"),
OPT_BOOLEAN('f', "force", , "don't complain, do it"),
OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts,
-   NULL, "opts", "Instruction Tracing options",
+   NULL, "opts", "Instruction Tracing 
options\n"
+   ITRACE_HELP,
itrace_parse_synth_opts),
OPT_BOOLEAN(0, "strip", ,
"strip non-synthesized events (use with --itrace)"),
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 7507e4d6dce1..c0703979c51d 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -1104,7 +1104,7 @@ int cmd_report(int argc, const char **argv)
OPT_CALLBACK(0, "percentage", NULL, "relative|absolute",
 "how to display percentage of filtered entries", 
parse_filter_percentage),
OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, NULL, "opts",
-   "Instruction Tracing options",
+   "Instruction Tracing options\n" ITRACE_HELP,
itrace_parse_synth_opts),
OPT_BOOLEAN(0, "full-source-path", _full_filename,
"Show full source file name path for source lines"),
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 765391b6c88c..66699dd351e0 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -3193,7 +3193,7 @@ int cmd_script(int argc, const char **argv)
OPT_BOOLEAN(0, "ns", ,
"Use 9 decimal places when displaying time"),
OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, NULL, "opts",
-   "Instruction Tracing options",
+   "Instruction Tracing options\n" ITRACE_HELP,
itrace_parse_synth_opts),
OPT_BOOLEAN(0, "full-source-path", _full_filename,
"Show full source file name path for source lines"),
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index a86b7eab6673..0a6ce9c4fc11 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -576,6 +576,23 @@ static inline void auxtrace__free(struct perf_session 
*session)
return session->auxtrace->free(session);
 }
 
+#define ITRACE_HELP \
+"  i:  synthesize instructions 
events\n"   \
+"  b:  synthesize branches 
events\n"   \
+"  c:  synthesize branches 
events (calls only)\n"  \
+"  r:  synthesize branches 
events (returns only)\n" \
+"  x:  synthesize transactions 
events\n"   \
+"  w:  synthesize ptwrite 
events\n"\
+"  p:  synthesize power 
events\n"  \
+"  e:  synthesize error 
events\n"  \
+"  d:  create a debug log\n"   
\
+"  g[len]: synthesize a call chain 
(use wi

[PATCH v6 5/5] perf, tools, script: Support total cycles count

2018-09-24 Thread Andi Kleen
[Updated patch with a bug fix. Please use this version.]

For perf script brstackinsn also print a running cycles count.
This makes it easier to calculate cycle deltas for code sections
measured with LBRs.

% perf record -b -a sleep 1
% perf script -F +brstackinsn
...
7f73ecc41083insn: 74 06 # PRED 9 cycles 
[17] 1.11 IPC
7f73ecc4108binsn: a8 10
7f73ecc4108dinsn: 74 71 # PRED 1 cycles 
[18] 1.00 IPC
7f73ecc41100insn: 48 8b 46 10
7f73ecc41104insn: 4c 8b 38
7f73ecc41107insn: 4d 85 ff
7f73ecc4110ainsn: 0f 84 b0 00 00 00
7f73ecc41110insn: 83 43 58 01
7f73ecc41114insn: 48 89 df
7f73ecc41117insn: e8 94 73 04 00# PRED 6 cycles 
[24] 1.00 IPC

Signed-off-by: Andi Kleen 

---

v2: reflow line
v3: Print cycles in correct column

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 6232658c6f31..dbb0a780225c 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -913,7 +913,7 @@ static int grab_bb(u8 *buffer, u64 start, u64 end,
 
 static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en,
struct perf_insn *x, u8 *inbuf, int len,
-   int insn, FILE *fp)
+   int insn, FILE *fp, int *total_cycles)
 {
int printed = fprintf(fp, "\t%016" PRIx64 "\t%-30s\t#%s%s%s%s", ip,
  dump_insn(x, ip, inbuf, len, NULL),
@@ -922,7 +922,8 @@ static int ip__fprintf_jump(uint64_t ip, struct 
branch_entry *en,
  en->flags.in_tx ? " INTX" : "",
  en->flags.abort ? " ABORT" : "");
if (en->flags.cycles) {
-   printed += fprintf(fp, " %d cycles", en->flags.cycles);
+   *total_cycles += en->flags.cycles;
+   printed += fprintf(fp, " %d cycles [%d]", en->flags.cycles, 
*total_cycles);
if (insn)
printed += fprintf(fp, " %.2f IPC", (float)insn / 
en->flags.cycles);
}
@@ -979,6 +980,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
u8 buffer[MAXBB];
unsigned off;
struct symbol *lastsym = NULL;
+   int total_cycles = 0;
 
if (!(br && br->nr))
return 0;
@@ -999,7 +1001,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
printed += ip__fprintf_sym(br->entries[nr - 1].from, thread,
   x.cpumode, x.cpu, , attr, 
fp);
printed += ip__fprintf_jump(br->entries[nr - 1].from, 
>entries[nr - 1],
-   , buffer, len, 0, fp);
+   , buffer, len, 0, fp, 
_cycles);
}
 
/* Print all blocks */
@@ -1027,7 +1029,8 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
 
printed += ip__fprintf_sym(ip, thread, x.cpumode, 
x.cpu, , attr, fp);
if (ip == end) {
-   printed += ip__fprintf_jump(ip, 
>entries[i], , buffer + off, len - off, insn, fp);
+   printed += ip__fprintf_jump(ip, 
>entries[i], , buffer + off, len - off, insn, fp,
+   _cycles);
break;
} else {
printed += fprintf(fp, "\t%016" PRIx64 
"\t%s\n", ip,


[PATCH v6 5/5] perf, tools, script: Support total cycles count

2018-09-24 Thread Andi Kleen
[Updated patch with a bug fix. Please use this version.]

For perf script brstackinsn also print a running cycles count.
This makes it easier to calculate cycle deltas for code sections
measured with LBRs.

% perf record -b -a sleep 1
% perf script -F +brstackinsn
...
7f73ecc41083insn: 74 06 # PRED 9 cycles 
[17] 1.11 IPC
7f73ecc4108binsn: a8 10
7f73ecc4108dinsn: 74 71 # PRED 1 cycles 
[18] 1.00 IPC
7f73ecc41100insn: 48 8b 46 10
7f73ecc41104insn: 4c 8b 38
7f73ecc41107insn: 4d 85 ff
7f73ecc4110ainsn: 0f 84 b0 00 00 00
7f73ecc41110insn: 83 43 58 01
7f73ecc41114insn: 48 89 df
7f73ecc41117insn: e8 94 73 04 00# PRED 6 cycles 
[24] 1.00 IPC

Signed-off-by: Andi Kleen 

---

v2: reflow line
v3: Print cycles in correct column

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 6232658c6f31..dbb0a780225c 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -913,7 +913,7 @@ static int grab_bb(u8 *buffer, u64 start, u64 end,
 
 static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en,
struct perf_insn *x, u8 *inbuf, int len,
-   int insn, FILE *fp)
+   int insn, FILE *fp, int *total_cycles)
 {
int printed = fprintf(fp, "\t%016" PRIx64 "\t%-30s\t#%s%s%s%s", ip,
  dump_insn(x, ip, inbuf, len, NULL),
@@ -922,7 +922,8 @@ static int ip__fprintf_jump(uint64_t ip, struct 
branch_entry *en,
  en->flags.in_tx ? " INTX" : "",
  en->flags.abort ? " ABORT" : "");
if (en->flags.cycles) {
-   printed += fprintf(fp, " %d cycles", en->flags.cycles);
+   *total_cycles += en->flags.cycles;
+   printed += fprintf(fp, " %d cycles [%d]", en->flags.cycles, 
*total_cycles);
if (insn)
printed += fprintf(fp, " %.2f IPC", (float)insn / 
en->flags.cycles);
}
@@ -979,6 +980,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
u8 buffer[MAXBB];
unsigned off;
struct symbol *lastsym = NULL;
+   int total_cycles = 0;
 
if (!(br && br->nr))
return 0;
@@ -999,7 +1001,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
printed += ip__fprintf_sym(br->entries[nr - 1].from, thread,
   x.cpumode, x.cpu, , attr, 
fp);
printed += ip__fprintf_jump(br->entries[nr - 1].from, 
>entries[nr - 1],
-   , buffer, len, 0, fp);
+   , buffer, len, 0, fp, 
_cycles);
}
 
/* Print all blocks */
@@ -1027,7 +1029,8 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
 
printed += ip__fprintf_sym(ip, thread, x.cpumode, 
x.cpu, , attr, fp);
if (ip == end) {
-   printed += ip__fprintf_jump(ip, 
>entries[i], , buffer + off, len - off, insn, fp);
+   printed += ip__fprintf_jump(ip, 
>entries[i], , buffer + off, len - off, insn, fp,
+   _cycles);
break;
} else {
printed += fprintf(fp, "\t%016" PRIx64 
"\t%s\n", ip,


Re: [PATCH] perf tools: Do not zero sample_id_all for group members

2018-09-23 Thread Andi Kleen
On Sun, Sep 23, 2018 at 05:04:20PM +0200, Jiri Olsa wrote:
> Andi reported following malfunction:
> 
>   # perf record -e '{ref-cycles,cycles}:S' -a sleep 1
>   # perf script
>   non matching sample_id_all
> 
> That's because we disable sample_id_all bit for non-sampling
> group members. We can't do that, because it needs to be the
> same over the whole event list. This patch keeps it untouched
> again.
> 
> Fixes: e9add8bac6c6 ("perf evsel: Disable write_backward for leader sampling 
> group events")
> Reported-by: Andi Kleen 
> Link: http://lkml.kernel.org/n/tip-y7hc1i6en4saxqxuwr2g2...@git.kernel.org
> Signed-off-by: Jiri Olsa 

Tested-by: Andi Kleen 

Should be also cc stable please

-Andi


Re: [PATCH] perf tools: Do not zero sample_id_all for group members

2018-09-23 Thread Andi Kleen
On Sun, Sep 23, 2018 at 05:04:20PM +0200, Jiri Olsa wrote:
> Andi reported following malfunction:
> 
>   # perf record -e '{ref-cycles,cycles}:S' -a sleep 1
>   # perf script
>   non matching sample_id_all
> 
> That's because we disable sample_id_all bit for non-sampling
> group members. We can't do that, because it needs to be the
> same over the whole event list. This patch keeps it untouched
> again.
> 
> Fixes: e9add8bac6c6 ("perf evsel: Disable write_backward for leader sampling 
> group events")
> Reported-by: Andi Kleen 
> Link: http://lkml.kernel.org/n/tip-y7hc1i6en4saxqxuwr2g2...@git.kernel.org
> Signed-off-by: Jiri Olsa 

Tested-by: Andi Kleen 

Should be also cc stable please

-Andi


[PATCH v6 1/5] tools, perf, script: Add --insn-trace for instruction decoding

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

Add a --insn-trace short hand option for decoding and disassembling
instruction streams for intel_pt. This automatically pipes the
output into the xed disassembler to generate disassembled instructions.
This just makes this use model much nicer to use

Before

% perf record -e intel_pt// ...
% perf script --itrace=i0ns --ns -F +insn,-event,-period | xed -F insn: -A -64
   swapper 0 [000] 117276.429606186:  81010486 pt_config 
([kernel.kallsyms])  nopl  %eax, (%rax,%rax,1)
   swapper 0 [000] 117276.429606186:  8101048b pt_config 
([kernel.kallsyms])  add $0x10, %rsp
   swapper 0 [000] 117276.429606186:  8101048f pt_config 
([kernel.kallsyms])  popq  %rbx
   swapper 0 [000] 117276.429606186:  81010490 pt_config 
([kernel.kallsyms])  popq  %rbp
   swapper 0 [000] 117276.429606186:  81010491 pt_config 
([kernel.kallsyms])  popq  %r12
   swapper 0 [000] 117276.429606186:  81010493 pt_config 
([kernel.kallsyms])  popq  %r13
   swapper 0 [000] 117276.429606186:  81010495 pt_config 
([kernel.kallsyms])  popq  %r14
   swapper 0 [000] 117276.429606186:  81010497 pt_config 
([kernel.kallsyms])  popq  %r15
   swapper 0 [000] 117276.429606186:  81010499 pt_config 
([kernel.kallsyms])  retq
   swapper 0 [000] 117276.429606186:  8101063e pt_event_add 
([kernel.kallsyms])   cmpl  $0x1, 0x1b0(%rbx)
   swapper 0 [000] 117276.429606186:  81010645 pt_event_add 
([kernel.kallsyms])   mov $0xffea, %eax
   swapper 0 [000] 117276.429606186:  8101064a pt_event_add 
([kernel.kallsyms])   mov $0x0, %edx
   swapper 0 [000] 117276.429606186:  8101064f pt_event_add 
([kernel.kallsyms])   popq  %rbx
   swapper 0 [000] 117276.429606186:  81010650 pt_event_add 
([kernel.kallsyms])   cmovnz %edx, %eax
   swapper 0 [000] 117276.429606186:  81010653 pt_event_add 
([kernel.kallsyms])   jmp 0x81010635
   swapper 0 [000] 117276.429606186:  81010635 pt_event_add 
([kernel.kallsyms])   retq
   swapper 0 [000] 117276.429606186:  8115e687 
event_sched_in.isra.107 ([kernel.kallsyms])test %eax, %eax

Now

% perf record -e intel_pt// ...
% perf script --insn-trace --xed
... same output ...

XED needs to be installed with:

> git clone https://github.com/intelxed/mbuild.git mbuild
> git clone https://github.com/intelxed/xed
> cd xed
> mkdir obj
> cd obj
> ../mfile.py
> sudo ../mfile.py --prefix=/usr/local install

Signed-off-by: Andi Kleen 

--

v2: Add separate --xed option
v3: Add xed build documentation and update commit
---
 tools/perf/Documentation/build-xed.txt   | 11 +++
 tools/perf/Documentation/perf-script.txt |  7 +++
 tools/perf/builtin-script.c  | 23 +++
 3 files changed, 41 insertions(+)
 create mode 100644 tools/perf/Documentation/build-xed.txt

diff --git a/tools/perf/Documentation/build-xed.txt 
b/tools/perf/Documentation/build-xed.txt
new file mode 100644
index ..8da3028e6dca
--- /dev/null
+++ b/tools/perf/Documentation/build-xed.txt
@@ -0,0 +1,11 @@
+
+For --xed the xed tool is needed. Here is how to install it:
+
+> git clone https://github.com/intelxed/mbuild.git mbuild  
  
+> git clone https://github.com/intelxed/xed
  
+> cd xed   
  
+> mkdir obj
  
+> cd obj   
  
+> ../mfile.py  
  
+> sudo ../mfile.py --prefix=/usr/local install
+
diff --git a/tools/perf/Documentation/perf-script.txt 
b/tools/perf/Documentation/perf-script.txt
index afdafe2110a1..00c655ab4968 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -383,6 +383,13 @@ include::itrace.txt[]
will be printed. Each entry has function name and file/line. Enabled by
default, disable with --no-inline.
 
+--insn-trace::
+   Show instruction stream for intel_pt traces. Combine with --xed to
+   show disassembly.
+
+--xed::
+   Run xed disassembler on output. Requires installing the xed 
disassembler.
+
 

[PATCH v6 1/5] tools, perf, script: Add --insn-trace for instruction decoding

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

Add a --insn-trace short hand option for decoding and disassembling
instruction streams for intel_pt. This automatically pipes the
output into the xed disassembler to generate disassembled instructions.
This just makes this use model much nicer to use

Before

% perf record -e intel_pt// ...
% perf script --itrace=i0ns --ns -F +insn,-event,-period | xed -F insn: -A -64
   swapper 0 [000] 117276.429606186:  81010486 pt_config 
([kernel.kallsyms])  nopl  %eax, (%rax,%rax,1)
   swapper 0 [000] 117276.429606186:  8101048b pt_config 
([kernel.kallsyms])  add $0x10, %rsp
   swapper 0 [000] 117276.429606186:  8101048f pt_config 
([kernel.kallsyms])  popq  %rbx
   swapper 0 [000] 117276.429606186:  81010490 pt_config 
([kernel.kallsyms])  popq  %rbp
   swapper 0 [000] 117276.429606186:  81010491 pt_config 
([kernel.kallsyms])  popq  %r12
   swapper 0 [000] 117276.429606186:  81010493 pt_config 
([kernel.kallsyms])  popq  %r13
   swapper 0 [000] 117276.429606186:  81010495 pt_config 
([kernel.kallsyms])  popq  %r14
   swapper 0 [000] 117276.429606186:  81010497 pt_config 
([kernel.kallsyms])  popq  %r15
   swapper 0 [000] 117276.429606186:  81010499 pt_config 
([kernel.kallsyms])  retq
   swapper 0 [000] 117276.429606186:  8101063e pt_event_add 
([kernel.kallsyms])   cmpl  $0x1, 0x1b0(%rbx)
   swapper 0 [000] 117276.429606186:  81010645 pt_event_add 
([kernel.kallsyms])   mov $0xffea, %eax
   swapper 0 [000] 117276.429606186:  8101064a pt_event_add 
([kernel.kallsyms])   mov $0x0, %edx
   swapper 0 [000] 117276.429606186:  8101064f pt_event_add 
([kernel.kallsyms])   popq  %rbx
   swapper 0 [000] 117276.429606186:  81010650 pt_event_add 
([kernel.kallsyms])   cmovnz %edx, %eax
   swapper 0 [000] 117276.429606186:  81010653 pt_event_add 
([kernel.kallsyms])   jmp 0x81010635
   swapper 0 [000] 117276.429606186:  81010635 pt_event_add 
([kernel.kallsyms])   retq
   swapper 0 [000] 117276.429606186:  8115e687 
event_sched_in.isra.107 ([kernel.kallsyms])test %eax, %eax

Now

% perf record -e intel_pt// ...
% perf script --insn-trace --xed
... same output ...

XED needs to be installed with:

> git clone https://github.com/intelxed/mbuild.git mbuild
> git clone https://github.com/intelxed/xed
> cd xed
> mkdir obj
> cd obj
> ../mfile.py
> sudo ../mfile.py --prefix=/usr/local install

Signed-off-by: Andi Kleen 

--

v2: Add separate --xed option
v3: Add xed build documentation and update commit
---
 tools/perf/Documentation/build-xed.txt   | 11 +++
 tools/perf/Documentation/perf-script.txt |  7 +++
 tools/perf/builtin-script.c  | 23 +++
 3 files changed, 41 insertions(+)
 create mode 100644 tools/perf/Documentation/build-xed.txt

diff --git a/tools/perf/Documentation/build-xed.txt 
b/tools/perf/Documentation/build-xed.txt
new file mode 100644
index ..8da3028e6dca
--- /dev/null
+++ b/tools/perf/Documentation/build-xed.txt
@@ -0,0 +1,11 @@
+
+For --xed the xed tool is needed. Here is how to install it:
+
+> git clone https://github.com/intelxed/mbuild.git mbuild  
  
+> git clone https://github.com/intelxed/xed
  
+> cd xed   
  
+> mkdir obj
  
+> cd obj   
  
+> ../mfile.py  
  
+> sudo ../mfile.py --prefix=/usr/local install
+
diff --git a/tools/perf/Documentation/perf-script.txt 
b/tools/perf/Documentation/perf-script.txt
index afdafe2110a1..00c655ab4968 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -383,6 +383,13 @@ include::itrace.txt[]
will be printed. Each entry has function name and file/line. Enabled by
default, disable with --no-inline.
 
+--insn-trace::
+   Show instruction stream for intel_pt traces. Combine with --xed to
+   show disassembly.
+
+--xed::
+   Run xed disassembler on output. Requires installing the xed 
disassembler.
+
 

[PATCH v6 2/5] perf, tools, script: Make itrace script default to all calls

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

By default perf script for itrace outputs sampled instructions or
branches. In my experience this is confusing to users because it's
hard to correlate with real program behavior. The sampling makes
sense for tools like report that actually sample to reduce
the run time, but run time is normally not a problem for perf script.
It's better to give an accurate representation of the program flow.

Default perf script to output all calls for itrace. That's a much saner
default. The old behavior can be still requested with
perf script --itrace=ibxwpe10

v2: Fix ETM build failure
v3: Really fix ETM build failure (Kim Phillips)
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/itrace.txt |  7 ---
 tools/perf/builtin-script.c |  5 -
 tools/perf/util/auxtrace.c  | 17 -
 tools/perf/util/auxtrace.h  |  5 -
 tools/perf/util/cs-etm.c|  3 ++-
 tools/perf/util/intel-bts.c |  3 ++-
 tools/perf/util/intel-pt.c  |  3 ++-
 7 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/tools/perf/Documentation/itrace.txt 
b/tools/perf/Documentation/itrace.txt
index a3abe04c779d..c2182cbabde3 100644
--- a/tools/perf/Documentation/itrace.txt
+++ b/tools/perf/Documentation/itrace.txt
@@ -11,10 +11,11 @@
l   synthesize last branch entries (use with i or x)
s   skip initial number of events
 
-   The default is all events i.e. the same as --itrace=ibxwpe
+   The default is all events i.e. the same as --itrace=ibxwpe,
+   except for perf script where it is --itrace=ce
 
-   In addition, the period (default 10) for instructions events
-   can be specified in units of:
+   In addition, the period (default 10, except for perf script where 
it is 1)
+   for instructions events can be specified in units of:
 
i   instructions
t   ticks
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 311f5b53dd83..519ebb5a1f96 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -3128,7 +3128,10 @@ int cmd_script(int argc, const char **argv)
char *rec_script_path = NULL;
char *rep_script_path = NULL;
struct perf_session *session;
-   struct itrace_synth_opts itrace_synth_opts = { .set = false, };
+   struct itrace_synth_opts itrace_synth_opts = {
+   .set = false,
+   .default_no_sample = true,
+   };
char *script_path = NULL;
const char **__argv;
int i, j, err = 0;
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index c4617bcfd521..72d5ba2479bf 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -962,16 +962,23 @@ s64 perf_event__process_auxtrace(struct perf_session 
*session,
 #define PERF_ITRACE_DEFAULT_LAST_BRANCH_SZ 64
 #define PERF_ITRACE_MAX_LAST_BRANCH_SZ 1024
 
-void itrace_synth_opts__set_default(struct itrace_synth_opts *synth_opts)
+void itrace_synth_opts__set_default(struct itrace_synth_opts *synth_opts,
+   bool no_sample)
 {
-   synth_opts->instructions = true;
synth_opts->branches = true;
synth_opts->transactions = true;
synth_opts->ptwrites = true;
synth_opts->pwr_events = true;
synth_opts->errors = true;
-   synth_opts->period_type = PERF_ITRACE_DEFAULT_PERIOD_TYPE;
-   synth_opts->period = PERF_ITRACE_DEFAULT_PERIOD;
+   if (no_sample) {
+   synth_opts->period_type = PERF_ITRACE_PERIOD_INSTRUCTIONS;
+   synth_opts->period = 1;
+   synth_opts->calls = true;
+   } else {
+   synth_opts->instructions = true;
+   synth_opts->period_type = PERF_ITRACE_DEFAULT_PERIOD_TYPE;
+   synth_opts->period = PERF_ITRACE_DEFAULT_PERIOD;
+   }
synth_opts->callchain_sz = PERF_ITRACE_DEFAULT_CALLCHAIN_SZ;
synth_opts->last_branch_sz = PERF_ITRACE_DEFAULT_LAST_BRANCH_SZ;
synth_opts->initial_skip = 0;
@@ -999,7 +1006,7 @@ int itrace_parse_synth_opts(const struct option *opt, 
const char *str,
}
 
if (!str) {
-   itrace_synth_opts__set_default(synth_opts);
+   itrace_synth_opts__set_default(synth_opts, false);
return 0;
}
 
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index 0a6ce9c4fc11..f6df30187e1c 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -57,6 +57,7 @@ enum itrace_period_type {
 /**
  * struct itrace_synth_opts - AUX area tracing synthesis options.
  * @set: indicates whether or not options have been set
+ * @default_no_sample: Default to no sampling.
  * @inject: indicates the event (not just the sample) must be fully synthesized
  *  because 'perf inject' will wr

[PATCH v6 2/5] perf, tools, script: Make itrace script default to all calls

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

By default perf script for itrace outputs sampled instructions or
branches. In my experience this is confusing to users because it's
hard to correlate with real program behavior. The sampling makes
sense for tools like report that actually sample to reduce
the run time, but run time is normally not a problem for perf script.
It's better to give an accurate representation of the program flow.

Default perf script to output all calls for itrace. That's a much saner
default. The old behavior can be still requested with
perf script --itrace=ibxwpe10

v2: Fix ETM build failure
v3: Really fix ETM build failure (Kim Phillips)
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/itrace.txt |  7 ---
 tools/perf/builtin-script.c |  5 -
 tools/perf/util/auxtrace.c  | 17 -
 tools/perf/util/auxtrace.h  |  5 -
 tools/perf/util/cs-etm.c|  3 ++-
 tools/perf/util/intel-bts.c |  3 ++-
 tools/perf/util/intel-pt.c  |  3 ++-
 7 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/tools/perf/Documentation/itrace.txt 
b/tools/perf/Documentation/itrace.txt
index a3abe04c779d..c2182cbabde3 100644
--- a/tools/perf/Documentation/itrace.txt
+++ b/tools/perf/Documentation/itrace.txt
@@ -11,10 +11,11 @@
l   synthesize last branch entries (use with i or x)
s   skip initial number of events
 
-   The default is all events i.e. the same as --itrace=ibxwpe
+   The default is all events i.e. the same as --itrace=ibxwpe,
+   except for perf script where it is --itrace=ce
 
-   In addition, the period (default 10) for instructions events
-   can be specified in units of:
+   In addition, the period (default 10, except for perf script where 
it is 1)
+   for instructions events can be specified in units of:
 
i   instructions
t   ticks
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 311f5b53dd83..519ebb5a1f96 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -3128,7 +3128,10 @@ int cmd_script(int argc, const char **argv)
char *rec_script_path = NULL;
char *rep_script_path = NULL;
struct perf_session *session;
-   struct itrace_synth_opts itrace_synth_opts = { .set = false, };
+   struct itrace_synth_opts itrace_synth_opts = {
+   .set = false,
+   .default_no_sample = true,
+   };
char *script_path = NULL;
const char **__argv;
int i, j, err = 0;
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index c4617bcfd521..72d5ba2479bf 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -962,16 +962,23 @@ s64 perf_event__process_auxtrace(struct perf_session 
*session,
 #define PERF_ITRACE_DEFAULT_LAST_BRANCH_SZ 64
 #define PERF_ITRACE_MAX_LAST_BRANCH_SZ 1024
 
-void itrace_synth_opts__set_default(struct itrace_synth_opts *synth_opts)
+void itrace_synth_opts__set_default(struct itrace_synth_opts *synth_opts,
+   bool no_sample)
 {
-   synth_opts->instructions = true;
synth_opts->branches = true;
synth_opts->transactions = true;
synth_opts->ptwrites = true;
synth_opts->pwr_events = true;
synth_opts->errors = true;
-   synth_opts->period_type = PERF_ITRACE_DEFAULT_PERIOD_TYPE;
-   synth_opts->period = PERF_ITRACE_DEFAULT_PERIOD;
+   if (no_sample) {
+   synth_opts->period_type = PERF_ITRACE_PERIOD_INSTRUCTIONS;
+   synth_opts->period = 1;
+   synth_opts->calls = true;
+   } else {
+   synth_opts->instructions = true;
+   synth_opts->period_type = PERF_ITRACE_DEFAULT_PERIOD_TYPE;
+   synth_opts->period = PERF_ITRACE_DEFAULT_PERIOD;
+   }
synth_opts->callchain_sz = PERF_ITRACE_DEFAULT_CALLCHAIN_SZ;
synth_opts->last_branch_sz = PERF_ITRACE_DEFAULT_LAST_BRANCH_SZ;
synth_opts->initial_skip = 0;
@@ -999,7 +1006,7 @@ int itrace_parse_synth_opts(const struct option *opt, 
const char *str,
}
 
if (!str) {
-   itrace_synth_opts__set_default(synth_opts);
+   itrace_synth_opts__set_default(synth_opts, false);
return 0;
}
 
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index 0a6ce9c4fc11..f6df30187e1c 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -57,6 +57,7 @@ enum itrace_period_type {
 /**
  * struct itrace_synth_opts - AUX area tracing synthesis options.
  * @set: indicates whether or not options have been set
+ * @default_no_sample: Default to no sampling.
  * @inject: indicates the event (not just the sample) must be fully synthesized
  *  because 'perf inject' will wr

[PATCH v6 5/5] perf, tools, script: Support total cycles count

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

For perf script brstackinsn also print a running cycles count.
This makes it easier to calculate cycle deltas for code sections
measured with LBRs.

% perf record -b -a sleep 1
% perf script -F +brstackinsn
...
_dl_sysdep_start+330:
7eff9f20583ainsn: 75 c4 # PRED 24 
cycles [24]
7eff9f205800insn: 48 83 e8 03
7eff9f205804insn: 48 83 f8 1e
7eff9f205808insn: 77 26
7eff9f20580ainsn: 48 63 04 81
7eff9f20580einsn: 48 01 c8
7eff9f205811insn: ff e0 # MISPRED 31 
cycles [7] 0.71 IPC
7eff9f2059c0insn: 44 8b 62 08
7eff9f2059c4insn: e9 67 fe ff ff# PRED 55 
cycles [24] 0.04 IPC
7eff9f205830insn: 48 83 c2 10
7eff9f205834insn: 48 8b 02
7eff9f205837insn: 48 85 c0
7eff9f20583ainsn: 75 c4 # PRED 68 
cycles [13] 0.23 IPC

Signed-off-by: Andi Kleen 

---

v2: reflow line
---
 tools/perf/builtin-script.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 6232658c6f31..53dc27e63d98 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -913,7 +913,7 @@ static int grab_bb(u8 *buffer, u64 start, u64 end,
 
 static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en,
struct perf_insn *x, u8 *inbuf, int len,
-   int insn, FILE *fp)
+   int insn, FILE *fp, int *total_cycles)
 {
int printed = fprintf(fp, "\t%016" PRIx64 "\t%-30s\t#%s%s%s%s", ip,
  dump_insn(x, ip, inbuf, len, NULL),
@@ -922,7 +922,8 @@ static int ip__fprintf_jump(uint64_t ip, struct 
branch_entry *en,
  en->flags.in_tx ? " INTX" : "",
  en->flags.abort ? " ABORT" : "");
if (en->flags.cycles) {
-   printed += fprintf(fp, " %d cycles", en->flags.cycles);
+   *total_cycles += en->flags.cycles;
+   printed += fprintf(fp, " %d cycles [%d]", *total_cycles, 
en->flags.cycles);
if (insn)
printed += fprintf(fp, " %.2f IPC", (float)insn / 
en->flags.cycles);
}
@@ -979,6 +980,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
u8 buffer[MAXBB];
unsigned off;
struct symbol *lastsym = NULL;
+   int total_cycles = 0;
 
if (!(br && br->nr))
return 0;
@@ -999,7 +1001,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
printed += ip__fprintf_sym(br->entries[nr - 1].from, thread,
   x.cpumode, x.cpu, , attr, 
fp);
printed += ip__fprintf_jump(br->entries[nr - 1].from, 
>entries[nr - 1],
-   , buffer, len, 0, fp);
+   , buffer, len, 0, fp, 
_cycles);
}
 
/* Print all blocks */
@@ -1027,7 +1029,8 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
 
printed += ip__fprintf_sym(ip, thread, x.cpumode, 
x.cpu, , attr, fp);
if (ip == end) {
-   printed += ip__fprintf_jump(ip, 
>entries[i], , buffer + off, len - off, insn, fp);
+   printed += ip__fprintf_jump(ip, 
>entries[i], , buffer + off, len - off, insn, fp,
+   _cycles);
break;
} else {
printed += fprintf(fp, "\t%016" PRIx64 
"\t%s\n", ip,
-- 
2.17.1



[PATCH v6 5/5] perf, tools, script: Support total cycles count

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

For perf script brstackinsn also print a running cycles count.
This makes it easier to calculate cycle deltas for code sections
measured with LBRs.

% perf record -b -a sleep 1
% perf script -F +brstackinsn
...
_dl_sysdep_start+330:
7eff9f20583ainsn: 75 c4 # PRED 24 
cycles [24]
7eff9f205800insn: 48 83 e8 03
7eff9f205804insn: 48 83 f8 1e
7eff9f205808insn: 77 26
7eff9f20580ainsn: 48 63 04 81
7eff9f20580einsn: 48 01 c8
7eff9f205811insn: ff e0 # MISPRED 31 
cycles [7] 0.71 IPC
7eff9f2059c0insn: 44 8b 62 08
7eff9f2059c4insn: e9 67 fe ff ff# PRED 55 
cycles [24] 0.04 IPC
7eff9f205830insn: 48 83 c2 10
7eff9f205834insn: 48 8b 02
7eff9f205837insn: 48 85 c0
7eff9f20583ainsn: 75 c4 # PRED 68 
cycles [13] 0.23 IPC

Signed-off-by: Andi Kleen 

---

v2: reflow line
---
 tools/perf/builtin-script.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 6232658c6f31..53dc27e63d98 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -913,7 +913,7 @@ static int grab_bb(u8 *buffer, u64 start, u64 end,
 
 static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en,
struct perf_insn *x, u8 *inbuf, int len,
-   int insn, FILE *fp)
+   int insn, FILE *fp, int *total_cycles)
 {
int printed = fprintf(fp, "\t%016" PRIx64 "\t%-30s\t#%s%s%s%s", ip,
  dump_insn(x, ip, inbuf, len, NULL),
@@ -922,7 +922,8 @@ static int ip__fprintf_jump(uint64_t ip, struct 
branch_entry *en,
  en->flags.in_tx ? " INTX" : "",
  en->flags.abort ? " ABORT" : "");
if (en->flags.cycles) {
-   printed += fprintf(fp, " %d cycles", en->flags.cycles);
+   *total_cycles += en->flags.cycles;
+   printed += fprintf(fp, " %d cycles [%d]", *total_cycles, 
en->flags.cycles);
if (insn)
printed += fprintf(fp, " %.2f IPC", (float)insn / 
en->flags.cycles);
}
@@ -979,6 +980,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
u8 buffer[MAXBB];
unsigned off;
struct symbol *lastsym = NULL;
+   int total_cycles = 0;
 
if (!(br && br->nr))
return 0;
@@ -999,7 +1001,7 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
printed += ip__fprintf_sym(br->entries[nr - 1].from, thread,
   x.cpumode, x.cpu, , attr, 
fp);
printed += ip__fprintf_jump(br->entries[nr - 1].from, 
>entries[nr - 1],
-   , buffer, len, 0, fp);
+   , buffer, len, 0, fp, 
_cycles);
}
 
/* Print all blocks */
@@ -1027,7 +1029,8 @@ static int perf_sample__fprintf_brstackinsn(struct 
perf_sample *sample,
 
printed += ip__fprintf_sym(ip, thread, x.cpumode, 
x.cpu, , attr, fp);
if (ip == end) {
-   printed += ip__fprintf_jump(ip, 
>entries[i], , buffer + off, len - off, insn, fp);
+   printed += ip__fprintf_jump(ip, 
>entries[i], , buffer + off, len - off, insn, fp,
+   _cycles);
break;
} else {
printed += fprintf(fp, "\t%016" PRIx64 
"\t%s\n", ip,
-- 
2.17.1



Make perf script easier to use for itrace

2018-09-20 Thread Andi Kleen
Implement a range of improvements to make it easier to look
at itrace traces with perf script. Nothing here couldn't be done
before with some additional scripting, but add simple high 
level options to make it easier to use.

% perf record -e intel_pt//k -a sleep 1

Show function calls:

% perf script --call-trace
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_enable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_filter_match
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
group_sched_in
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_sched_in.isra.107
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_set_state.part.71
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_time
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_disable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_log_itrace_start
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_userpage

Show function calls and returns:

% perf script --call-ret-trace
perf   900 [000] 194167.205652203:   tr strt ([unknown])
pt_config
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])pt_config
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])pt_event_add
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])perf_pmu_enable
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_void
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])event_sched_in.isra.107
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_int
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])group_sched_in
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])event_filter_match
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])event_filter_match
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])group_sched_in


Show instruction traces (using XED):

  % perf script --insn-trace --xed
   swapper 0 [000] 117276.429606186:  81010486 pt_config 
([kernel.kallsyms])  nopl  %eax, (%rax,%rax,1)
   swapper 0 [000] 117276.429606186:  8101048b pt_config 
([kernel.kallsyms])  add $0x10, %rsp
   swapper 0 [000] 117276.429606186:  8101048f pt_config 
([kernel.kallsyms])  popq  %rbx
   swapper 0 [000] 117276.429606186:  81010490 pt_config 
([kernel.kallsyms])  popq  %rbp
   swapper 0 [000] 117276.429606186:  81010491 pt_config 
([kernel.kallsyms])  popq  %r12
   swapper 0 [000] 117276.429606186:  81010493 pt_config 
([kernel.kallsyms])  popq  %r13
   swapper 0 [000] 117276.429606186:  81010495 pt_config 
([kernel.kallsyms])  popq  %r14
   swapper 0 [000] 117276.429606186:  81010497 pt_config 
([kernel.kallsyms])  popq  %r15
   swapper 0 [000] 117276.429606186:  81010499 pt_config 
([kernel.kallsyms])  retq


Filter by a ftrace style graph function:

 % perf script --graph-function group_sched_in --call-trace
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
group_sched_in
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_sched_in.isra.107
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_set_state.part.71
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_time
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_disable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
  

[PATCH v6 4/5] tools, perf, script: Implement --graph-function

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

Add a ftrace style --graph-function argument to perf script that allows
to print itrace function calls only below a given function. This
makes it easier to find the code of interest in a large trace.

% perf record -e intel_pt//k -a sleep 1
% perf script --graph-function group_sched_in --call-trace
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
group_sched_in
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_sched_in.isra.107
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_set_state.part.71
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_time
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_disable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_log_itrace_start
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_userpage
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
calc_timer_values
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
sched_clock_cpu
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
arch_perf_update_userpage
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__fentry__
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
using_native_sched_clock
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
sched_clock_stable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_enable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
group_sched_in
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
event_sched_in.isra.107
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_event_set_state.part.71
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_event_update_time
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_pmu_disable
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_log_itrace_start
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_event_update_userpage
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
calc_timer_values
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
sched_clock_cpu
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
arch_perf_update_userpage
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
__fentry__
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
using_native_sched_clock
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
sched_clock_stable

v2: Remove debug printout
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-script.txt |  4 +
 tools/perf/builtin-script.c  | 96 +++-
 tools/perf/util/symbol.h |  3 +-
 tools/perf/util/thread.h |  2 +
 4 files changed, 86 insertions(+), 19 deletions(-)

diff --git a/tools/perf/Documentation/perf-script.txt 
b/tools/perf/Documentation/perf-script.txt
index 805baabd238e..a2b37ce48094 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -397,6 +397,10 @@ include::itrace.txt[]
 --call-ret-trace::
Show call and return stream for intel_pt traces.
 
+--graph-function::
+   For itrace only show specified functions and their callees for
+   itrace. Multiple functions can be separated by comma.
+
 SEE ALSO
 
 linkperf:perf-record[1], linkperf:perf-script-perl[1],
diff --git

Make perf script easier to use for itrace

2018-09-20 Thread Andi Kleen
Implement a range of improvements to make it easier to look
at itrace traces with perf script. Nothing here couldn't be done
before with some additional scripting, but add simple high 
level options to make it easier to use.

% perf record -e intel_pt//k -a sleep 1

Show function calls:

% perf script --call-trace
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_enable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_filter_match
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
group_sched_in
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_sched_in.isra.107
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_set_state.part.71
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_time
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_disable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_log_itrace_start
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_userpage

Show function calls and returns:

% perf script --call-ret-trace
perf   900 [000] 194167.205652203:   tr strt ([unknown])
pt_config
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])pt_config
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])pt_event_add
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])perf_pmu_enable
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_void
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])event_sched_in.isra.107
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_int
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])group_sched_in
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])event_filter_match
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])event_filter_match
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])group_sched_in


Show instruction traces (using XED):

  % perf script --insn-trace --xed
   swapper 0 [000] 117276.429606186:  81010486 pt_config 
([kernel.kallsyms])  nopl  %eax, (%rax,%rax,1)
   swapper 0 [000] 117276.429606186:  8101048b pt_config 
([kernel.kallsyms])  add $0x10, %rsp
   swapper 0 [000] 117276.429606186:  8101048f pt_config 
([kernel.kallsyms])  popq  %rbx
   swapper 0 [000] 117276.429606186:  81010490 pt_config 
([kernel.kallsyms])  popq  %rbp
   swapper 0 [000] 117276.429606186:  81010491 pt_config 
([kernel.kallsyms])  popq  %r12
   swapper 0 [000] 117276.429606186:  81010493 pt_config 
([kernel.kallsyms])  popq  %r13
   swapper 0 [000] 117276.429606186:  81010495 pt_config 
([kernel.kallsyms])  popq  %r14
   swapper 0 [000] 117276.429606186:  81010497 pt_config 
([kernel.kallsyms])  popq  %r15
   swapper 0 [000] 117276.429606186:  81010499 pt_config 
([kernel.kallsyms])  retq


Filter by a ftrace style graph function:

 % perf script --graph-function group_sched_in --call-trace
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
group_sched_in
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_sched_in.isra.107
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_set_state.part.71
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_time
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_disable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
  

[PATCH v6 4/5] tools, perf, script: Implement --graph-function

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

Add a ftrace style --graph-function argument to perf script that allows
to print itrace function calls only below a given function. This
makes it easier to find the code of interest in a large trace.

% perf record -e intel_pt//k -a sleep 1
% perf script --graph-function group_sched_in --call-trace
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
group_sched_in
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_sched_in.isra.107
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_set_state.part.71
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_time
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_disable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_log_itrace_start
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_userpage
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
calc_timer_values
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
sched_clock_cpu
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
arch_perf_update_userpage
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__fentry__
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
using_native_sched_clock
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
sched_clock_stable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_enable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
group_sched_in
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
event_sched_in.isra.107
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_event_set_state.part.71
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_event_update_time
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_pmu_disable
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_log_itrace_start
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
perf_event_update_userpage
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
calc_timer_values
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
sched_clock_cpu
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
arch_perf_update_userpage
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
__fentry__
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
using_native_sched_clock
 swapper 0 [001] 194167.205660693: ([kernel.kallsyms])  
sched_clock_stable

v2: Remove debug printout
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-script.txt |  4 +
 tools/perf/builtin-script.c  | 96 +++-
 tools/perf/util/symbol.h |  3 +-
 tools/perf/util/thread.h |  2 +
 4 files changed, 86 insertions(+), 19 deletions(-)

diff --git a/tools/perf/Documentation/perf-script.txt 
b/tools/perf/Documentation/perf-script.txt
index 805baabd238e..a2b37ce48094 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -397,6 +397,10 @@ include::itrace.txt[]
 --call-ret-trace::
Show call and return stream for intel_pt traces.
 
+--graph-function::
+   For itrace only show specified functions and their callees for
+   itrace. Multiple functions can be separated by comma.
+
 SEE ALSO
 
 linkperf:perf-record[1], linkperf:perf-script-perl[1],
diff --git

[PATCH v6 3/5] tools, perf, script: Add --call-trace and --call-ret-trace

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

Add short cut options to print PT call trace and call-ret-trace,
for calls and call and returns. Roughly corresponds to ftrace
function tracer and function graph tracer.

Just makes these common use cases nicer to use.

% perf record -a -e intel_pt// sleep 1
% perf script --call-trace
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_enable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_filter_match
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
group_sched_in
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_sched_in.isra.107
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_set_state.part.71
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_time
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_disable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_log_itrace_start
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_userpage

% perf script --call-ret-trace
perf   900 [000] 194167.205652203:   tr strt ([unknown])
pt_config
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])pt_config
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])pt_event_add
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])perf_pmu_enable
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_void
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])event_sched_in.isra.107
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_int
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])group_sched_in
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])event_filter_match
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])event_filter_match
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])group_sched_in
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_txn
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])event_sched_in.isra.107
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])perf_event_set_state.part.71

Signed-off-by: Andi Kleen 
---
v2: Print errors, power, ptwrite too
---
 tools/perf/Documentation/perf-script.txt |  7 +++
 tools/perf/builtin-script.c  | 24 
 2 files changed, 31 insertions(+)

diff --git a/tools/perf/Documentation/perf-script.txt 
b/tools/perf/Documentation/perf-script.txt
index 00c655ab4968..805baabd238e 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -390,6 +390,13 @@ include::itrace.txt[]
 --xed::
Run xed disassembler on output. Requires installing the xed 
disassembler.
 
+--call-trace::
+   Show call stream for intel_pt traces. The CPUs are interleaved, but
+   can be filtered with -C.
+
+--call-ret-trace::
+   Show call and return stream for intel_pt traces.
+
 SEE ALSO
 
 linkperf:perf-record[1], linkperf:perf-script-perl[1],
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 519ebb5a1f96..6c4562973983 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -3119,6 +3119,26 @@ static int parse_xed(const struct option *opt 
__maybe_unused,
return 0;
 }
 
+static int parse_call_trace(const struct option *opt __maybe_unused,
+   const char *str __maybe_unused,
+   int unset __maybe_unused)
+{
+   parse_output_fields(NULL, "-ip,-addr,-event,-period,+callindent", 0);
+   itrace_parse_synth_opts(opt, "cewp", 0);
+   nanosecs = true;
+   return 0;
+}
+
+static int parse_callret_trace(const struct option

[PATCH v6 3/5] tools, perf, script: Add --call-trace and --call-ret-trace

2018-09-20 Thread Andi Kleen
From: Andi Kleen 

Add short cut options to print PT call trace and call-ret-trace,
for calls and call and returns. Roughly corresponds to ftrace
function tracer and function graph tracer.

Just makes these common use cases nicer to use.

% perf record -a -e intel_pt// sleep 1
% perf script --call-trace
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_enable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_filter_match
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
group_sched_in
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
event_sched_in.isra.107
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_set_state.part.71
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_time
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_pmu_disable
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_log_itrace_start
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203: ([kernel.kallsyms])  
perf_event_update_userpage

% perf script --call-ret-trace
perf   900 [000] 194167.205652203:   tr strt ([unknown])
pt_config
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])pt_config
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])pt_event_add
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])perf_pmu_enable
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_void
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])event_sched_in.isra.107
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_int
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])group_sched_in
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])event_filter_match
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])event_filter_match
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])group_sched_in
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])__x86_indirect_thunk_rax
perf   900 [000] 194167.205652203:   return  
([kernel.kallsyms])perf_pmu_nop_txn
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])event_sched_in.isra.107
perf   900 [000] 194167.205652203:   call
([kernel.kallsyms])perf_event_set_state.part.71

Signed-off-by: Andi Kleen 
---
v2: Print errors, power, ptwrite too
---
 tools/perf/Documentation/perf-script.txt |  7 +++
 tools/perf/builtin-script.c  | 24 
 2 files changed, 31 insertions(+)

diff --git a/tools/perf/Documentation/perf-script.txt 
b/tools/perf/Documentation/perf-script.txt
index 00c655ab4968..805baabd238e 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -390,6 +390,13 @@ include::itrace.txt[]
 --xed::
Run xed disassembler on output. Requires installing the xed 
disassembler.
 
+--call-trace::
+   Show call stream for intel_pt traces. The CPUs are interleaved, but
+   can be filtered with -C.
+
+--call-ret-trace::
+   Show call and return stream for intel_pt traces.
+
 SEE ALSO
 
 linkperf:perf-record[1], linkperf:perf-script-perl[1],
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 519ebb5a1f96..6c4562973983 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -3119,6 +3119,26 @@ static int parse_xed(const struct option *opt 
__maybe_unused,
return 0;
 }
 
+static int parse_call_trace(const struct option *opt __maybe_unused,
+   const char *str __maybe_unused,
+   int unset __maybe_unused)
+{
+   parse_output_fields(NULL, "-ip,-addr,-event,-period,+callindent", 0);
+   itrace_parse_synth_opts(opt, "cewp", 0);
+   nanosecs = true;
+   return 0;
+}
+
+static int parse_callret_trace(const struct option

Re: [PATCH v3 4/5] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack

2018-09-20 Thread Andi Kleen
> +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + struct perf_event *event;
> + struct perf_event_attr attr = {
> + .type = PERF_TYPE_RAW,
> + .size = sizeof(attr),
> + .pinned = true,
> + .exclude_host = true,
> + .sample_type = PERF_SAMPLE_BRANCH_STACK,
> + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> +   PERF_SAMPLE_BRANCH_USER |
> +   PERF_SAMPLE_BRANCH_KERNEL,

I think that will allocate an extra perfmon counter, right? 

So if the guest wants to use all perfmon counters we would start to 
multiplex, which would be bad

Would need to fix perf core to not allocate a perfmon counter in
this case, if no period and no event count is requested.

-Andi


<    3   4   5   6   7   8   9   10   11   12   >