Re: [PATCH v2] x86/microcode: Handle negative microcode revisions
On Sat, Oct 20, 2018 at 07:41:36PM +0200, Borislav Petkov wrote: > Dropping stable. > > On Sat, Oct 20, 2018 at 07:41:58AM -0700, Andi Kleen wrote: > > From: Andi Kleen > > > > The Intel microcode revision space is unsigned. Inside Intel there are > > special > > microcodes that have the highest bit set, and they are considered to have > > a higher revision than any microcodes that don't have this bit set. > > > > The function comparing the microcode revision in the Linux driver compares > > u32 with int, which ends up being signed extended to long on 64bit > > systems. This results in these highest bit set microcode revision not > > loading > > because their revision appears negative and smaller than the > > existing microcode. > > > > Change the comparison to unsigned. With that the loading works > > as expected. > > > > Cc: sta...@vger.kernel.org # Any supported stable > > Signed-off-by: Andi Kleen > > -- > > v2: White space changes. > > --- > > arch/x86/kernel/cpu/microcode/intel.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/cpu/microcode/intel.c > > b/arch/x86/kernel/cpu/microcode/intel.c > > index 16936a24795c..e54d402500d3 100644 > > --- a/arch/x86/kernel/cpu/microcode/intel.c > > +++ b/arch/x86/kernel/cpu/microcode/intel.c > > @@ -93,7 +93,8 @@ static int find_matching_signature(void *mc, unsigned int > > csig, int cpf) > > /* > > * Returns 1 if update has been found, 0 otherwise. > > */ > > -static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int > > new_rev) > > +static int has_newer_microcode(void *mc, unsigned int csig, int cpf, > > + unsigned new_rev) > > { > > struct microcode_header_intel *mc_hdr = mc; > > > > -- > > Please incorporate all review comments before sending a new version of > your patch. I replaced one more microcodes with microcodes revisions if that is what you meant. -Andi
[PATCH v4 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering
From: Andi Kleen KVM added a workaround for PEBS events leaking into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.") This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR. Intel also added a fix for this issue to microcode updates on Haswell/Broadwell/Skylake. It turns out using the MSR entry/exit list makes VM exits significantly slower. The list is only needed for disabling PEBS, because the GLOBAL_CTRL change gets optimized by KVM into changing the VMCS. Check for the microcode updates that have the microcode fix for leaking PEBS, and disable the extra entry/exit list entry for PEBS_ENABLE. In addition we always clear the GLOBAL_CTRL for the PEBS counter while running in the guest, which is enough to make them never fire at the wrong side of the host/guest transition. We see significantly reduced overhead for VM exits with the filtering active with the patch from 8% to 4%. Signed-off-by: Andi Kleen --- v2: Use match_ucode, not match_ucode_all Remove cpu lock Use INTEL_MIN_UCODE and move to header Update Table to include skylake clients. v3: Use x86_min_microcode --- arch/x86/events/intel/core.c | 80 arch/x86/events/perf_event.h | 3 +- 2 files changed, 73 insertions(+), 10 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 0fb8659b20d8..89ec85c3359c 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "../perf_event.h" @@ -3170,16 +3171,27 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; - /* -* If PMU counter has PEBS enabled it is not enough to disable counter -* on a guest entry since PEBS memory write can overshoot guest entry -* and corrupt guest memory. Disabling PEBS solves the problem. -*/ - arr[1].msr = MSR_IA32_PEBS_ENABLE; - arr[1].host = cpuc->pebs_enabled; - arr[1].guest = 0; + if (x86_pmu.flags & PMU_FL_PEBS_ALL) + arr[0].guest &= ~cpuc->pebs_enabled; + else + arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); + *nr = 1; + + if (!x86_pmu.pebs_isolated) { + /* +* If PMU counter has PEBS enabled it is not enough to +* disable counter on a guest entry since PEBS memory +* write can overshoot guest entry and corrupt guest +* memory. Disabling PEBS solves the problem. +* +* Don't do this if the CPU already enforces it. +*/ + arr[1].msr = MSR_IA32_PEBS_ENABLE; + arr[1].host = cpuc->pebs_enabled; + arr[1].guest = 0; + *nr = 2; + } - *nr = 2; return arr; } @@ -3697,6 +3709,45 @@ static __init void intel_clovertown_quirk(void) x86_pmu.pebs_constraints = NULL; } +static const struct x86_ucode_id isolation_ucodes[] = { + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT, 1, 0x001e), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE, 4, 0x0023), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E, 1, 0x0014), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X, 2, 0x0b14), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004
[PATCH v4 1/2] x86/cpufeature: Add facility to match microcode revisions
From: Andi Kleen For bug workarounds or checks it is useful to check for specific microcode revisionss. Add a new table format to check for steppings with min microcode revisions. This does not change the existing x86_cpu_id because it's an ABI shared with modutils, and also has quite different requirements, as in no wildcards, but everything has to be matched exactly. Signed-off-by: Andi Kleen --- v2: Remove all CPU match, only check boot cpu Move INTEL_MIN_UCODE macro to header. Minor cleanups. Remove max ucode and driver data v3: Rename function Update comments. Document mixed stepping caveats. Use u8 for model Remove vendor 0 check. Change return check v4: Rename to x86_min_microcode Change return value to bool --- arch/x86/include/asm/cpu_device_id.h | 28 arch/x86/kernel/cpu/match.c | 19 +++ 2 files changed, 47 insertions(+) diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index baeba0567126..28847d5ea1fa 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -11,4 +11,32 @@ extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match); +/* + * Match specific microcode revisions. + * + * vendor/family/model/stepping must be all set. + * + * only checks against the boot cpu. When mixed-stepping configs are + * valid for a CPU model, add a quirk for every valid stepping and + * do the fine-tuning in the quirk handler. + */ + +struct x86_ucode_id { + u8 vendor; + u8 family; + u8 model; + u8 stepping; + u32 min_ucode; +}; + +#define INTEL_MIN_UCODE(mod, step, rev) { \ + .vendor = X86_VENDOR_INTEL, \ + .family = 6,\ + .model = mod, \ + .stepping = step, \ + .min_ucode = rev, \ +} + +extern bool x86_min_microcode(const struct x86_ucode_id *mt); + #endif diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c index 3fed38812eea..12db14232d62 100644 --- a/arch/x86/kernel/cpu/match.c +++ b/arch/x86/kernel/cpu/match.c @@ -48,3 +48,22 @@ const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match) return NULL; } EXPORT_SYMBOL(x86_match_cpu); + +bool x86_min_microcode(const struct x86_ucode_id *mt) +{ + struct cpuinfo_x86 *c = _cpu_data; + const struct x86_ucode_id *m; + + for (m = mt; m->family | m->model; m++) { + if (c->x86_vendor != m->vendor) + continue; + if (c->x86 != m->family) + continue; + if (c->x86_model != m->model) + continue; + if (c->x86_stepping != m->stepping) + continue; + return c->microcode >= m->min_ucode; + } + return false; +} -- 2.17.2
[PATCH v4 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering
From: Andi Kleen KVM added a workaround for PEBS events leaking into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.") This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR. Intel also added a fix for this issue to microcode updates on Haswell/Broadwell/Skylake. It turns out using the MSR entry/exit list makes VM exits significantly slower. The list is only needed for disabling PEBS, because the GLOBAL_CTRL change gets optimized by KVM into changing the VMCS. Check for the microcode updates that have the microcode fix for leaking PEBS, and disable the extra entry/exit list entry for PEBS_ENABLE. In addition we always clear the GLOBAL_CTRL for the PEBS counter while running in the guest, which is enough to make them never fire at the wrong side of the host/guest transition. We see significantly reduced overhead for VM exits with the filtering active with the patch from 8% to 4%. Signed-off-by: Andi Kleen --- v2: Use match_ucode, not match_ucode_all Remove cpu lock Use INTEL_MIN_UCODE and move to header Update Table to include skylake clients. v3: Use x86_min_microcode --- arch/x86/events/intel/core.c | 80 arch/x86/events/perf_event.h | 3 +- 2 files changed, 73 insertions(+), 10 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 0fb8659b20d8..89ec85c3359c 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "../perf_event.h" @@ -3170,16 +3171,27 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; - /* -* If PMU counter has PEBS enabled it is not enough to disable counter -* on a guest entry since PEBS memory write can overshoot guest entry -* and corrupt guest memory. Disabling PEBS solves the problem. -*/ - arr[1].msr = MSR_IA32_PEBS_ENABLE; - arr[1].host = cpuc->pebs_enabled; - arr[1].guest = 0; + if (x86_pmu.flags & PMU_FL_PEBS_ALL) + arr[0].guest &= ~cpuc->pebs_enabled; + else + arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); + *nr = 1; + + if (!x86_pmu.pebs_isolated) { + /* +* If PMU counter has PEBS enabled it is not enough to +* disable counter on a guest entry since PEBS memory +* write can overshoot guest entry and corrupt guest +* memory. Disabling PEBS solves the problem. +* +* Don't do this if the CPU already enforces it. +*/ + arr[1].msr = MSR_IA32_PEBS_ENABLE; + arr[1].host = cpuc->pebs_enabled; + arr[1].guest = 0; + *nr = 2; + } - *nr = 2; return arr; } @@ -3697,6 +3709,45 @@ static __init void intel_clovertown_quirk(void) x86_pmu.pebs_constraints = NULL; } +static const struct x86_ucode_id isolation_ucodes[] = { + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT, 1, 0x001e), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE, 4, 0x0023), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E, 1, 0x0014), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X, 2, 0x0b14), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004
[PATCH v4 1/2] x86/cpufeature: Add facility to match microcode revisions
From: Andi Kleen For bug workarounds or checks it is useful to check for specific microcode revisionss. Add a new table format to check for steppings with min microcode revisions. This does not change the existing x86_cpu_id because it's an ABI shared with modutils, and also has quite different requirements, as in no wildcards, but everything has to be matched exactly. Signed-off-by: Andi Kleen --- v2: Remove all CPU match, only check boot cpu Move INTEL_MIN_UCODE macro to header. Minor cleanups. Remove max ucode and driver data v3: Rename function Update comments. Document mixed stepping caveats. Use u8 for model Remove vendor 0 check. Change return check v4: Rename to x86_min_microcode Change return value to bool --- arch/x86/include/asm/cpu_device_id.h | 28 arch/x86/kernel/cpu/match.c | 19 +++ 2 files changed, 47 insertions(+) diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index baeba0567126..28847d5ea1fa 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -11,4 +11,32 @@ extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match); +/* + * Match specific microcode revisions. + * + * vendor/family/model/stepping must be all set. + * + * only checks against the boot cpu. When mixed-stepping configs are + * valid for a CPU model, add a quirk for every valid stepping and + * do the fine-tuning in the quirk handler. + */ + +struct x86_ucode_id { + u8 vendor; + u8 family; + u8 model; + u8 stepping; + u32 min_ucode; +}; + +#define INTEL_MIN_UCODE(mod, step, rev) { \ + .vendor = X86_VENDOR_INTEL, \ + .family = 6,\ + .model = mod, \ + .stepping = step, \ + .min_ucode = rev, \ +} + +extern bool x86_min_microcode(const struct x86_ucode_id *mt); + #endif diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c index 3fed38812eea..12db14232d62 100644 --- a/arch/x86/kernel/cpu/match.c +++ b/arch/x86/kernel/cpu/match.c @@ -48,3 +48,22 @@ const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match) return NULL; } EXPORT_SYMBOL(x86_match_cpu); + +bool x86_min_microcode(const struct x86_ucode_id *mt) +{ + struct cpuinfo_x86 *c = _cpu_data; + const struct x86_ucode_id *m; + + for (m = mt; m->family | m->model; m++) { + if (c->x86_vendor != m->vendor) + continue; + if (c->x86 != m->family) + continue; + if (c->x86_model != m->model) + continue; + if (c->x86_stepping != m->stepping) + continue; + return c->microcode >= m->min_ucode; + } + return false; +} -- 2.17.2
Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions
On Sun, Oct 21, 2018 at 12:20:47PM +0200, Thomas Gleixner wrote: > Andi, > > On Sat, 20 Oct 2018, Andi Kleen wrote: > > On Sat, Oct 20, 2018 at 10:19:37AM +0200, Thomas Gleixner wrote: > > > On Fri, 19 Oct 2018, Andi Kleen wrote: > > > There is no point to return the pointer because it's not a compound > > > structure. If you want to provide the possibility to use the index then > > > return the index and an error code if it does not match. > > > > It will be useful with the driver_data pointer, which you short sightedly > > forced me to remove, and likely will need to be readded at some point > > anyways if this gets more widely used. > > It's good and established practice not to add functionality on a 'might be > used' basis. If you'd provide at least one or two patches which demonstrate > how that is useful then that would be convincing. > > > At least with the pointer not all callers will need to be changed then. > > It doesn't need to be changed at all, when done correctly. Thanks. I opted for the simpler method of returning a boolean now. -Andi
Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions
On Sun, Oct 21, 2018 at 12:20:47PM +0200, Thomas Gleixner wrote: > Andi, > > On Sat, 20 Oct 2018, Andi Kleen wrote: > > On Sat, Oct 20, 2018 at 10:19:37AM +0200, Thomas Gleixner wrote: > > > On Fri, 19 Oct 2018, Andi Kleen wrote: > > > There is no point to return the pointer because it's not a compound > > > structure. If you want to provide the possibility to use the index then > > > return the index and an error code if it does not match. > > > > It will be useful with the driver_data pointer, which you short sightedly > > forced me to remove, and likely will need to be readded at some point > > anyways if this gets more widely used. > > It's good and established practice not to add functionality on a 'might be > used' basis. If you'd provide at least one or two patches which demonstrate > how that is useful then that would be convincing. > > > At least with the pointer not all callers will need to be changed then. > > It doesn't need to be changed at all, when done correctly. Thanks. I opted for the simpler method of returning a boolean now. -Andi
Re: [v3 03/12] x86/fsgsbase/64: Add intrinsics/macros for FSGSBASE instructions
On Wed, Oct 24, 2018 at 11:53:54AM -0700, Andy Lutomirski wrote: > On Tue, Oct 23, 2018 at 11:43 AM Chang S. Bae > wrote: > > > > From: Andi Kleen > > > > Add C intrinsics and assembler macros for the new FSBASE and GSBASE > > instructions. > > > > Very straight forward. Used in followon patches. > > > > [ luto: Rename the variables from FS and GS to FSBASE and GSBASE and > > make safe to include on 32-bit kernels. ] > > > > v2: Use __always_inline > > > > [ chang: Revise the changelog. Place them in . Replace > > the macros with GAS-compatible ones. ] > > > > If GCC supports it, we can add -mfsgsbase to CFLAGS and use the builtins > > here for extra performance. > > Reviewed-by: Andy Lutomirski # C parts only > > With the caveat that I'm not convinced that the memory clobbers are > needed. The __force_order trick in special_insns.h would probably be > more appropriate. > > I don't feel qualified to review the asm part without some research. > Whereas hpa or Boris could probably review it with their eyes closed. BTW the other option would be to update the min-binutils requirement to 2.21 (currently it is 2.20) and then write it directly without .byte. I believe 2.21 added support for these instructions. (It's only a binutils requirement, don't need gcc support) -Andi
Re: [v3 03/12] x86/fsgsbase/64: Add intrinsics/macros for FSGSBASE instructions
On Wed, Oct 24, 2018 at 11:53:54AM -0700, Andy Lutomirski wrote: > On Tue, Oct 23, 2018 at 11:43 AM Chang S. Bae > wrote: > > > > From: Andi Kleen > > > > Add C intrinsics and assembler macros for the new FSBASE and GSBASE > > instructions. > > > > Very straight forward. Used in followon patches. > > > > [ luto: Rename the variables from FS and GS to FSBASE and GSBASE and > > make safe to include on 32-bit kernels. ] > > > > v2: Use __always_inline > > > > [ chang: Revise the changelog. Place them in . Replace > > the macros with GAS-compatible ones. ] > > > > If GCC supports it, we can add -mfsgsbase to CFLAGS and use the builtins > > here for extra performance. > > Reviewed-by: Andy Lutomirski # C parts only > > With the caveat that I'm not convinced that the memory clobbers are > needed. The __force_order trick in special_insns.h would probably be > more appropriate. > > I don't feel qualified to review the asm part without some research. > Whereas hpa or Boris could probably review it with their eyes closed. BTW the other option would be to update the min-binutils requirement to 2.21 (currently it is 2.20) and then write it directly without .byte. I believe 2.21 added support for these instructions. (It's only a binutils requirement, don't need gcc support) -Andi
Re: [PATCH 1/2] perf: Add munmap callback
> > void perf_event_mmap(struct vm_area_struct *vma) > > { > > struct perf_mmap_event mmap_event; > > > > if (!atomic_read(_mmap_events)) > > return; > > > > } > > > > Thanks. I'll add the nr_mmap_events check in V2. No, that's the wrong check here. The PEBS flush is independent of mmap events being requested. If anything would need to check for any PEBS events active, which would need a new counter. I think the easiest is to just check if this_cpu_ptr(_cb_list) is empty, which should be good enough. -Andi
Re: [PATCH 1/2] perf: Add munmap callback
> > void perf_event_mmap(struct vm_area_struct *vma) > > { > > struct perf_mmap_event mmap_event; > > > > if (!atomic_read(_mmap_events)) > > return; > > > > } > > > > Thanks. I'll add the nr_mmap_events check in V2. No, that's the wrong check here. The PEBS flush is independent of mmap events being requested. If anything would need to check for any PEBS events active, which would need a new counter. I think the easiest is to just check if this_cpu_ptr(_cb_list) is empty, which should be good enough. -Andi
Re: [PATCH 1/2] perf: Add munmap callback
> +void perf_event_munmap(void) > +{ > + struct perf_cpu_context *cpuctx; > + unsigned long flags; > + struct pmu *pmu; > + > + local_irq_save(flags); > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > sched_cb_entry) { Would be good have a fast path here that checks for the list being empty without disabling the interrupts. munmap can be somewhat hot. I think it's ok to make it slower with perf running, but we shouldn't impact it without perf. -Andi
Re: [PATCH 1/2] perf: Add munmap callback
> +void perf_event_munmap(void) > +{ > + struct perf_cpu_context *cpuctx; > + unsigned long flags; > + struct pmu *pmu; > + > + local_irq_save(flags); > + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), > sched_cb_entry) { Would be good have a fast path here that checks for the list being empty without disabling the interrupts. munmap can be somewhat hot. I think it's ok to make it slower with perf running, but we shouldn't impact it without perf. -Andi
Re: Broken dwarf unwinding - wrong stack pointer register value?
> > Can someone at least confirm whether unwinding from a function prologue via > .eh_frame (but without .debug_frame) should actually be possible? Yes it should be possible. Asynchronous unwind tables should work from any instruction. -Andi
Re: Broken dwarf unwinding - wrong stack pointer register value?
> > Can someone at least confirm whether unwinding from a function prologue via > .eh_frame (but without .debug_frame) should actually be possible? Yes it should be possible. Asynchronous unwind tables should work from any instruction. -Andi
Re: Broken dwarf unwinding - wrong stack pointer register value?
> So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We It's default (64bit since always and 32bit now too) Unless someone disabled it. However libm might be partially written in assembler and hand written assembler often has problems with unwind tables because the programmer has to get them correct explicitely. -Andi
Re: Broken dwarf unwinding - wrong stack pointer register value?
> So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We It's default (64bit since always and 32bit now too) Unless someone disabled it. However libm might be partially written in assembler and hand written assembler often has problems with unwind tables because the programmer has to get them correct explicitely. -Andi
Re: Broken dwarf unwinding - wrong stack pointer register value?
Milian Wolff writes: > > After more digging, it turns out that I've apparently chased a red herring. > I'm running archlinux which isn't shipping debug symbols for libm. 64bit executables normally have unwind information even when stripped. Unless someone forcefully stripped those too. You can checkout with objdump --sections. -Andi
Re: Broken dwarf unwinding - wrong stack pointer register value?
Milian Wolff writes: > > After more digging, it turns out that I've apparently chased a red herring. > I'm running archlinux which isn't shipping debug symbols for libm. 64bit executables normally have unwind information even when stripped. Unless someone forcefully stripped those too. You can checkout with objdump --sections. -Andi
[PATCH v2] x86/microcode: Handle negative microcode revisions
From: Andi Kleen The Intel microcode revision space is unsigned. Inside Intel there are special microcodes that have the highest bit set, and they are considered to have a higher revision than any microcodes that don't have this bit set. The function comparing the microcode revision in the Linux driver compares u32 with int, which ends up being signed extended to long on 64bit systems. This results in these highest bit set microcode revision not loading because their revision appears negative and smaller than the existing microcode. Change the comparison to unsigned. With that the loading works as expected. Cc: sta...@vger.kernel.org # Any supported stable Signed-off-by: Andi Kleen -- v2: White space changes. --- arch/x86/kernel/cpu/microcode/intel.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 16936a24795c..e54d402500d3 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -93,7 +93,8 @@ static int find_matching_signature(void *mc, unsigned int csig, int cpf) /* * Returns 1 if update has been found, 0 otherwise. */ -static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int new_rev) +static int has_newer_microcode(void *mc, unsigned int csig, int cpf, + unsigned new_rev) { struct microcode_header_intel *mc_hdr = mc; -- 2.17.1
[PATCH v2] x86/microcode: Handle negative microcode revisions
From: Andi Kleen The Intel microcode revision space is unsigned. Inside Intel there are special microcodes that have the highest bit set, and they are considered to have a higher revision than any microcodes that don't have this bit set. The function comparing the microcode revision in the Linux driver compares u32 with int, which ends up being signed extended to long on 64bit systems. This results in these highest bit set microcode revision not loading because their revision appears negative and smaller than the existing microcode. Change the comparison to unsigned. With that the loading works as expected. Cc: sta...@vger.kernel.org # Any supported stable Signed-off-by: Andi Kleen -- v2: White space changes. --- arch/x86/kernel/cpu/microcode/intel.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 16936a24795c..e54d402500d3 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -93,7 +93,8 @@ static int find_matching_signature(void *mc, unsigned int csig, int cpf) /* * Returns 1 if update has been found, 0 otherwise. */ -static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int new_rev) +static int has_newer_microcode(void *mc, unsigned int csig, int cpf, + unsigned new_rev) { struct microcode_header_intel *mc_hdr = mc; -- 2.17.1
Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions
On Sat, Oct 20, 2018 at 10:19:37AM +0200, Thomas Gleixner wrote: > On Fri, 19 Oct 2018, Andi Kleen wrote: > > > > > + u32 min_ucode; > > > > +}; > > > > + > > > > +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id > > > > *match) > > > > > > What's the point of returning the struct pointer? Shouldn't it be enough > > > to > > > make it return bool? Also the function name really should reflect that > > > this > > > checks whether the minimal required microcode revision is active. > > > > This allows the user to find the table entry to tie something to it > > (e.g. use the index to match some other table) > > > > Same pattern as pci discovery etc. use. > > > > Given the current caller doesn't need it, but we still follow standard > > conventions. > > There is no point to return the pointer because it's not a compound > structure. If you want to provide the possibility to use the index then > return the index and an error code if it does not match. It will be useful with the driver_data pointer, which you short sightedly forced me to remove, and likely will need to be readded at some point anyways if this gets more widely used. At least with the pointer not all callers will need to be changed then. Also it's symmetric with how the PCI and USB and the existing x86 match discovery interfaces work. > > > VENDOR_INTEL = 0, so this check is obscure to begin with. Either you chose > > > a explicit condition to put at the end of the table, e.g. vendor = U8_MAX > > > or you hand in the array size to the function. > > > > That would both be awkward. It's the same as match_cpu, and 0 terminators > > are standard practice in practical all similar code. I removed > > the or with the family. > > That's debatable because it's more easy to miss the terminator than getting > the ARRAY_SIZE() argument wrong. But it doesn't matter much. Ok then please apply it. -Andi
Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions
On Sat, Oct 20, 2018 at 10:19:37AM +0200, Thomas Gleixner wrote: > On Fri, 19 Oct 2018, Andi Kleen wrote: > > > > > + u32 min_ucode; > > > > +}; > > > > + > > > > +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id > > > > *match) > > > > > > What's the point of returning the struct pointer? Shouldn't it be enough > > > to > > > make it return bool? Also the function name really should reflect that > > > this > > > checks whether the minimal required microcode revision is active. > > > > This allows the user to find the table entry to tie something to it > > (e.g. use the index to match some other table) > > > > Same pattern as pci discovery etc. use. > > > > Given the current caller doesn't need it, but we still follow standard > > conventions. > > There is no point to return the pointer because it's not a compound > structure. If you want to provide the possibility to use the index then > return the index and an error code if it does not match. It will be useful with the driver_data pointer, which you short sightedly forced me to remove, and likely will need to be readded at some point anyways if this gets more widely used. At least with the pointer not all callers will need to be changed then. Also it's symmetric with how the PCI and USB and the existing x86 match discovery interfaces work. > > > VENDOR_INTEL = 0, so this check is obscure to begin with. Either you chose > > > a explicit condition to put at the end of the table, e.g. vendor = U8_MAX > > > or you hand in the array size to the function. > > > > That would both be awkward. It's the same as match_cpu, and 0 terminators > > are standard practice in practical all similar code. I removed > > the or with the family. > > That's debatable because it's more easy to miss the terminator than getting > the ARRAY_SIZE() argument wrong. But it doesn't matter much. Ok then please apply it. -Andi
Re: [PATCH v1] x86/microcode: Handle negative microcode revisions
On Sat, Oct 20, 2018 at 03:42:05PM +0200, Thomas Gleixner wrote: > Andi, > > On Fri, 19 Oct 2018, Andi Kleen wrote: > > Change the comparison to unsigned. With that the loading works > > as expected. > > > > I assume that wants a fixes tag and needs to be backported to stable, > right? I think the bug was there since the original microcode loader, not sure I can dig out the right commit id for that (it might be in linux-historic) Yes please cc stable for this. -Andi
Re: [PATCH v1] x86/microcode: Handle negative microcode revisions
On Sat, Oct 20, 2018 at 03:42:05PM +0200, Thomas Gleixner wrote: > Andi, > > On Fri, 19 Oct 2018, Andi Kleen wrote: > > Change the comparison to unsigned. With that the loading works > > as expected. > > > > I assume that wants a fixes tag and needs to be backported to stable, > right? I think the bug was there since the original microcode loader, not sure I can dig out the right commit id for that (it might be in linux-historic) Yes please cc stable for this. -Andi
[PATCH v3 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering
From: Andi Kleen KVM added a workaround for PEBS events leaking into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.") This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR. Intel also added a fix for this issue to microcode updates on Haswell/Broadwell/Skylake. It turns out using the MSR entry/exit list makes VM exits significantly slower. The list is only needed for disabling PEBS, because the GLOBAL_CTRL change gets optimized by KVM into changing the VMCS. Check for the microcode updates that have the microcode fix for leaking PEBS, and disable the extra entry/exit list entry for PEBS_ENABLE. In addition we always clear the GLOBAL_CTRL for the PEBS counter while running in the guest, which is enough to make them never fire at the wrong side of the host/guest transition. We see significantly reduced overhead for VM exits with the filtering active with the patch from 8% to 4%. Signed-off-by: Andi Kleen --- v2: Use match_ucode, not match_ucode_all Remove cpu lock Use INTEL_MIN_UCODE and move to header Update Table to include skylake clients. --- arch/x86/events/intel/core.c | 80 arch/x86/events/perf_event.h | 3 +- 2 files changed, 73 insertions(+), 10 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 0fb8659b20d8..5c45535c60b4 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "../perf_event.h" @@ -3170,16 +3171,27 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; - /* -* If PMU counter has PEBS enabled it is not enough to disable counter -* on a guest entry since PEBS memory write can overshoot guest entry -* and corrupt guest memory. Disabling PEBS solves the problem. -*/ - arr[1].msr = MSR_IA32_PEBS_ENABLE; - arr[1].host = cpuc->pebs_enabled; - arr[1].guest = 0; + if (x86_pmu.flags & PMU_FL_PEBS_ALL) + arr[0].guest &= ~cpuc->pebs_enabled; + else + arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); + *nr = 1; + + if (!x86_pmu.pebs_isolated) { + /* +* If PMU counter has PEBS enabled it is not enough to +* disable counter on a guest entry since PEBS memory +* write can overshoot guest entry and corrupt guest +* memory. Disabling PEBS solves the problem. +* +* Don't do this if the CPU already enforces it. +*/ + arr[1].msr = MSR_IA32_PEBS_ENABLE; + arr[1].host = cpuc->pebs_enabled; + arr[1].guest = 0; + *nr = 2; + } - *nr = 2; return arr; } @@ -3697,6 +3709,45 @@ static __init void intel_clovertown_quirk(void) x86_pmu.pebs_constraints = NULL; } +static const struct x86_ucode_id isolation_ucodes[] = { + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT, 1, 0x001e), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE, 4, 0x0023), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E, 1, 0x0014), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X, 2, 0x0b14), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004e), +
[PATCH v3 1/2] x86/cpufeature: Add facility to match microcode revisions
From: Andi Kleen For bug workarounds or checks it is useful to check for specific microcode revisionss. Add a new table format to check for steppings with min microcode revisions. This does not change the existing x86_cpu_id because it's an ABI shared with modutils, and also has quite different requirements, as in no wildcards, but everything has to be matched exactly. Signed-off-by: Andi Kleen --- v2: Remove all CPU match, only check boot cpu Move INTEL_MIN_UCODE macro to header. Minor cleanups. Remove max ucode and driver data v3: Rename function Update comments. Document mixed stepping caveats. Use u8 for model Remove vendor 0 check. Change return check --- arch/x86/include/asm/cpu_device_id.h | 29 arch/x86/kernel/cpu/match.c | 19 ++ 2 files changed, 48 insertions(+) diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index baeba0567126..97d645910ec2 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -11,4 +11,33 @@ extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match); +/* + * Match specific microcode revisions. + * + * vendor/family/model/stepping must be all set. + * + * only checks against the boot cpu. When mixed-stepping configs are + * valid for a CPU model, add a quirk for every valid stepping and + * do the fine-tuning in the quirk handler. + */ + +struct x86_ucode_id { + u8 vendor; + u8 family; + u8 model; + u8 stepping; + u32 min_ucode; +}; + +#define INTEL_MIN_UCODE(mod, step, rev) { \ + .vendor = X86_VENDOR_INTEL, \ + .family = 6,\ + .model = mod, \ + .stepping = step, \ + .min_ucode = rev, \ +} + +extern const struct x86_ucode_id * +x86_match_microcode(const struct x86_ucode_id *match); + #endif diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c index 3fed38812eea..d44c8cfd7b29 100644 --- a/arch/x86/kernel/cpu/match.c +++ b/arch/x86/kernel/cpu/match.c @@ -48,3 +48,22 @@ const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match) return NULL; } EXPORT_SYMBOL(x86_match_cpu); + +const struct x86_ucode_id *x86_match_microcode(const struct x86_ucode_id *match) +{ + struct cpuinfo_x86 *c = _cpu_data; + const struct x86_ucode_id *m; + + for (m = match; m->family | m->model; m++) { + if (c->x86_vendor != m->vendor) + continue; + if (c->x86 != m->family) + continue; + if (c->x86_model != m->model) + continue; + if (c->x86_stepping != m->stepping) + continue; + return c->microcode >= m->min_ucode ? m : NULL; + } + return NULL; +} -- 2.17.1
[PATCH v3 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering
From: Andi Kleen KVM added a workaround for PEBS events leaking into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.") This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR. Intel also added a fix for this issue to microcode updates on Haswell/Broadwell/Skylake. It turns out using the MSR entry/exit list makes VM exits significantly slower. The list is only needed for disabling PEBS, because the GLOBAL_CTRL change gets optimized by KVM into changing the VMCS. Check for the microcode updates that have the microcode fix for leaking PEBS, and disable the extra entry/exit list entry for PEBS_ENABLE. In addition we always clear the GLOBAL_CTRL for the PEBS counter while running in the guest, which is enough to make them never fire at the wrong side of the host/guest transition. We see significantly reduced overhead for VM exits with the filtering active with the patch from 8% to 4%. Signed-off-by: Andi Kleen --- v2: Use match_ucode, not match_ucode_all Remove cpu lock Use INTEL_MIN_UCODE and move to header Update Table to include skylake clients. --- arch/x86/events/intel/core.c | 80 arch/x86/events/perf_event.h | 3 +- 2 files changed, 73 insertions(+), 10 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 0fb8659b20d8..5c45535c60b4 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "../perf_event.h" @@ -3170,16 +3171,27 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; - /* -* If PMU counter has PEBS enabled it is not enough to disable counter -* on a guest entry since PEBS memory write can overshoot guest entry -* and corrupt guest memory. Disabling PEBS solves the problem. -*/ - arr[1].msr = MSR_IA32_PEBS_ENABLE; - arr[1].host = cpuc->pebs_enabled; - arr[1].guest = 0; + if (x86_pmu.flags & PMU_FL_PEBS_ALL) + arr[0].guest &= ~cpuc->pebs_enabled; + else + arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); + *nr = 1; + + if (!x86_pmu.pebs_isolated) { + /* +* If PMU counter has PEBS enabled it is not enough to +* disable counter on a guest entry since PEBS memory +* write can overshoot guest entry and corrupt guest +* memory. Disabling PEBS solves the problem. +* +* Don't do this if the CPU already enforces it. +*/ + arr[1].msr = MSR_IA32_PEBS_ENABLE; + arr[1].host = cpuc->pebs_enabled; + arr[1].guest = 0; + *nr = 2; + } - *nr = 2; return arr; } @@ -3697,6 +3709,45 @@ static __init void intel_clovertown_quirk(void) x86_pmu.pebs_constraints = NULL; } +static const struct x86_ucode_id isolation_ucodes[] = { + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT, 1, 0x001e), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE, 4, 0x0023), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E, 1, 0x0014), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X, 2, 0x0b14), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004e), +
[PATCH v3 1/2] x86/cpufeature: Add facility to match microcode revisions
From: Andi Kleen For bug workarounds or checks it is useful to check for specific microcode revisionss. Add a new table format to check for steppings with min microcode revisions. This does not change the existing x86_cpu_id because it's an ABI shared with modutils, and also has quite different requirements, as in no wildcards, but everything has to be matched exactly. Signed-off-by: Andi Kleen --- v2: Remove all CPU match, only check boot cpu Move INTEL_MIN_UCODE macro to header. Minor cleanups. Remove max ucode and driver data v3: Rename function Update comments. Document mixed stepping caveats. Use u8 for model Remove vendor 0 check. Change return check --- arch/x86/include/asm/cpu_device_id.h | 29 arch/x86/kernel/cpu/match.c | 19 ++ 2 files changed, 48 insertions(+) diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index baeba0567126..97d645910ec2 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -11,4 +11,33 @@ extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match); +/* + * Match specific microcode revisions. + * + * vendor/family/model/stepping must be all set. + * + * only checks against the boot cpu. When mixed-stepping configs are + * valid for a CPU model, add a quirk for every valid stepping and + * do the fine-tuning in the quirk handler. + */ + +struct x86_ucode_id { + u8 vendor; + u8 family; + u8 model; + u8 stepping; + u32 min_ucode; +}; + +#define INTEL_MIN_UCODE(mod, step, rev) { \ + .vendor = X86_VENDOR_INTEL, \ + .family = 6,\ + .model = mod, \ + .stepping = step, \ + .min_ucode = rev, \ +} + +extern const struct x86_ucode_id * +x86_match_microcode(const struct x86_ucode_id *match); + #endif diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c index 3fed38812eea..d44c8cfd7b29 100644 --- a/arch/x86/kernel/cpu/match.c +++ b/arch/x86/kernel/cpu/match.c @@ -48,3 +48,22 @@ const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match) return NULL; } EXPORT_SYMBOL(x86_match_cpu); + +const struct x86_ucode_id *x86_match_microcode(const struct x86_ucode_id *match) +{ + struct cpuinfo_x86 *c = _cpu_data; + const struct x86_ucode_id *m; + + for (m = match; m->family | m->model; m++) { + if (c->x86_vendor != m->vendor) + continue; + if (c->x86 != m->family) + continue; + if (c->x86_model != m->model) + continue; + if (c->x86_stepping != m->stepping) + continue; + return c->microcode >= m->min_ucode ? m : NULL; + } + return NULL; +} -- 2.17.1
[PATCH v1] x86/microcode: Handle negative microcode revisions
From: Andi Kleen The Intel ucode revision space is unsigned. Inside Intel there are special microcodes that have the highest bit set, and they are considered to have a higher revision than any microcodes that don't have this bit set. The function comparing the microcodes in the Linux driver compares u32 with int, which ends up being signed extended to long on 64bit systems. This results in these highest bit set microcodes not loading because their revision appears negative and smaller than the existing microcode. Change the comparison to unsigned. With that the loading works as expected. Signed-off-by: Andi Kleen --- arch/x86/kernel/cpu/microcode/intel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 16936a24795c..e95cebdd5993 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -93,7 +93,7 @@ static int find_matching_signature(void *mc, unsigned int csig, int cpf) /* * Returns 1 if update has been found, 0 otherwise. */ -static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int new_rev) +static int has_newer_microcode(void *mc, unsigned int csig, int cpf, unsigned new_rev) { struct microcode_header_intel *mc_hdr = mc; -- 2.17.1
[PATCH v1] x86/microcode: Handle negative microcode revisions
From: Andi Kleen The Intel ucode revision space is unsigned. Inside Intel there are special microcodes that have the highest bit set, and they are considered to have a higher revision than any microcodes that don't have this bit set. The function comparing the microcodes in the Linux driver compares u32 with int, which ends up being signed extended to long on 64bit systems. This results in these highest bit set microcodes not loading because their revision appears negative and smaller than the existing microcode. Change the comparison to unsigned. With that the loading works as expected. Signed-off-by: Andi Kleen --- arch/x86/kernel/cpu/microcode/intel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c index 16936a24795c..e95cebdd5993 100644 --- a/arch/x86/kernel/cpu/microcode/intel.c +++ b/arch/x86/kernel/cpu/microcode/intel.c @@ -93,7 +93,7 @@ static int find_matching_signature(void *mc, unsigned int csig, int cpf) /* * Returns 1 if update has been found, 0 otherwise. */ -static int has_newer_microcode(void *mc, unsigned int csig, int cpf, int new_rev) +static int has_newer_microcode(void *mc, unsigned int csig, int cpf, unsigned new_rev) { struct microcode_header_intel *mc_hdr = mc; -- 2.17.1
Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions
> > + u32 min_ucode; > > +}; > > + > > +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id > > *match) > > What's the point of returning the struct pointer? Shouldn't it be enough to > make it return bool? Also the function name really should reflect that this > checks whether the minimal required microcode revision is active. This allows the user to find the table entry to tie something to it (e.g. use the index to match some other table) Same pattern as pci discovery etc. use. Given the current caller doesn't need it, but we still follow standard conventions. > > > +{ > > + struct cpuinfo_x86 *c = _cpu_data; > > + const struct x86_ucode_id *m; > > + > > + for (m = match; m->vendor | m->family | m->model; m++) { > > VENDOR_INTEL = 0, so this check is obscure to begin with. Either you chose > a explicit condition to put at the end of the table, e.g. vendor = U8_MAX > or you hand in the array size to the function. That would both be awkward. It's the same as match_cpu, and 0 terminators are standard practice in practical all similar code. I removed the or with the family. -Andi
Re: [PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions
> > + u32 min_ucode; > > +}; > > + > > +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id > > *match) > > What's the point of returning the struct pointer? Shouldn't it be enough to > make it return bool? Also the function name really should reflect that this > checks whether the minimal required microcode revision is active. This allows the user to find the table entry to tie something to it (e.g. use the index to match some other table) Same pattern as pci discovery etc. use. Given the current caller doesn't need it, but we still follow standard conventions. > > > +{ > > + struct cpuinfo_x86 *c = _cpu_data; > > + const struct x86_ucode_id *m; > > + > > + for (m = match; m->vendor | m->family | m->model; m++) { > > VENDOR_INTEL = 0, so this check is obscure to begin with. Either you chose > a explicit condition to put at the end of the table, e.g. vendor = U8_MAX > or you hand in the array size to the function. That would both be awkward. It's the same as match_cpu, and 0 terminators are standard practice in practical all similar code. I removed the or with the family. -Andi
Re: l1tf: Kernel suggests I throw away third of my memory. I'd rather not
On Wed, Oct 17, 2018 at 12:56:10PM +0200, Pavel Machek wrote: > Hi! > > 6a012288 suggests I throw away 1GB on RAM. On 3GB system.. that is not > going to be pleasant. Just rebuild your kernel with PAE? I assume your CPU supports it. This will also give you NX, which if you're really worried about security is far more important than L1TF. If you don't worry about security just ignore. -Andi
Re: l1tf: Kernel suggests I throw away third of my memory. I'd rather not
On Wed, Oct 17, 2018 at 12:56:10PM +0200, Pavel Machek wrote: > Hi! > > 6a012288 suggests I throw away 1GB on RAM. On 3GB system.. that is not > going to be pleasant. Just rebuild your kernel with PAE? I assume your CPU supports it. This will also give you NX, which if you're really worried about security is far more important than L1TF. If you don't worry about security just ignore. -Andi
Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization
> 4. Results > - Without this optimization, the guest pmi handling time is > ~450 ns, and the max sampling rate is reduced to 250. > - With this optimization, the guest pmi handling time is ~9000 ns > (i.e. 1 / 500 of the non-optimization case), and the max sampling > rate remains at the original 10. Impressive performance improvement! It's not clear to me why you're special casing PMIs here. The optimization should work generically, right? perf will enable/disable the PMU even outside PMIs, e.g. on context switches, which is a very important path too. > @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, > struct msr_data *msr_info) > default: > if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || > (pmc = get_fixed_pmc(pmu, msr))) { > - if (!msr_info->host_initiated) > - data = (s64)(s32)data; > - pmc->counter += data - pmc_read_counter(pmc); > + if (pmu->in_pmi) { > + /* > + * Since we are not re-allocating a perf event > + * to reconfigure the sampling time when the > + * guest pmu is in PMI, just set the value to > + * the hardware perf counter. Counting will > + * continue after the guest enables the > + * counter bit in MSR_CORE_PERF_GLOBAL_CTRL. > + */ > + struct hw_perf_event *hwc = > + >perf_event->hw; > + wrmsrl(hwc->event_base, data); Is that guaranteed to be always called on the right CPU that will run the vcpu? AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure it won't handle that. May need to be delayed to entry time. -Andi
Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization
> 4. Results > - Without this optimization, the guest pmi handling time is > ~450 ns, and the max sampling rate is reduced to 250. > - With this optimization, the guest pmi handling time is ~9000 ns > (i.e. 1 / 500 of the non-optimization case), and the max sampling > rate remains at the original 10. Impressive performance improvement! It's not clear to me why you're special casing PMIs here. The optimization should work generically, right? perf will enable/disable the PMU even outside PMIs, e.g. on context switches, which is a very important path too. > @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, > struct msr_data *msr_info) > default: > if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || > (pmc = get_fixed_pmc(pmu, msr))) { > - if (!msr_info->host_initiated) > - data = (s64)(s32)data; > - pmc->counter += data - pmc_read_counter(pmc); > + if (pmu->in_pmi) { > + /* > + * Since we are not re-allocating a perf event > + * to reconfigure the sampling time when the > + * guest pmu is in PMI, just set the value to > + * the hardware perf counter. Counting will > + * continue after the guest enables the > + * counter bit in MSR_CORE_PERF_GLOBAL_CTRL. > + */ > + struct hw_perf_event *hwc = > + >perf_event->hw; > + wrmsrl(hwc->event_base, data); Is that guaranteed to be always called on the right CPU that will run the vcpu? AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure it won't handle that. May need to be delayed to entry time. -Andi
[PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions
From: Andi Kleen For bug workarounds or checks it is useful to check for specific microcode versions. Add a new table format to check for steppings with min microcode revisions. This does not change the existing x86_cpu_id because it's an ABI shared with modutils, and also has quite difference requirements, as in no wildcards, but everything has to be matched exactly. Signed-off-by: Andi Kleen --- v2: Remove all CPU match, only check boot cpu Move INTEL_MIN_UCODE macro to header. Minor cleanups. Remove max ucode and driver data --- arch/x86/include/asm/cpu_device_id.h | 26 ++ arch/x86/kernel/cpu/match.c | 21 + 2 files changed, 47 insertions(+) diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index baeba0567126..1b90bd1d0b95 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -11,4 +11,30 @@ extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match); +/* + * Match specific microcodes + * + * vendor/family/model/stepping must be all set. + * min_ucode is optional and can be 0. + */ + +struct x86_ucode_id { + u8 vendor; + u8 family; + u16 model; + u16 stepping; + u32 min_ucode; +}; + +#define INTEL_MIN_UCODE(mod, step, rev) { \ + .vendor = X86_VENDOR_INTEL, \ + .family = 6,\ + .model = mod, \ + .stepping = step, \ + .min_ucode = rev, \ +} + +extern const struct x86_ucode_id * +x86_match_ucode(const struct x86_ucode_id *match); + #endif diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c index 3fed38812eea..ec8ee31699cd 100644 --- a/arch/x86/kernel/cpu/match.c +++ b/arch/x86/kernel/cpu/match.c @@ -48,3 +48,24 @@ const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match) return NULL; } EXPORT_SYMBOL(x86_match_cpu); + +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id *match) +{ + struct cpuinfo_x86 *c = _cpu_data; + const struct x86_ucode_id *m; + + for (m = match; m->vendor | m->family | m->model; m++) { + if (c->x86_vendor != m->vendor) + continue; + if (c->x86 != m->family) + continue; + if (c->x86_model != m->model) + continue; + if (c->x86_stepping != m->stepping) + continue; + if (c->microcode < m->min_ucode) + continue; + return m; + } + return NULL; +} -- 2.17.1
[PATCH v2 1/2] x86/cpufeature: Add facility to match microcode revisions
From: Andi Kleen For bug workarounds or checks it is useful to check for specific microcode versions. Add a new table format to check for steppings with min microcode revisions. This does not change the existing x86_cpu_id because it's an ABI shared with modutils, and also has quite difference requirements, as in no wildcards, but everything has to be matched exactly. Signed-off-by: Andi Kleen --- v2: Remove all CPU match, only check boot cpu Move INTEL_MIN_UCODE macro to header. Minor cleanups. Remove max ucode and driver data --- arch/x86/include/asm/cpu_device_id.h | 26 ++ arch/x86/kernel/cpu/match.c | 21 + 2 files changed, 47 insertions(+) diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index baeba0567126..1b90bd1d0b95 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -11,4 +11,30 @@ extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match); +/* + * Match specific microcodes + * + * vendor/family/model/stepping must be all set. + * min_ucode is optional and can be 0. + */ + +struct x86_ucode_id { + u8 vendor; + u8 family; + u16 model; + u16 stepping; + u32 min_ucode; +}; + +#define INTEL_MIN_UCODE(mod, step, rev) { \ + .vendor = X86_VENDOR_INTEL, \ + .family = 6,\ + .model = mod, \ + .stepping = step, \ + .min_ucode = rev, \ +} + +extern const struct x86_ucode_id * +x86_match_ucode(const struct x86_ucode_id *match); + #endif diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c index 3fed38812eea..ec8ee31699cd 100644 --- a/arch/x86/kernel/cpu/match.c +++ b/arch/x86/kernel/cpu/match.c @@ -48,3 +48,24 @@ const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match) return NULL; } EXPORT_SYMBOL(x86_match_cpu); + +const struct x86_ucode_id *x86_match_ucode(const struct x86_ucode_id *match) +{ + struct cpuinfo_x86 *c = _cpu_data; + const struct x86_ucode_id *m; + + for (m = match; m->vendor | m->family | m->model; m++) { + if (c->x86_vendor != m->vendor) + continue; + if (c->x86 != m->family) + continue; + if (c->x86_model != m->model) + continue; + if (c->x86_stepping != m->stepping) + continue; + if (c->microcode < m->min_ucode) + continue; + return m; + } + return NULL; +} -- 2.17.1
[PATCH v2 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering
From: Andi Kleen KVM added a workaround for PEBS events leaking into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.") This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR. Intel also added a fix for this issue to microcode updates on Haswell/Broadwell/Skylake. It turns out using the MSR entry/exit list makes VM exits significantly slower. The list is only needed for disabling PEBS, because the GLOBAL_CTRL change gets optimized by KVM into changing the VMCS. Check for the microcode updates that have the microcode fix for leaking PEBS, and disable the extra entry/exit list entry for PEBS_ENABLE. In addition we always clear the GLOBAL_CTRL for the PEBS counter while running in the guest, which is enough to make them never fire at the wrong side of the host/guest transition. We see significantly reduced overhead for VM exits with the filtering active with the patch from 8% to 4%. Signed-off-by: Andi Kleen --- v2: Use match_ucode, not match_ucode_all Remove cpu lock Use INTEL_MIN_UCODE and move to header Update Table to include skylake clients. --- arch/x86/events/intel/core.c | 80 arch/x86/events/perf_event.h | 3 +- 2 files changed, 73 insertions(+), 10 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index ab01ef9ddd77..5e8e76753eea 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "../perf_event.h" @@ -3166,16 +3167,27 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; - /* -* If PMU counter has PEBS enabled it is not enough to disable counter -* on a guest entry since PEBS memory write can overshoot guest entry -* and corrupt guest memory. Disabling PEBS solves the problem. -*/ - arr[1].msr = MSR_IA32_PEBS_ENABLE; - arr[1].host = cpuc->pebs_enabled; - arr[1].guest = 0; + if (x86_pmu.flags & PMU_FL_PEBS_ALL) + arr[0].guest &= ~cpuc->pebs_enabled; + else + arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); + *nr = 1; + + if (!x86_pmu.pebs_isolated) { + /* +* If PMU counter has PEBS enabled it is not enough to +* disable counter on a guest entry since PEBS memory +* write can overshoot guest entry and corrupt guest +* memory. Disabling PEBS solves the problem. +* +* Don't do this if the CPU already enforces it. +*/ + arr[1].msr = MSR_IA32_PEBS_ENABLE; + arr[1].host = cpuc->pebs_enabled; + arr[1].guest = 0; + *nr = 2; + } - *nr = 2; return arr; } @@ -3693,6 +3705,45 @@ static __init void intel_clovertown_quirk(void) x86_pmu.pebs_constraints = NULL; } +static const struct x86_ucode_id isolation_ucodes[] = { + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT, 1, 0x001e), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE, 4, 0x0023), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E, 1, 0x0014), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X, 2, 0x0b14), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004e), +
[PATCH v2 2/2] perf/x86/kvm: Avoid unnecessary work in guest filtering
From: Andi Kleen KVM added a workaround for PEBS events leaking into guests with 26a4f3c08de4 ("perf/x86: disable PEBS on a guest entry.") This uses the VT entry/exit list to add an extra disable of the PEBS_ENABLE MSR. Intel also added a fix for this issue to microcode updates on Haswell/Broadwell/Skylake. It turns out using the MSR entry/exit list makes VM exits significantly slower. The list is only needed for disabling PEBS, because the GLOBAL_CTRL change gets optimized by KVM into changing the VMCS. Check for the microcode updates that have the microcode fix for leaking PEBS, and disable the extra entry/exit list entry for PEBS_ENABLE. In addition we always clear the GLOBAL_CTRL for the PEBS counter while running in the guest, which is enough to make them never fire at the wrong side of the host/guest transition. We see significantly reduced overhead for VM exits with the filtering active with the patch from 8% to 4%. Signed-off-by: Andi Kleen --- v2: Use match_ucode, not match_ucode_all Remove cpu lock Use INTEL_MIN_UCODE and move to header Update Table to include skylake clients. --- arch/x86/events/intel/core.c | 80 arch/x86/events/perf_event.h | 3 +- 2 files changed, 73 insertions(+), 10 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index ab01ef9ddd77..5e8e76753eea 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "../perf_event.h" @@ -3166,16 +3167,27 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr) arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; - /* -* If PMU counter has PEBS enabled it is not enough to disable counter -* on a guest entry since PEBS memory write can overshoot guest entry -* and corrupt guest memory. Disabling PEBS solves the problem. -*/ - arr[1].msr = MSR_IA32_PEBS_ENABLE; - arr[1].host = cpuc->pebs_enabled; - arr[1].guest = 0; + if (x86_pmu.flags & PMU_FL_PEBS_ALL) + arr[0].guest &= ~cpuc->pebs_enabled; + else + arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); + *nr = 1; + + if (!x86_pmu.pebs_isolated) { + /* +* If PMU counter has PEBS enabled it is not enough to +* disable counter on a guest entry since PEBS memory +* write can overshoot guest entry and corrupt guest +* memory. Disabling PEBS solves the problem. +* +* Don't do this if the CPU already enforces it. +*/ + arr[1].msr = MSR_IA32_PEBS_ENABLE; + arr[1].host = cpuc->pebs_enabled; + arr[1].guest = 0; + *nr = 2; + } - *nr = 2; return arr; } @@ -3693,6 +3705,45 @@ static __init void intel_clovertown_quirk(void) x86_pmu.pebs_constraints = NULL; } +static const struct x86_ucode_id isolation_ucodes[] = { + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_CORE, 3, 0x001f), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_ULT, 1, 0x001e), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_GT3E, 1, 0x0015), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,2, 0x0037), + INTEL_MIN_UCODE(INTEL_FAM6_HASWELL_X,4, 0x000a), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_CORE, 4, 0x0023), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_GT3E, 1, 0x0014), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 2, 0x0010), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 3, 0x0709), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 4, 0x0f09), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_XEON_D, 5, 0x0e02), + INTEL_MIN_UCODE(INTEL_FAM6_BROADWELL_X, 2, 0x0b14), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,3, 0x0021), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_X,4, 0x), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_MOBILE, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_SKYLAKE_DESKTOP, 3, 0x007c), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 9, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 11, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_MOBILE, 12, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,10, 0x004e), + INTEL_MIN_UCODE(INTEL_FAM6_KABYLAKE_DESKTOP,11, 0x004e), +
Re: [RFC] perf tools: Wrong filter_band* values in json calculation"
On Tue, Oct 09, 2018 at 12:01:44PM +0200, Jiri Olsa wrote: > On Wed, Oct 03, 2018 at 07:45:50AM -0700, Andi Kleen wrote: > > > note there's couple of changes that actually changed > > > the number completely, like: > > > > > > -"Filter": "edge=1,filter_band2=4000", > > > +"Filter": "edge=1,filter_band2=30", > > > > Thanks. Looks good. I'll fix the scripts to generate the uncore events. > > hi, > any idea when you could post an update for this? Can just use your patch for the existing event lists. You can add Acked-by: Andi Kleen I was mainly worried about future updates, but it doesn't seem to be a problem. -Andi
Re: [RFC] perf tools: Wrong filter_band* values in json calculation"
On Tue, Oct 09, 2018 at 12:01:44PM +0200, Jiri Olsa wrote: > On Wed, Oct 03, 2018 at 07:45:50AM -0700, Andi Kleen wrote: > > > note there's couple of changes that actually changed > > > the number completely, like: > > > > > > -"Filter": "edge=1,filter_band2=4000", > > > +"Filter": "edge=1,filter_band2=30", > > > > Thanks. Looks good. I'll fix the scripts to generate the uncore events. > > hi, > any idea when you could post an update for this? Can just use your patch for the existing event lists. You can add Acked-by: Andi Kleen I was mainly worried about future updates, but it doesn't seem to be a problem. -Andi
Re: [PATCH v6 1/5] tools, perf, script: Add --insn-trace for instruction decoding
> > +> git clone https://github.com/intelxed/mbuild.git mbuild > > > > +> git clone https://github.com/intelxed/xed > > > > +> cd xed > > > > +> mkdir obj > > > > +> cd obj > > > > +> ../mfile.py > > > > +> sudo ../mfile.py --prefix=/usr/local install Need ../mfile.py examples sudo ../mfile.py --prefix=/usr/local examples install -Andi
Re: [PATCH v6 1/5] tools, perf, script: Add --insn-trace for instruction decoding
> > +> git clone https://github.com/intelxed/mbuild.git mbuild > > > > +> git clone https://github.com/intelxed/xed > > > > +> cd xed > > > > +> mkdir obj > > > > +> cd obj > > > > +> ../mfile.py > > > > +> sudo ../mfile.py --prefix=/usr/local install Need ../mfile.py examples sudo ../mfile.py --prefix=/usr/local examples install -Andi
Re: [PATCH v1 1/2] x86/cpufeature: Add facility to match microcode revisions
On Sat, Oct 06, 2018 at 04:14:54PM +0200, Thomas Gleixner wrote: > On Fri, 5 Oct 2018, Andi Kleen wrote: > > +/* > > + * Match specific microcodes or steppings. > > What means microcodes or steppings? If you mean microcode revisions then > please spell it out and use it all over the place. steppings is confusing > at best as its associated to the CPU stepping. The matcher can be used to match specific hardware steppings by setting the min/max_ucode to 0 or specific microcode revisions (which are associated with steppings) > > +const struct x86_ucode_id *x86_match_ucode_all(const struct x86_ucode_id > > *match) > > Can you please name that so it's obvious that this checks all cpus, but > aside of that, why would a system ever end up with different microcode > revisions at all? The changelog is not mentioning any reason for this > function and "/* Check all CPUs */" is not helpful either. We still support the old microcode interface that allows updates per CPU, and also it could happen during CPU hotplug. > > > + int cpu; > > + const struct x86_ucode_id *all_m = NULL; > > + bool first = true; > > + > > + for_each_online_cpu(cpu) { > > What guarantees that CPUs cannot be plugged? You either need to have > cpus_read_lock() in this function or a lockdep_assert_cpus_held(). In my case the caller, but yes should be documented. -Andi
Re: [PATCH v1 1/2] x86/cpufeature: Add facility to match microcode revisions
On Sat, Oct 06, 2018 at 04:14:54PM +0200, Thomas Gleixner wrote: > On Fri, 5 Oct 2018, Andi Kleen wrote: > > +/* > > + * Match specific microcodes or steppings. > > What means microcodes or steppings? If you mean microcode revisions then > please spell it out and use it all over the place. steppings is confusing > at best as its associated to the CPU stepping. The matcher can be used to match specific hardware steppings by setting the min/max_ucode to 0 or specific microcode revisions (which are associated with steppings) > > +const struct x86_ucode_id *x86_match_ucode_all(const struct x86_ucode_id > > *match) > > Can you please name that so it's obvious that this checks all cpus, but > aside of that, why would a system ever end up with different microcode > revisions at all? The changelog is not mentioning any reason for this > function and "/* Check all CPUs */" is not helpful either. We still support the old microcode interface that allows updates per CPU, and also it could happen during CPU hotplug. > > > + int cpu; > > + const struct x86_ucode_id *all_m = NULL; > > + bool first = true; > > + > > + for_each_online_cpu(cpu) { > > What guarantees that CPUs cannot be plugged? You either need to have > cpus_read_lock() in this function or a lockdep_assert_cpus_held(). In my case the caller, but yes should be documented. -Andi
[PATCH v1 1/2] x86/cpufeature: Add facility to match microcode revisions
From: Andi Kleen For bug workarounds or checks it is useful to check for specific microcode versions. Add a new table format to check for steppings with min/max microcode revisions. This does not change the existing x86_cpu_id because it's an ABI shared with modutils, and also has quite difference requirements, as in no wildcards, but everything has to be matched exactly. Signed-off-by: Andi Kleen --- arch/x86/include/asm/cpu_device_id.h | 22 ++ arch/x86/kernel/cpu/match.c | 43 2 files changed, 65 insertions(+) diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index baeba0567126..bfd5438c 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -11,4 +11,26 @@ extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match); +/* + * Match specific microcodes or steppings. + * + * vendor/family/model/stepping must be all set. + * min_ucode/max_ucode/driver_data are optional and can be 0. + */ + +struct x86_ucode_id { + __u16 vendor; + __u16 family; + __u16 model; + __u16 stepping; + __u32 min_ucode; + __u32 max_ucode; + kernel_ulong_t driver_data; +}; + +extern const struct x86_ucode_id * +x86_match_ucode_cpu(int cpu, const struct x86_ucode_id *match); +extern const struct x86_ucode_id * +x86_match_ucode_all(const struct x86_ucode_id *match); + #endif diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c index 3fed38812eea..f29a21b2809c 100644 --- a/arch/x86/kernel/cpu/match.c +++ b/arch/x86/kernel/cpu/match.c @@ -48,3 +48,46 @@ const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match) return NULL; } EXPORT_SYMBOL(x86_match_cpu); + +const struct x86_ucode_id *x86_match_ucode_cpu(int cpu, + const struct x86_ucode_id *match) +{ + const struct x86_ucode_id *m; + struct cpuinfo_x86 *c = _data(cpu); + + for (m = match; m->vendor | m->family | m->model; m++) { + if (c->x86_vendor != m->vendor) + continue; + if (c->x86 != m->family) + continue; + if (c->x86_model != m->model) + continue; + if (c->x86_stepping != m->stepping) + continue; + if (m->min_ucode && c->microcode < m->min_ucode) + continue; + if (m->max_ucode && c->microcode > m->max_ucode) + continue; + return m; + } + return NULL; +} + +/* Check all CPUs */ +const struct x86_ucode_id *x86_match_ucode_all(const struct x86_ucode_id *match) +{ + int cpu; + const struct x86_ucode_id *all_m = NULL; + bool first = true; + + for_each_online_cpu(cpu) { + const struct x86_ucode_id *m = x86_match_ucode_cpu(cpu, match); + + if (first) + all_m = m; + else if (m != all_m) + return NULL; + first = false; + } + return all_m; +} -- 2.17.1
[PATCH v1 1/2] x86/cpufeature: Add facility to match microcode revisions
From: Andi Kleen For bug workarounds or checks it is useful to check for specific microcode versions. Add a new table format to check for steppings with min/max microcode revisions. This does not change the existing x86_cpu_id because it's an ABI shared with modutils, and also has quite difference requirements, as in no wildcards, but everything has to be matched exactly. Signed-off-by: Andi Kleen --- arch/x86/include/asm/cpu_device_id.h | 22 ++ arch/x86/kernel/cpu/match.c | 43 2 files changed, 65 insertions(+) diff --git a/arch/x86/include/asm/cpu_device_id.h b/arch/x86/include/asm/cpu_device_id.h index baeba0567126..bfd5438c 100644 --- a/arch/x86/include/asm/cpu_device_id.h +++ b/arch/x86/include/asm/cpu_device_id.h @@ -11,4 +11,26 @@ extern const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match); +/* + * Match specific microcodes or steppings. + * + * vendor/family/model/stepping must be all set. + * min_ucode/max_ucode/driver_data are optional and can be 0. + */ + +struct x86_ucode_id { + __u16 vendor; + __u16 family; + __u16 model; + __u16 stepping; + __u32 min_ucode; + __u32 max_ucode; + kernel_ulong_t driver_data; +}; + +extern const struct x86_ucode_id * +x86_match_ucode_cpu(int cpu, const struct x86_ucode_id *match); +extern const struct x86_ucode_id * +x86_match_ucode_all(const struct x86_ucode_id *match); + #endif diff --git a/arch/x86/kernel/cpu/match.c b/arch/x86/kernel/cpu/match.c index 3fed38812eea..f29a21b2809c 100644 --- a/arch/x86/kernel/cpu/match.c +++ b/arch/x86/kernel/cpu/match.c @@ -48,3 +48,46 @@ const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match) return NULL; } EXPORT_SYMBOL(x86_match_cpu); + +const struct x86_ucode_id *x86_match_ucode_cpu(int cpu, + const struct x86_ucode_id *match) +{ + const struct x86_ucode_id *m; + struct cpuinfo_x86 *c = _data(cpu); + + for (m = match; m->vendor | m->family | m->model; m++) { + if (c->x86_vendor != m->vendor) + continue; + if (c->x86 != m->family) + continue; + if (c->x86_model != m->model) + continue; + if (c->x86_stepping != m->stepping) + continue; + if (m->min_ucode && c->microcode < m->min_ucode) + continue; + if (m->max_ucode && c->microcode > m->max_ucode) + continue; + return m; + } + return NULL; +} + +/* Check all CPUs */ +const struct x86_ucode_id *x86_match_ucode_all(const struct x86_ucode_id *match) +{ + int cpu; + const struct x86_ucode_id *all_m = NULL; + bool first = true; + + for_each_online_cpu(cpu) { + const struct x86_ucode_id *m = x86_match_ucode_cpu(cpu, match); + + if (first) + all_m = m; + else if (m != all_m) + return NULL; + first = false; + } + return all_m; +} -- 2.17.1
Re: [RFC] perf tools: Wrong filter_band* values in json calculation"
> note there's couple of changes that actually changed > the number completely, like: > > -"Filter": "edge=1,filter_band2=4000", > +"Filter": "edge=1,filter_band2=30", Thanks. Looks good. I'll fix the scripts to generate the uncore events. -Andi
Re: [RFC] perf tools: Wrong filter_band* values in json calculation"
> note there's couple of changes that actually changed > the number completely, like: > > -"Filter": "edge=1,filter_band2=4000", > +"Filter": "edge=1,filter_band2=30", Thanks. Looks good. I'll fix the scripts to generate the uncore events. -Andi
Re: [PATCH] perf/x86/intel: Add counter freezing quirk for Goldmont
> There is another variant of model/stepping micro code verification code in > intel_snb_pebs_broken(). Can we please make this table based and use a > common function? That's certainly not the last quirk we're going to have. I have a patch to add a table driven microcode matcher for another fix. Will post shortly. -Andi
Re: [PATCH] perf/x86/intel: Add counter freezing quirk for Goldmont
> There is another variant of model/stepping micro code verification code in > intel_snb_pebs_broken(). Can we please make this table based and use a > common function? That's certainly not the last quirk we're going to have. I have a patch to add a table driven microcode matcher for another fix. Will post shortly. -Andi
[tip:perf/core] perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler
Commit-ID: af3bdb991a5cb57c189d34aadbd3aa88995e0d9f Gitweb: https://git.kernel.org/tip/af3bdb991a5cb57c189d34aadbd3aa88995e0d9f Author: Andi Kleen AuthorDate: Wed, 8 Aug 2018 00:12:07 -0700 Committer: Ingo Molnar CommitDate: Tue, 2 Oct 2018 10:14:31 +0200 perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler Implements counter freezing for Arch Perfmon v4 (Skylake and newer). This allows to speed up the PMI handler by avoiding unnecessary MSR writes and make it more accurate. The Arch Perfmon v4 PMI handler is substantially different than the older PMI handler. Differences to the old handler: - It relies on counter freezing, which eliminates several MSR writes from the PMI handler and lowers the overhead significantly. It makes the PMI handler more accurate, as all counters get frozen atomically as soon as any counter overflows. So there is much less counting of the PMI handler itself. With the freezing we don't need to disable or enable counters or PEBS. Only BTS which does not support auto-freezing still needs to be explicitly managed. - The PMU acking is done at the end, not the beginning. This makes it possible to avoid manual enabling/disabling of the PMU, instead we just rely on the freezing/acking. - The APIC is acked before reenabling the PMU, which avoids problems with LBRs occasionally not getting unfreezed on Skylake. - Looping is only needed to workaround a corner case which several PMIs are very close to each other. For common cases, the counters are freezed during PMI handler. It doesn't need to do re-check. This patch: - Adds code to enable v4 counter freezing - Fork <=v3 and >=v4 PMI handlers into separate functions. - Add kernel parameter to disable counter freezing. It took some time to debug counter freezing, so in case there are new problems we added an option to turn it off. Would not expect this to be used until there are new bugs. - Only for big core. The patch for small core will be posted later separately. Performance: When profiling a kernel build on Kabylake with different perf options, measuring the length of all NMI handlers using the nmi handler trace point: V3 is without counter freezing. V4 is with counter freezing. The value is the average cost of the PMI handler. (lower is better) perf options` V3(ns) V4(ns) delta -c 10 1088 894 -18% -g -c 101862 1646-12% --call-graph lbr -c 10 3649 3367-8% --c.g. dwarf -c 10 2248 1982-12% Signed-off-by: Andi Kleen Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Cc: a...@kernel.org Link: http://lkml.kernel.org/r/1533712328-2834-2-git-send-email-kan.li...@linux.intel.com Signed-off-by: Ingo Molnar --- Documentation/admin-guide/kernel-parameters.txt | 5 ++ arch/x86/events/intel/core.c| 113 arch/x86/events/perf_event.h| 4 +- arch/x86/include/asm/msr-index.h| 1 + 4 files changed, 122 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 92eb1f42240d..6795dedcbd1e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -856,6 +856,11 @@ causing system reset or hang due to sending INIT from AP to BSP. + disable_counter_freezing [HW] + Disable Intel PMU counter freezing feature. + The feature only exists starting from + Arch Perfmon v4 (Skylake and newer). + disable_ddw [PPC/PSERIES] Disable Dynamic DMA Window support. Use this if to workaround buggy firmware. diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 9b320a51f82f..bd3b8f3600b2 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -1995,6 +1995,18 @@ static void intel_pmu_nhm_enable_all(int added) intel_pmu_enable_all(added); } +static void enable_counter_freeze(void) +{ + update_debugctlmsr(get_debugctlmsr() | + DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI); +} + +static void disable_counter_freeze(void) +{ + update_debugctlmsr(get_debugctlmsr() & + ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI); +} + static inline u64 intel_pmu_get_status(void) { u64 status; @@ -2290,6 +2302,91 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) return handled; } +static bool disable_counter_freezing; +static int __init intel_perf_counter_freezing_setup(char *s) +{ + disable_counte
[tip:perf/core] perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler
Commit-ID: af3bdb991a5cb57c189d34aadbd3aa88995e0d9f Gitweb: https://git.kernel.org/tip/af3bdb991a5cb57c189d34aadbd3aa88995e0d9f Author: Andi Kleen AuthorDate: Wed, 8 Aug 2018 00:12:07 -0700 Committer: Ingo Molnar CommitDate: Tue, 2 Oct 2018 10:14:31 +0200 perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler Implements counter freezing for Arch Perfmon v4 (Skylake and newer). This allows to speed up the PMI handler by avoiding unnecessary MSR writes and make it more accurate. The Arch Perfmon v4 PMI handler is substantially different than the older PMI handler. Differences to the old handler: - It relies on counter freezing, which eliminates several MSR writes from the PMI handler and lowers the overhead significantly. It makes the PMI handler more accurate, as all counters get frozen atomically as soon as any counter overflows. So there is much less counting of the PMI handler itself. With the freezing we don't need to disable or enable counters or PEBS. Only BTS which does not support auto-freezing still needs to be explicitly managed. - The PMU acking is done at the end, not the beginning. This makes it possible to avoid manual enabling/disabling of the PMU, instead we just rely on the freezing/acking. - The APIC is acked before reenabling the PMU, which avoids problems with LBRs occasionally not getting unfreezed on Skylake. - Looping is only needed to workaround a corner case which several PMIs are very close to each other. For common cases, the counters are freezed during PMI handler. It doesn't need to do re-check. This patch: - Adds code to enable v4 counter freezing - Fork <=v3 and >=v4 PMI handlers into separate functions. - Add kernel parameter to disable counter freezing. It took some time to debug counter freezing, so in case there are new problems we added an option to turn it off. Would not expect this to be used until there are new bugs. - Only for big core. The patch for small core will be posted later separately. Performance: When profiling a kernel build on Kabylake with different perf options, measuring the length of all NMI handlers using the nmi handler trace point: V3 is without counter freezing. V4 is with counter freezing. The value is the average cost of the PMI handler. (lower is better) perf options` V3(ns) V4(ns) delta -c 10 1088 894 -18% -g -c 101862 1646-12% --call-graph lbr -c 10 3649 3367-8% --c.g. dwarf -c 10 2248 1982-12% Signed-off-by: Andi Kleen Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Cc: a...@kernel.org Link: http://lkml.kernel.org/r/1533712328-2834-2-git-send-email-kan.li...@linux.intel.com Signed-off-by: Ingo Molnar --- Documentation/admin-guide/kernel-parameters.txt | 5 ++ arch/x86/events/intel/core.c| 113 arch/x86/events/perf_event.h| 4 +- arch/x86/include/asm/msr-index.h| 1 + 4 files changed, 122 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 92eb1f42240d..6795dedcbd1e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -856,6 +856,11 @@ causing system reset or hang due to sending INIT from AP to BSP. + disable_counter_freezing [HW] + Disable Intel PMU counter freezing feature. + The feature only exists starting from + Arch Perfmon v4 (Skylake and newer). + disable_ddw [PPC/PSERIES] Disable Dynamic DMA Window support. Use this if to workaround buggy firmware. diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 9b320a51f82f..bd3b8f3600b2 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -1995,6 +1995,18 @@ static void intel_pmu_nhm_enable_all(int added) intel_pmu_enable_all(added); } +static void enable_counter_freeze(void) +{ + update_debugctlmsr(get_debugctlmsr() | + DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI); +} + +static void disable_counter_freeze(void) +{ + update_debugctlmsr(get_debugctlmsr() & + ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI); +} + static inline u64 intel_pmu_get_status(void) { u64 status; @@ -2290,6 +2302,91 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) return handled; } +static bool disable_counter_freezing; +static int __init intel_perf_counter_freezing_setup(char *s) +{ + disable_counte
[PATCH v1 1/2] perf, tools: Move perf_evsel__reset_weak_group into evlist
From: Andi Kleen - Move the function from builtin-stat to evlist for reuse - Rename to evlist to match purpose better - Pass the evlist as first argument. - No functional changes Signed-off-by: Andi Kleen --- tools/perf/builtin-stat.c | 29 ++--- tools/perf/util/evlist.c | 27 +++ tools/perf/util/evlist.h | 3 +++ 3 files changed, 32 insertions(+), 27 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 08a1e64078ca..e60f6991dbb8 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -567,32 +567,6 @@ static bool perf_evsel__should_store_id(struct perf_evsel *counter) return STAT_RECORD || counter->attr.read_format & PERF_FORMAT_ID; } -static struct perf_evsel *perf_evsel__reset_weak_group(struct perf_evsel *evsel) -{ - struct perf_evsel *c2, *leader; - bool is_open = true; - - leader = evsel->leader; - pr_debug("Weak group for %s/%d failed\n", - leader->name, leader->nr_members); - - /* -* for_each_group_member doesn't work here because it doesn't -* include the first entry. -*/ - evlist__for_each_entry(evsel_list, c2) { - if (c2 == evsel) - is_open = false; - if (c2->leader == leader) { - if (is_open) - perf_evsel__close(c2); - c2->leader = c2; - c2->nr_members = 0; - } - } - return leader; -} - static int __run_perf_stat(int argc, const char **argv, int run_idx) { int interval = stat_config.interval; @@ -639,7 +613,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) if ((errno == EINVAL || errno == EBADF) && counter->leader != counter && counter->weak_group) { - counter = perf_evsel__reset_weak_group(counter); + counter = perf_evlist__reset_weak_group(evsel_list, + counter); goto try_again; } diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index e7a4b31a84fb..de90f9097e97 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -1810,3 +1810,30 @@ void perf_evlist__force_leader(struct perf_evlist *evlist) leader->forced_leader = true; } } + +struct perf_evsel *perf_evlist__reset_weak_group(struct perf_evlist *evsel_list, +struct perf_evsel *evsel) +{ + struct perf_evsel *c2, *leader; + bool is_open = true; + + leader = evsel->leader; + pr_debug("Weak group for %s/%d failed\n", + leader->name, leader->nr_members); + + /* +* for_each_group_member doesn't work here because it doesn't +* include the first entry. +*/ + evlist__for_each_entry(evsel_list, c2) { + if (c2 == evsel) + is_open = false; + if (c2->leader == leader) { + if (is_open) + perf_evsel__close(c2); + c2->leader = c2; + c2->nr_members = 0; + } + } + return leader; +} diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h index dc66436add98..9919eed6d15b 100644 --- a/tools/perf/util/evlist.h +++ b/tools/perf/util/evlist.h @@ -312,4 +312,7 @@ bool perf_evlist__exclude_kernel(struct perf_evlist *evlist); void perf_evlist__force_leader(struct perf_evlist *evlist); +struct perf_evsel *perf_evlist__reset_weak_group(struct perf_evlist *evlist, +struct perf_evsel *evsel); + #endif /* __PERF_EVLIST_H */ -- 2.17.1
[PATCH v1 1/2] perf, tools: Move perf_evsel__reset_weak_group into evlist
From: Andi Kleen - Move the function from builtin-stat to evlist for reuse - Rename to evlist to match purpose better - Pass the evlist as first argument. - No functional changes Signed-off-by: Andi Kleen --- tools/perf/builtin-stat.c | 29 ++--- tools/perf/util/evlist.c | 27 +++ tools/perf/util/evlist.h | 3 +++ 3 files changed, 32 insertions(+), 27 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 08a1e64078ca..e60f6991dbb8 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -567,32 +567,6 @@ static bool perf_evsel__should_store_id(struct perf_evsel *counter) return STAT_RECORD || counter->attr.read_format & PERF_FORMAT_ID; } -static struct perf_evsel *perf_evsel__reset_weak_group(struct perf_evsel *evsel) -{ - struct perf_evsel *c2, *leader; - bool is_open = true; - - leader = evsel->leader; - pr_debug("Weak group for %s/%d failed\n", - leader->name, leader->nr_members); - - /* -* for_each_group_member doesn't work here because it doesn't -* include the first entry. -*/ - evlist__for_each_entry(evsel_list, c2) { - if (c2 == evsel) - is_open = false; - if (c2->leader == leader) { - if (is_open) - perf_evsel__close(c2); - c2->leader = c2; - c2->nr_members = 0; - } - } - return leader; -} - static int __run_perf_stat(int argc, const char **argv, int run_idx) { int interval = stat_config.interval; @@ -639,7 +613,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) if ((errno == EINVAL || errno == EBADF) && counter->leader != counter && counter->weak_group) { - counter = perf_evsel__reset_weak_group(counter); + counter = perf_evlist__reset_weak_group(evsel_list, + counter); goto try_again; } diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index e7a4b31a84fb..de90f9097e97 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -1810,3 +1810,30 @@ void perf_evlist__force_leader(struct perf_evlist *evlist) leader->forced_leader = true; } } + +struct perf_evsel *perf_evlist__reset_weak_group(struct perf_evlist *evsel_list, +struct perf_evsel *evsel) +{ + struct perf_evsel *c2, *leader; + bool is_open = true; + + leader = evsel->leader; + pr_debug("Weak group for %s/%d failed\n", + leader->name, leader->nr_members); + + /* +* for_each_group_member doesn't work here because it doesn't +* include the first entry. +*/ + evlist__for_each_entry(evsel_list, c2) { + if (c2 == evsel) + is_open = false; + if (c2->leader == leader) { + if (is_open) + perf_evsel__close(c2); + c2->leader = c2; + c2->nr_members = 0; + } + } + return leader; +} diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h index dc66436add98..9919eed6d15b 100644 --- a/tools/perf/util/evlist.h +++ b/tools/perf/util/evlist.h @@ -312,4 +312,7 @@ bool perf_evlist__exclude_kernel(struct perf_evlist *evlist); void perf_evlist__force_leader(struct perf_evlist *evlist); +struct perf_evsel *perf_evlist__reset_weak_group(struct perf_evlist *evlist, +struct perf_evsel *evsel); + #endif /* __PERF_EVLIST_H */ -- 2.17.1
[PATCH v1 2/2] perf record: Support weak groups
From: Andi Kleen Implement a weak group fallback for perf record, similar to the existing perf stat support. This allows to use groups that might be longer than the available counters without failing. Before: $ perf record -e '{cycles,cache-misses,cache-references,cpu_clk_unhalted.thread,cycles,cycles,cycles}' -a sleep 1 Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cycles). /bin/dmesg | grep -i perf may provide additional information. After: $ ./perf record -e '{cycles,cache-misses,cache-references,cpu_clk_unhalted.thread,cycles,cycles,cycles}:W' -a sleep 1 WARNING: No sample_id_all support, falling back to unordered processing [ perf record: Woken up 3 times to write data ] [ perf record: Captured and wrote 8.136 MB perf.data (134069 samples) ] Signed-off-by: Andi Kleen --- tools/perf/Documentation/perf-list.txt | 1 - tools/perf/builtin-record.c| 7 ++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt index 11300dbe35c5..c50ed1177984 100644 --- a/tools/perf/Documentation/perf-list.txt +++ b/tools/perf/Documentation/perf-list.txt @@ -49,7 +49,6 @@ counted. The following modifiers exist: S - read sample value (PERF_SAMPLE_READ) D - pin the event to the PMU W - group is weak and will fallback to non-group if not schedulable, - only supported in 'perf stat' for now. The 'p' modifier can be used for specifying how precise the instruction address should be. The 'p' modifier can be specified multiple times: diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 22ebeb92ac51..504d89d67b3a 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -388,7 +388,12 @@ static int record__open(struct record *rec) ui__warning("%s\n", msg); goto try_again; } - + if ((errno == EINVAL || errno == EBADF) && + pos->leader != pos && + pos->weak_group) { + pos = perf_evlist__reset_weak_group(evlist, pos); + goto try_again; + } rc = -errno; perf_evsel__open_strerror(pos, >target, errno, msg, sizeof(msg)); -- 2.17.1
[PATCH v1 2/2] perf record: Support weak groups
From: Andi Kleen Implement a weak group fallback for perf record, similar to the existing perf stat support. This allows to use groups that might be longer than the available counters without failing. Before: $ perf record -e '{cycles,cache-misses,cache-references,cpu_clk_unhalted.thread,cycles,cycles,cycles}' -a sleep 1 Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cycles). /bin/dmesg | grep -i perf may provide additional information. After: $ ./perf record -e '{cycles,cache-misses,cache-references,cpu_clk_unhalted.thread,cycles,cycles,cycles}:W' -a sleep 1 WARNING: No sample_id_all support, falling back to unordered processing [ perf record: Woken up 3 times to write data ] [ perf record: Captured and wrote 8.136 MB perf.data (134069 samples) ] Signed-off-by: Andi Kleen --- tools/perf/Documentation/perf-list.txt | 1 - tools/perf/builtin-record.c| 7 ++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt index 11300dbe35c5..c50ed1177984 100644 --- a/tools/perf/Documentation/perf-list.txt +++ b/tools/perf/Documentation/perf-list.txt @@ -49,7 +49,6 @@ counted. The following modifiers exist: S - read sample value (PERF_SAMPLE_READ) D - pin the event to the PMU W - group is weak and will fallback to non-group if not schedulable, - only supported in 'perf stat' for now. The 'p' modifier can be used for specifying how precise the instruction address should be. The 'p' modifier can be specified multiple times: diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 22ebeb92ac51..504d89d67b3a 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -388,7 +388,12 @@ static int record__open(struct record *rec) ui__warning("%s\n", msg); goto try_again; } - + if ((errno == EINVAL || errno == EBADF) && + pos->leader != pos && + pos->weak_group) { + pos = perf_evlist__reset_weak_group(evlist, pos); + goto try_again; + } rc = -errno; perf_evsel__open_strerror(pos, >target, errno, msg, sizeof(msg)); -- 2.17.1
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
On Fri, Sep 28, 2018 at 11:22:37PM +0200, Jann Horn wrote: > On Fri, Sep 28, 2018 at 10:59 PM Andi Kleen wrote: > > > > This new file descriptor argument doesn't exist today so it would > > > > need to create a new system call with more arguments > > > > > > Is that true? The first argument is a pointer to a struct that > > > contains its own size, so it can be expanded without an ABI break. I > > > don't see any reason why you couldn't cram more stuff in there. > > > > You're right we could put the fd into the perf_event, but the following is > > still true: > > > > > > Obviously we would need to keep the old system call around > > > > for compability, so you would need to worry about this > > > > interaction in any case! > > > Is that true? IIRC if you want to use the perf tools after a kernel > update, you have to install a new version of perf anyway, no? Not at all. perf is fully ABI compatible. Yes Ubuntu/Debian make you do it, but there is no reason for it other than their ignorance. Other sane distributions don't. Usually the first step when I'm forced to use one of those machine is to remove the useless wrapper and call the perf binary directly. -Andi
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
On Fri, Sep 28, 2018 at 11:22:37PM +0200, Jann Horn wrote: > On Fri, Sep 28, 2018 at 10:59 PM Andi Kleen wrote: > > > > This new file descriptor argument doesn't exist today so it would > > > > need to create a new system call with more arguments > > > > > > Is that true? The first argument is a pointer to a struct that > > > contains its own size, so it can be expanded without an ABI break. I > > > don't see any reason why you couldn't cram more stuff in there. > > > > You're right we could put the fd into the perf_event, but the following is > > still true: > > > > > > Obviously we would need to keep the old system call around > > > > for compability, so you would need to worry about this > > > > interaction in any case! > > > Is that true? IIRC if you want to use the perf tools after a kernel > update, you have to install a new version of perf anyway, no? Not at all. perf is fully ABI compatible. Yes Ubuntu/Debian make you do it, but there is no reason for it other than their ignorance. Other sane distributions don't. Usually the first step when I'm forced to use one of those machine is to remove the useless wrapper and call the perf binary directly. -Andi
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
> > This new file descriptor argument doesn't exist today so it would > > need to create a new system call with more arguments > > Is that true? The first argument is a pointer to a struct that > contains its own size, so it can be expanded without an ABI break. I > don't see any reason why you couldn't cram more stuff in there. You're right we could put the fd into the perf_event, but the following is still true: > > Obviously we would need to keep the old system call around > > for compability, so you would need to worry about this > > interaction in any case! > > > > So tying it together doesn't make any sense, because > > the problem has to be solved separately anyways. -Andi
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
> > This new file descriptor argument doesn't exist today so it would > > need to create a new system call with more arguments > > Is that true? The first argument is a pointer to a struct that > contains its own size, so it can be expanded without an ABI break. I > don't see any reason why you couldn't cram more stuff in there. You're right we could put the fd into the perf_event, but the following is still true: > > Obviously we would need to keep the old system call around > > for compability, so you would need to worry about this > > interaction in any case! > > > > So tying it together doesn't make any sense, because > > the problem has to be solved separately anyways. -Andi
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
On Fri, Sep 28, 2018 at 06:40:17PM +0100, Mark Rutland wrote: > On Fri, Sep 28, 2018 at 10:23:40AM -0700, Andi Kleen wrote: > > > There's also been prior discussion on these feature in other contexts > > > (e.g. android expoits resulting from out-of-tree drivers). It would be > > > nice to see those considered. > > > > > > IIRC The conclusion from prior discussions (e.g. [1]) was that we wanted > > > finer granularity of control such that we could limit PMU access to > > > specific users -- e.g. disallow arbitrary android apps from poking *any* > > > PMU, while allowing some more trusted apps/users to uses *some* specific > > > PMUs. > > > > > > e.g. we could add /sys/bus/event_source/devices/${PMU}/device, protect > > > this via the usual fs ACLs, and pass the fd to perf_event_open() > > > somehow. A valid fd would act as a capability, taking precedence over > > > perf_event_paranoid. > > > > That sounds like an orthogonal feature. I don't think the original > > patchkit would need to be hold up for this. It would be something > > in addition. > > I have to say that I disagree -- these controls will have to interact > somehow, and the fewer of them we have, the less complexity we'll have > to deal with longer-term. You're proposing to completely redesign perf_event_open. This new file descriptor argument doesn't exist today so it would need to create a new system call with more arguments (and BTW it would be more than the normal 6 argument limit we have, so actually it couldn't even be a standard sycall) Obviously we would need to keep the old system call around for compability, so you would need to worry about this interaction in any case! So tying it together doesn't make any sense, because the problem has to be solved separately anyways. > > > BTW can't you already do that with the syscall filter? I assume > > the Android sandboxes already use that. Just forbid perf_event_open > > for the apps. > > Note that this was about providing access to *some* PMUs in some cases. > > IIUC, if that can be done today via a syscall filter, the same is true > of per-pmu paranoid settings. The difference is that the Android sandboxes likely already doing this and have all the infrastructure, and it's just another rule. Requiring syscall filters just to use the PMU on xn system that otherwise doesn't need them would be very odd. -Andi
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
On Fri, Sep 28, 2018 at 06:40:17PM +0100, Mark Rutland wrote: > On Fri, Sep 28, 2018 at 10:23:40AM -0700, Andi Kleen wrote: > > > There's also been prior discussion on these feature in other contexts > > > (e.g. android expoits resulting from out-of-tree drivers). It would be > > > nice to see those considered. > > > > > > IIRC The conclusion from prior discussions (e.g. [1]) was that we wanted > > > finer granularity of control such that we could limit PMU access to > > > specific users -- e.g. disallow arbitrary android apps from poking *any* > > > PMU, while allowing some more trusted apps/users to uses *some* specific > > > PMUs. > > > > > > e.g. we could add /sys/bus/event_source/devices/${PMU}/device, protect > > > this via the usual fs ACLs, and pass the fd to perf_event_open() > > > somehow. A valid fd would act as a capability, taking precedence over > > > perf_event_paranoid. > > > > That sounds like an orthogonal feature. I don't think the original > > patchkit would need to be hold up for this. It would be something > > in addition. > > I have to say that I disagree -- these controls will have to interact > somehow, and the fewer of them we have, the less complexity we'll have > to deal with longer-term. You're proposing to completely redesign perf_event_open. This new file descriptor argument doesn't exist today so it would need to create a new system call with more arguments (and BTW it would be more than the normal 6 argument limit we have, so actually it couldn't even be a standard sycall) Obviously we would need to keep the old system call around for compability, so you would need to worry about this interaction in any case! So tying it together doesn't make any sense, because the problem has to be solved separately anyways. > > > BTW can't you already do that with the syscall filter? I assume > > the Android sandboxes already use that. Just forbid perf_event_open > > for the apps. > > Note that this was about providing access to *some* PMUs in some cases. > > IIUC, if that can be done today via a syscall filter, the same is true > of per-pmu paranoid settings. The difference is that the Android sandboxes likely already doing this and have all the infrastructure, and it's just another rule. Requiring syscall filters just to use the PMU on xn system that otherwise doesn't need them would be very odd. -Andi
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
> Right now we have a single knob, which is poorly documented and that should > be fixed first. But some googling gives you the information that allowing > unprivilegded access is a security risk. So the security focussed sysadmin Ah only if google could simply answer all our questions! > will deny access to the PMUs no matter what. It's not like there is or isn't a security risk and that you can say that it is or it isn't in a global way. Essentially these are channels of information. The channels always exist in form of timing variances for any shared resource (like shared caches or shared memory/IO/interconnect bandwidth) that can be measured. Perfmon counters make the channels generally less noisy, but they do not cause them. To really close them completely you would need to avoid sharing anything, or not allowing to measure time, neither of which is practical short of an air gap. There are reasonable assesments you can make either way and the answers will be different based on your requirements. There isn't a single answer that works for everyone. There are cases where it isn't a problem at all. If you don't have multiple users on the system your tolerance should be extremely high. For users who have multiple users there can be different tradeoffs. So there isn't a single answer, and that is why it is important that this if configurable. -Andi
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
> Right now we have a single knob, which is poorly documented and that should > be fixed first. But some googling gives you the information that allowing > unprivilegded access is a security risk. So the security focussed sysadmin Ah only if google could simply answer all our questions! > will deny access to the PMUs no matter what. It's not like there is or isn't a security risk and that you can say that it is or it isn't in a global way. Essentially these are channels of information. The channels always exist in form of timing variances for any shared resource (like shared caches or shared memory/IO/interconnect bandwidth) that can be measured. Perfmon counters make the channels generally less noisy, but they do not cause them. To really close them completely you would need to avoid sharing anything, or not allowing to measure time, neither of which is practical short of an air gap. There are reasonable assesments you can make either way and the answers will be different based on your requirements. There isn't a single answer that works for everyone. There are cases where it isn't a problem at all. If you don't have multiple users on the system your tolerance should be extremely high. For users who have multiple users there can be different tradeoffs. So there isn't a single answer, and that is why it is important that this if configurable. -Andi
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
> There's also been prior discussion on these feature in other contexts > (e.g. android expoits resulting from out-of-tree drivers). It would be > nice to see those considered. > > IIRC The conclusion from prior discussions (e.g. [1]) was that we wanted > finer granularity of control such that we could limit PMU access to > specific users -- e.g. disallow arbitrary android apps from poking *any* > PMU, while allowing some more trusted apps/users to uses *some* specific > PMUs. > > e.g. we could add /sys/bus/event_source/devices/${PMU}/device, protect > this via the usual fs ACLs, and pass the fd to perf_event_open() > somehow. A valid fd would act as a capability, taking precedence over > perf_event_paranoid. That sounds like an orthogonal feature. I don't think the original patchkit would need to be hold up for this. It would be something in addition. BTW can't you already do that with the syscall filter? I assume the Android sandboxes already use that. Just forbid perf_event_open for the apps. -Andi
Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)
> There's also been prior discussion on these feature in other contexts > (e.g. android expoits resulting from out-of-tree drivers). It would be > nice to see those considered. > > IIRC The conclusion from prior discussions (e.g. [1]) was that we wanted > finer granularity of control such that we could limit PMU access to > specific users -- e.g. disallow arbitrary android apps from poking *any* > PMU, while allowing some more trusted apps/users to uses *some* specific > PMUs. > > e.g. we could add /sys/bus/event_source/devices/${PMU}/device, protect > this via the usual fs ACLs, and pass the fd to perf_event_open() > somehow. A valid fd would act as a capability, taking precedence over > perf_event_paranoid. That sounds like an orthogonal feature. I don't think the original patchkit would need to be hold up for this. It would be something in addition. BTW can't you already do that with the syscall filter? I assume the Android sandboxes already use that. Just forbid perf_event_open for the apps. -Andi
Re: [PATCH v6 3/5] tools, perf, script: Add --call-trace and --call-ret-trace
> Seems to me, these two features are _NOT_ only benefit for intel_pt, > other hardware tracing (e.g. Arm CoreSight) can enable these features > as well. This patch is to document only for intel_pt, later if we > enable this feature on Arm platform we need to change the doc; > alternatively we can use more general description for these two options > at the first place. How about you think for this? Likely it already works for CoreSight I specified intel_pt, because if we just say traces the users won't know what PMU to specify for record. Being too abstract is often not helpful. If someone successfully tests it on CoreSight they could submit a patch to the documentation to add "or " to these two cases. That would make it then clear for those users too. -Andi
Re: [PATCH v6 3/5] tools, perf, script: Add --call-trace and --call-ret-trace
> Seems to me, these two features are _NOT_ only benefit for intel_pt, > other hardware tracing (e.g. Arm CoreSight) can enable these features > as well. This patch is to document only for intel_pt, later if we > enable this feature on Arm platform we need to change the doc; > alternatively we can use more general description for these two options > at the first place. How about you think for this? Likely it already works for CoreSight I specified intel_pt, because if we just say traces the users won't know what PMU to specify for record. Being too abstract is often not helpful. If someone successfully tests it on CoreSight they could submit a patch to the documentation to add "or " to these two cases. That would make it then clear for those users too. -Andi
Re: [RFC 3/5] perf: Allow per PMU access control
> + mutex_lock(_lock); > + list_for_each_entry(pmu, , entry) > + pmu->perf_event_paranoid = sysctl_perf_event_paranoid; > + mutex_unlock(_lock); What happens to pmus that got added later? The rest looks good. Can you post a non RFC version? -Andi
Re: [RFC 3/5] perf: Allow per PMU access control
> + mutex_lock(_lock); > + list_for_each_entry(pmu, , entry) > + pmu->perf_event_paranoid = sysctl_perf_event_paranoid; > + mutex_unlock(_lock); What happens to pmus that got added later? The rest looks good. Can you post a non RFC version? -Andi
Re: perf segmentation fault from NULL dereference
> Please me let me know if a valid issue so we can get a fix in. If it crashes it must be a valid issue of course. But I'm not sure about your bisect. Hard to see how my patch could cause this. Sometimes bisects go wrong. You verified by just reverting the patch? First thing I would also try is to run with valgrind or ASan and see if it reports anything. -Andi
Re: perf segmentation fault from NULL dereference
> Please me let me know if a valid issue so we can get a fix in. If it crashes it must be a valid issue of course. But I'm not sure about your bisect. Hard to see how my patch could cause this. Sometimes bisects go wrong. You verified by just reverting the patch? First thing I would also try is to run with valgrind or ASan and see if it reports anything. -Andi
[tip:perf/core] perf script: Print DSO for callindent
Commit-ID: a78cdee6fbb136694334ade4cedb331a9d0b4d5e Gitweb: https://git.kernel.org/tip/a78cdee6fbb136694334ade4cedb331a9d0b4d5e Author: Andi Kleen AuthorDate: Tue, 18 Sep 2018 05:32:10 -0700 Committer: Arnaldo Carvalho de Melo CommitDate: Wed, 19 Sep 2018 15:25:51 -0300 perf script: Print DSO for callindent Now that we don't need to print the IP/ADDR for callindent the DSO is also not printed. It's useful for some cases, so add an own DSO printout for callindent for the case when IP/ADDR is not enabled. Before: % perf script --itrace=cr -F +callindent,-ip,-sym,-symoff,-addr swapper 0 [000] 3377.917072: 1 branches: pt_config swapper 0 [000] 3377.917072: 1 branches: pt_config swapper 0 [000] 3377.917072: 1 branches: pt_event_add swapper 0 [000] 3377.917072: 1 branches: perf_pmu_enable swapper 0 [000] 3377.917072: 1 branches: perf_pmu_nop_void swapper 0 [000] 3377.917072: 1 branches: event_sched_in.isra.107 swapper 0 [000] 3377.917072: 1 branches: __x86_indirect_thunk_rax swapper 0 [000] 3377.917072: 1 branches: perf_pmu_nop_int swapper 0 [000] 3377.917072: 1 branches: group_sched_in swapper 0 [000] 3377.917072: 1 branches: event_filter_match swapper 0 [000] 3377.917072: 1 branches: event_filter_match swapper 0 [000] 3377.917072: 1 branches: group_sched_in After: swapper 0 [000] 3377.917072: 1 branches: ([unknown]) pt_config swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) pt_config swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) pt_event_add swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) perf_pmu_enable swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) perf_pmu_nop_void swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) event_sched_in.isra.107 swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) perf_pmu_nop_int swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) group_sched_in swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) event_filter_match swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) event_filter_match swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) group_sched_in swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) perf_pmu_nop_txn swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) event_sched_in.isra.107 (in the kernel case of course it's not very useful, but it's important with user programs where symbols are not unique) Signed-off-by: Andi Kleen Tested-by: Arnaldo Carvalho de Melo Cc: Adrian Hunter Cc: Jiri Olsa Cc: Kim Phillips Link: http://lkml.kernel.org/r/20180918123214.26728-6-a...@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/builtin-script.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 7aa59696e97a..7732346bd9dd 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -1115,6 +1115,7 @@ static int perf_sample__fprintf_callindent(struct perf_sample *sample, const char *name = NULL; static int spacing; int len = 0; + int dlen = 0; u64 ip = 0; /* @@ -1141,6 +1142,12 @@ static int perf_sample__fprintf_callindent(struct perf_sample *sample, ip = sample->ip; } + if (PRINT_FIELD(DSO) && !(PRINT_FIELD(IP) || PRINT_FIELD(ADDR))) { + dlen += fprintf(fp, "("); + dlen += map__fprintf_dsoname(al->map, fp); + dlen += fprintf(fp, ")\t"); + } + if (name) len = fprintf(fp, "%*s%s", (int)depth * 4, "", name); else if (ip) @@ -1159,7 +1166,7 @@ static int perf_sample__fprintf_callindent(struct perf_sample *sample, if (len < spacing) len += fprintf(fp, "%*s", spacing - len, ""); - return len; + return len + dlen; } static int perf_sample__fprintf_insn(struct perf_sample *sample,
[tip:perf/core] perf script: Print DSO for callindent
Commit-ID: a78cdee6fbb136694334ade4cedb331a9d0b4d5e Gitweb: https://git.kernel.org/tip/a78cdee6fbb136694334ade4cedb331a9d0b4d5e Author: Andi Kleen AuthorDate: Tue, 18 Sep 2018 05:32:10 -0700 Committer: Arnaldo Carvalho de Melo CommitDate: Wed, 19 Sep 2018 15:25:51 -0300 perf script: Print DSO for callindent Now that we don't need to print the IP/ADDR for callindent the DSO is also not printed. It's useful for some cases, so add an own DSO printout for callindent for the case when IP/ADDR is not enabled. Before: % perf script --itrace=cr -F +callindent,-ip,-sym,-symoff,-addr swapper 0 [000] 3377.917072: 1 branches: pt_config swapper 0 [000] 3377.917072: 1 branches: pt_config swapper 0 [000] 3377.917072: 1 branches: pt_event_add swapper 0 [000] 3377.917072: 1 branches: perf_pmu_enable swapper 0 [000] 3377.917072: 1 branches: perf_pmu_nop_void swapper 0 [000] 3377.917072: 1 branches: event_sched_in.isra.107 swapper 0 [000] 3377.917072: 1 branches: __x86_indirect_thunk_rax swapper 0 [000] 3377.917072: 1 branches: perf_pmu_nop_int swapper 0 [000] 3377.917072: 1 branches: group_sched_in swapper 0 [000] 3377.917072: 1 branches: event_filter_match swapper 0 [000] 3377.917072: 1 branches: event_filter_match swapper 0 [000] 3377.917072: 1 branches: group_sched_in After: swapper 0 [000] 3377.917072: 1 branches: ([unknown]) pt_config swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) pt_config swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) pt_event_add swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) perf_pmu_enable swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) perf_pmu_nop_void swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) event_sched_in.isra.107 swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) perf_pmu_nop_int swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) group_sched_in swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) event_filter_match swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) event_filter_match swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) group_sched_in swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) perf_pmu_nop_txn swapper 0 [000] 3377.917072: 1 branches: ([kernel.kallsyms]) event_sched_in.isra.107 (in the kernel case of course it's not very useful, but it's important with user programs where symbols are not unique) Signed-off-by: Andi Kleen Tested-by: Arnaldo Carvalho de Melo Cc: Adrian Hunter Cc: Jiri Olsa Cc: Kim Phillips Link: http://lkml.kernel.org/r/20180918123214.26728-6-a...@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/builtin-script.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 7aa59696e97a..7732346bd9dd 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -1115,6 +1115,7 @@ static int perf_sample__fprintf_callindent(struct perf_sample *sample, const char *name = NULL; static int spacing; int len = 0; + int dlen = 0; u64 ip = 0; /* @@ -1141,6 +1142,12 @@ static int perf_sample__fprintf_callindent(struct perf_sample *sample, ip = sample->ip; } + if (PRINT_FIELD(DSO) && !(PRINT_FIELD(IP) || PRINT_FIELD(ADDR))) { + dlen += fprintf(fp, "("); + dlen += map__fprintf_dsoname(al->map, fp); + dlen += fprintf(fp, ")\t"); + } + if (name) len = fprintf(fp, "%*s%s", (int)depth * 4, "", name); else if (ip) @@ -1159,7 +1166,7 @@ static int perf_sample__fprintf_callindent(struct perf_sample *sample, if (len < spacing) len += fprintf(fp, "%*s", spacing - len, ""); - return len; + return len + dlen; } static int perf_sample__fprintf_insn(struct perf_sample *sample,
[tip:perf/core] tools lib subcmd: Support overwriting the pager
Commit-ID: 03a1f49f26482cf832e7f0157ae5da716d927701 Gitweb: https://git.kernel.org/tip/03a1f49f26482cf832e7f0157ae5da716d927701 Author: Andi Kleen AuthorDate: Tue, 18 Sep 2018 05:32:07 -0700 Committer: Arnaldo Carvalho de Melo CommitDate: Wed, 19 Sep 2018 15:16:19 -0300 tools lib subcmd: Support overwriting the pager Add an interface to the auto pager code that allows callers to overwrite the pager. Signed-off-by: Andi Kleen Cc: Adrian Hunter Cc: Jiri Olsa Cc: Josh Poimboeuf Cc: Kim Phillips Cc: Namhyung Kim Link: http://lkml.kernel.org/r/20180918123214.26728-3-a...@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/lib/subcmd/pager.c | 11 ++- tools/lib/subcmd/pager.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/tools/lib/subcmd/pager.c b/tools/lib/subcmd/pager.c index 9997a8805a82..e3d47b59b14d 100644 --- a/tools/lib/subcmd/pager.c +++ b/tools/lib/subcmd/pager.c @@ -23,6 +23,13 @@ void pager_init(const char *pager_env) subcmd_config.pager_env = pager_env; } +static const char *forced_pager; + +void force_pager(const char *pager) +{ + forced_pager = pager; +} + static void pager_preexec(void) { /* @@ -66,7 +73,9 @@ void setup_pager(void) const char *pager = getenv(subcmd_config.pager_env); struct winsize sz; - if (!isatty(1)) + if (forced_pager) + pager = forced_pager; + if (!isatty(1) && !forced_pager) return; if (ioctl(1, TIOCGWINSZ, ) == 0) pager_columns = sz.ws_col; diff --git a/tools/lib/subcmd/pager.h b/tools/lib/subcmd/pager.h index f1a53cf29880..a818964693ab 100644 --- a/tools/lib/subcmd/pager.h +++ b/tools/lib/subcmd/pager.h @@ -7,5 +7,6 @@ extern void pager_init(const char *pager_env); extern void setup_pager(void); extern int pager_in_use(void); extern int pager_get_columns(void); +extern void force_pager(const char *); #endif /* __SUBCMD_PAGER_H */
[tip:perf/core] tools lib subcmd: Support overwriting the pager
Commit-ID: 03a1f49f26482cf832e7f0157ae5da716d927701 Gitweb: https://git.kernel.org/tip/03a1f49f26482cf832e7f0157ae5da716d927701 Author: Andi Kleen AuthorDate: Tue, 18 Sep 2018 05:32:07 -0700 Committer: Arnaldo Carvalho de Melo CommitDate: Wed, 19 Sep 2018 15:16:19 -0300 tools lib subcmd: Support overwriting the pager Add an interface to the auto pager code that allows callers to overwrite the pager. Signed-off-by: Andi Kleen Cc: Adrian Hunter Cc: Jiri Olsa Cc: Josh Poimboeuf Cc: Kim Phillips Cc: Namhyung Kim Link: http://lkml.kernel.org/r/20180918123214.26728-3-a...@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/lib/subcmd/pager.c | 11 ++- tools/lib/subcmd/pager.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/tools/lib/subcmd/pager.c b/tools/lib/subcmd/pager.c index 9997a8805a82..e3d47b59b14d 100644 --- a/tools/lib/subcmd/pager.c +++ b/tools/lib/subcmd/pager.c @@ -23,6 +23,13 @@ void pager_init(const char *pager_env) subcmd_config.pager_env = pager_env; } +static const char *forced_pager; + +void force_pager(const char *pager) +{ + forced_pager = pager; +} + static void pager_preexec(void) { /* @@ -66,7 +73,9 @@ void setup_pager(void) const char *pager = getenv(subcmd_config.pager_env); struct winsize sz; - if (!isatty(1)) + if (forced_pager) + pager = forced_pager; + if (!isatty(1) && !forced_pager) return; if (ioctl(1, TIOCGWINSZ, ) == 0) pager_columns = sz.ws_col; diff --git a/tools/lib/subcmd/pager.h b/tools/lib/subcmd/pager.h index f1a53cf29880..a818964693ab 100644 --- a/tools/lib/subcmd/pager.h +++ b/tools/lib/subcmd/pager.h @@ -7,5 +7,6 @@ extern void pager_init(const char *pager_env); extern void setup_pager(void); extern int pager_in_use(void); extern int pager_get_columns(void); +extern void force_pager(const char *); #endif /* __SUBCMD_PAGER_H */
[tip:perf/core] perf script: Allow sym and dso without ip, addr
Commit-ID: 37fed3de555199733805a2d3e03aee7727c09ea4 Gitweb: https://git.kernel.org/tip/37fed3de555199733805a2d3e03aee7727c09ea4 Author: Andi Kleen AuthorDate: Tue, 18 Sep 2018 05:32:09 -0700 Committer: Arnaldo Carvalho de Melo CommitDate: Wed, 19 Sep 2018 15:20:03 -0300 perf script: Allow sym and dso without ip, addr Currently sym and dso require printing ip and addr because the print function is tied to those outputs. With callindent it makes sense to print the symbol or dso without numerical IP or ADDR. So change the dependency check to only check the underlying attribute. Also the branch target output relies on the user_set flag to determine if the branch target should be implicitely printed. When modifying the fields with + or - also set user_set, so that ADDR can be removed. We also need to set wildcard_set to make the initial sanity check pass. This allows to remove a lot of noise in callindent output by dropping the numerical addresses, which are not all that useful. Before % perf script --itrace=cr -F +callindent swapper 0 [000] 156546.354971: 1 branches: pt_config 0 [unknown] ([unknown]) => 81010486 pt_config ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: pt_config81010499 pt_config ([kernel.kallsyms]) => 8101063e pt_event_add ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: pt_event_add 81010635 pt_event_add ([kernel.kallsyms]) => 8115e687 event_sched_in.isra.107 ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: perf_pmu_enable 8115e726 event_sched_in.isra.107 ([kernel.kallsyms]) => 811579b0 perf_pmu_enable ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: perf_pmu_nop_void81151730 perf_pmu_nop_void ([kernel.kallsyms]) => 8115e72b event_sched_in.isra.107 ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: event_sched_in.isra.107 8115e737 event_sched_in.isra.107 ([kernel.kallsyms]) => 8115e7a5 group_sched_in ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: __x86_indirect_thunk_rax 8115e7f6 group_sched_in ([kernel.kallsyms]) => 81a03000 __x86_indirect_thunk_rax ([kernel.kallsyms]) After % perf script --itrace=cr -F +callindent,-ip,-sym,-symoff swapper 0 [000] 156546.354971: 1 branches: pt_config swapper 0 [000] 156546.354971: 1 branches: pt_config swapper 0 [000] 156546.354971: 1 branches: pt_event_add swapper 0 [000] 156546.354971: 1 branches: perf_pmu_enable swapper 0 [000] 156546.354971: 1 branches: perf_pmu_nop_void swapper 0 [000] 156546.354971: 1 branches: event_sched_in.isra.107 swapper 0 [000] 156546.354971: 1 branches: __x86_indirect_thunk_rax swapper 0 [000] 156546.354971: 1 branches: perf_pmu_nop_int swapper 0 [000] 156546.354971: 1 branches: group_sched_in swapper 0 [000] 156546.354971: 1 branches: event_filter_match swapper 0 [000] 156546.354971: 1 branches: event_filter_match swapper 0 [000] 156546.354971: 1 branches: group_sched_in Signed-off-by: Andi Kleen Tested-by: Arnaldo Carvalho de Melo Cc: Adrian Hunter Cc: Jiri Olsa Cc: Kim Phillips Link: http://lkml.kernel.org/r/20180918123214.26728-5-a...@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/builtin-script.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 66699dd351e0..7aa59696e97a 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -406,9 +406,10 @@ static int perf_evsel__check_attr(struct perf_evsel *evsel, PERF_OUTPUT_WEIGHT)) return -EINVAL; - if (PRINT_FIELD(SYM) && !PRINT_FIELD(IP) && !PRINT_FIELD(ADDR)) { + if (PRINT_FIELD(SYM) && + !(evsel->attr.sample_type & (PERF_SAMPLE_IP|PERF_SAMPLE_ADDR))) { pr_err("Display of symbols requested but neither sample IP nor " - "sample address\nis selected. Hence, no addresses to convert " + "sample address\navailable. Hence, no addresses to convert " "to symbols.\n"); return -EINVA
[tip:perf/core] perf script: Allow sym and dso without ip, addr
Commit-ID: 37fed3de555199733805a2d3e03aee7727c09ea4 Gitweb: https://git.kernel.org/tip/37fed3de555199733805a2d3e03aee7727c09ea4 Author: Andi Kleen AuthorDate: Tue, 18 Sep 2018 05:32:09 -0700 Committer: Arnaldo Carvalho de Melo CommitDate: Wed, 19 Sep 2018 15:20:03 -0300 perf script: Allow sym and dso without ip, addr Currently sym and dso require printing ip and addr because the print function is tied to those outputs. With callindent it makes sense to print the symbol or dso without numerical IP or ADDR. So change the dependency check to only check the underlying attribute. Also the branch target output relies on the user_set flag to determine if the branch target should be implicitely printed. When modifying the fields with + or - also set user_set, so that ADDR can be removed. We also need to set wildcard_set to make the initial sanity check pass. This allows to remove a lot of noise in callindent output by dropping the numerical addresses, which are not all that useful. Before % perf script --itrace=cr -F +callindent swapper 0 [000] 156546.354971: 1 branches: pt_config 0 [unknown] ([unknown]) => 81010486 pt_config ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: pt_config81010499 pt_config ([kernel.kallsyms]) => 8101063e pt_event_add ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: pt_event_add 81010635 pt_event_add ([kernel.kallsyms]) => 8115e687 event_sched_in.isra.107 ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: perf_pmu_enable 8115e726 event_sched_in.isra.107 ([kernel.kallsyms]) => 811579b0 perf_pmu_enable ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: perf_pmu_nop_void81151730 perf_pmu_nop_void ([kernel.kallsyms]) => 8115e72b event_sched_in.isra.107 ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: event_sched_in.isra.107 8115e737 event_sched_in.isra.107 ([kernel.kallsyms]) => 8115e7a5 group_sched_in ([kernel.kallsyms]) swapper 0 [000] 156546.354971: 1 branches: __x86_indirect_thunk_rax 8115e7f6 group_sched_in ([kernel.kallsyms]) => 81a03000 __x86_indirect_thunk_rax ([kernel.kallsyms]) After % perf script --itrace=cr -F +callindent,-ip,-sym,-symoff swapper 0 [000] 156546.354971: 1 branches: pt_config swapper 0 [000] 156546.354971: 1 branches: pt_config swapper 0 [000] 156546.354971: 1 branches: pt_event_add swapper 0 [000] 156546.354971: 1 branches: perf_pmu_enable swapper 0 [000] 156546.354971: 1 branches: perf_pmu_nop_void swapper 0 [000] 156546.354971: 1 branches: event_sched_in.isra.107 swapper 0 [000] 156546.354971: 1 branches: __x86_indirect_thunk_rax swapper 0 [000] 156546.354971: 1 branches: perf_pmu_nop_int swapper 0 [000] 156546.354971: 1 branches: group_sched_in swapper 0 [000] 156546.354971: 1 branches: event_filter_match swapper 0 [000] 156546.354971: 1 branches: event_filter_match swapper 0 [000] 156546.354971: 1 branches: group_sched_in Signed-off-by: Andi Kleen Tested-by: Arnaldo Carvalho de Melo Cc: Adrian Hunter Cc: Jiri Olsa Cc: Kim Phillips Link: http://lkml.kernel.org/r/20180918123214.26728-5-a...@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/builtin-script.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 66699dd351e0..7aa59696e97a 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -406,9 +406,10 @@ static int perf_evsel__check_attr(struct perf_evsel *evsel, PERF_OUTPUT_WEIGHT)) return -EINVAL; - if (PRINT_FIELD(SYM) && !PRINT_FIELD(IP) && !PRINT_FIELD(ADDR)) { + if (PRINT_FIELD(SYM) && + !(evsel->attr.sample_type & (PERF_SAMPLE_IP|PERF_SAMPLE_ADDR))) { pr_err("Display of symbols requested but neither sample IP nor " - "sample address\nis selected. Hence, no addresses to convert " + "sample address\navailable. Hence, no addresses to convert " "to symbols.\n"); return -EINVA
[tip:perf/core] perf tools: Report itrace options in help
Commit-ID: c12e039d1233f24ab2726945f883037f47b26f1d Gitweb: https://git.kernel.org/tip/c12e039d1233f24ab2726945f883037f47b26f1d Author: Andi Kleen AuthorDate: Thu, 13 Sep 2018 20:10:31 -0700 Committer: Arnaldo Carvalho de Melo CommitDate: Wed, 19 Sep 2018 15:06:59 -0300 perf tools: Report itrace options in help I often forget all the options that --itrace accepts. Instead of burying them in the man page only report them in the normal command line help too to make them easier accessible. v2: Align Signed-off-by: Andi Kleen Cc: Jiri Olsa Cc: Kim Phillips Link: http://lkml.kernel.org/r/20180914031038.4160-2-a...@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/builtin-inject.c | 3 ++- tools/perf/builtin-report.c | 2 +- tools/perf/builtin-script.c | 2 +- tools/perf/util/auxtrace.h | 19 +++ 4 files changed, 23 insertions(+), 3 deletions(-) diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c index b4a29f435b06..eda41673c4f3 100644 --- a/tools/perf/builtin-inject.c +++ b/tools/perf/builtin-inject.c @@ -798,7 +798,8 @@ int cmd_inject(int argc, const char **argv) "kallsyms pathname"), OPT_BOOLEAN('f', "force", , "don't complain, do it"), OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, - NULL, "opts", "Instruction Tracing options", + NULL, "opts", "Instruction Tracing options\n" + ITRACE_HELP, itrace_parse_synth_opts), OPT_BOOLEAN(0, "strip", , "strip non-synthesized events (use with --itrace)"), diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c index 7507e4d6dce1..c0703979c51d 100644 --- a/tools/perf/builtin-report.c +++ b/tools/perf/builtin-report.c @@ -1104,7 +1104,7 @@ int cmd_report(int argc, const char **argv) OPT_CALLBACK(0, "percentage", NULL, "relative|absolute", "how to display percentage of filtered entries", parse_filter_percentage), OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, NULL, "opts", - "Instruction Tracing options", + "Instruction Tracing options\n" ITRACE_HELP, itrace_parse_synth_opts), OPT_BOOLEAN(0, "full-source-path", _full_filename, "Show full source file name path for source lines"), diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 765391b6c88c..66699dd351e0 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -3193,7 +3193,7 @@ int cmd_script(int argc, const char **argv) OPT_BOOLEAN(0, "ns", , "Use 9 decimal places when displaying time"), OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, NULL, "opts", - "Instruction Tracing options", + "Instruction Tracing options\n" ITRACE_HELP, itrace_parse_synth_opts), OPT_BOOLEAN(0, "full-source-path", _full_filename, "Show full source file name path for source lines"), diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index a86b7eab6673..0a6ce9c4fc11 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -576,6 +576,23 @@ static inline void auxtrace__free(struct perf_session *session) return session->auxtrace->free(session); } +#define ITRACE_HELP \ +" i: synthesize instructions events\n" \ +" b: synthesize branches events\n" \ +" c: synthesize branches events (calls only)\n" \ +" r: synthesize branches events (returns only)\n" \ +" x: synthesize transactions events\n" \ +" w: synthesize ptwrite events\n"\ +" p: synthesize power events\n" \ +" e: synthesize error events\n" \ +" d: create a debug log\n" \ +" g[len]: synthesize a call chain (use wi
[tip:perf/core] perf tools: Report itrace options in help
Commit-ID: c12e039d1233f24ab2726945f883037f47b26f1d Gitweb: https://git.kernel.org/tip/c12e039d1233f24ab2726945f883037f47b26f1d Author: Andi Kleen AuthorDate: Thu, 13 Sep 2018 20:10:31 -0700 Committer: Arnaldo Carvalho de Melo CommitDate: Wed, 19 Sep 2018 15:06:59 -0300 perf tools: Report itrace options in help I often forget all the options that --itrace accepts. Instead of burying them in the man page only report them in the normal command line help too to make them easier accessible. v2: Align Signed-off-by: Andi Kleen Cc: Jiri Olsa Cc: Kim Phillips Link: http://lkml.kernel.org/r/20180914031038.4160-2-a...@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/builtin-inject.c | 3 ++- tools/perf/builtin-report.c | 2 +- tools/perf/builtin-script.c | 2 +- tools/perf/util/auxtrace.h | 19 +++ 4 files changed, 23 insertions(+), 3 deletions(-) diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c index b4a29f435b06..eda41673c4f3 100644 --- a/tools/perf/builtin-inject.c +++ b/tools/perf/builtin-inject.c @@ -798,7 +798,8 @@ int cmd_inject(int argc, const char **argv) "kallsyms pathname"), OPT_BOOLEAN('f', "force", , "don't complain, do it"), OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, - NULL, "opts", "Instruction Tracing options", + NULL, "opts", "Instruction Tracing options\n" + ITRACE_HELP, itrace_parse_synth_opts), OPT_BOOLEAN(0, "strip", , "strip non-synthesized events (use with --itrace)"), diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c index 7507e4d6dce1..c0703979c51d 100644 --- a/tools/perf/builtin-report.c +++ b/tools/perf/builtin-report.c @@ -1104,7 +1104,7 @@ int cmd_report(int argc, const char **argv) OPT_CALLBACK(0, "percentage", NULL, "relative|absolute", "how to display percentage of filtered entries", parse_filter_percentage), OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, NULL, "opts", - "Instruction Tracing options", + "Instruction Tracing options\n" ITRACE_HELP, itrace_parse_synth_opts), OPT_BOOLEAN(0, "full-source-path", _full_filename, "Show full source file name path for source lines"), diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 765391b6c88c..66699dd351e0 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -3193,7 +3193,7 @@ int cmd_script(int argc, const char **argv) OPT_BOOLEAN(0, "ns", , "Use 9 decimal places when displaying time"), OPT_CALLBACK_OPTARG(0, "itrace", _synth_opts, NULL, "opts", - "Instruction Tracing options", + "Instruction Tracing options\n" ITRACE_HELP, itrace_parse_synth_opts), OPT_BOOLEAN(0, "full-source-path", _full_filename, "Show full source file name path for source lines"), diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index a86b7eab6673..0a6ce9c4fc11 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -576,6 +576,23 @@ static inline void auxtrace__free(struct perf_session *session) return session->auxtrace->free(session); } +#define ITRACE_HELP \ +" i: synthesize instructions events\n" \ +" b: synthesize branches events\n" \ +" c: synthesize branches events (calls only)\n" \ +" r: synthesize branches events (returns only)\n" \ +" x: synthesize transactions events\n" \ +" w: synthesize ptwrite events\n"\ +" p: synthesize power events\n" \ +" e: synthesize error events\n" \ +" d: create a debug log\n" \ +" g[len]: synthesize a call chain (use wi
[PATCH v6 5/5] perf, tools, script: Support total cycles count
[Updated patch with a bug fix. Please use this version.] For perf script brstackinsn also print a running cycles count. This makes it easier to calculate cycle deltas for code sections measured with LBRs. % perf record -b -a sleep 1 % perf script -F +brstackinsn ... 7f73ecc41083insn: 74 06 # PRED 9 cycles [17] 1.11 IPC 7f73ecc4108binsn: a8 10 7f73ecc4108dinsn: 74 71 # PRED 1 cycles [18] 1.00 IPC 7f73ecc41100insn: 48 8b 46 10 7f73ecc41104insn: 4c 8b 38 7f73ecc41107insn: 4d 85 ff 7f73ecc4110ainsn: 0f 84 b0 00 00 00 7f73ecc41110insn: 83 43 58 01 7f73ecc41114insn: 48 89 df 7f73ecc41117insn: e8 94 73 04 00# PRED 6 cycles [24] 1.00 IPC Signed-off-by: Andi Kleen --- v2: reflow line v3: Print cycles in correct column diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 6232658c6f31..dbb0a780225c 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -913,7 +913,7 @@ static int grab_bb(u8 *buffer, u64 start, u64 end, static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en, struct perf_insn *x, u8 *inbuf, int len, - int insn, FILE *fp) + int insn, FILE *fp, int *total_cycles) { int printed = fprintf(fp, "\t%016" PRIx64 "\t%-30s\t#%s%s%s%s", ip, dump_insn(x, ip, inbuf, len, NULL), @@ -922,7 +922,8 @@ static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en, en->flags.in_tx ? " INTX" : "", en->flags.abort ? " ABORT" : ""); if (en->flags.cycles) { - printed += fprintf(fp, " %d cycles", en->flags.cycles); + *total_cycles += en->flags.cycles; + printed += fprintf(fp, " %d cycles [%d]", en->flags.cycles, *total_cycles); if (insn) printed += fprintf(fp, " %.2f IPC", (float)insn / en->flags.cycles); } @@ -979,6 +980,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, u8 buffer[MAXBB]; unsigned off; struct symbol *lastsym = NULL; + int total_cycles = 0; if (!(br && br->nr)) return 0; @@ -999,7 +1001,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, printed += ip__fprintf_sym(br->entries[nr - 1].from, thread, x.cpumode, x.cpu, , attr, fp); printed += ip__fprintf_jump(br->entries[nr - 1].from, >entries[nr - 1], - , buffer, len, 0, fp); + , buffer, len, 0, fp, _cycles); } /* Print all blocks */ @@ -1027,7 +1029,8 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, printed += ip__fprintf_sym(ip, thread, x.cpumode, x.cpu, , attr, fp); if (ip == end) { - printed += ip__fprintf_jump(ip, >entries[i], , buffer + off, len - off, insn, fp); + printed += ip__fprintf_jump(ip, >entries[i], , buffer + off, len - off, insn, fp, + _cycles); break; } else { printed += fprintf(fp, "\t%016" PRIx64 "\t%s\n", ip,
[PATCH v6 5/5] perf, tools, script: Support total cycles count
[Updated patch with a bug fix. Please use this version.] For perf script brstackinsn also print a running cycles count. This makes it easier to calculate cycle deltas for code sections measured with LBRs. % perf record -b -a sleep 1 % perf script -F +brstackinsn ... 7f73ecc41083insn: 74 06 # PRED 9 cycles [17] 1.11 IPC 7f73ecc4108binsn: a8 10 7f73ecc4108dinsn: 74 71 # PRED 1 cycles [18] 1.00 IPC 7f73ecc41100insn: 48 8b 46 10 7f73ecc41104insn: 4c 8b 38 7f73ecc41107insn: 4d 85 ff 7f73ecc4110ainsn: 0f 84 b0 00 00 00 7f73ecc41110insn: 83 43 58 01 7f73ecc41114insn: 48 89 df 7f73ecc41117insn: e8 94 73 04 00# PRED 6 cycles [24] 1.00 IPC Signed-off-by: Andi Kleen --- v2: reflow line v3: Print cycles in correct column diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 6232658c6f31..dbb0a780225c 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -913,7 +913,7 @@ static int grab_bb(u8 *buffer, u64 start, u64 end, static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en, struct perf_insn *x, u8 *inbuf, int len, - int insn, FILE *fp) + int insn, FILE *fp, int *total_cycles) { int printed = fprintf(fp, "\t%016" PRIx64 "\t%-30s\t#%s%s%s%s", ip, dump_insn(x, ip, inbuf, len, NULL), @@ -922,7 +922,8 @@ static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en, en->flags.in_tx ? " INTX" : "", en->flags.abort ? " ABORT" : ""); if (en->flags.cycles) { - printed += fprintf(fp, " %d cycles", en->flags.cycles); + *total_cycles += en->flags.cycles; + printed += fprintf(fp, " %d cycles [%d]", en->flags.cycles, *total_cycles); if (insn) printed += fprintf(fp, " %.2f IPC", (float)insn / en->flags.cycles); } @@ -979,6 +980,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, u8 buffer[MAXBB]; unsigned off; struct symbol *lastsym = NULL; + int total_cycles = 0; if (!(br && br->nr)) return 0; @@ -999,7 +1001,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, printed += ip__fprintf_sym(br->entries[nr - 1].from, thread, x.cpumode, x.cpu, , attr, fp); printed += ip__fprintf_jump(br->entries[nr - 1].from, >entries[nr - 1], - , buffer, len, 0, fp); + , buffer, len, 0, fp, _cycles); } /* Print all blocks */ @@ -1027,7 +1029,8 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, printed += ip__fprintf_sym(ip, thread, x.cpumode, x.cpu, , attr, fp); if (ip == end) { - printed += ip__fprintf_jump(ip, >entries[i], , buffer + off, len - off, insn, fp); + printed += ip__fprintf_jump(ip, >entries[i], , buffer + off, len - off, insn, fp, + _cycles); break; } else { printed += fprintf(fp, "\t%016" PRIx64 "\t%s\n", ip,
Re: [PATCH] perf tools: Do not zero sample_id_all for group members
On Sun, Sep 23, 2018 at 05:04:20PM +0200, Jiri Olsa wrote: > Andi reported following malfunction: > > # perf record -e '{ref-cycles,cycles}:S' -a sleep 1 > # perf script > non matching sample_id_all > > That's because we disable sample_id_all bit for non-sampling > group members. We can't do that, because it needs to be the > same over the whole event list. This patch keeps it untouched > again. > > Fixes: e9add8bac6c6 ("perf evsel: Disable write_backward for leader sampling > group events") > Reported-by: Andi Kleen > Link: http://lkml.kernel.org/n/tip-y7hc1i6en4saxqxuwr2g2...@git.kernel.org > Signed-off-by: Jiri Olsa Tested-by: Andi Kleen Should be also cc stable please -Andi
Re: [PATCH] perf tools: Do not zero sample_id_all for group members
On Sun, Sep 23, 2018 at 05:04:20PM +0200, Jiri Olsa wrote: > Andi reported following malfunction: > > # perf record -e '{ref-cycles,cycles}:S' -a sleep 1 > # perf script > non matching sample_id_all > > That's because we disable sample_id_all bit for non-sampling > group members. We can't do that, because it needs to be the > same over the whole event list. This patch keeps it untouched > again. > > Fixes: e9add8bac6c6 ("perf evsel: Disable write_backward for leader sampling > group events") > Reported-by: Andi Kleen > Link: http://lkml.kernel.org/n/tip-y7hc1i6en4saxqxuwr2g2...@git.kernel.org > Signed-off-by: Jiri Olsa Tested-by: Andi Kleen Should be also cc stable please -Andi
[PATCH v6 1/5] tools, perf, script: Add --insn-trace for instruction decoding
From: Andi Kleen Add a --insn-trace short hand option for decoding and disassembling instruction streams for intel_pt. This automatically pipes the output into the xed disassembler to generate disassembled instructions. This just makes this use model much nicer to use Before % perf record -e intel_pt// ... % perf script --itrace=i0ns --ns -F +insn,-event,-period | xed -F insn: -A -64 swapper 0 [000] 117276.429606186: 81010486 pt_config ([kernel.kallsyms]) nopl %eax, (%rax,%rax,1) swapper 0 [000] 117276.429606186: 8101048b pt_config ([kernel.kallsyms]) add $0x10, %rsp swapper 0 [000] 117276.429606186: 8101048f pt_config ([kernel.kallsyms]) popq %rbx swapper 0 [000] 117276.429606186: 81010490 pt_config ([kernel.kallsyms]) popq %rbp swapper 0 [000] 117276.429606186: 81010491 pt_config ([kernel.kallsyms]) popq %r12 swapper 0 [000] 117276.429606186: 81010493 pt_config ([kernel.kallsyms]) popq %r13 swapper 0 [000] 117276.429606186: 81010495 pt_config ([kernel.kallsyms]) popq %r14 swapper 0 [000] 117276.429606186: 81010497 pt_config ([kernel.kallsyms]) popq %r15 swapper 0 [000] 117276.429606186: 81010499 pt_config ([kernel.kallsyms]) retq swapper 0 [000] 117276.429606186: 8101063e pt_event_add ([kernel.kallsyms]) cmpl $0x1, 0x1b0(%rbx) swapper 0 [000] 117276.429606186: 81010645 pt_event_add ([kernel.kallsyms]) mov $0xffea, %eax swapper 0 [000] 117276.429606186: 8101064a pt_event_add ([kernel.kallsyms]) mov $0x0, %edx swapper 0 [000] 117276.429606186: 8101064f pt_event_add ([kernel.kallsyms]) popq %rbx swapper 0 [000] 117276.429606186: 81010650 pt_event_add ([kernel.kallsyms]) cmovnz %edx, %eax swapper 0 [000] 117276.429606186: 81010653 pt_event_add ([kernel.kallsyms]) jmp 0x81010635 swapper 0 [000] 117276.429606186: 81010635 pt_event_add ([kernel.kallsyms]) retq swapper 0 [000] 117276.429606186: 8115e687 event_sched_in.isra.107 ([kernel.kallsyms])test %eax, %eax Now % perf record -e intel_pt// ... % perf script --insn-trace --xed ... same output ... XED needs to be installed with: > git clone https://github.com/intelxed/mbuild.git mbuild > git clone https://github.com/intelxed/xed > cd xed > mkdir obj > cd obj > ../mfile.py > sudo ../mfile.py --prefix=/usr/local install Signed-off-by: Andi Kleen -- v2: Add separate --xed option v3: Add xed build documentation and update commit --- tools/perf/Documentation/build-xed.txt | 11 +++ tools/perf/Documentation/perf-script.txt | 7 +++ tools/perf/builtin-script.c | 23 +++ 3 files changed, 41 insertions(+) create mode 100644 tools/perf/Documentation/build-xed.txt diff --git a/tools/perf/Documentation/build-xed.txt b/tools/perf/Documentation/build-xed.txt new file mode 100644 index ..8da3028e6dca --- /dev/null +++ b/tools/perf/Documentation/build-xed.txt @@ -0,0 +1,11 @@ + +For --xed the xed tool is needed. Here is how to install it: + +> git clone https://github.com/intelxed/mbuild.git mbuild +> git clone https://github.com/intelxed/xed +> cd xed +> mkdir obj +> cd obj +> ../mfile.py +> sudo ../mfile.py --prefix=/usr/local install + diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt index afdafe2110a1..00c655ab4968 100644 --- a/tools/perf/Documentation/perf-script.txt +++ b/tools/perf/Documentation/perf-script.txt @@ -383,6 +383,13 @@ include::itrace.txt[] will be printed. Each entry has function name and file/line. Enabled by default, disable with --no-inline. +--insn-trace:: + Show instruction stream for intel_pt traces. Combine with --xed to + show disassembly. + +--xed:: + Run xed disassembler on output. Requires installing the xed disassembler. +
[PATCH v6 1/5] tools, perf, script: Add --insn-trace for instruction decoding
From: Andi Kleen Add a --insn-trace short hand option for decoding and disassembling instruction streams for intel_pt. This automatically pipes the output into the xed disassembler to generate disassembled instructions. This just makes this use model much nicer to use Before % perf record -e intel_pt// ... % perf script --itrace=i0ns --ns -F +insn,-event,-period | xed -F insn: -A -64 swapper 0 [000] 117276.429606186: 81010486 pt_config ([kernel.kallsyms]) nopl %eax, (%rax,%rax,1) swapper 0 [000] 117276.429606186: 8101048b pt_config ([kernel.kallsyms]) add $0x10, %rsp swapper 0 [000] 117276.429606186: 8101048f pt_config ([kernel.kallsyms]) popq %rbx swapper 0 [000] 117276.429606186: 81010490 pt_config ([kernel.kallsyms]) popq %rbp swapper 0 [000] 117276.429606186: 81010491 pt_config ([kernel.kallsyms]) popq %r12 swapper 0 [000] 117276.429606186: 81010493 pt_config ([kernel.kallsyms]) popq %r13 swapper 0 [000] 117276.429606186: 81010495 pt_config ([kernel.kallsyms]) popq %r14 swapper 0 [000] 117276.429606186: 81010497 pt_config ([kernel.kallsyms]) popq %r15 swapper 0 [000] 117276.429606186: 81010499 pt_config ([kernel.kallsyms]) retq swapper 0 [000] 117276.429606186: 8101063e pt_event_add ([kernel.kallsyms]) cmpl $0x1, 0x1b0(%rbx) swapper 0 [000] 117276.429606186: 81010645 pt_event_add ([kernel.kallsyms]) mov $0xffea, %eax swapper 0 [000] 117276.429606186: 8101064a pt_event_add ([kernel.kallsyms]) mov $0x0, %edx swapper 0 [000] 117276.429606186: 8101064f pt_event_add ([kernel.kallsyms]) popq %rbx swapper 0 [000] 117276.429606186: 81010650 pt_event_add ([kernel.kallsyms]) cmovnz %edx, %eax swapper 0 [000] 117276.429606186: 81010653 pt_event_add ([kernel.kallsyms]) jmp 0x81010635 swapper 0 [000] 117276.429606186: 81010635 pt_event_add ([kernel.kallsyms]) retq swapper 0 [000] 117276.429606186: 8115e687 event_sched_in.isra.107 ([kernel.kallsyms])test %eax, %eax Now % perf record -e intel_pt// ... % perf script --insn-trace --xed ... same output ... XED needs to be installed with: > git clone https://github.com/intelxed/mbuild.git mbuild > git clone https://github.com/intelxed/xed > cd xed > mkdir obj > cd obj > ../mfile.py > sudo ../mfile.py --prefix=/usr/local install Signed-off-by: Andi Kleen -- v2: Add separate --xed option v3: Add xed build documentation and update commit --- tools/perf/Documentation/build-xed.txt | 11 +++ tools/perf/Documentation/perf-script.txt | 7 +++ tools/perf/builtin-script.c | 23 +++ 3 files changed, 41 insertions(+) create mode 100644 tools/perf/Documentation/build-xed.txt diff --git a/tools/perf/Documentation/build-xed.txt b/tools/perf/Documentation/build-xed.txt new file mode 100644 index ..8da3028e6dca --- /dev/null +++ b/tools/perf/Documentation/build-xed.txt @@ -0,0 +1,11 @@ + +For --xed the xed tool is needed. Here is how to install it: + +> git clone https://github.com/intelxed/mbuild.git mbuild +> git clone https://github.com/intelxed/xed +> cd xed +> mkdir obj +> cd obj +> ../mfile.py +> sudo ../mfile.py --prefix=/usr/local install + diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt index afdafe2110a1..00c655ab4968 100644 --- a/tools/perf/Documentation/perf-script.txt +++ b/tools/perf/Documentation/perf-script.txt @@ -383,6 +383,13 @@ include::itrace.txt[] will be printed. Each entry has function name and file/line. Enabled by default, disable with --no-inline. +--insn-trace:: + Show instruction stream for intel_pt traces. Combine with --xed to + show disassembly. + +--xed:: + Run xed disassembler on output. Requires installing the xed disassembler. +
[PATCH v6 2/5] perf, tools, script: Make itrace script default to all calls
From: Andi Kleen By default perf script for itrace outputs sampled instructions or branches. In my experience this is confusing to users because it's hard to correlate with real program behavior. The sampling makes sense for tools like report that actually sample to reduce the run time, but run time is normally not a problem for perf script. It's better to give an accurate representation of the program flow. Default perf script to output all calls for itrace. That's a much saner default. The old behavior can be still requested with perf script --itrace=ibxwpe10 v2: Fix ETM build failure v3: Really fix ETM build failure (Kim Phillips) Signed-off-by: Andi Kleen --- tools/perf/Documentation/itrace.txt | 7 --- tools/perf/builtin-script.c | 5 - tools/perf/util/auxtrace.c | 17 - tools/perf/util/auxtrace.h | 5 - tools/perf/util/cs-etm.c| 3 ++- tools/perf/util/intel-bts.c | 3 ++- tools/perf/util/intel-pt.c | 3 ++- 7 files changed, 30 insertions(+), 13 deletions(-) diff --git a/tools/perf/Documentation/itrace.txt b/tools/perf/Documentation/itrace.txt index a3abe04c779d..c2182cbabde3 100644 --- a/tools/perf/Documentation/itrace.txt +++ b/tools/perf/Documentation/itrace.txt @@ -11,10 +11,11 @@ l synthesize last branch entries (use with i or x) s skip initial number of events - The default is all events i.e. the same as --itrace=ibxwpe + The default is all events i.e. the same as --itrace=ibxwpe, + except for perf script where it is --itrace=ce - In addition, the period (default 10) for instructions events - can be specified in units of: + In addition, the period (default 10, except for perf script where it is 1) + for instructions events can be specified in units of: i instructions t ticks diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 311f5b53dd83..519ebb5a1f96 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -3128,7 +3128,10 @@ int cmd_script(int argc, const char **argv) char *rec_script_path = NULL; char *rep_script_path = NULL; struct perf_session *session; - struct itrace_synth_opts itrace_synth_opts = { .set = false, }; + struct itrace_synth_opts itrace_synth_opts = { + .set = false, + .default_no_sample = true, + }; char *script_path = NULL; const char **__argv; int i, j, err = 0; diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c index c4617bcfd521..72d5ba2479bf 100644 --- a/tools/perf/util/auxtrace.c +++ b/tools/perf/util/auxtrace.c @@ -962,16 +962,23 @@ s64 perf_event__process_auxtrace(struct perf_session *session, #define PERF_ITRACE_DEFAULT_LAST_BRANCH_SZ 64 #define PERF_ITRACE_MAX_LAST_BRANCH_SZ 1024 -void itrace_synth_opts__set_default(struct itrace_synth_opts *synth_opts) +void itrace_synth_opts__set_default(struct itrace_synth_opts *synth_opts, + bool no_sample) { - synth_opts->instructions = true; synth_opts->branches = true; synth_opts->transactions = true; synth_opts->ptwrites = true; synth_opts->pwr_events = true; synth_opts->errors = true; - synth_opts->period_type = PERF_ITRACE_DEFAULT_PERIOD_TYPE; - synth_opts->period = PERF_ITRACE_DEFAULT_PERIOD; + if (no_sample) { + synth_opts->period_type = PERF_ITRACE_PERIOD_INSTRUCTIONS; + synth_opts->period = 1; + synth_opts->calls = true; + } else { + synth_opts->instructions = true; + synth_opts->period_type = PERF_ITRACE_DEFAULT_PERIOD_TYPE; + synth_opts->period = PERF_ITRACE_DEFAULT_PERIOD; + } synth_opts->callchain_sz = PERF_ITRACE_DEFAULT_CALLCHAIN_SZ; synth_opts->last_branch_sz = PERF_ITRACE_DEFAULT_LAST_BRANCH_SZ; synth_opts->initial_skip = 0; @@ -999,7 +1006,7 @@ int itrace_parse_synth_opts(const struct option *opt, const char *str, } if (!str) { - itrace_synth_opts__set_default(synth_opts); + itrace_synth_opts__set_default(synth_opts, false); return 0; } diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index 0a6ce9c4fc11..f6df30187e1c 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -57,6 +57,7 @@ enum itrace_period_type { /** * struct itrace_synth_opts - AUX area tracing synthesis options. * @set: indicates whether or not options have been set + * @default_no_sample: Default to no sampling. * @inject: indicates the event (not just the sample) must be fully synthesized * because 'perf inject' will wr
[PATCH v6 2/5] perf, tools, script: Make itrace script default to all calls
From: Andi Kleen By default perf script for itrace outputs sampled instructions or branches. In my experience this is confusing to users because it's hard to correlate with real program behavior. The sampling makes sense for tools like report that actually sample to reduce the run time, but run time is normally not a problem for perf script. It's better to give an accurate representation of the program flow. Default perf script to output all calls for itrace. That's a much saner default. The old behavior can be still requested with perf script --itrace=ibxwpe10 v2: Fix ETM build failure v3: Really fix ETM build failure (Kim Phillips) Signed-off-by: Andi Kleen --- tools/perf/Documentation/itrace.txt | 7 --- tools/perf/builtin-script.c | 5 - tools/perf/util/auxtrace.c | 17 - tools/perf/util/auxtrace.h | 5 - tools/perf/util/cs-etm.c| 3 ++- tools/perf/util/intel-bts.c | 3 ++- tools/perf/util/intel-pt.c | 3 ++- 7 files changed, 30 insertions(+), 13 deletions(-) diff --git a/tools/perf/Documentation/itrace.txt b/tools/perf/Documentation/itrace.txt index a3abe04c779d..c2182cbabde3 100644 --- a/tools/perf/Documentation/itrace.txt +++ b/tools/perf/Documentation/itrace.txt @@ -11,10 +11,11 @@ l synthesize last branch entries (use with i or x) s skip initial number of events - The default is all events i.e. the same as --itrace=ibxwpe + The default is all events i.e. the same as --itrace=ibxwpe, + except for perf script where it is --itrace=ce - In addition, the period (default 10) for instructions events - can be specified in units of: + In addition, the period (default 10, except for perf script where it is 1) + for instructions events can be specified in units of: i instructions t ticks diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 311f5b53dd83..519ebb5a1f96 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -3128,7 +3128,10 @@ int cmd_script(int argc, const char **argv) char *rec_script_path = NULL; char *rep_script_path = NULL; struct perf_session *session; - struct itrace_synth_opts itrace_synth_opts = { .set = false, }; + struct itrace_synth_opts itrace_synth_opts = { + .set = false, + .default_no_sample = true, + }; char *script_path = NULL; const char **__argv; int i, j, err = 0; diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c index c4617bcfd521..72d5ba2479bf 100644 --- a/tools/perf/util/auxtrace.c +++ b/tools/perf/util/auxtrace.c @@ -962,16 +962,23 @@ s64 perf_event__process_auxtrace(struct perf_session *session, #define PERF_ITRACE_DEFAULT_LAST_BRANCH_SZ 64 #define PERF_ITRACE_MAX_LAST_BRANCH_SZ 1024 -void itrace_synth_opts__set_default(struct itrace_synth_opts *synth_opts) +void itrace_synth_opts__set_default(struct itrace_synth_opts *synth_opts, + bool no_sample) { - synth_opts->instructions = true; synth_opts->branches = true; synth_opts->transactions = true; synth_opts->ptwrites = true; synth_opts->pwr_events = true; synth_opts->errors = true; - synth_opts->period_type = PERF_ITRACE_DEFAULT_PERIOD_TYPE; - synth_opts->period = PERF_ITRACE_DEFAULT_PERIOD; + if (no_sample) { + synth_opts->period_type = PERF_ITRACE_PERIOD_INSTRUCTIONS; + synth_opts->period = 1; + synth_opts->calls = true; + } else { + synth_opts->instructions = true; + synth_opts->period_type = PERF_ITRACE_DEFAULT_PERIOD_TYPE; + synth_opts->period = PERF_ITRACE_DEFAULT_PERIOD; + } synth_opts->callchain_sz = PERF_ITRACE_DEFAULT_CALLCHAIN_SZ; synth_opts->last_branch_sz = PERF_ITRACE_DEFAULT_LAST_BRANCH_SZ; synth_opts->initial_skip = 0; @@ -999,7 +1006,7 @@ int itrace_parse_synth_opts(const struct option *opt, const char *str, } if (!str) { - itrace_synth_opts__set_default(synth_opts); + itrace_synth_opts__set_default(synth_opts, false); return 0; } diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index 0a6ce9c4fc11..f6df30187e1c 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -57,6 +57,7 @@ enum itrace_period_type { /** * struct itrace_synth_opts - AUX area tracing synthesis options. * @set: indicates whether or not options have been set + * @default_no_sample: Default to no sampling. * @inject: indicates the event (not just the sample) must be fully synthesized * because 'perf inject' will wr
[PATCH v6 5/5] perf, tools, script: Support total cycles count
From: Andi Kleen For perf script brstackinsn also print a running cycles count. This makes it easier to calculate cycle deltas for code sections measured with LBRs. % perf record -b -a sleep 1 % perf script -F +brstackinsn ... _dl_sysdep_start+330: 7eff9f20583ainsn: 75 c4 # PRED 24 cycles [24] 7eff9f205800insn: 48 83 e8 03 7eff9f205804insn: 48 83 f8 1e 7eff9f205808insn: 77 26 7eff9f20580ainsn: 48 63 04 81 7eff9f20580einsn: 48 01 c8 7eff9f205811insn: ff e0 # MISPRED 31 cycles [7] 0.71 IPC 7eff9f2059c0insn: 44 8b 62 08 7eff9f2059c4insn: e9 67 fe ff ff# PRED 55 cycles [24] 0.04 IPC 7eff9f205830insn: 48 83 c2 10 7eff9f205834insn: 48 8b 02 7eff9f205837insn: 48 85 c0 7eff9f20583ainsn: 75 c4 # PRED 68 cycles [13] 0.23 IPC Signed-off-by: Andi Kleen --- v2: reflow line --- tools/perf/builtin-script.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 6232658c6f31..53dc27e63d98 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -913,7 +913,7 @@ static int grab_bb(u8 *buffer, u64 start, u64 end, static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en, struct perf_insn *x, u8 *inbuf, int len, - int insn, FILE *fp) + int insn, FILE *fp, int *total_cycles) { int printed = fprintf(fp, "\t%016" PRIx64 "\t%-30s\t#%s%s%s%s", ip, dump_insn(x, ip, inbuf, len, NULL), @@ -922,7 +922,8 @@ static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en, en->flags.in_tx ? " INTX" : "", en->flags.abort ? " ABORT" : ""); if (en->flags.cycles) { - printed += fprintf(fp, " %d cycles", en->flags.cycles); + *total_cycles += en->flags.cycles; + printed += fprintf(fp, " %d cycles [%d]", *total_cycles, en->flags.cycles); if (insn) printed += fprintf(fp, " %.2f IPC", (float)insn / en->flags.cycles); } @@ -979,6 +980,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, u8 buffer[MAXBB]; unsigned off; struct symbol *lastsym = NULL; + int total_cycles = 0; if (!(br && br->nr)) return 0; @@ -999,7 +1001,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, printed += ip__fprintf_sym(br->entries[nr - 1].from, thread, x.cpumode, x.cpu, , attr, fp); printed += ip__fprintf_jump(br->entries[nr - 1].from, >entries[nr - 1], - , buffer, len, 0, fp); + , buffer, len, 0, fp, _cycles); } /* Print all blocks */ @@ -1027,7 +1029,8 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, printed += ip__fprintf_sym(ip, thread, x.cpumode, x.cpu, , attr, fp); if (ip == end) { - printed += ip__fprintf_jump(ip, >entries[i], , buffer + off, len - off, insn, fp); + printed += ip__fprintf_jump(ip, >entries[i], , buffer + off, len - off, insn, fp, + _cycles); break; } else { printed += fprintf(fp, "\t%016" PRIx64 "\t%s\n", ip, -- 2.17.1
[PATCH v6 5/5] perf, tools, script: Support total cycles count
From: Andi Kleen For perf script brstackinsn also print a running cycles count. This makes it easier to calculate cycle deltas for code sections measured with LBRs. % perf record -b -a sleep 1 % perf script -F +brstackinsn ... _dl_sysdep_start+330: 7eff9f20583ainsn: 75 c4 # PRED 24 cycles [24] 7eff9f205800insn: 48 83 e8 03 7eff9f205804insn: 48 83 f8 1e 7eff9f205808insn: 77 26 7eff9f20580ainsn: 48 63 04 81 7eff9f20580einsn: 48 01 c8 7eff9f205811insn: ff e0 # MISPRED 31 cycles [7] 0.71 IPC 7eff9f2059c0insn: 44 8b 62 08 7eff9f2059c4insn: e9 67 fe ff ff# PRED 55 cycles [24] 0.04 IPC 7eff9f205830insn: 48 83 c2 10 7eff9f205834insn: 48 8b 02 7eff9f205837insn: 48 85 c0 7eff9f20583ainsn: 75 c4 # PRED 68 cycles [13] 0.23 IPC Signed-off-by: Andi Kleen --- v2: reflow line --- tools/perf/builtin-script.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 6232658c6f31..53dc27e63d98 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -913,7 +913,7 @@ static int grab_bb(u8 *buffer, u64 start, u64 end, static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en, struct perf_insn *x, u8 *inbuf, int len, - int insn, FILE *fp) + int insn, FILE *fp, int *total_cycles) { int printed = fprintf(fp, "\t%016" PRIx64 "\t%-30s\t#%s%s%s%s", ip, dump_insn(x, ip, inbuf, len, NULL), @@ -922,7 +922,8 @@ static int ip__fprintf_jump(uint64_t ip, struct branch_entry *en, en->flags.in_tx ? " INTX" : "", en->flags.abort ? " ABORT" : ""); if (en->flags.cycles) { - printed += fprintf(fp, " %d cycles", en->flags.cycles); + *total_cycles += en->flags.cycles; + printed += fprintf(fp, " %d cycles [%d]", *total_cycles, en->flags.cycles); if (insn) printed += fprintf(fp, " %.2f IPC", (float)insn / en->flags.cycles); } @@ -979,6 +980,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, u8 buffer[MAXBB]; unsigned off; struct symbol *lastsym = NULL; + int total_cycles = 0; if (!(br && br->nr)) return 0; @@ -999,7 +1001,7 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, printed += ip__fprintf_sym(br->entries[nr - 1].from, thread, x.cpumode, x.cpu, , attr, fp); printed += ip__fprintf_jump(br->entries[nr - 1].from, >entries[nr - 1], - , buffer, len, 0, fp); + , buffer, len, 0, fp, _cycles); } /* Print all blocks */ @@ -1027,7 +1029,8 @@ static int perf_sample__fprintf_brstackinsn(struct perf_sample *sample, printed += ip__fprintf_sym(ip, thread, x.cpumode, x.cpu, , attr, fp); if (ip == end) { - printed += ip__fprintf_jump(ip, >entries[i], , buffer + off, len - off, insn, fp); + printed += ip__fprintf_jump(ip, >entries[i], , buffer + off, len - off, insn, fp, + _cycles); break; } else { printed += fprintf(fp, "\t%016" PRIx64 "\t%s\n", ip, -- 2.17.1
Make perf script easier to use for itrace
Implement a range of improvements to make it easier to look at itrace traces with perf script. Nothing here couldn't be done before with some additional scripting, but add simple high level options to make it easier to use. % perf record -e intel_pt//k -a sleep 1 Show function calls: % perf script --call-trace perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_enable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_filter_match perf 900 [000] 194167.205652203: ([kernel.kallsyms]) group_sched_in perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_sched_in.isra.107 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_set_state.part.71 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_time perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_disable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_log_itrace_start perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_userpage Show function calls and returns: % perf script --call-ret-trace perf 900 [000] 194167.205652203: tr strt ([unknown]) pt_config perf 900 [000] 194167.205652203: return ([kernel.kallsyms])pt_config perf 900 [000] 194167.205652203: return ([kernel.kallsyms])pt_event_add perf 900 [000] 194167.205652203: call ([kernel.kallsyms])perf_pmu_enable perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_void perf 900 [000] 194167.205652203: return ([kernel.kallsyms])event_sched_in.isra.107 perf 900 [000] 194167.205652203: call ([kernel.kallsyms])__x86_indirect_thunk_rax perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_int perf 900 [000] 194167.205652203: return ([kernel.kallsyms])group_sched_in perf 900 [000] 194167.205652203: call ([kernel.kallsyms])event_filter_match perf 900 [000] 194167.205652203: return ([kernel.kallsyms])event_filter_match perf 900 [000] 194167.205652203: call ([kernel.kallsyms])group_sched_in Show instruction traces (using XED): % perf script --insn-trace --xed swapper 0 [000] 117276.429606186: 81010486 pt_config ([kernel.kallsyms]) nopl %eax, (%rax,%rax,1) swapper 0 [000] 117276.429606186: 8101048b pt_config ([kernel.kallsyms]) add $0x10, %rsp swapper 0 [000] 117276.429606186: 8101048f pt_config ([kernel.kallsyms]) popq %rbx swapper 0 [000] 117276.429606186: 81010490 pt_config ([kernel.kallsyms]) popq %rbp swapper 0 [000] 117276.429606186: 81010491 pt_config ([kernel.kallsyms]) popq %r12 swapper 0 [000] 117276.429606186: 81010493 pt_config ([kernel.kallsyms]) popq %r13 swapper 0 [000] 117276.429606186: 81010495 pt_config ([kernel.kallsyms]) popq %r14 swapper 0 [000] 117276.429606186: 81010497 pt_config ([kernel.kallsyms]) popq %r15 swapper 0 [000] 117276.429606186: 81010499 pt_config ([kernel.kallsyms]) retq Filter by a ftrace style graph function: % perf script --graph-function group_sched_in --call-trace perf 900 [000] 194167.205652203: ([kernel.kallsyms]) group_sched_in perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_sched_in.isra.107 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_set_state.part.71 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_time perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_disable perf 900 [000] 194167.205652203: ([kernel.kallsyms])
[PATCH v6 4/5] tools, perf, script: Implement --graph-function
From: Andi Kleen Add a ftrace style --graph-function argument to perf script that allows to print itrace function calls only below a given function. This makes it easier to find the code of interest in a large trace. % perf record -e intel_pt//k -a sleep 1 % perf script --graph-function group_sched_in --call-trace perf 900 [000] 194167.205652203: ([kernel.kallsyms]) group_sched_in perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_sched_in.isra.107 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_set_state.part.71 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_time perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_disable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_log_itrace_start perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_userpage perf 900 [000] 194167.205652203: ([kernel.kallsyms]) calc_timer_values perf 900 [000] 194167.205652203: ([kernel.kallsyms]) sched_clock_cpu perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) arch_perf_update_userpage perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __fentry__ perf 900 [000] 194167.205652203: ([kernel.kallsyms]) using_native_sched_clock perf 900 [000] 194167.205652203: ([kernel.kallsyms]) sched_clock_stable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_enable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) group_sched_in swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) event_sched_in.isra.107 swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_event_set_state.part.71 swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_event_update_time swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_pmu_disable swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_log_itrace_start swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_event_update_userpage swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) calc_timer_values swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) sched_clock_cpu swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) arch_perf_update_userpage swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) __fentry__ swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) using_native_sched_clock swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) sched_clock_stable v2: Remove debug printout Signed-off-by: Andi Kleen --- tools/perf/Documentation/perf-script.txt | 4 + tools/perf/builtin-script.c | 96 +++- tools/perf/util/symbol.h | 3 +- tools/perf/util/thread.h | 2 + 4 files changed, 86 insertions(+), 19 deletions(-) diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt index 805baabd238e..a2b37ce48094 100644 --- a/tools/perf/Documentation/perf-script.txt +++ b/tools/perf/Documentation/perf-script.txt @@ -397,6 +397,10 @@ include::itrace.txt[] --call-ret-trace:: Show call and return stream for intel_pt traces. +--graph-function:: + For itrace only show specified functions and their callees for + itrace. Multiple functions can be separated by comma. + SEE ALSO linkperf:perf-record[1], linkperf:perf-script-perl[1], diff --git
Make perf script easier to use for itrace
Implement a range of improvements to make it easier to look at itrace traces with perf script. Nothing here couldn't be done before with some additional scripting, but add simple high level options to make it easier to use. % perf record -e intel_pt//k -a sleep 1 Show function calls: % perf script --call-trace perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_enable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_filter_match perf 900 [000] 194167.205652203: ([kernel.kallsyms]) group_sched_in perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_sched_in.isra.107 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_set_state.part.71 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_time perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_disable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_log_itrace_start perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_userpage Show function calls and returns: % perf script --call-ret-trace perf 900 [000] 194167.205652203: tr strt ([unknown]) pt_config perf 900 [000] 194167.205652203: return ([kernel.kallsyms])pt_config perf 900 [000] 194167.205652203: return ([kernel.kallsyms])pt_event_add perf 900 [000] 194167.205652203: call ([kernel.kallsyms])perf_pmu_enable perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_void perf 900 [000] 194167.205652203: return ([kernel.kallsyms])event_sched_in.isra.107 perf 900 [000] 194167.205652203: call ([kernel.kallsyms])__x86_indirect_thunk_rax perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_int perf 900 [000] 194167.205652203: return ([kernel.kallsyms])group_sched_in perf 900 [000] 194167.205652203: call ([kernel.kallsyms])event_filter_match perf 900 [000] 194167.205652203: return ([kernel.kallsyms])event_filter_match perf 900 [000] 194167.205652203: call ([kernel.kallsyms])group_sched_in Show instruction traces (using XED): % perf script --insn-trace --xed swapper 0 [000] 117276.429606186: 81010486 pt_config ([kernel.kallsyms]) nopl %eax, (%rax,%rax,1) swapper 0 [000] 117276.429606186: 8101048b pt_config ([kernel.kallsyms]) add $0x10, %rsp swapper 0 [000] 117276.429606186: 8101048f pt_config ([kernel.kallsyms]) popq %rbx swapper 0 [000] 117276.429606186: 81010490 pt_config ([kernel.kallsyms]) popq %rbp swapper 0 [000] 117276.429606186: 81010491 pt_config ([kernel.kallsyms]) popq %r12 swapper 0 [000] 117276.429606186: 81010493 pt_config ([kernel.kallsyms]) popq %r13 swapper 0 [000] 117276.429606186: 81010495 pt_config ([kernel.kallsyms]) popq %r14 swapper 0 [000] 117276.429606186: 81010497 pt_config ([kernel.kallsyms]) popq %r15 swapper 0 [000] 117276.429606186: 81010499 pt_config ([kernel.kallsyms]) retq Filter by a ftrace style graph function: % perf script --graph-function group_sched_in --call-trace perf 900 [000] 194167.205652203: ([kernel.kallsyms]) group_sched_in perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_sched_in.isra.107 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_set_state.part.71 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_time perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_disable perf 900 [000] 194167.205652203: ([kernel.kallsyms])
[PATCH v6 4/5] tools, perf, script: Implement --graph-function
From: Andi Kleen Add a ftrace style --graph-function argument to perf script that allows to print itrace function calls only below a given function. This makes it easier to find the code of interest in a large trace. % perf record -e intel_pt//k -a sleep 1 % perf script --graph-function group_sched_in --call-trace perf 900 [000] 194167.205652203: ([kernel.kallsyms]) group_sched_in perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_sched_in.isra.107 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_set_state.part.71 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_time perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_disable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_log_itrace_start perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_userpage perf 900 [000] 194167.205652203: ([kernel.kallsyms]) calc_timer_values perf 900 [000] 194167.205652203: ([kernel.kallsyms]) sched_clock_cpu perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) arch_perf_update_userpage perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __fentry__ perf 900 [000] 194167.205652203: ([kernel.kallsyms]) using_native_sched_clock perf 900 [000] 194167.205652203: ([kernel.kallsyms]) sched_clock_stable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_enable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) group_sched_in swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) event_sched_in.isra.107 swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_event_set_state.part.71 swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_event_update_time swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_pmu_disable swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_log_itrace_start swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) perf_event_update_userpage swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) calc_timer_values swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) sched_clock_cpu swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) __x86_indirect_thunk_rax swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) arch_perf_update_userpage swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) __fentry__ swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) using_native_sched_clock swapper 0 [001] 194167.205660693: ([kernel.kallsyms]) sched_clock_stable v2: Remove debug printout Signed-off-by: Andi Kleen --- tools/perf/Documentation/perf-script.txt | 4 + tools/perf/builtin-script.c | 96 +++- tools/perf/util/symbol.h | 3 +- tools/perf/util/thread.h | 2 + 4 files changed, 86 insertions(+), 19 deletions(-) diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt index 805baabd238e..a2b37ce48094 100644 --- a/tools/perf/Documentation/perf-script.txt +++ b/tools/perf/Documentation/perf-script.txt @@ -397,6 +397,10 @@ include::itrace.txt[] --call-ret-trace:: Show call and return stream for intel_pt traces. +--graph-function:: + For itrace only show specified functions and their callees for + itrace. Multiple functions can be separated by comma. + SEE ALSO linkperf:perf-record[1], linkperf:perf-script-perl[1], diff --git
[PATCH v6 3/5] tools, perf, script: Add --call-trace and --call-ret-trace
From: Andi Kleen Add short cut options to print PT call trace and call-ret-trace, for calls and call and returns. Roughly corresponds to ftrace function tracer and function graph tracer. Just makes these common use cases nicer to use. % perf record -a -e intel_pt// sleep 1 % perf script --call-trace perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_enable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_filter_match perf 900 [000] 194167.205652203: ([kernel.kallsyms]) group_sched_in perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_sched_in.isra.107 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_set_state.part.71 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_time perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_disable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_log_itrace_start perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_userpage % perf script --call-ret-trace perf 900 [000] 194167.205652203: tr strt ([unknown]) pt_config perf 900 [000] 194167.205652203: return ([kernel.kallsyms])pt_config perf 900 [000] 194167.205652203: return ([kernel.kallsyms])pt_event_add perf 900 [000] 194167.205652203: call ([kernel.kallsyms])perf_pmu_enable perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_void perf 900 [000] 194167.205652203: return ([kernel.kallsyms])event_sched_in.isra.107 perf 900 [000] 194167.205652203: call ([kernel.kallsyms])__x86_indirect_thunk_rax perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_int perf 900 [000] 194167.205652203: return ([kernel.kallsyms])group_sched_in perf 900 [000] 194167.205652203: call ([kernel.kallsyms])event_filter_match perf 900 [000] 194167.205652203: return ([kernel.kallsyms])event_filter_match perf 900 [000] 194167.205652203: call ([kernel.kallsyms])group_sched_in perf 900 [000] 194167.205652203: call ([kernel.kallsyms])__x86_indirect_thunk_rax perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_txn perf 900 [000] 194167.205652203: call ([kernel.kallsyms])event_sched_in.isra.107 perf 900 [000] 194167.205652203: call ([kernel.kallsyms])perf_event_set_state.part.71 Signed-off-by: Andi Kleen --- v2: Print errors, power, ptwrite too --- tools/perf/Documentation/perf-script.txt | 7 +++ tools/perf/builtin-script.c | 24 2 files changed, 31 insertions(+) diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt index 00c655ab4968..805baabd238e 100644 --- a/tools/perf/Documentation/perf-script.txt +++ b/tools/perf/Documentation/perf-script.txt @@ -390,6 +390,13 @@ include::itrace.txt[] --xed:: Run xed disassembler on output. Requires installing the xed disassembler. +--call-trace:: + Show call stream for intel_pt traces. The CPUs are interleaved, but + can be filtered with -C. + +--call-ret-trace:: + Show call and return stream for intel_pt traces. + SEE ALSO linkperf:perf-record[1], linkperf:perf-script-perl[1], diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 519ebb5a1f96..6c4562973983 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -3119,6 +3119,26 @@ static int parse_xed(const struct option *opt __maybe_unused, return 0; } +static int parse_call_trace(const struct option *opt __maybe_unused, + const char *str __maybe_unused, + int unset __maybe_unused) +{ + parse_output_fields(NULL, "-ip,-addr,-event,-period,+callindent", 0); + itrace_parse_synth_opts(opt, "cewp", 0); + nanosecs = true; + return 0; +} + +static int parse_callret_trace(const struct option
[PATCH v6 3/5] tools, perf, script: Add --call-trace and --call-ret-trace
From: Andi Kleen Add short cut options to print PT call trace and call-ret-trace, for calls and call and returns. Roughly corresponds to ftrace function tracer and function graph tracer. Just makes these common use cases nicer to use. % perf record -a -e intel_pt// sleep 1 % perf script --call-trace perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_enable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_filter_match perf 900 [000] 194167.205652203: ([kernel.kallsyms]) group_sched_in perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) event_sched_in.isra.107 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_set_state.part.71 perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_time perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_pmu_disable perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_log_itrace_start perf 900 [000] 194167.205652203: ([kernel.kallsyms]) __x86_indirect_thunk_rax perf 900 [000] 194167.205652203: ([kernel.kallsyms]) perf_event_update_userpage % perf script --call-ret-trace perf 900 [000] 194167.205652203: tr strt ([unknown]) pt_config perf 900 [000] 194167.205652203: return ([kernel.kallsyms])pt_config perf 900 [000] 194167.205652203: return ([kernel.kallsyms])pt_event_add perf 900 [000] 194167.205652203: call ([kernel.kallsyms])perf_pmu_enable perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_void perf 900 [000] 194167.205652203: return ([kernel.kallsyms])event_sched_in.isra.107 perf 900 [000] 194167.205652203: call ([kernel.kallsyms])__x86_indirect_thunk_rax perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_int perf 900 [000] 194167.205652203: return ([kernel.kallsyms])group_sched_in perf 900 [000] 194167.205652203: call ([kernel.kallsyms])event_filter_match perf 900 [000] 194167.205652203: return ([kernel.kallsyms])event_filter_match perf 900 [000] 194167.205652203: call ([kernel.kallsyms])group_sched_in perf 900 [000] 194167.205652203: call ([kernel.kallsyms])__x86_indirect_thunk_rax perf 900 [000] 194167.205652203: return ([kernel.kallsyms])perf_pmu_nop_txn perf 900 [000] 194167.205652203: call ([kernel.kallsyms])event_sched_in.isra.107 perf 900 [000] 194167.205652203: call ([kernel.kallsyms])perf_event_set_state.part.71 Signed-off-by: Andi Kleen --- v2: Print errors, power, ptwrite too --- tools/perf/Documentation/perf-script.txt | 7 +++ tools/perf/builtin-script.c | 24 2 files changed, 31 insertions(+) diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt index 00c655ab4968..805baabd238e 100644 --- a/tools/perf/Documentation/perf-script.txt +++ b/tools/perf/Documentation/perf-script.txt @@ -390,6 +390,13 @@ include::itrace.txt[] --xed:: Run xed disassembler on output. Requires installing the xed disassembler. +--call-trace:: + Show call stream for intel_pt traces. The CPUs are interleaved, but + can be filtered with -C. + +--call-ret-trace:: + Show call and return stream for intel_pt traces. + SEE ALSO linkperf:perf-record[1], linkperf:perf-script-perl[1], diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 519ebb5a1f96..6c4562973983 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -3119,6 +3119,26 @@ static int parse_xed(const struct option *opt __maybe_unused, return 0; } +static int parse_call_trace(const struct option *opt __maybe_unused, + const char *str __maybe_unused, + int unset __maybe_unused) +{ + parse_output_fields(NULL, "-ip,-addr,-event,-period,+callindent", 0); + itrace_parse_synth_opts(opt, "cewp", 0); + nanosecs = true; + return 0; +} + +static int parse_callret_trace(const struct option
Re: [PATCH v3 4/5] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
> +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu) > +{ > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); > + struct perf_event *event; > + struct perf_event_attr attr = { > + .type = PERF_TYPE_RAW, > + .size = sizeof(attr), > + .pinned = true, > + .exclude_host = true, > + .sample_type = PERF_SAMPLE_BRANCH_STACK, > + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK | > + PERF_SAMPLE_BRANCH_USER | > + PERF_SAMPLE_BRANCH_KERNEL, I think that will allocate an extra perfmon counter, right? So if the guest wants to use all perfmon counters we would start to multiplex, which would be bad Would need to fix perf core to not allocate a perfmon counter in this case, if no period and no event count is requested. -Andi