Re: [tip: perf/core] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
On 4/16/2021 12:45 PM, Peter Zijlstra wrote: On Fri, Apr 16, 2021 at 03:01:48PM -, tip-bot2 for Kan Liang wrote: @@ -2331,6 +2367,9 @@ static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *m if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED)) return; + if (x86_pmu.sched_task && event->hw.target) + perf_sched_cb_dec(event->ctx->pmu); + if (atomic_dec_and_test(>context.perf_rdpmc_allowed)) on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1); } 'perf test' on a kernel with CONFIG_DEBUG_PREEMPT=y gives: [ 244.439538] BUG: using smp_processor_id() in preemptible [] code: perf/1771 If it's a preemptible env, I think we should disable the interrupts and preemption to protect the sched_cb_list. Seems we don't need perf_ctx_lock() here. I don't think we touch the area in NMI. I think disabling the interrupts should be good enough to protect the cpuctx. How about the below patch? diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index e34eb72..45630beed 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2333,6 +2333,8 @@ static void x86_pmu_clear_dirty_counters(void) static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm) { + unsigned long flags; + if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED)) return; @@ -2341,8 +2343,10 @@ static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm) * and clear the existing dirty counters. */ if (x86_pmu.sched_task && event->hw.target) { + local_irq_save(flags); perf_sched_cb_inc(event->ctx->pmu); x86_pmu_clear_dirty_counters(); + local_irq_restore(flags); } /* @@ -2363,12 +2367,16 @@ static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm) static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm) { + unsigned long flags; if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED)) return; - if (x86_pmu.sched_task && event->hw.target) + if (x86_pmu.sched_task && event->hw.target) { + local_irq_save(flags); perf_sched_cb_dec(event->ctx->pmu); + local_irq_restore(flags); + } if (atomic_dec_and_test(>context.perf_rdpmc_allowed)) on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1); Thanks, Kan
Re: [PATCH] perf/x86/intel/uncore: Avoid null dereferences (uncore_extra_pci_dev)
On 4/15/2021 5:19 PM, Liang, Kan wrote: Hi Steve, On 4/15/2021 4:37 PM, Steve Wahl wrote: If an uncore has no pci_init routine, or that routine fails, uncore_pci_init is not called, and memory is not allocated for uncore_extra_pci_dev. So check to make sure uncore_extra_pci_dev is not NULL before use. I think more after yesterday's discussion. There may be a better solution than this. Actually, we don't have to probe all the PCU devices and stores them into the uncore_extra_pci_dev for the cpu_init(). We just need to pick up the first PCU device and check the existence of the SBOX once. I will send out a patch shortly. Here is the patch I mentioned. https://lore.kernel.org/lkml/1618521764-100923-1-git-send-email-kan.li...@linux.intel.com Thanks, Kan And fix the case that led us to discover the null derefs; don't fail snbep_pci2phy_map_init if BIOS doesn't supply pcibus_to_node information. Fixes: 9a7832ce3d92 ("perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info") In theory, the pci_read_config_dword() may fails as well. It has possible that the issue can still be observed before the "> 8 nodes" patch. I think the fixes should be 5306c31c5733 ("perf/x86/uncore/hsw-ep: Handle systems with only two SBOXes") Thanks, Kan Signed-off-by: Steve Wahl --- arch/x86/events/intel/uncore_snbep.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index b79951d0707c..14c24356a2fa 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -1373,11 +1373,11 @@ static int snbep_pci2phy_map_init(int devid, int nodeid_loc, int idmap_loc, bool /* * The nodeid and idmap registers only contain enough * information to handle 8 nodes. On systems with more - * than 8 nodes, we need to rely on NUMA information, + * than 8 nodes, if available we rely on NUMA information, * filled in from BIOS supplied information, to determine * the topology. */ - if (nr_node_ids <= 8) { + if ((nr_node_ids <= 8) || (pcibus_to_node(ubox_dev->bus) == -1)) { /* get the Node ID of the local register */ err = pci_read_config_dword(ubox_dev, nodeid_loc, ); if (err) @@ -2865,7 +2865,9 @@ void hswep_uncore_cpu_init(void) hswep_uncore_cbox.num_boxes = boot_cpu_data.x86_max_cores; /* Detect 6-8 core systems with only two SBOXes */ - if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) { + if (!uncore_extra_pci_dev) + hswep_uncore_sbox.num_boxes = 2; + else if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) { u32 capid4; pci_read_config_dword(uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3], @@ -3243,6 +3245,8 @@ void bdx_uncore_cpu_init(void) if (boot_cpu_data.x86_model == 86) { uncore_msr_uncores[BDX_MSR_UNCORE_SBOX] = NULL; /* Detect systems with no SBOXes */ + } else if (!uncore_extra_pci_dev) { + bdx_msr_uncores[BDX_MSR_UNCORE_SBOX] = NULL; } else if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) { struct pci_dev *pdev; u32 capid4;
Re: [PATCH] perf/x86/intel/uncore: Avoid null dereferences (uncore_extra_pci_dev)
Hi Steve, On 4/15/2021 4:37 PM, Steve Wahl wrote: If an uncore has no pci_init routine, or that routine fails, uncore_pci_init is not called, and memory is not allocated for uncore_extra_pci_dev. So check to make sure uncore_extra_pci_dev is not NULL before use. I think more after yesterday's discussion. There may be a better solution than this. Actually, we don't have to probe all the PCU devices and stores them into the uncore_extra_pci_dev for the cpu_init(). We just need to pick up the first PCU device and check the existence of the SBOX once. I will send out a patch shortly. And fix the case that led us to discover the null derefs; don't fail snbep_pci2phy_map_init if BIOS doesn't supply pcibus_to_node information. Fixes: 9a7832ce3d92 ("perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info") In theory, the pci_read_config_dword() may fails as well. It has possible that the issue can still be observed before the "> 8 nodes" patch. I think the fixes should be 5306c31c5733 ("perf/x86/uncore/hsw-ep: Handle systems with only two SBOXes") Thanks, Kan Signed-off-by: Steve Wahl --- arch/x86/events/intel/uncore_snbep.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c index b79951d0707c..14c24356a2fa 100644 --- a/arch/x86/events/intel/uncore_snbep.c +++ b/arch/x86/events/intel/uncore_snbep.c @@ -1373,11 +1373,11 @@ static int snbep_pci2phy_map_init(int devid, int nodeid_loc, int idmap_loc, bool /* * The nodeid and idmap registers only contain enough * information to handle 8 nodes. On systems with more -* than 8 nodes, we need to rely on NUMA information, +* than 8 nodes, if available we rely on NUMA information, * filled in from BIOS supplied information, to determine * the topology. */ - if (nr_node_ids <= 8) { + if ((nr_node_ids <= 8) || (pcibus_to_node(ubox_dev->bus) == -1)) { /* get the Node ID of the local register */ err = pci_read_config_dword(ubox_dev, nodeid_loc, ); if (err) @@ -2865,7 +2865,9 @@ void hswep_uncore_cpu_init(void) hswep_uncore_cbox.num_boxes = boot_cpu_data.x86_max_cores; /* Detect 6-8 core systems with only two SBOXes */ - if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) { + if (!uncore_extra_pci_dev) + hswep_uncore_sbox.num_boxes = 2; + else if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) { u32 capid4; pci_read_config_dword(uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3], @@ -3243,6 +3245,8 @@ void bdx_uncore_cpu_init(void) if (boot_cpu_data.x86_model == 86) { uncore_msr_uncores[BDX_MSR_UNCORE_SBOX] = NULL; /* Detect systems with no SBOXes */ + } else if (!uncore_extra_pci_dev) { + bdx_msr_uncores[BDX_MSR_UNCORE_SBOX] = NULL; } else if (uncore_extra_pci_dev[pkg].dev[HSWEP_PCI_PCU_3]) { struct pci_dev *pdev; u32 capid4;
Re: [PATCH V3 2/2] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
On 4/14/2021 9:51 AM, Namhyung Kim wrote: Hi Kan, On Wed, Apr 14, 2021 at 4:04 AM wrote: diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index dd9f3c2..0d4a1a3 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1585,6 +1585,8 @@ static void x86_pmu_del(struct perf_event *event, int flags) if (cpuc->txn_flags & PERF_PMU_TXN_ADD) goto do_del; + __set_bit(event->hw.idx, cpuc->dirty); + /* * Not a TXN, therefore cleanup properly. */ @@ -2304,12 +2306,46 @@ static int x86_pmu_event_init(struct perf_event *event) return err; } +void x86_pmu_clear_dirty_counters(void) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); + int i; + + if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX)) + return; Maybe you can check it after clearing assigned counters. It should be very likely that the cpuc->dirty is non-empty. Move it after the clearing can skip the for_each_set_bit() and bitmap_zero(). OK. I will change it in V4. Thanks, Kan
Re: [PATCH V3 2/2] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
On 4/13/2021 4:33 PM, kernel test robot wrote: Hi, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on tip/perf/core] [also build test WARNING on tip/master linux/master linus/master v5.12-rc7 next-20210413] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/kan-liang-linux-intel-com/perf-x86-Move-cpuc-running-into-P4-specific-code/20210414-030649 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git cface0326a6c2ae5c8f47bd466f07624b3e348a7 config: i386-tinyconfig (attached as .config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce (this is a W=1 build): # https://github.com/0day-ci/linux/commit/83f02393e1b5a2723294d8697f4fd5473d70602c git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review kan-liang-linux-intel-com/perf-x86-Move-cpuc-running-into-P4-specific-code/20210414-030649 git checkout 83f02393e1b5a2723294d8697f4fd5473d70602c # save the attached .config to linux build tree make W=1 ARCH=i386 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All warnings (new ones prefixed by >>): arch/x86/events/core.c:2309:6: warning: no previous prototype for 'x86_pmu_clear_dirty_counters' [-Wmissing-prototypes] 2309 | void x86_pmu_clear_dirty_counters(void) | ^~~~ vim +/x86_pmu_clear_dirty_counters +2309 arch/x86/events/core.c 2308 2309void x86_pmu_clear_dirty_counters(void) Should be "static void x86_pmu_clear_dirty_counters(void)". I will send V4 shortly to fix it. Thanks, Kan 2310 { 2311 struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); 2312 int i; 2313 2314 if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX)) 2315 return; 2316 2317 /* Don't need to clear the assigned counter. */ 2318 for (i = 0; i < cpuc->n_events; i++) 2319 __clear_bit(cpuc->assign[i], cpuc->dirty); 2320 2321 for_each_set_bit(i, cpuc->dirty, X86_PMC_IDX_MAX) { 2322 /* Metrics and fake events don't have corresponding HW counters. */ 2323 if (is_metric_idx(i) || (i == INTEL_PMC_IDX_FIXED_VLBR)) 2324 continue; 2325 else if (i >= INTEL_PMC_IDX_FIXED) 2326 wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + (i - INTEL_PMC_IDX_FIXED), 0); 2327 else 2328 wrmsrl(x86_pmu_event_addr(i), 0); 2329 } 2330 2331 bitmap_zero(cpuc->dirty, X86_PMC_IDX_MAX); 2332 } 2333 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
Re: [PATCH V3 2/2] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
On 4/13/2021 8:34 PM, Andy Lutomirski wrote: On Tue, Apr 13, 2021 at 12:05 PM wrote: From: Kan Liang The counter value of a perf task may leak to another RDPMC task. For example, a perf stat task as below is running on CPU 0. perf stat -e 'branches,cycles' -- taskset -c 0 ./workload I assume this doesn't fix the leak if the sensitive counter is systemwide? Right. Could Intel please add proper security and ideally virtualization for this? Ideally RDPMC permission would be a bitmask for all RDPMC-able counters, not just a single on/off switch. Yes, we are working on it. For now, I think this patch is what we can do so far. Thanks, Kan
Re: [PATCH V5 16/25] perf/x86: Register hybrid PMUs
On 4/9/2021 11:45 AM, Peter Zijlstra wrote: On Fri, Apr 09, 2021 at 09:50:20AM -0400, Liang, Kan wrote: On 4/9/2021 2:58 AM, Peter Zijlstra wrote: On Mon, Apr 05, 2021 at 08:10:58AM -0700, kan.li...@linux.intel.com wrote: @@ -2089,9 +2119,46 @@ static int __init init_hw_perf_events(void) if (err) goto out1; - err = perf_pmu_register(, "cpu", PERF_TYPE_RAW); - if (err) - goto out2; + if (!is_hybrid()) { + err = perf_pmu_register(, "cpu", PERF_TYPE_RAW); + if (err) + goto out2; + } else { + u8 cpu_type = get_this_hybrid_cpu_type(); + struct x86_hybrid_pmu *hybrid_pmu; + bool registered = false; + int i; + + if (!cpu_type && x86_pmu.get_hybrid_cpu_type) + cpu_type = x86_pmu.get_hybrid_cpu_type(); + + for (i = 0; i < x86_pmu.num_hybrid_pmus; i++) { + hybrid_pmu = _pmu.hybrid_pmu[i]; + + hybrid_pmu->pmu = pmu; + hybrid_pmu->pmu.type = -1; + hybrid_pmu->pmu.attr_update = x86_pmu.attr_update; + hybrid_pmu->pmu.capabilities |= PERF_PMU_CAP_HETEROGENEOUS_CPUS; + + err = perf_pmu_register(_pmu->pmu, hybrid_pmu->name, + (hybrid_pmu->cpu_type == hybrid_big) ? PERF_TYPE_RAW : -1); + if (err) + continue; + + if (cpu_type == hybrid_pmu->cpu_type) + x86_pmu_update_cpu_context(_pmu->pmu, raw_smp_processor_id()); + + registered = true; + } + + if (!registered) { + pr_warn("Failed to register hybrid PMUs\n"); + kfree(x86_pmu.hybrid_pmu); + x86_pmu.hybrid_pmu = NULL; + x86_pmu.num_hybrid_pmus = 0; + goto out2; + } I don't think this is quite right. registered will be true even if one fails, while I think you meant to only have it true when all (both) types registered correctly. No, I mean that perf error out only when all types fail to be registered. All or nothing seems a better approach to me. There really isn't a good reason for any one of them to fail. Sure. I will change it in V6. Thanks, Kan
Re: [PATCH V5 21/25] perf: Introduce PERF_TYPE_HARDWARE_PMU and PERF_TYPE_HW_CACHE_PMU
On 4/9/2021 5:21 AM, Peter Zijlstra wrote: On Mon, Apr 05, 2021 at 08:11:03AM -0700, kan.li...@linux.intel.com wrote: From: Kan Liang Current Hardware events and Hardware cache events have special perf types, PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE. The two types don't pass the PMU type in the user interface. For a hybrid system, the perf subsystem doesn't know which PMU the events belong to. The first capable PMU will always be assigned to the events. The events never get a chance to run on the other capable PMUs. Add a PMU aware version PERF_TYPE_HARDWARE_PMU and PERF_TYPE_HW_CACHE_PMU. The PMU type ID is stored at attr.config[40:32]. Support the new types for X86. Obviously ARM would need the same, but also, I don't think I see the need to introduce new types. AFAICT there is nothing that stops this scheme from working for the existing types. Also, pmu type is 32bit, not 8bit. So how about something like this? --- diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 3f7f89ea5e51..074c7687d466 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -260,15 +260,16 @@ struct perf_event; /** * pmu::capabilities flags */ -#define PERF_PMU_CAP_NO_INTERRUPT 0x01 -#define PERF_PMU_CAP_NO_NMI0x02 -#define PERF_PMU_CAP_AUX_NO_SG 0x04 -#define PERF_PMU_CAP_EXTENDED_REGS 0x08 -#define PERF_PMU_CAP_EXCLUSIVE 0x10 -#define PERF_PMU_CAP_ITRACE0x20 -#define PERF_PMU_CAP_HETEROGENEOUS_CPUS0x40 -#define PERF_PMU_CAP_NO_EXCLUDE0x80 -#define PERF_PMU_CAP_AUX_OUTPUT0x100 +#define PERF_PMU_CAP_NO_INTERRUPT 0x0001 +#define PERF_PMU_CAP_NO_NMI0x0002 +#define PERF_PMU_CAP_AUX_NO_SG 0x0004 +#define PERF_PMU_CAP_EXTENDED_REGS 0x0008 +#define PERF_PMU_CAP_EXCLUSIVE 0x0010 +#define PERF_PMU_CAP_ITRACE0x0020 +#define PERF_PMU_CAP_HETEROGENEOUS_CPUS0x0040 +#define PERF_PMU_CAP_NO_EXCLUDE0x0080 +#define PERF_PMU_CAP_AUX_OUTPUT0x0100 +#define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0200 struct perf_output_handle; diff --git a/kernel/events/core.c b/kernel/events/core.c index f07943183041..910a0666ebfe 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3,14 +3,21 @@ static struct pmu *perf_init_event(struct perf_event *event) * are often aliases for PERF_TYPE_RAW. */ type = event->attr.type; - if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE) - type = PERF_TYPE_RAW; + if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE) { + type = event->attr.config >> 32; + if (!type) + type = PERF_TYPE_RAW; + } For the old tool, the default PMU will be the big core. I think it's OK for X86. Since only the low 32 bit of event->attr.config contains the 'real' config value, I think all the ARCHs will do event->attr.config &= 0x. Maybe we should move it to the generic code. + if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE) { + type = event->attr.config >> 32; + if (!type) + type = PERF_TYPE_RAW; + else + event->attr.config &= 0x; > again: rcu_read_lock(); pmu = idr_find(_idr, type); rcu_read_unlock(); if (pmu) { + if (event->attr.type != type && type != PERF_TYPE_RAW && + !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE)) + goto fail; + ret = perf_try_init_event(pmu, event); if (ret == -ENOENT && event->attr.type != type) { type = event->attr.type; I don't think we want to go through all available PMUs again if users already specify a PMU. I update the patch a little bit. (Not test yet. I will do some tests then.) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 3158cbc..4f5f9a9 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2175,6 +2175,7 @@ static int __init init_hw_perf_events(void) hybrid_pmu->pmu.type = -1; hybrid_pmu->pmu.attr_update = x86_pmu.attr_update; hybrid_pmu->pmu.capabilities |= PERF_PMU_CAP_HETEROGENEOUS_CPUS; + hybrid_pmu->pmu.capabilities |= PERF_PMU_CAP_EXTENDED_HW_TYPE; err = perf_pmu_register(_pmu->pmu, hybrid_pmu->name, (hybrid_pmu->cpu_type == hybrid_big) ? PERF_TYPE_RAW : -1); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index b832e09..391bfb7 100644 ---
Re: [PATCH V5 23/25] perf/x86/msr: Add Alder Lake CPU support
On 4/9/2021 5:24 AM, Peter Zijlstra wrote: On Mon, Apr 05, 2021 at 08:11:05AM -0700, kan.li...@linux.intel.com wrote: From: Kan Liang PPERF and SMI_COUNT MSRs are also supported on Alder Lake. The External Design Specification (EDS) is not published yet. It comes from an authoritative internal source. The patch has been tested on real hardware. Reviewed-by: Andi Kleen Signed-off-by: Kan Liang --- arch/x86/events/msr.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c index 680404c..c853b28 100644 --- a/arch/x86/events/msr.c +++ b/arch/x86/events/msr.c @@ -100,6 +100,8 @@ static bool test_intel(int idx, void *data) case INTEL_FAM6_TIGERLAKE_L: case INTEL_FAM6_TIGERLAKE: case INTEL_FAM6_ROCKETLAKE: + case INTEL_FAM6_ALDERLAKE: + case INTEL_FAM6_ALDERLAKE_L: if (idx == PERF_MSR_SMI || idx == PERF_MSR_PPERF) return true; break; If only it would be sanely enumerated... What about sapphire rapids? SPR also supports the MSRs. I will submit a patch separately to support SPR. Thanks, Kan
Re: [PATCH V5 16/25] perf/x86: Register hybrid PMUs
On 4/9/2021 2:58 AM, Peter Zijlstra wrote: On Mon, Apr 05, 2021 at 08:10:58AM -0700, kan.li...@linux.intel.com wrote: @@ -2089,9 +2119,46 @@ static int __init init_hw_perf_events(void) if (err) goto out1; - err = perf_pmu_register(, "cpu", PERF_TYPE_RAW); - if (err) - goto out2; + if (!is_hybrid()) { + err = perf_pmu_register(, "cpu", PERF_TYPE_RAW); + if (err) + goto out2; + } else { + u8 cpu_type = get_this_hybrid_cpu_type(); + struct x86_hybrid_pmu *hybrid_pmu; + bool registered = false; + int i; + + if (!cpu_type && x86_pmu.get_hybrid_cpu_type) + cpu_type = x86_pmu.get_hybrid_cpu_type(); + + for (i = 0; i < x86_pmu.num_hybrid_pmus; i++) { + hybrid_pmu = _pmu.hybrid_pmu[i]; + + hybrid_pmu->pmu = pmu; + hybrid_pmu->pmu.type = -1; + hybrid_pmu->pmu.attr_update = x86_pmu.attr_update; + hybrid_pmu->pmu.capabilities |= PERF_PMU_CAP_HETEROGENEOUS_CPUS; + + err = perf_pmu_register(_pmu->pmu, hybrid_pmu->name, + (hybrid_pmu->cpu_type == hybrid_big) ? PERF_TYPE_RAW : -1); + if (err) + continue; + + if (cpu_type == hybrid_pmu->cpu_type) + x86_pmu_update_cpu_context(_pmu->pmu, raw_smp_processor_id()); + + registered = true; + } + + if (!registered) { + pr_warn("Failed to register hybrid PMUs\n"); + kfree(x86_pmu.hybrid_pmu); + x86_pmu.hybrid_pmu = NULL; + x86_pmu.num_hybrid_pmus = 0; + goto out2; + } I don't think this is quite right. registered will be true even if one fails, while I think you meant to only have it true when all (both) types registered correctly. No, I mean that perf error out only when all types fail to be registered. For the case (1 failure, 1 success), users can still access the registered PMU. When a CPU belongs to the unregistered PMU online, a warning will be displayed. Because in init_hybrid_pmu(), we will check the PMU type before update the CPU mask. if (WARN_ON_ONCE(!pmu || (pmu->pmu.type == -1))) { cpuc->pmu = NULL; return false; } Thanks, Kan
Re: [PATCH V5 04/25] perf/x86/intel: Hybrid PMU support for perf capabilities
On 4/8/2021 9:40 AM, Peter Zijlstra wrote: @@ -4330,7 +4347,7 @@ static int intel_pmu_check_period(struct perf_event *event, u64 value) static int intel_pmu_aux_output_match(struct perf_event *event) { - if (!x86_pmu.intel_cap.pebs_output_pt_available) + if (!intel_pmu_has_cap(event, PERF_CAP_PT_IDX)) return 0; return is_intel_pt_event(event); these sites can have !event->pmu ? I don't think the event->pmu can be NULL, but it could be pt_pmu.pmu. If so, it should be a problem. I think I will still use the x86_pmu.intel_cap.pebs_output_pt_available here in V6. The worst case is that we lost the PEBS via PT support on the small core for now. I guess Alexander may provide a separate patch later to enable the PEBS via PT support on the ADL small core. Thanks, Kan
Re: [PATCH V5 04/25] perf/x86/intel: Hybrid PMU support for perf capabilities
On 4/8/2021 1:00 PM, Peter Zijlstra wrote: On Mon, Apr 05, 2021 at 08:10:46AM -0700, kan.li...@linux.intel.com wrote: +#define is_hybrid()(!!x86_pmu.num_hybrid_pmus) Given this is sprinkled all over the place, can you make this a static_key_false + static_branch_unlikely() such that the hybrid case is out-of-line? Sure, I will add a new static_key_false "perf_is_hybrid" to indicate a hybrid system as below (not test yet). diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index f8d1222..bd6412e 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -54,6 +54,7 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = { DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key); DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key); +DEFINE_STATIC_KEY_FALSE(perf_is_hybrid); /* * This here uses DEFINE_STATIC_CALL_NULL() to get a static_call defined diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 2b553d9..7cef3cd 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -6119,6 +6119,7 @@ __init int intel_pmu_init(void) GFP_KERNEL); if (!x86_pmu.hybrid_pmu) return -ENOMEM; + static_branch_enable(_is_hybrid); x86_pmu.num_hybrid_pmus = X86_HYBRID_NUM_PMUS; x86_pmu.late_ack = true; diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index bfbecde..d6383d1 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -663,8 +663,8 @@ static __always_inline struct x86_hybrid_pmu *hybrid_pmu(struct pmu *pmu) return container_of(pmu, struct x86_hybrid_pmu, pmu); } -/* The number of hybrid PMUs implies whether it's a hybrid system */ -#define is_hybrid()(!!x86_pmu.num_hybrid_pmus) +extern struct static_key_false perf_is_hybrid; +#define is_hybrid()static_branch_unlikely(_is_hybrid) #define hybrid(_pmu, _field) \ ({ \ Thanks, Kan
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
On 3/23/2021 5:41 PM, Peter Zijlstra wrote: On Mon, Mar 22, 2021 at 02:06:33PM +0800, Like Xu wrote: diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 18df17129695..a4ce669cc78d 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return x86_pmu_extra_regs(val, event); } -int x86_reserve_hardware(void) +int x86_reserve_hardware(struct perf_event *event) { int err = 0; @@ -382,8 +382,10 @@ int x86_reserve_hardware(void) if (atomic_read(_refcount) == 0) { if (!reserve_pmc_hardware()) err = -EBUSY; - else + else { reserve_ds_buffers(); + reserve_lbr_buffers(event); + } } if (!err) atomic_inc(_refcount); This makes absolutely no sense what so ever. This is only executed for the first event that gets here. The LBR xsave buffer is a per-CPU buffer, not a per-event buffer. I think we only need to allocate the buffer once when initializing the first event. Thanks, Kan
Re: [PATCH v4 RESEND 3/5] perf/x86/lbr: Move cpuc->lbr_xsave allocation out of sleeping region
On 3/22/2021 2:06 AM, Like Xu wrote: If the kernel is compiled with the CONFIG_LOCKDEP option, the conditional might_sleep_if() deep in kmem_cache_alloc() will generate the following trace, and potentially cause a deadlock when another LBR event is added: [ 243.115549] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 [ 243.117576] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 839, name: perf [ 243.119326] INFO: lockdep is turned off. [ 243.120249] irq event stamp: 0 [ 243.120967] hardirqs last enabled at (0): [<>] 0x0 [ 243.122415] hardirqs last disabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.124302] softirqs last enabled at (0): [] copy_process+0xa45/0x1dc0 [ 243.126255] softirqs last disabled at (0): [<>] 0x0 [ 243.128119] CPU: 0 PID: 839 Comm: perf Not tainted 5.11.0-rc4-guest+ #8 [ 243.129654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [ 243.131520] Call Trace: [ 243.132112] dump_stack+0x8d/0xb5 [ 243.132896] ___might_sleep.cold.106+0xb3/0xc3 [ 243.133984] slab_pre_alloc_hook.constprop.85+0x96/0xd0 [ 243.135208] ? intel_pmu_lbr_add+0x152/0x170 [ 243.136207] kmem_cache_alloc+0x36/0x250 [ 243.137126] intel_pmu_lbr_add+0x152/0x170 [ 243.138088] x86_pmu_add+0x83/0xd0 [ 243.138889] ? lock_acquire+0x158/0x350 [ 243.139791] ? lock_acquire+0x158/0x350 [ 243.140694] ? lock_acquire+0x158/0x350 [ 243.141625] ? lock_acquired+0x1e3/0x360 [ 243.142544] ? lock_release+0x1bf/0x340 [ 243.143726] ? trace_hardirqs_on+0x1a/0xd0 [ 243.144823] ? lock_acquired+0x1e3/0x360 [ 243.145742] ? lock_release+0x1bf/0x340 [ 243.147107] ? __slab_free+0x49/0x540 [ 243.147966] ? trace_hardirqs_on+0x1a/0xd0 [ 243.148924] event_sched_in.isra.129+0xf8/0x2a0 [ 243.149989] merge_sched_in+0x261/0x3e0 [ 243.150889] ? trace_hardirqs_on+0x1a/0xd0 [ 243.151869] visit_groups_merge.constprop.135+0x130/0x4a0 [ 243.153122] ? sched_clock_cpu+0xc/0xb0 [ 243.154023] ctx_sched_in+0x101/0x210 [ 243.154884] ctx_resched+0x6f/0xc0 [ 243.155686] perf_event_exec+0x21e/0x2e0 [ 243.156641] begin_new_exec+0x5e5/0xbd0 [ 243.157540] load_elf_binary+0x6af/0x1770 [ 243.158478] ? __kernel_read+0x19d/0x2b0 [ 243.159977] ? lock_acquire+0x158/0x350 [ 243.160876] ? __kernel_read+0x19d/0x2b0 [ 243.161796] bprm_execve+0x3c8/0x840 [ 243.162638] do_execveat_common.isra.38+0x1a5/0x1c0 [ 243.163776] __x64_sys_execve+0x32/0x40 [ 243.164676] do_syscall_64+0x33/0x40 [ 243.165514] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 243.166746] RIP: 0033:0x7f6180a26feb [ 243.167590] Code: Unable to access opcode bytes at RIP 0x7f6180a26fc1. [ 243.169097] RSP: 002b:7ffc6558ce18 EFLAGS: 0202 ORIG_RAX: 003b [ 243.170844] RAX: ffda RBX: 7ffc65592d30 RCX: 7f6180a26feb [ 243.172514] RDX: 55657f408dc0 RSI: 7ffc65592410 RDI: 7ffc65592d30 [ 243.174162] RBP: 7ffc6558ce80 R08: 7ffc6558cde0 R09: [ 243.176042] R10: 0008 R11: 0202 R12: 7ffc65592410 [ 243.177696] R13: 55657f408dc0 R14: 0001 R15: 7ffc65592410 One of the solution is to use GFP_ATOMIC, but it will make the code less reliable under memory pressue. Let's move the memory allocation out of the sleeping region and put it into the x86_reserve_hardware(). The disadvantage of this fix is that the cpuc->lbr_xsave memory will be allocated for each cpu like the legacy ds_buffer. Fixes: c085fb8774 ("perf/x86/intel/lbr: Support XSAVES for arch LBR read") Suggested-by: Kan Liang Signed-off-by: Like Xu I observed the same issue when I did LBR test on an ADL machine. This patch fixes the issue. Tested-by: Kan Liang Thanks, Kan --- arch/x86/events/core.c | 8 +--- arch/x86/events/intel/bts.c | 2 +- arch/x86/events/intel/lbr.c | 22 -- arch/x86/events/perf_event.h | 8 +++- 4 files changed, 29 insertions(+), 11 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 18df17129695..a4ce669cc78d 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -373,7 +373,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return x86_pmu_extra_regs(val, event); } -int x86_reserve_hardware(void) +int x86_reserve_hardware(struct perf_event *event) { int err = 0; @@ -382,8 +382,10 @@ int x86_reserve_hardware(void) if (atomic_read(_refcount) == 0) { if (!reserve_pmc_hardware()) err = -EBUSY; - else + else { reserve_ds_buffers(); + reserve_lbr_buffers(event); + } } if (!err) atomic_inc(_refcount); @@ -634,7 +636,7 @@ static int
Re: [PATCH V2 1/5] perf/x86/intel/uncore: Parse uncore discovery tables
On 3/18/2021 9:10 PM, Namhyung Kim wrote: Hi Kan, On Thu, Mar 18, 2021 at 3:05 AM wrote: From: Kan Liang A self-describing mechanism for the uncore PerfMon hardware has been introduced with the latest Intel platforms. By reading through an MMIO page worth of information, perf can 'discover' all the standard uncore PerfMon registers in a machine. The discovery mechanism relies on BIOS's support. With a proper BIOS, a PCI device with the unique capability ID 0x23 can be found on each die. Perf can retrieve the information of all available uncore PerfMons from the device via MMIO. The information is composed of one global discovery table and several unit discovery tables. - The global discovery table includes global uncore information of the die, e.g., the address of the global control register, the offset of the global status register, the number of uncore units, the offset of unit discovery tables, etc. - The unit discovery table includes generic uncore unit information, e.g., the access type, the counter width, the address of counters, the address of the counter control, the unit ID, the unit type, etc. The unit is also called "box" in the code. Perf can provide basic uncore support based on this information with the following patches. To locate the PCI device with the discovery tables, check the generic PCI ID first. If it doesn't match, go through the entire PCI device tree and locate the device with the unique capability ID. The uncore information is similar among dies. To save parsing time and space, only completely parse and store the discovery tables on the first die and the first box of each die. The parsed information is stored in an RB tree structure, intel_uncore_discovery_type. The size of the stored discovery tables varies among platforms. It's around 4KB for a Sapphire Rapids server. If a BIOS doesn't support the 'discovery' mechanism, the uncore driver will exit with -ENODEV. There is nothing changed. Add a module parameter to disable the discovery feature. If a BIOS gets the discovery tables wrong, users can have an option to disable the feature. For the current patchset, the uncore driver will exit with -ENODEV. In the future, it may fall back to the hardcode uncore driver on a known platform. Signed-off-by: Kan Liang --- arch/x86/events/intel/Makefile | 2 +- arch/x86/events/intel/uncore.c | 31 ++- arch/x86/events/intel/uncore_discovery.c | 318 +++ arch/x86/events/intel/uncore_discovery.h | 105 ++ 4 files changed, 448 insertions(+), 8 deletions(-) create mode 100644 arch/x86/events/intel/uncore_discovery.c create mode 100644 arch/x86/events/intel/uncore_discovery.h diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile index e67a588..10bde6c 100644 --- a/arch/x86/events/intel/Makefile +++ b/arch/x86/events/intel/Makefile @@ -3,6 +3,6 @@ obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o obj-$(CONFIG_CPU_SUP_INTEL)+= ds.o knc.o obj-$(CONFIG_CPU_SUP_INTEL)+= lbr.o p4.o p6.o pt.o obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += intel-uncore.o -intel-uncore-objs := uncore.o uncore_nhmex.o uncore_snb.o uncore_snbep.o +intel-uncore-objs := uncore.o uncore_nhmex.o uncore_snb.o uncore_snbep.o uncore_discovery.o obj-$(CONFIG_PERF_EVENTS_INTEL_CSTATE) += intel-cstate.o intel-cstate-objs := cstate.o diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 33c8180..d111370 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -4,7 +4,12 @@ #include #include #include "uncore.h" +#include "uncore_discovery.h" +static bool uncore_no_discover; +module_param(uncore_no_discover, bool, 0); Wouldn't it be better to use a positive form like 'uncore_discover = true'? To disable, the module param can be set to 'uncore_discover = false'. I'd like the feature is enabled by default. The default value of a static is 0. So I use the current name. It's just a personal preference. +MODULE_PARM_DESC(uncore_no_discover, "Don't enable the Intel uncore PerfMon discovery mechanism " +"(default: enable the discovery mechanism)."); static struct intel_uncore_type *empty_uncore[] = { NULL, }; struct intel_uncore_type **uncore_msr_uncores = empty_uncore; struct intel_uncore_type **uncore_pci_uncores = empty_uncore; [SNIP] +enum uncore_access_type { + UNCORE_ACCESS_MSR = 0, + UNCORE_ACCESS_MMIO, + UNCORE_ACCESS_PCI, + + UNCORE_ACCESS_MAX, +}; + +struct uncore_global_discovery { + union { + u64 table1; + struct { + u64 type : 8, + stride : 8, + max_units : 10, + __reserved_1 : 36, +
Re: [PATCH v2 11/27] perf parse-events: Support hardware events inside PMU
On 3/18/2021 9:21 AM, Arnaldo Carvalho de Melo wrote: Em Thu, Mar 18, 2021 at 01:16:37PM +0100, Jiri Olsa escreveu: On Wed, Mar 17, 2021 at 10:42:45AM -0300, Arnaldo Carvalho de Melo wrote: Em Wed, Mar 17, 2021 at 08:17:52PM +0800, Jin, Yao escreveu: I'm OK to only support 'cpu_core/cpu-cycles/' or 'cpu_atom/cpu-cycles/'. But what would we do for cache event? 'perf stat -e LLC-loads' is OK, but 'perf stat -e cpu/LLC-loads/' is not supported currently. For hybrid platform, user may only want to enable the LLC-loads on core CPUs or on atom CPUs. That's reasonable. While if we don't support the pmu style event, how to satisfy this requirement? If we can support the pmu style event, we can also use the same way for cpu_core/cycles/. At least it's not a bad thing, right? :) While we're discussing, do we really want to use the "core" and "atom" terms here? I thought cpu/cycles/ would be ok for the main (Big) CPU and that we should come up with some short name for the "litle" CPUs. Won't we have the same situation with ARM where we want to know the number of cycles spent on a BIG core and also on a little one? Perhaps 'cycles' should mean all cycles, and then we use 'big/cycles/' and 'little/cycles/'? do arm servers already export multiple pmus like this? I did not notice I haven't checked, but AFAIK this BIG/Little kind of arch started there, Mark? Here is the cover letter of the ARM big.little patch set. ARM also exports multiple PMUs, e.g., armv7_cortex_a15 and armv7_cortex_a7. https://lore.kernel.org/lkml/1431533549-27715-1-git-send-email-mark.rutl...@arm.com/ We follow a similar way to handle the Intel hybrid PMUs. The naming rule is also similar, "cpu_" + CPU type. We don't use the old name "cpu" for the main CPU type, because we want to make sure every software updated for the hybrid architecture. Otherwise, the old script with "cpu//" can still run on a hybrid architecture. Users cannot notice that the monitored scope is already implicitly changed. The results may be not what they want. Thanks, Kan - Arnaldo it'd be definitely great to have some unite way for this, so far we have the hybrid pmu detection and support in hw events like cycles/instructions.. which should be easy to follow on arm there's also support to have these events on specific pmu pmu/cycles/ , which I still need to check on
Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"
On 3/16/2021 2:34 PM, Stephane Eranian wrote: On Tue, Mar 16, 2021 at 5:28 AM Liang, Kan wrote: On 3/16/2021 3:22 AM, Namhyung Kim wrote: Hi Peter and Kan, On Thu, Mar 4, 2021 at 5:22 AM Peter Zijlstra wrote: On Wed, Mar 03, 2021 at 02:53:00PM -0500, Liang, Kan wrote: On 3/3/2021 1:59 PM, Peter Zijlstra wrote: On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.li...@linux.intel.com wrote: +++ b/arch/x86/events/intel/ds.c @@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d continue; } - /* - * On some CPUs the PEBS status can be zero when PEBS is - * racing with clearing of GLOBAL_STATUS. - * - * Normally we would drop that record, but in the - * case when there is only a single active PEBS event - * we can assume it's for that event. - */ - if (!pebs_status && cpuc->pebs_enabled && - !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1))) - pebs_status = cpuc->pebs_enabled; Wouldn't something like: pebs_status = p->status = cpus->pebs_enabled; I didn't consider it as a potential solution in this patch because I don't think it's a proper way that SW modifies the buffer, which is supposed to be manipulated by the HW. Right, but then HW was supposed to write sane values and it doesn't do that either ;-) It's just a personal preference. I don't see any issue here. We may try it. So I mostly agree with you, but I think it's a shame to unsupport such chips, HSW is still a plenty useable chip today. I got a similar issue on ivybridge machines which caused kernel crash. My case it's related to the branch stack with PEBS events but I think it's the same issue. And I can confirm that the above approach of updating p->status fixed the problem. I've talked to Stephane about this, and he wants to make it more robust when we see stale (or invalid) PEBS records. I'll send the patch soon. Hi Namhyung, In case you didn't see it, I've already submitted a patch to fix the issue last Friday. https://lore.kernel.org/lkml/161298-140216-1-git-send-email-kan.li...@linux.intel.com/ But if you have a more robust proposal, please feel free to submit it. This fixes the problem on the older systems. The other problem we identified related to the PEBS sample processing code is that you can end up with uninitialized perf_sample_data struct passed to perf_event_overflow(): setup_pebs_fixed_sample_data(pebs, data) { if (!pebs) return; perf_sample_data_init(data); <<< must be moved before the if (!pebs) ... } __intel_pmu_pebs_event(pebs, data) { setup_sample(pebs, data) perf_event_overflow(data); ... } If there is any other reason to get a pebs = NULL in fix_sample_data() or adaptive_sample_data(), then you must call perf_sample_data_init(data) BEFORE you return otherwise you end up in perf_event_overflow() with uninitialized data and you may die as follows: [] ? perf_output_copy+0x4d/0xb0 [] perf_output_sample+0x561/0xab0 [] ? __perf_event_header__init_id+0x112/0x130 [] ? perf_prepare_sample+0x1b1/0x730 [] perf_event_output_forward+0x59/0x80 [] ? perf_event_update_userpage+0xf4/0x110 [] perf_event_overflow+0x88/0xe0 [] __intel_pmu_pebs_event+0x328/0x380 This all stems from get_next_pebs_record_by_bit() potentially returning NULL but the NULL is not handled correctly by the callers. This is what I'd like to see cleaned up in __intel_pmu_pebs_event() to avoid future problems. Got it. Yes, we need another patch to correctly handle the potentially returning NULL. Thanks for pointing it out. Thanks, Kan
Re: [PATCH 1/5] perf/x86/intel/uncore: Parse uncore discovery tables
On 3/16/2021 10:05 AM, Peter Zijlstra wrote: On Tue, Mar 16, 2021 at 08:42:25AM -0400, Liang, Kan wrote: On 3/16/2021 7:43 AM, Peter Zijlstra wrote: On Fri, Mar 12, 2021 at 08:34:34AM -0800, kan.li...@linux.intel.com wrote: From: Kan Liang A self-describing mechanism for the uncore PerfMon hardware has been introduced with the latest Intel platforms. By reading through an MMIO page worth of information, perf can 'discover' all the standard uncore PerfMon registers in a machine. The discovery mechanism relies on BIOS's support. With a proper BIOS, a PCI device with the unique capability ID 0x23 can be found on each die. Perf can retrieve the information of all available uncore PerfMons from the device via MMIO. The information is composed of one global discovery table and several unit discovery tables. If a BIOS doesn't support the 'discovery' mechanism, there is nothing changed. What if the BIOS got it wrong? Will the driver still get it correct if it is a known platform? Yes, I will submit a platform specific patch to fix this case. Do we need a chicken flag to kill the discovery? uncore_no_discover? Yes, I plan to introduce a .use_discovery_tables flag to indicate whether to use the discovery tables for the known platform. The below codes is part of the upcoming SPR uncore patches. The first SPR uncore patch will still rely on the BIOS discovery tables, because some uncore block information hasn't been published yet. We have to retrieve the information fro the tables. Once all the information is published, we can kill the discovery by removing the ".use_discovery_tables = true". I was thinking of a module parameter, such that we can tell it to skip discovery on module load time etc. Sure, I will add a module parameter, uncore_no_discover. If users don't want the discovery feature, they can set uncore_no_discover=true. Thanks, Kan
Re: [PATCH 1/5] perf/x86/intel/uncore: Parse uncore discovery tables
On 3/16/2021 7:43 AM, Peter Zijlstra wrote: On Fri, Mar 12, 2021 at 08:34:34AM -0800, kan.li...@linux.intel.com wrote: From: Kan Liang A self-describing mechanism for the uncore PerfMon hardware has been introduced with the latest Intel platforms. By reading through an MMIO page worth of information, perf can 'discover' all the standard uncore PerfMon registers in a machine. The discovery mechanism relies on BIOS's support. With a proper BIOS, a PCI device with the unique capability ID 0x23 can be found on each die. Perf can retrieve the information of all available uncore PerfMons from the device via MMIO. The information is composed of one global discovery table and several unit discovery tables. If a BIOS doesn't support the 'discovery' mechanism, there is nothing changed. What if the BIOS got it wrong? Will the driver still get it correct if it is a known platform? Yes, I will submit a platform specific patch to fix this case. Do we need a chicken flag to kill the discovery? uncore_no_discover? Yes, I plan to introduce a .use_discovery_tables flag to indicate whether to use the discovery tables for the known platform. The below codes is part of the upcoming SPR uncore patches. The first SPR uncore patch will still rely on the BIOS discovery tables, because some uncore block information hasn't been published yet. We have to retrieve the information fro the tables. Once all the information is published, we can kill the discovery by removing the ".use_discovery_tables = true". +static const struct intel_uncore_init_fun spr_uncore_init __initconst = { + .cpu_init = spr_uncore_cpu_init, + .pci_init = spr_uncore_pci_init, + .mmio_init = spr_uncore_mmio_init, + .use_discovery_tables = true, +}; Thanks, Kan
Re: [PATCH 1/5] perf/x86/intel/uncore: Parse uncore discovery tables
On 3/16/2021 7:40 AM, Peter Zijlstra wrote: On Fri, Mar 12, 2021 at 08:34:34AM -0800, kan.li...@linux.intel.com wrote: +static struct intel_uncore_discovery_type * +search_uncore_discovery_type(u16 type_id) +{ + struct rb_node *node = discovery_tables.rb_node; + struct intel_uncore_discovery_type *type; + + while (node) { + type = rb_entry(node, struct intel_uncore_discovery_type, node); + + if (type->type > type_id) + node = node->rb_left; + else if (type->type < type_id) + node = node->rb_right; + else + return type; + } + + return NULL; +} + +static struct intel_uncore_discovery_type * +add_uncore_discovery_type(struct uncore_unit_discovery *unit) +{ + struct intel_uncore_discovery_type *type, *cur; + struct rb_node **node = _tables.rb_node; + struct rb_node *parent = *node; + + if (unit->access_type >= UNCORE_ACCESS_MAX) { + pr_warn("Unsupported access type %d\n", unit->access_type); + return NULL; + } + + type = kzalloc(sizeof(struct intel_uncore_discovery_type), GFP_KERNEL); + if (!type) + return NULL; + + type->box_ctrl_die = kcalloc(__uncore_max_dies, sizeof(u64), GFP_KERNEL); + if (!type->box_ctrl_die) + goto free_type; + + type->access_type = unit->access_type; + num_discovered_types[type->access_type]++; + type->type = unit->box_type; + + while (*node) { + parent = *node; + cur = rb_entry(parent, struct intel_uncore_discovery_type, node); + + if (cur->type > type->type) + node = >rb_left; + else + node = >rb_right; + } + + rb_link_node(>node, parent, node); + rb_insert_color(>node, _tables); + + return type; + +free_type: + kfree(type); + + return NULL; + +} I'm thinking this can use some of this: 2d24dd5798d0 ("rbtree: Add generic add and find helpers") Sure, I will use the generic rbtree framework in V2. Thanks, Kan
Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"
On 3/16/2021 3:22 AM, Namhyung Kim wrote: Hi Peter and Kan, On Thu, Mar 4, 2021 at 5:22 AM Peter Zijlstra wrote: On Wed, Mar 03, 2021 at 02:53:00PM -0500, Liang, Kan wrote: On 3/3/2021 1:59 PM, Peter Zijlstra wrote: On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.li...@linux.intel.com wrote: +++ b/arch/x86/events/intel/ds.c @@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d continue; } - /* - * On some CPUs the PEBS status can be zero when PEBS is - * racing with clearing of GLOBAL_STATUS. - * - * Normally we would drop that record, but in the - * case when there is only a single active PEBS event - * we can assume it's for that event. - */ - if (!pebs_status && cpuc->pebs_enabled && - !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1))) - pebs_status = cpuc->pebs_enabled; Wouldn't something like: pebs_status = p->status = cpus->pebs_enabled; I didn't consider it as a potential solution in this patch because I don't think it's a proper way that SW modifies the buffer, which is supposed to be manipulated by the HW. Right, but then HW was supposed to write sane values and it doesn't do that either ;-) It's just a personal preference. I don't see any issue here. We may try it. So I mostly agree with you, but I think it's a shame to unsupport such chips, HSW is still a plenty useable chip today. I got a similar issue on ivybridge machines which caused kernel crash. My case it's related to the branch stack with PEBS events but I think it's the same issue. And I can confirm that the above approach of updating p->status fixed the problem. I've talked to Stephane about this, and he wants to make it more robust when we see stale (or invalid) PEBS records. I'll send the patch soon. Hi Namhyung, In case you didn't see it, I've already submitted a patch to fix the issue last Friday. https://lore.kernel.org/lkml/161298-140216-1-git-send-email-kan.li...@linux.intel.com/ But if you have a more robust proposal, please feel free to submit it. BTW: The patch set from last Friday also fixed another bug found by the perf_fuzzer test. You may be interested. Thanks, Kan
Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support
On 3/11/2021 2:58 PM, Peter Zijlstra wrote: On Thu, Mar 11, 2021 at 11:53:35AM -0500, Liang, Kan wrote: The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without PMEM. The "cpu_atom" PMU is similar to Tremont, but with different event_constraints, extra_regs and number of counters. So do these things use the same event lists as SPR and TNT? No, there will be two new event lists on ADL. One is for Atom core, and the other is for big core. They are different to SPR and TNT. *sigh* how different? The core PMU event list should be similar between SPR and the big core of ADL, because they both utilize the Golden Cove core. But the uncore PMU event list is totally different for client and server. The Atom core of ADL utilizes the Gracemont core, which is a successor to Tremont. It introduces many new events. We cannot use the Tremont event list instead. My desktop has: cpu/caps/pmu_name and that gives "skylake", do we want the above to have cpu_core/caps/pmu_name give "sapphire_rapids" etc.. ? I think current implementation should be good enough. $ cat /sys/devices/cpu_atom/caps/pmu_name alderlake_hybrid "alderlake_hybrid" tells the perf tool that it's Alder Lake Hybrid system. "cpu_atom" tells the perf tool that it's for Atom core. Yeah, but then I have to ask Google wth those atoms and cores actually are. Why not tell me upfront? Since we're now working on it, we all know, but in 6 months time nobody will remember and then we'll constantly have to look it up and curse ourselves for not doing better. I think the "sapphire_rapids" is the code name for the server platform. Maybe we should use the code name of core? $ cat /sys/devices/cpu_atom/caps/pmu_name gracemont $ cat /sys/devices/cpu_core/caps/pmu_name golden_cove Thanks, Kan
Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support
On 3/11/2021 11:32 AM, Peter Zijlstra wrote: On Thu, Mar 11, 2021 at 05:09:25PM +0100, Peter Zijlstra wrote: On Wed, Mar 10, 2021 at 08:37:56AM -0800, kan.li...@linux.intel.com wrote: From: Kan Liang Alder Lake Hybrid system has two different types of core, Golden Cove core and Gracemont core. The Golden Cove core is registered to "cpu_core" PMU. The Gracemont core is registered to "cpu_atom" PMU. The difference between the two PMUs include: - Number of GP and fixed counters - Events - The "cpu_core" PMU supports Topdown metrics. The "cpu_atom" PMU supports PEBS-via-PT. The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without PMEM. The "cpu_atom" PMU is similar to Tremont, but with different event_constraints, extra_regs and number of counters. + /* Initialize big core specific PerfMon capabilities.*/ + pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX]; + pmu->name = "cpu_core"; + /* Initialize Atom core specific PerfMon capabilities.*/ + pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_ATOM_IDX]; + pmu->name = "cpu_atom"; So do these things use the same event lists as SPR and TNT? Is there any way to discover that, because AFAICT /proc/cpuinfo will say every CPU is 'Alderlake', and the above also doesn't give any clue. FWIW, ARM big.LITTLE does discriminate in its /proc/cpuinfo, but I'm not entirely sure it's really useful. Mark said perf userspace uses somethink akin to our CPUID, except exposed through sysfs, to find the event lists. My desktop has: cpu/caps/pmu_name and that gives "skylake", do we want the above to have cpu_core/caps/pmu_name give "sapphire_rapids" etc.. ? FWIW, "Tremont" is the only pmu_name with a capital :-( I don't suppose we can still fix that? The cpu/caps/pmu_name is not used for the event list. I don't think perf tool checks the "Tremont". I think it should be OK to fix it. Let me post a patch. Thanks, Kan
Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support
On 3/11/2021 11:53 AM, Liang, Kan wrote: On 3/11/2021 11:09 AM, Peter Zijlstra wrote: On Wed, Mar 10, 2021 at 08:37:56AM -0800, kan.li...@linux.intel.com wrote: From: Kan Liang Alder Lake Hybrid system has two different types of core, Golden Cove core and Gracemont core. The Golden Cove core is registered to "cpu_core" PMU. The Gracemont core is registered to "cpu_atom" PMU. The difference between the two PMUs include: - Number of GP and fixed counters - Events - The "cpu_core" PMU supports Topdown metrics. The "cpu_atom" PMU supports PEBS-via-PT. The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without PMEM. The "cpu_atom" PMU is similar to Tremont, but with different event_constraints, extra_regs and number of counters. + /* Initialize big core specific PerfMon capabilities.*/ + pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX]; + pmu->name = "cpu_core"; + /* Initialize Atom core specific PerfMon capabilities.*/ + pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_ATOM_IDX]; + pmu->name = "cpu_atom"; So do these things use the same event lists as SPR and TNT? No, there will be two new event lists on ADL. One is for Atom core, and the other is for big core. They are different to SPR and TNT. Is there any way to discover that, because AFAICT /proc/cpuinfo will say every CPU is 'Alderlake', and the above also doesn't give any clue. Ricardo once submitted a patch to expose the CPU type under /sys/devices/system/cpu, but I don't know the latest status. https://lore.kernel.org/lkml/20201003011745.7768-5-ricardo.neri-calde...@linux.intel.com/ FWIW, ARM big.LITTLE does discriminate in its /proc/cpuinfo, but I'm not entirely sure it's really useful. Mark said perf userspace uses somethink akin to our CPUID, except exposed through sysfs, to find the event lists. Ah, I guess I misunderstood the concern. Let me try again. Here is how perf tool find a event name via event list. To get the correct event list file, yes, perf tool relies on the CPUID. It will search a CPUID table in the tools/perf/pmu-events/arch/x86/mapfile.csv. GenuineIntel-6-97,v1,alderlake,core Now perf tool knows the event list file "alderlake" is for the CPUID 97. In the event list file for the Alder Lake (CPUID 0x97), we add a new field "Unit" to distinguish the type of PMU. "Unit": "cpu_core" "Unit": "cpu_atom" So perf can search the event name for a certain type of PMU via PMU name "cpu_core" or "cpu_atom". Perf tool doesn't use the "cpu_core/caps/pmu_name" for the event list. Thanks, Kan
Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support
On 3/11/2021 11:09 AM, Peter Zijlstra wrote: On Wed, Mar 10, 2021 at 08:37:56AM -0800, kan.li...@linux.intel.com wrote: From: Kan Liang Alder Lake Hybrid system has two different types of core, Golden Cove core and Gracemont core. The Golden Cove core is registered to "cpu_core" PMU. The Gracemont core is registered to "cpu_atom" PMU. The difference between the two PMUs include: - Number of GP and fixed counters - Events - The "cpu_core" PMU supports Topdown metrics. The "cpu_atom" PMU supports PEBS-via-PT. The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without PMEM. The "cpu_atom" PMU is similar to Tremont, but with different event_constraints, extra_regs and number of counters. + /* Initialize big core specific PerfMon capabilities.*/ + pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_CORE_IDX]; + pmu->name = "cpu_core"; + /* Initialize Atom core specific PerfMon capabilities.*/ + pmu = _pmu.hybrid_pmu[X86_HYBRID_PMU_ATOM_IDX]; + pmu->name = "cpu_atom"; So do these things use the same event lists as SPR and TNT? No, there will be two new event lists on ADL. One is for Atom core, and the other is for big core. They are different to SPR and TNT. Is there any way to discover that, because AFAICT /proc/cpuinfo will say every CPU is 'Alderlake', and the above also doesn't give any clue. Ricardo once submitted a patch to expose the CPU type under /sys/devices/system/cpu, but I don't know the latest status. https://lore.kernel.org/lkml/20201003011745.7768-5-ricardo.neri-calde...@linux.intel.com/ FWIW, ARM big.LITTLE does discriminate in its /proc/cpuinfo, but I'm not entirely sure it's really useful. Mark said perf userspace uses somethink akin to our CPUID, except exposed through sysfs, to find the event lists. Perf tool can use the pmu name. For perf stat -e cpu_atom/EVENT_NAME/, perf will apply the event list for atom. For perf stat -e cpu_core/EVENT_NAME/, perf will apply the event list for the big core. For perf stat -e EVENT_NAME, perf tool will check if the EVENT_NAME exists. If it's available on both event list, perf will automatically create two events, perf stat -e cpu_atom/EVENT_NAME/,cpu_core/EVENT_NAME/. If the event name is only available on a certain type, e.g., atom. The perf tool will only apply the corresponding event, e.g., perf stat -e cpu_atom/EVENT_NAME/ My desktop has: cpu/caps/pmu_name and that gives "skylake", do we want the above to have cpu_core/caps/pmu_name give "sapphire_rapids" etc.. ? I think current implementation should be good enough. $ cat /sys/devices/cpu_atom/caps/pmu_name alderlake_hybrid "alderlake_hybrid" tells the perf tool that it's Alder Lake Hybrid system. "cpu_atom" tells the perf tool that it's for Atom core. Thanks, Kan
Re: [PATCH V2 16/25] perf/x86: Register hybrid PMUs
On 3/11/2021 7:34 AM, Peter Zijlstra wrote: On Wed, Mar 10, 2021 at 08:37:52AM -0800, kan.li...@linux.intel.com wrote: @@ -2092,9 +2105,37 @@ static int __init init_hw_perf_events(void) if (err) goto out1; - err = perf_pmu_register(, "cpu", PERF_TYPE_RAW); - if (err) - goto out2; + if (!is_hybrid()) { + err = perf_pmu_register(, "cpu", PERF_TYPE_RAW); + if (err) + goto out2; + } else { + u8 cpu_type = get_hybrid_cpu_type(smp_processor_id()); + struct x86_hybrid_pmu *hybrid_pmu; + int i; + + for (i = 0; i < x86_pmu.num_hybrid_pmus; i++) { + hybrid_pmu = _pmu.hybrid_pmu[i]; + + hybrid_pmu->pmu = pmu; + hybrid_pmu->pmu.type = -1; + hybrid_pmu->pmu.attr_update = x86_pmu.attr_update; + hybrid_pmu->pmu.capabilities |= PERF_PMU_CAP_HETEROGENEOUS_CPUS; + + /* Only register the PMU for the boot CPU */ Why ?! > AFAICT we could register them all here. That instantly fixes that CPU_STARTING / CPU_DEAD fail elsewhere in this patch. It's possible that all CPUs of a certain type all offline, but I cannot know the information here, because the boot CPU is the only online CPU. I don't know the status of the other CPUs. If we unconditionally register all PMUs, users may see a PMU in /sys/devices, but they cannot use it, because there is no available CPU. Is it acceptable that registering an empty PMU? Thanks, Kan
Re: [PATCH V2 08/25] perf/x86: Hybrid PMU support for hardware cache event
On 3/11/2021 6:07 AM, Peter Zijlstra wrote: On Wed, Mar 10, 2021 at 08:37:44AM -0800, kan.li...@linux.intel.com wrote: From: Kan Liang The hardware cache events are different among hybrid PMUs. Each hybrid PMU should have its own hw cache event table. Reviewed-by: Andi Kleen Signed-off-by: Kan Liang --- arch/x86/events/core.c | 11 +-- arch/x86/events/perf_event.h | 9 + 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 039a851..1db4a67 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -352,6 +352,7 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) { struct perf_event_attr *attr = >attr; unsigned int cache_type, cache_op, cache_result; + struct x86_hybrid_pmu *pmu = is_hybrid() ? hybrid_pmu(event->pmu) : NULL; u64 config, val; config = attr->config; @@ -371,7 +372,10 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return -EINVAL; cache_result = array_index_nospec(cache_result, PERF_COUNT_HW_CACHE_RESULT_MAX); - val = hw_cache_event_ids[cache_type][cache_op][cache_result]; + if (pmu) + val = pmu->hw_cache_event_ids[cache_type][cache_op][cache_result]; + else + val = hw_cache_event_ids[cache_type][cache_op][cache_result]; if (val == 0) return -ENOENT; @@ -380,7 +384,10 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event *event) return -EINVAL; hwc->config |= val; - attr->config1 = hw_cache_extra_regs[cache_type][cache_op][cache_result]; + if (pmu) + attr->config1 = pmu->hw_cache_extra_regs[cache_type][cache_op][cache_result]; + else + attr->config1 = hw_cache_extra_regs[cache_type][cache_op][cache_result]; Why not: attr->config1 = hybrid(event->pmu, hw_cache_extra_regs)[cache_type][cache_op][cache_result]; ? Because hw_cache_extra_regs is not part of the struct x86_pmu. Thanks, Kan
Re: [PATCH V2 1/25] x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit
On 3/10/2021 5:25 PM, Ricardo Neri wrote: On Wed, Mar 10, 2021 at 09:01:47PM +0100, Borislav Petkov wrote: On Wed, Mar 10, 2021 at 11:46:44AM -0800, Ricardo Neri wrote: But this series provides the use case, right? Kan's patches handle PMU counters that may differ cross types of CPUs. In patch 2, get_hybrid_params() needs to check first if X86_FEATURE_HYBRID_CPU is enabled before querying the hybrid parameters. Otherwise, we would need to rely on the maximum level of CPUID, which may not be reliable. On Wed, Mar 10, 2021 at 11:33:54AM -0800, Srinivas Pandruvada wrote: We are working on changes to P-State driver for hybrid CPUs using this define. They are still work in progress. But this patch can be submitted later with our set of changes. Answering to both with a single mail: I don't have a problem with X86_FEATURE_HYBRID_CPU - I simply don't want to show "hybrid_cpu" in /proc/cpuinfo unless there's a valid use case for userspace to know that it is running on a hybrid CPU. Ah, I get your point now. You would like to see #define X86_FEATURE_HYBRID_CPU (18*32+15) /* "" This part has CPUs of more than one type */ Right? Now your first comment makes sense. Srinivas, Kan, I don't think we need to expose "hybrid_cpu" in /proc/cpuinfo, do we? Right, Perf doesn't use the "hybrid_cpu" in /proc/cpuinfo. Thanks, Kan
Re: [PATCH V2 16/25] perf/x86: Register hybrid PMUs
On 3/10/2021 11:50 AM, Dave Hansen wrote: On 3/10/21 8:37 AM, kan.li...@linux.intel.com wrote: - err = perf_pmu_register(, "cpu", PERF_TYPE_RAW); - if (err) - goto out2; + if (!is_hybrid()) { + err = perf_pmu_register(, "cpu", PERF_TYPE_RAW); + if (err) + goto out2; + } else { + u8 cpu_type = get_hybrid_cpu_type(smp_processor_id()); + struct x86_hybrid_pmu *hybrid_pmu; + int i; Where's the preempt_disable()? +static void init_hybrid_pmu(int cpu) +{ + unsigned int fixed_mask, unused_eax, unused_ebx, unused_edx; + struct cpu_hw_events *cpuc = _cpu(cpu_hw_events, cpu); + u8 cpu_type = get_hybrid_cpu_type(cpu); + struct x86_hybrid_pmu *pmu = NULL; + struct perf_cpu_context *cpuctx; + int i; Ditto. Are we really sure the IPIs are worth the trouble? Why don't we just cache the leaf when we bring the CPU up like just about every other thing we read from CPUID? Simple answer: You are right. We don't need it. A simple get_this_hybrid_cpu_type() should be fine for perf. Here is the complete story. I need the CPU type of the dead CPU in the cpu_dead. In the previous patch set, we can read it from the cached CPU type in the struct cpuinfo_x86. In the V2 patch, I tried to do a similar thing (but I have to read it at runtime). Since the CPU is offlined, I asked Ricardo to provide the function get_hybrid_cpu_type() which can read the CPU type of a specific CPU. But I'm wrong. We cannot retrieve the valid CPU type from an offlined CPU. So I dropped the method and used another method to retrieve the information, but I didn't let Ricardo update the function. My bad. Now, we only need to read the CPU type of the current CPU. A get_this_hybrid_cpu_type() function is enough for me. I think we can get rid of the IPIs trouble with the new get_this_hybrid_cpu_type() in the next version. We shouldn't need the preempt_disable() either. Thanks for pointing that out. Kan
Re: [PATCH 00/49] Add Alder Lake support for perf
On 3/5/2021 6:14 AM, Peter Zijlstra wrote: On Thu, Mar 04, 2021 at 06:50:00PM +0100, Peter Zijlstra wrote: On Thu, Mar 04, 2021 at 10:50:45AM -0500, Liang, Kan wrote: Hi Peter, Could you please take a look at the perf kernel patches (3-25)? By now, we have got some comments regarding the generic hybrid feature enumeration code and perf tool patches. I would appreciate it very much if you could comment on the perf kernel patches. Yeah, I started staring at it yesterday.. I'll try and respond tomorrow. OK, so STYLE IS HORRIBLE, please surrender your CAPS-LOCK key until further notice. The below is a completely unfinished 'diff' against 3-20. It has various notes interspersed. Please rework. Thanks for the detailed comments. I will rework the code accordingly. Thanks, Kan --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -44,11 +44,13 @@ #include "perf_event.h" +static struct pmu pmu; + struct x86_pmu x86_pmu __read_mostly; DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = { .enabled = 1, - .hybrid_pmu_idx = X86_NON_HYBRID_PMU, + .pmu = }; DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key); @@ -184,7 +186,7 @@ static inline int get_possible_num_count { int bit, num_counters = 0; - if (!IS_X86_HYBRID) + if (!is_hybrid()) return x86_pmu.num_counters; for_each_set_bit(bit, _pmu.hybrid_pmu_bitmap, X86_HYBRID_PMU_MAX_INDEX) @@ -270,6 +272,9 @@ bool check_hw_exists(int num_counters, i if (ret) goto msr_fail; for (i = 0; i < num_counters_fixed; i++) { + /* +* XXX comment that explains why/how NULL below works. +*/ if (fixed_counter_disabled(i, NULL)) continue; if (val & (0x03 << i*4)) { @@ -352,7 +357,6 @@ set_ext_hw_attr(struct hw_perf_event *hw { struct perf_event_attr *attr = >attr; unsigned int cache_type, cache_op, cache_result; - struct x86_hybrid_pmu *pmu = IS_X86_HYBRID ? container_of(event->pmu, struct x86_hybrid_pmu, pmu) : NULL; u64 config, val; config = attr->config; @@ -372,10 +376,7 @@ set_ext_hw_attr(struct hw_perf_event *hw return -EINVAL; cache_result = array_index_nospec(cache_result, PERF_COUNT_HW_CACHE_RESULT_MAX); - if (pmu) - val = pmu->hw_cache_event_ids[cache_type][cache_op][cache_result]; - else - val = hw_cache_event_ids[cache_type][cache_op][cache_result]; + val = hybrid(event->pmu, hw_cache_event_ids)[cache_type][cache_op][cache_result]; if (val == 0) return -ENOENT; @@ -384,10 +385,7 @@ set_ext_hw_attr(struct hw_perf_event *hw return -EINVAL; hwc->config |= val; - if (pmu) - attr->config1 = pmu->hw_cache_extra_regs[cache_type][cache_op][cache_result]; - else - attr->config1 = hw_cache_extra_regs[cache_type][cache_op][cache_result]; + attr->config1 = hybrid(event->pmu, hw_cache_extra_regs)[cache_type][cache_op][cache_result]; return x86_pmu_extra_regs(val, event); } @@ -742,13 +740,11 @@ void x86_pmu_enable_all(int added) } } -static struct pmu pmu; - static inline int is_x86_event(struct perf_event *event) { int bit; - if (!IS_X86_HYBRID) + if (!is_hybrid()) return event->pmu == for_each_set_bit(bit, _pmu.hybrid_pmu_bitmap, X86_HYBRID_PMU_MAX_INDEX) { @@ -760,6 +756,7 @@ static inline int is_x86_event(struct pe struct pmu *x86_get_pmu(void) { + /* borken */ return } /* @@ -963,7 +960,7 @@ EXPORT_SYMBOL_GPL(perf_assign_events); int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) { - int num_counters = X86_HYBRID_READ_FROM_CPUC(num_counters, cpuc); + int num_counters = hybrid(cpuc->pmu, num_counters); struct event_constraint *c; struct perf_event *e; int n0, i, wmin, wmax, unsched = 0; @@ -1124,7 +1121,7 @@ static void del_nr_metric_event(struct c static int collect_event(struct cpu_hw_events *cpuc, struct perf_event *event, int max_count, int n) { - union perf_capabilities intel_cap = X86_HYBRID_READ_FROM_CPUC(intel_cap, cpuc); + union perf_capabilities intel_cap = hybrid(cpuc->pmu, intel_cap); if (intel_cap.perf_metrics && add_nr_metric_event(cpuc, event)) return -EINVAL; @@ -1147,8 +1144,8 @@ static int collect_event(struct cpu_hw_e */ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader, bool dogrp) { - int num_counters = X86_HYBRID_READ_FROM_CPUC(num_counters, cpuc); - int num_counters_fixed = X86_HYBRID_REA
Re: [perf] perf_fuzzer causes unchecked MSR access error
On 3/3/2021 3:22 PM, Vince Weaver wrote: On Wed, 3 Mar 2021, Liang, Kan wrote: We never use bit 58. It should be a new issue. Actually, KVM uses it. They create a fake event called VLBR_EVENT, which uses bit 58. It's introduced from the commit 097e4311cda9 ("perf/x86: Add constraint to create guest LBR event without hw counter"). Since it's a fake event, it doesn't support PEBS. Perf should reject it if it sets the precise_ip. The below patch should fix the MSR access error. diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 5bac48d..1ea3c67 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3659,6 +3659,10 @@ static int intel_pmu_hw_config(struct perf_event *event) return ret; if (event->attr.precise_ip) { + + if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_FIXED_VLBR_EVENT) + return -EINVAL; + if (!(event->attr.freq || (event->attr.wakeup_events && !event->attr.watermark))) { event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; if (!(event->attr.sample_type & Thanks, Kan
Re: [PATCH 00/49] Add Alder Lake support for perf
Hi Peter, Could you please take a look at the perf kernel patches (3-25)? By now, we have got some comments regarding the generic hybrid feature enumeration code and perf tool patches. I would appreciate it very much if you could comment on the perf kernel patches. Thanks, Kan On 2/8/2021 10:24 AM, kan.li...@linux.intel.com wrote: From: Kan Liang (The V1 patchset is a complete patchset for the Alder Lake support on the Linux perf. It includes both kernel patches (1-25) and the user space patches (26-49). It tries to give the maintainers/reviewers an overall picture of the ADL enabling patches. The number of the patches are huge. Sorry for it. For future versions, the patchset will be divided into the kernel patch series and the userspace patch series. They can be reviewed separately.) Alder Lake uses a hybrid architecture utilizing Golden Cove cores and Gracemont cores. On such architectures, all CPUs support the same, homogeneous and symmetric, instruction set. Also, CPUID enumerate the same features for all CPUs. There may be model-specific differences, such as those addressed in this patchset. The first two patches enumerate the hybrid CPU feature bit and save the CPU type in a new field x86_cpu_type in struct cpuinfo_x86 for the following patches. They were posted previously[1] but not merged. Compared with the initial submission, they address the below two concerns[2][3], - Provide a good use case, PMU. - Clarify what Intel Hybrid Technology is and is not. The PMU capabilities for Golden Cove core and Gracemont core are not the same. The key differences include the number of counters, events, perf metrics feature, and PEBS-via-PT feature. A dedicated hybrid PMU has to be registered for each of them. However, the current perf X86 assumes that there is only one CPU PMU. To handle the hybrid PMUs, the patchset - Introduce a new struct x86_hybrid_pmu to save the unique capabilities from different PMUs. It's part of the global x86_pmu. The architecture capabilities, which are available for all PMUs, are still saved in the global x86_pmu. I once considered dynamically create dedicated x86_pmu and pmu for each hybrid PMU. If so, they have to be changed to pointers. Since they are used everywhere, the changes could be huge and complex. Also, most of the PMU capabilities are the same between hybrid PMUs. Duplicated data in the big x86_pmu structure will be saved many times. So the dynamic way was dropped. - The hybrid PMU registration has been moved to the cpu_starting(), because only boot CPU is available when invoking the init_hw_perf_events(). - Hybrid PMUs have different events and formats. Add new structures and helpers for events attribute and format attribute which take the PMU type into account. - Add a PMU aware version PERF_TYPE_HARDWARE_PMU and PERF_TYPE_HW_CACHE_PMU to facilitate user space tools The uncore, MSR and cstate are the same between hybrid CPUs. Don't need to register hybrid PMUs for them. The generic code kernel/events/core.c is not hybrid friendly either, especially for the per-task monitoring. Peter once proposed a patchset[4], but it hasn't been merged. This patchset doesn't intend to improve the generic code (which can be improved later separately). It still uses the capability PERF_PMU_CAP_HETEROGENEOUS_CPUS for each hybrid PMUs. For per-task and system-wide monitoring, user space tools have to create events on all available hybrid PMUs. The events which are from different hybrid PMUs cannot be included in the same group. [1]. https://lore.kernel.org/lkml/20201002201931.2826-1-ricardo.neri-calde...@linux.intel.com/ [2]. https://lore.kernel.org/lkml/20201002203452.ge17...@zn.tnic/ [3]. https://lore.kernel.org/lkml/87r1qgccku@nanos.tec.linutronix.de/ [4]. https://lkml.kernel.org/r/20181010104559.go5...@hirez.programming.kicks-ass.net/ Jin Yao (24): perf jevents: Support unit value "cpu_core" and "cpu_atom" perf util: Save pmu name to struct perf_pmu_alias perf pmu: Save detected hybrid pmus to a global pmu list perf pmu: Add hybrid helper functions perf list: Support --cputype option to list hybrid pmu events perf stat: Hybrid evsel uses its own cpus perf header: Support HYBRID_TOPOLOGY feature perf header: Support hybrid CPU_PMU_CAPS tools headers uapi: Update tools's copy of linux/perf_event.h perf parse-events: Create two hybrid hardware events perf parse-events: Create two hybrid cache events perf parse-events: Support hardware events inside PMU perf list: Display pmu prefix for partially supported hybrid cache events perf parse-events: Support hybrid raw events perf stat: Support --cputype option for hybrid events perf stat: Support metrics with hybrid events perf evlist: Create two hybrid 'cycles' events by default perf stat: Add default hybrid events perf stat: Uniquify hybrid event name perf stat: Merge event counts from all hybrid PMUs perf
Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"
On 3/3/2021 1:59 PM, Peter Zijlstra wrote: On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.li...@linux.intel.com wrote: For some old CPUs (HSW and earlier), the PEBS status in a PEBS record may be mistakenly set to 0. To minimize the impact of the defect, the commit was introduced to try to avoid dropping the PEBS record for some cases. It adds a check in the intel_pmu_drain_pebs_nhm(), and updates the local pebs_status accordingly. However, it doesn't correct the PEBS status in the PEBS record, which may trigger the crash, especially for the large PEBS. It's possible that all the PEBS records in a large PEBS have the PEBS status 0. If so, the first get_next_pebs_record_by_bit() in the __intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large PEBS, the 'count' parameter must > 1. The second get_next_pebs_record_by_bit() will crash. Two solutions were considered to fix the crash. - Keep the SW workaround and add extra checks in the get_next_pebs_record_by_bit() to workaround the issue. The get_next_pebs_record_by_bit() is a critical path. The extra checks will bring extra overhead for the latest CPUs which don't have the defect. Also, the defect can only be observed on some old CPUs (For example, the issue can be reproduced on an HSW client, but I didn't observe the issue on my Haswell server machine.). The impact of the defect should be limit. This solution is dropped. - Drop the SW workaround and revert the commit. It seems that the commit never works, because the PEBS status in the PEBS record never be changed. The get_next_pebs_record_by_bit() only checks the PEBS status in the PEBS record. The record is dropped eventually. Reverting the commit should not change the current behavior. +++ b/arch/x86/events/intel/ds.c @@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d continue; } - /* -* On some CPUs the PEBS status can be zero when PEBS is -* racing with clearing of GLOBAL_STATUS. -* -* Normally we would drop that record, but in the -* case when there is only a single active PEBS event -* we can assume it's for that event. -*/ - if (!pebs_status && cpuc->pebs_enabled && - !(cpuc->pebs_enabled & (cpuc->pebs_enabled-1))) - pebs_status = cpuc->pebs_enabled; Wouldn't something like: pebs_status = p->status = cpus->pebs_enabled; I didn't consider it as a potential solution in this patch because I don't think it's a proper way that SW modifies the buffer, which is supposed to be manipulated by the HW. It's just a personal preference. I don't see any issue here. We may try it. Vince, could you please help check whether Peter's suggestion fixes the crash? Thanks, Kan actually fix things without adding overhead?
Re: [perf] perf_fuzzer causes unchecked MSR access error
On 3/3/2021 2:28 PM, Stephane Eranian wrote: On Wed, Mar 3, 2021 at 10:16 AM Vince Weaver wrote: Hello on my Haswell machine the perf_fuzzer managed to trigger this message: [117248.075892] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 0x0400) at rIP: 0x8106e4f4 (native_write_msr+0x4/0x20) [117248.089957] Call Trace: [117248.092685] intel_pmu_pebs_enable_all+0x31/0x40 [117248.097737] intel_pmu_enable_all+0xa/0x10 [117248.102210] __perf_event_task_sched_in+0x2df/0x2f0 [117248.107511] finish_task_switch.isra.0+0x15f/0x280 [117248.112765] schedule_tail+0xc/0x40 [117248.116562] ret_from_fork+0x8/0x30 that shouldn't be possible, should it? MSR 0x3f1 is MSR_IA32_PEBS_ENABLE Not possible, bit 58 is not defined in PEBS_ENABLE, AFAIK. this is on recent-git with the patch causing the pebs-related crash reverted. We never use bit 58. It should be a new issue. Is it repeatable? Thanks, Kan
Re: [PATCH] perf test: Test case 27 fails on s390 and non-x86 platforms
On 3/2/2021 12:08 PM, Thomas Richter wrote: On 3/2/21 4:23 PM, Liang, Kan wrote: On 3/2/2021 9:48 AM, Thomas Richter wrote: On 3/2/21 3:03 PM, Liang, Kan wrote: + Athira Rajeev On 3/2/2021 8:31 AM, Thomas Richter wrote: Executing perf test 27 fails on s390: [root@t35lp46 perf]# ./perf test -Fv 27 27: Sample parsing --- start --- end Sample parsing: FAILED! [root@t35lp46 perf]# The root cause is commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT") This commit introduced a test case for PERF_SAMPLE_WEIGHT_STRUCT but does not adjust non-x86 weak linkage functions. The error is in test__sample_parsing() --> do_test() Function do_test() defines two structures of type struct perf_sample named sample and sample_out. The first sets member sample.ins_lat = 117 Structure sample_out is constructed dynamically using functions perf_event__synthesize_sample() and evsel__parse_sample(). Both functions have an x86 specific function version which sets member ins_lat. The weak common functions do not set member ins_lat. I don't think Power supports the instruction latency. As a request from Athira Rajeev, I moved the PERF_SAMPLE_WEIGHT_STRUCT to the X86 specific codes. https://lore.kernel.org/lkml/d97fef4f-dd88-4760-885e-9a6161a9b...@linux.vnet.ibm.com/ https://lore.kernel.org/lkml/1612540912-6562-1-git-send-email-kan.li...@linux.intel.com/ I don't think we want to add the ins_lat back in the weak common functions. Could you please update the perf test and don't apply the PERF_SAMPLE_WEIGHT_STRUCT for the non-X86 platform? I used offical linux git tree [root@t35lp46 perf]# git tag | fgrep 5.12 v5.12-rc1 [root@t35lp46 perf]# So this change is in the pipe. I do not plan to revert individual patches. No, we shouldn't revert the patch. I mean can you fix the issue in perf test? Don't test ins_lat or PERF_SAMPLE_WEIGHT_STRUCT for a non-X86 platform. That would be very ugly code. We would end up in conditional compiles like #ifdef __s390x__ #endif and other architectes like ARM/POWER etc come along. This is something I want to avoid. The ins_lat is a model specific variable. Maybe we should move it to the arch specific test. And this fix only touches perf, not the kernel. The patch changes the behavior of the PERF_SAMPLE_WEIGHT. The high 32 bit will be dropped. It should bring some problems if the high 32 bit contains valid information. Later in function samples_same() both data in variable sample and sample_out are compared. The comparison fails because sample.ins_lat is 117 and samples_out.ins_lat is 0, the weak functions never set member ins_lat. Output after: [root@t35lp46 perf]# ./perf test -Fv 27 27: Sample parsing --- start --- end Sample parsing: Ok [root@t35lp46 perf]# Fixes: commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT") I think the regression should start from commit fbefe9c2f87f ("perf tools: Support arch specific PERF_SAMPLE_WEIGHT_STRUCT processing") Thanks, Kan Kan, I do not follow you. Your commit c7444297fd3769d10c7ffb52c81d71503b3e268f adds this line @@ -242,6 +245,7 @@ static int do_test(u64 sample_type, u64 sample_regs, u64 read_format) .cgroup = 114, .data_page_size = 115, .code_page_size = 116, + .ins_lat = 117, And this assignment 117 breaks the test. As mentioned before, member ins_lat is never touched by the weak functions. Here is the timeline for the patches. 1. The commit c7444297fd3769 and other SPR patches are merged at 2021-02-08. At that time, I don't think we have this issue. perf test should work well. Nope, that line above 'ins_lat = 117.' breaks the test. Comment it out and it works well!!! If you revert the commit fbefe9c2f87f, perf test should work well too. The root cause of the issue is that the commit fbefe9c2f87f change the ins_lat to a model-specific variable, but perf test still verify the variable in the generic test. The below patch moves the PERF_SAMPLE_WEIGHT test into a X86 specific test. Does it work for you? --- tools/perf/arch/x86/include/arch-tests.h | 1 + tools/perf/arch/x86/tests/Build| 1 + tools/perf/arch/x86/tests/arch-tests.c | 4 + tools/perf/arch/x86/tests/sample-parsing.c | 125 + tools/perf/tests/sample-parsing.c | 4 - 5 files changed, 131 insertions(+), 4 deletions(-) create mode 100644 tools/perf/arch/x86/tests/sample-parsing.c diff --git a/tools/perf/arch/x86/include/arch-tests.h b/tools/perf/arch/x86/include/arch-tests.h index 6a54b94..0e20f3d 100644 --- a/tools/perf/arch/x86/include/arch-tests.h +++ b/tools/perf/arch/x86/include/arch-tests.h @@ -10,6 +10,7 @@ int test__rdpmc(struct test *test __maybe_unused, int subtest); int test__insn_x86(struct test *test __maybe_unused, int subtest)
Re: [PATCH] perf test: Test case 27 fails on s390 and non-x86 platforms
On 3/2/2021 9:48 AM, Thomas Richter wrote: On 3/2/21 3:03 PM, Liang, Kan wrote: + Athira Rajeev On 3/2/2021 8:31 AM, Thomas Richter wrote: Executing perf test 27 fails on s390: [root@t35lp46 perf]# ./perf test -Fv 27 27: Sample parsing --- start --- end Sample parsing: FAILED! [root@t35lp46 perf]# The root cause is commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT") This commit introduced a test case for PERF_SAMPLE_WEIGHT_STRUCT but does not adjust non-x86 weak linkage functions. The error is in test__sample_parsing() --> do_test() Function do_test() defines two structures of type struct perf_sample named sample and sample_out. The first sets member sample.ins_lat = 117 Structure sample_out is constructed dynamically using functions perf_event__synthesize_sample() and evsel__parse_sample(). Both functions have an x86 specific function version which sets member ins_lat. The weak common functions do not set member ins_lat. I don't think Power supports the instruction latency. As a request from Athira Rajeev, I moved the PERF_SAMPLE_WEIGHT_STRUCT to the X86 specific codes. https://lore.kernel.org/lkml/d97fef4f-dd88-4760-885e-9a6161a9b...@linux.vnet.ibm.com/ https://lore.kernel.org/lkml/1612540912-6562-1-git-send-email-kan.li...@linux.intel.com/ I don't think we want to add the ins_lat back in the weak common functions. Could you please update the perf test and don't apply the PERF_SAMPLE_WEIGHT_STRUCT for the non-X86 platform? I used offical linux git tree [root@t35lp46 perf]# git tag | fgrep 5.12 v5.12-rc1 [root@t35lp46 perf]# So this change is in the pipe. I do not plan to revert individual patches. No, we shouldn't revert the patch. I mean can you fix the issue in perf test? Don't test ins_lat or PERF_SAMPLE_WEIGHT_STRUCT for a non-X86 platform. Later in function samples_same() both data in variable sample and sample_out are compared. The comparison fails because sample.ins_lat is 117 and samples_out.ins_lat is 0, the weak functions never set member ins_lat. Output after: [root@t35lp46 perf]# ./perf test -Fv 27 27: Sample parsing --- start --- end Sample parsing: Ok [root@t35lp46 perf]# Fixes: commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT") I think the regression should start from commit fbefe9c2f87f ("perf tools: Support arch specific PERF_SAMPLE_WEIGHT_STRUCT processing") Thanks, Kan Kan, I do not follow you. Your commit c7444297fd3769d10c7ffb52c81d71503b3e268f adds this line @@ -242,6 +245,7 @@ static int do_test(u64 sample_type, u64 sample_regs, u64 read_format) .cgroup = 114, .data_page_size = 115, .code_page_size = 116, + .ins_lat= 117, And this assignment 117 breaks the test. As mentioned before, member ins_lat is never touched by the weak functions. Here is the timeline for the patches. 1. The commit c7444297fd3769 and other SPR patches are merged at 2021-02-08. At that time, I don't think we have this issue. perf test should work well. 2. Athira Rajeev told me that Power doesn't support instruction latency. So I submitted a patch which create weak functions and move the ins_lat into X86 specific. 3. The patch (commit fbefe9c2f87f) was finally merged at 2021-02-18. We should observe the perf test at this time. As my understanding, "Fixes" should log the commit where an issue starts to be observed. If someone tries to backport the fix later, they have an idea where to put it. Thanks, Kan
Re: [PATCH] perf test: Test case 27 fails on s390 and non-x86 platforms
+ Athira Rajeev On 3/2/2021 8:31 AM, Thomas Richter wrote: Executing perf test 27 fails on s390: [root@t35lp46 perf]# ./perf test -Fv 27 27: Sample parsing --- start --- end Sample parsing: FAILED! [root@t35lp46 perf]# The root cause is commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT") This commit introduced a test case for PERF_SAMPLE_WEIGHT_STRUCT but does not adjust non-x86 weak linkage functions. The error is in test__sample_parsing() --> do_test() Function do_test() defines two structures of type struct perf_sample named sample and sample_out. The first sets member sample.ins_lat = 117 Structure sample_out is constructed dynamically using functions perf_event__synthesize_sample() and evsel__parse_sample(). Both functions have an x86 specific function version which sets member ins_lat. The weak common functions do not set member ins_lat. I don't think Power supports the instruction latency. As a request from Athira Rajeev, I moved the PERF_SAMPLE_WEIGHT_STRUCT to the X86 specific codes. https://lore.kernel.org/lkml/d97fef4f-dd88-4760-885e-9a6161a9b...@linux.vnet.ibm.com/ https://lore.kernel.org/lkml/1612540912-6562-1-git-send-email-kan.li...@linux.intel.com/ I don't think we want to add the ins_lat back in the weak common functions. Could you please update the perf test and don't apply the PERF_SAMPLE_WEIGHT_STRUCT for the non-X86 platform? Later in function samples_same() both data in variable sample and sample_out are compared. The comparison fails because sample.ins_lat is 117 and samples_out.ins_lat is 0, the weak functions never set member ins_lat. Output after: [root@t35lp46 perf]# ./perf test -Fv 27 27: Sample parsing --- start --- end Sample parsing: Ok [root@t35lp46 perf]# Fixes: commit c7444297fd3769 ("perf test: Support PERF_SAMPLE_WEIGHT_STRUCT") I think the regression should start from commit fbefe9c2f87f ("perf tools: Support arch specific PERF_SAMPLE_WEIGHT_STRUCT processing") Thanks, Kan Signed-off-by: Thomas Richter --- tools/perf/util/evsel.c| 8 +--- tools/perf/util/synthetic-events.c | 6 +- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index 1bf76864c4f2..c9efed5c173d 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -2106,10 +2106,12 @@ perf_event__check_size(union perf_event *event, unsigned int sample_size) } void __weak arch_perf_parse_sample_weight(struct perf_sample *data, - const __u64 *array, - u64 type __maybe_unused) + const __u64 *array, u64 type) { - data->weight = *array; + if (type & PERF_SAMPLE_WEIGHT) + data->weight = *array & 0x; + if (type & PERF_SAMPLE_WEIGHT_STRUCT) + data->ins_lat = (*array >> 32) & 0x; } int evsel__parse_sample(struct evsel *evsel, union perf_event *event, diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c index b698046ec2db..af7ecbc57cbe 100644 --- a/tools/perf/util/synthetic-events.c +++ b/tools/perf/util/synthetic-events.c @@ -1507,9 +1507,13 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type, } void __weak arch_perf_synthesize_sample_weight(const struct perf_sample *data, - __u64 *array, u64 type __maybe_unused) + __u64 *array, u64 type) { *array = data->weight; + if (type & PERF_SAMPLE_WEIGHT_STRUCT) { + *array &= 0x; + *array |= ((u64)data->ins_lat << 32); + } } int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_format,
Re: [perf] perf_fuzzer causes crash in intel_pmu_drain_pebs_nhm()
On 2/11/2021 9:53 AM, Peter Zijlstra wrote: Kan, do you have time to look at this? On Thu, Jan 28, 2021 at 02:49:47PM -0500, Vince Weaver wrote: On Thu, 28 Jan 2021, Vince Weaver wrote: the perf_fuzzer has turned up a repeatable crash on my haswell system. addr2line is not being very helpful, it points to DECLARE_PER_CPU_FIRST. I'll investigate more when I have the chance. so I poked around some more. This seems to be caused in __intel_pmu_pebs_event() get_next_pebs_record_by_bit() ds.c line 1639 get_pebs_status(at) ds.c line 1317 return ((struct pebs_record_nhm *)n)->status; where "n" has the value of 0xc0 rather than a proper pointer. I think I find the suspicious patch. The commt id 01330d7288e00 ("perf/x86: Allow zero PEBS status with only single active event") https://lore.kernel.org/lkml/tip-01330d7288e0050c5aaabc558059ff91589e6...@git.kernel.org/ The patch is an SW workaround for some old CPUs (HSW and earlier), which may set 0 to the PEBS status. It adds a check in the intel_pmu_drain_pebs_nhm(). It tries to minimize the impact of the defect by avoiding dropping the PEBS records which have PEBS status 0. But, it doesn't correct the PEBS status, which may bring problems, especially for the large PEBS. It's possible that all the PEBS records in a large PEBS have the PEBS status 0. If so, the first get_next_pebs_record_by_bit() in the __intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large PEBS, the 'count' parameter must > 1. The second get_next_pebs_record_by_bit() will crash. Could you please revert the patch and check whether it fixes your issue? Thanks, Kan
Re: [perf] perf_fuzzer causes crash in intel_pmu_drain_pebs_nhm()
On 2/11/2021 5:14 PM, Vince Weaver wrote: On Thu, 11 Feb 2021, Liang, Kan wrote: On Thu, Jan 28, 2021 at 02:49:47PM -0500, Vince Weaver wrote: I'd like to reproduce it on my machine. Is this issue only found in a Haswell client machine? To reproduce the issue, can I use ./perf_fuzzer under perf_event_tests/fuzzer? Do I need to apply any parameters with ./perf_fuzzer? Usually how long does it take to reproduce the issue? On my machine if I run the commands echo 1 > /proc/sys/kernel/nmi_watchdog echo 0 > /proc/sys/kernel/perf_event_paranoid echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate ./perf_fuzzer -s 3 -r 1611784483 it is repeatable within a minute, but because of the nature of the fuzzer it probably won't work for you because the random events will diverge based on the different configs of the system. I can try to generate a simple reproducer, I've just been extremely busy here at work and haven't had the chance. If you want to try to reproduce it the hard way, run the ./fast_repro99.sh script in the perf_fuzzer directory. It will start fuzzing. My machine turned up the issue within a day or so. Sorry for the late response. Just want to let you know I'm still trying to reproduce the issue. I only have a Haswell server on hand. I run the ./fast_repro99.sh on the machine for 3 days, but I didn't observe any crash. Now, I'm looking for a HSW client in my lab. I will check if I can reproduce it on a client machine. If you have a simple reproducer, please let me know. Thanks, Kan
Re: [perf] perf_fuzzer causes crash in intel_pmu_drain_pebs_nhm()
On 2/11/2021 9:53 AM, Peter Zijlstra wrote: Kan, do you have time to look at this? On Thu, Jan 28, 2021 at 02:49:47PM -0500, Vince Weaver wrote: On Thu, 28 Jan 2021, Vince Weaver wrote: the perf_fuzzer has turned up a repeatable crash on my haswell system. addr2line is not being very helpful, it points to DECLARE_PER_CPU_FIRST. I'll investigate more when I have the chance. so I poked around some more. This seems to be caused in __intel_pmu_pebs_event() get_next_pebs_record_by_bit() ds.c line 1639 get_pebs_status(at) ds.c line 1317 return ((struct pebs_record_nhm *)n)->status; where "n" has the value of 0xc0 rather than a proper pointer. The issue is pretty strange. I haven't found the root cause yet. The base->status (aka "n") was just used in the intel_pmu_drain_pebs_nhm(). for (at = base; at < top; at += x86_pmu.pebs_record_size) { struct pebs_record_nhm *p = at; u64 pebs_status; pebs_status = p->status & cpuc->pebs_enabled; pebs_status &= mask; Then it seems to be modified to 0xc0, and crash at get_pebs_status() based on your investigation. The "base" is a local variable. The ds->pebs_buffer_base is assigned to "base" at the beginning of the function. After that, no one change it. this does seem to be repetable, I'd like to reproduce it on my machine. Is this issue only found in a Haswell client machine? To reproduce the issue, can I use ./perf_fuzzer under perf_event_tests/fuzzer? Do I need to apply any parameters with ./perf_fuzzer? Usually how long does it take to reproduce the issue? Thanks, Kan but fairly deep in a fuzzing run so I don't have a quick reproducer. >> Vince [96289.009646] BUG: kernel NULL pointer dereference, address: 0150 [96289.017094] #PF: supervisor read access in kernel mode [96289.022588] #PF: error_code(0x) - not-present page [96289.028069] PGD 0 P4D 0 [96289.030796] Oops: [#1] SMP PTI [96289.034549] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW 5.11.0-rc5+ #151 [96289.043059] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014 [96289.050946] RIP: 0010:intel_pmu_drain_pebs_nhm+0x464/0x5f0 [96289.056817] Code: 09 00 00 0f b6 c0 49 39 c4 74 2a 48 63 82 78 09 00 00 48 01 c5 48 39 6c 24 08 76 17 0f b6 05 14 70 3f 01 83 e0 0f 3c 03 77 a4 <48> 8b 85 90 00 00 00 eb 9f 31 ed 83 eb 01 83 fb 01 0f 85 30 ff ff [96289.076876] RSP: :822039e0 EFLAGS: 00010097 [96289.082468] RAX: 0002 RBX: 0155 RCX: 0008 [96289.090095] RDX: 88811ac118a0 RSI: 82203980 RDI: 82203980 [96289.097746] RBP: 00c0 R08: R09: [96289.105376] R10: R11: R12: 0001 [96289.113008] R13: 82203bc0 R14: 88801c3cf800 R15: 829814a0 [96289.120671] FS: () GS:88811ac0() knlGS: [96289.129346] CS: 0010 DS: ES: CR0: 80050033 [96289.135526] CR2: 0150 CR3: 0220c003 CR4: 001706f0 [96289.143159] DR0: DR1: DR2: [96289.150803] DR3: DR6: 0ff0 DR7: 0600 [96289.158414] Call Trace: [96289.161041] ? update_blocked_averages+0x532/0x620 [96289.166152] ? update_group_capacity+0x25/0x1d0 [96289.171025] ? cpumask_next_and+0x19/0x20 [96289.175339] ? update_sd_lb_stats.constprop.0+0x702/0x820 [96289.181105] intel_pmu_drain_pebs_buffer+0x33/0x50 [96289.186259] ? x86_pmu_commit_txn+0xbc/0xf0 [96289.190749] ? _raw_spin_lock_irqsave+0x1d/0x30 [96289.195603] ? timerqueue_add+0x64/0xb0 [96289.199720] ? update_load_avg+0x6c/0x5e0 [96289.204001] ? enqueue_task_fair+0x98/0x5a0 [96289.208464] ? timerqueue_del+0x1e/0x40 [96289.212556] ? uncore_msr_read_counter+0x10/0x20 [96289.217513] intel_pmu_pebs_disable+0x12a/0x130 [96289.222324] x86_pmu_stop+0x48/0xa0 [96289.226076] x86_pmu_del+0x40/0x160 [96289.229813] event_sched_out.isra.0+0x81/0x1e0 [96289.234602] group_sched_out.part.0+0x4f/0xc0 [96289.239257] __perf_event_disable+0xef/0x1d0 [96289.243831] event_function+0x8c/0xd0 [96289.247785] remote_function+0x3e/0x50 [96289.251797] flush_smp_call_function_queue+0x11b/0x1a0 [96289.257268] flush_smp_call_function_from_idle+0x38/0x60 [96289.262944] do_idle+0x15f/0x240 [96289.266421] cpu_startup_entry+0x19/0x20 [96289.270639] start_kernel+0x7df/0x804 [96289.274558] ? apply_microcode_early.cold+0xc/0x27 [96289.279678] secondary_startup_64_no_verify+0xb0/0xbb [96289.285078] Modules linked in: nf_tables libcrc32c nfnetlink intel_rapl_msr intel_rapl_common snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi x86_pkg_temp_thermal ledtrig_audio intel_powerclamp snd_hda_intel coretemp snd_intel_dspcfg snd_hda_codec snd_hda_core kvm_intel kvm
Re: [PATCH 00/49] Add Alder Lake support for perf
On 2/11/2021 6:40 AM, Jiri Olsa wrote: On Mon, Feb 08, 2021 at 07:24:57AM -0800, kan.li...@linux.intel.com wrote: SNIP Jin Yao (24): perf jevents: Support unit value "cpu_core" and "cpu_atom" perf util: Save pmu name to struct perf_pmu_alias perf pmu: Save detected hybrid pmus to a global pmu list perf pmu: Add hybrid helper functions perf list: Support --cputype option to list hybrid pmu events perf stat: Hybrid evsel uses its own cpus perf header: Support HYBRID_TOPOLOGY feature perf header: Support hybrid CPU_PMU_CAPS tools headers uapi: Update tools's copy of linux/perf_event.h perf parse-events: Create two hybrid hardware events perf parse-events: Create two hybrid cache events perf parse-events: Support hardware events inside PMU perf list: Display pmu prefix for partially supported hybrid cache events perf parse-events: Support hybrid raw events perf stat: Support --cputype option for hybrid events perf stat: Support metrics with hybrid events perf evlist: Create two hybrid 'cycles' events by default perf stat: Add default hybrid events perf stat: Uniquify hybrid event name perf stat: Merge event counts from all hybrid PMUs perf stat: Filter out unmatched aggregation for hybrid event perf evlist: Warn as events from different hybrid PMUs in a group perf Documentation: Document intel-hybrid support perf evsel: Adjust hybrid event and global event mixed group Kan Liang (22): perf/x86/intel: Hybrid PMU support for perf capabilities perf/x86: Hybrid PMU support for intel_ctrl perf/x86: Hybrid PMU support for counters perf/x86: Hybrid PMU support for unconstrained perf/x86: Hybrid PMU support for hardware cache event perf/x86: Hybrid PMU support for event constraints perf/x86: Hybrid PMU support for extra_regs perf/x86/intel: Factor out intel_pmu_check_num_counters perf/x86/intel: Factor out intel_pmu_check_event_constraints perf/x86/intel: Factor out intel_pmu_check_extra_regs perf/x86: Expose check_hw_exists perf/x86: Remove temporary pmu assignment in event_init perf/x86: Factor out x86_pmu_show_pmu_cap perf/x86: Register hybrid PMUs perf/x86: Add structures for the attributes of Hybrid PMUs perf/x86/intel: Add attr_update for Hybrid PMUs perf/x86: Support filter_match callback perf/x86/intel: Add Alder Lake Hybrid support perf: Introduce PERF_TYPE_HARDWARE_PMU and PERF_TYPE_HW_CACHE_PMU perf/x86/intel/uncore: Add Alder Lake support perf/x86/msr: Add Alder Lake CPU support perf/x86/cstate: Add Alder Lake CPU support Ricardo Neri (2): x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit x86/cpu: Describe hybrid CPUs in cpuinfo_x86 Zhang Rui (1): perf/x86/rapl: Add support for Intel Alder Lake hi, would you have git branch with all this somewhere? Here is the git branch https://github.com/kliang2/perf.git adl_enabling Please note that the branch is on top of Peter's perf/core branch, which doesn't include the latest perf tool changes. The perf tool patches in the branch only includes the critical changes. There will be more tool patches later, e.g., patches for perf mem, perf test etc. Thanks, Kan
Re: [PATCH 23/49] perf/x86/msr: Add Alder Lake CPU support
On 2/8/2021 10:58 PM, kernel test robot wrote: Hi, Thank you for the patch! Yet something to improve: [auto build test ERROR on tip/perf/core] [cannot apply to tip/master linus/master tip/x86/core v5.11-rc6 next-20210125] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/kan-liang-linux-intel-com/Add-Alder-Lake-support-for-perf/20210209-070642 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 32451614da2a9cf4296f90d3606ac77814fb519d config: x86_64-randconfig-s021-20210209 (attached as .config) compiler: gcc-9 (Debian 9.3.0-15) 9.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.3-215-g0fb77bb6-dirty # https://github.com/0day-ci/linux/commit/ef3d3e5028f5f70a78fa37d642e8e7e65c60dee7 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review kan-liang-linux-intel-com/Add-Alder-Lake-support-for-perf/20210209-070642 git checkout ef3d3e5028f5f70a78fa37d642e8e7e65c60dee7 # save the attached .config to linux build tree make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All errors (new ones prefixed by >>): arch/x86/events/msr.c: In function 'test_intel': arch/x86/events/msr.c:104:7: error: 'INTEL_FAM6_ALDERLAKE_L' undeclared (first use in this function); did you mean 'INTEL_FAM6_ALDERLAKE'? 104 | case INTEL_FAM6_ALDERLAKE_L: | ^~ | INTEL_FAM6_ALDERLAKE arch/x86/events/msr.c:104:7: note: each undeclared identifier is reported only once for each function it appears in The patchset is on top of PeterZ's perf/core branch plus commit id 6e1239c13953 ("x86/cpu: Add another Alder Lake CPU to the Intel family") The above patch is also missed in the tip/perf/core branch. All the issues should be gone once the tip/perf/core sync with the tip/x86/urgent. Thanks, Kan
Re: [PATCH 02/49] x86/cpu: Describe hybrid CPUs in cpuinfo_x86
On 2/8/2021 12:56 PM, Borislav Petkov wrote: On Mon, Feb 08, 2021 at 07:24:59AM -0800, kan.li...@linux.intel.com wrote: diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index c20a52b..1f25ac9 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -139,6 +139,16 @@ struct cpuinfo_x86 { u32 microcode; /* Address space bits used by the cache internally */ u8 x86_cache_bits; + /* +* In hybrid processors, there is a CPU type and a native model ID. The +* CPU type (x86_cpu_type[31:24]) describes the type of micro- +* architecture families. The native model ID (x86_cpu_type[23:0]) +* describes a specific microarchitecture version. Combining both +* allows to uniquely identify a CPU. +* +* Please note that the native model ID is not related to x86_model. +*/ + u32 x86_cpu_type; Why are you adding it here instead of simply using X86_FEATURE_HYBRID_CPU at the call site? How many uses in this patchset? /me searches... Exactly one. Just query X86_FEATURE_HYBRID_CPU at the call site and read what you need from CPUID and use it there - no need for this. I think it's good enough for perf, but I'm not sure whether other codes need the CPU type information. Ricardo, do you know? Maybe we should implement a generic function as below for this? (Not test. Just use as an example.) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index a66c1fd..679f5fe 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2056,3 +2056,11 @@ void arch_smt_update(void) /* Check whether IPI broadcasting can be enabled */ apic_smt_update(); } + +u32 x86_read_hybrid_type(void) +{ + if (cpu_feature_enabled(X86_FEATURE_HYBRID_CPU)) + return cpuid_eax(0x001a); + + return 0; +} Thanks, Kan
Re: [PATCH 6/9] perf report: Support instruction latency
On 2/6/2021 3:09 AM, Namhyung Kim wrote: On Fri, Feb 5, 2021 at 11:38 PM Liang, Kan wrote: On 2/5/2021 6:08 AM, Namhyung Kim wrote: On Wed, Feb 3, 2021 at 5:14 AM wrote: From: Kan Liang The instruction latency information can be recorded on some platforms, e.g., the Intel Sapphire Rapids server. With both memory latency (weight) and the new instruction latency information, users can easily locate the expensive load instructions, and also understand the time spent in different stages. The users can optimize their applications in different pipeline stages. The 'weight' field is shared among different architectures. Reusing the 'weight' field may impacts other architectures. Add a new field to store the instruction latency. Like the 'weight' support, introduce a 'ins_lat' for the global instruction latency, and a 'local_ins_lat' for the local instruction latency version. Could you please clarify the difference between the global latency and the local latency? The global means the total latency. The local means average latency, aka total / number of samples. Thanks for the explanation, but I think it's confusing. Why not call it just total_latency and avg_latency? The instruction latency field is an extension of the weight field, so I follow the same way to name the field. I still think we should make the naming consistency. To address the confusion, I think we may update the document for both the weight and the instruction latency fields. How about the below patch? From d5e80f541cb7288b24a7c5661ae5faede4747807 Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Mon, 8 Feb 2021 05:27:03 -0800 Subject: [PATCH] perf documentation: Add comments to the local/global weight related fields Current 'local' and 'global' prefix is confusing for the weight related fields, e.g., weight, instruction latency. Add comments to clarify. 'global' means total weight/instruction latency sum. 'local' means average weight/instruction latency per sample Signed-off-by: Kan Liang --- tools/perf/Documentation/perf-report.txt | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt index f546b5e..acc1c1d 100644 --- a/tools/perf/Documentation/perf-report.txt +++ b/tools/perf/Documentation/perf-report.txt @@ -92,8 +92,9 @@ OPTIONS - srcfile: file name of the source file of the samples. Requires dwarf information. - weight: Event specific weight, e.g. memory latency or transaction - abort cost. This is the global weight. - - local_weight: Local weight version of the weight above. + abort cost. This is the global weight (total weight sum). + - local_weight: Local weight (average weight per sample) version of the + weight above. - cgroup_id: ID derived from cgroup namespace device and inode numbers. - cgroup: cgroup pathname in the cgroupfs. - transaction: Transaction abort flags. @@ -110,8 +111,9 @@ OPTIONS --time-quantum (default 100ms). Specify with overhead and before it. - code_page_size: the code page size of sampled code address (ip) - ins_lat: Instruction latency in core cycles. This is the global instruction - latency - - local_ins_lat: Local instruction latency version + latency (total instruction latency sum) + - local_ins_lat: Local instruction latency (average instruction latency per + sample) version By default, comm, dso and symbol keys are used. (i.e. --sort comm,dso,symbol) -- 2.7.4 Thanks, Kan
Re: [PATCH 2/9] perf tools: Support the auxiliary event
On 2/5/2021 10:26 AM, Arnaldo Carvalho de Melo wrote: Em Fri, Feb 05, 2021 at 09:13:34AM -0500, Liang, Kan escreveu: On 2/5/2021 5:52 AM, Namhyung Kim wrote: On Wed, Feb 3, 2021 at 5:14 AM wrote: From: Kan Liang On the Intel Sapphire Rapids server, an auxiliary event has to be enabled simultaneously with the load latency event to retrieve complete Memory Info. Add X86 specific perf_mem_events__name() to handle the auxiliary event. - Users are only interested in the samples of the mem-loads event. Sample read the auxiliary event. - The auxiliary event must be in front of the load latency event in a group. Assume the second event to sample if the auxiliary event is the leader. - Add a weak is_mem_loads_aux_event() to check the auxiliary event for X86. For other ARCHs, it always return false. Parse the unique event name, mem-loads-aux, for the auxiliary event. Signed-off-by: Kan Liang --- tools/perf/arch/x86/util/Build| 1 + tools/perf/arch/x86/util/mem-events.c | 44 +++ tools/perf/util/evsel.c | 3 +++ tools/perf/util/mem-events.c | 5 tools/perf/util/mem-events.h | 2 ++ tools/perf/util/parse-events.l| 1 + tools/perf/util/record.c | 5 +++- 7 files changed, 60 insertions(+), 1 deletion(-) create mode 100644 tools/perf/arch/x86/util/mem-events.c diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build index 347c39b..d73f548 100644 --- a/tools/perf/arch/x86/util/Build +++ b/tools/perf/arch/x86/util/Build @@ -6,6 +6,7 @@ perf-y += perf_regs.o perf-y += topdown.o perf-y += machine.o perf-y += event.o +perf-y += mem-events.o perf-$(CONFIG_DWARF) += dwarf-regs.o perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c new file mode 100644 index 000..11b8469 --- /dev/null +++ b/tools/perf/arch/x86/util/mem-events.c @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "util/pmu.h" +#include "map_symbol.h" +#include "mem-events.h" + +static char mem_loads_name[100]; +static bool mem_loads_name__init; + +#define MEM_LOADS_AUX 0x8203 +#define MEM_LOADS_AUX_NAME "{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/pp}:S" + +bool is_mem_loads_aux_event(struct evsel *leader) +{ + if (!pmu_have_event("cpu", "mem-loads-aux")) + return false; + + return leader->core.attr.config == MEM_LOADS_AUX; +} + +char *perf_mem_events__name(int i) +{ + struct perf_mem_event *e = perf_mem_events__ptr(i); + + if (!e) + return NULL; + + if (i == PERF_MEM_EVENTS__LOAD) { + if (mem_loads_name__init) + return mem_loads_name; + + mem_loads_name__init = true; + + if (pmu_have_event("cpu", "mem-loads-aux")) { + scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME), + MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat); It changes "%u" to an actual latency value, right? What if the value takes 3 or more digits? I'm not sure scnprintf() will handle it properly. Yes, you are right. We should use the sizeof(mem_loads_name) as below. I will submit a patch to fix it. diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c index 11b8469..588110f 100644 --- a/tools/perf/arch/x86/util/mem-events.c +++ b/tools/perf/arch/x86/util/mem-events.c @@ -31,7 +31,7 @@ char *perf_mem_events__name(int i) mem_loads_name__init = true; if (pmu_have_event("cpu", "mem-loads-aux")) { - scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME), + scnprintf(mem_loads_name, sizeof(mem_loads_name), MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat); } else { scnprintf(mem_loads_name, sizeof(mem_loads_name), I'll fold this in the relevant cset. Thanks! Kan
Re: [PATCH 2/9] perf tools: Support the auxiliary event
On 2/5/2021 5:52 AM, Namhyung Kim wrote: On Wed, Feb 3, 2021 at 5:14 AM wrote: From: Kan Liang On the Intel Sapphire Rapids server, an auxiliary event has to be enabled simultaneously with the load latency event to retrieve complete Memory Info. Add X86 specific perf_mem_events__name() to handle the auxiliary event. - Users are only interested in the samples of the mem-loads event. Sample read the auxiliary event. - The auxiliary event must be in front of the load latency event in a group. Assume the second event to sample if the auxiliary event is the leader. - Add a weak is_mem_loads_aux_event() to check the auxiliary event for X86. For other ARCHs, it always return false. Parse the unique event name, mem-loads-aux, for the auxiliary event. Signed-off-by: Kan Liang --- tools/perf/arch/x86/util/Build| 1 + tools/perf/arch/x86/util/mem-events.c | 44 +++ tools/perf/util/evsel.c | 3 +++ tools/perf/util/mem-events.c | 5 tools/perf/util/mem-events.h | 2 ++ tools/perf/util/parse-events.l| 1 + tools/perf/util/record.c | 5 +++- 7 files changed, 60 insertions(+), 1 deletion(-) create mode 100644 tools/perf/arch/x86/util/mem-events.c diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build index 347c39b..d73f548 100644 --- a/tools/perf/arch/x86/util/Build +++ b/tools/perf/arch/x86/util/Build @@ -6,6 +6,7 @@ perf-y += perf_regs.o perf-y += topdown.o perf-y += machine.o perf-y += event.o +perf-y += mem-events.o perf-$(CONFIG_DWARF) += dwarf-regs.o perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c new file mode 100644 index 000..11b8469 --- /dev/null +++ b/tools/perf/arch/x86/util/mem-events.c @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "util/pmu.h" +#include "map_symbol.h" +#include "mem-events.h" + +static char mem_loads_name[100]; +static bool mem_loads_name__init; + +#define MEM_LOADS_AUX 0x8203 +#define MEM_LOADS_AUX_NAME "{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/pp}:S" + +bool is_mem_loads_aux_event(struct evsel *leader) +{ + if (!pmu_have_event("cpu", "mem-loads-aux")) + return false; + + return leader->core.attr.config == MEM_LOADS_AUX; +} + +char *perf_mem_events__name(int i) +{ + struct perf_mem_event *e = perf_mem_events__ptr(i); + + if (!e) + return NULL; + + if (i == PERF_MEM_EVENTS__LOAD) { + if (mem_loads_name__init) + return mem_loads_name; + + mem_loads_name__init = true; + + if (pmu_have_event("cpu", "mem-loads-aux")) { + scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME), + MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat); It changes "%u" to an actual latency value, right? What if the value takes 3 or more digits? I'm not sure scnprintf() will handle it properly. Yes, you are right. We should use the sizeof(mem_loads_name) as below. I will submit a patch to fix it. diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c index 11b8469..588110f 100644 --- a/tools/perf/arch/x86/util/mem-events.c +++ b/tools/perf/arch/x86/util/mem-events.c @@ -31,7 +31,7 @@ char *perf_mem_events__name(int i) mem_loads_name__init = true; if (pmu_have_event("cpu", "mem-loads-aux")) { - scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME), + scnprintf(mem_loads_name, sizeof(mem_loads_name), MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat); } else { scnprintf(mem_loads_name, sizeof(mem_loads_name), Thanks, Kan
Re: [PATCH 6/9] perf report: Support instruction latency
On 2/5/2021 7:55 AM, Athira Rajeev wrote: Because in other archs, the var2_w of ‘perf_sample_weight’ could be used to capture something else than the Local INSTR Latency. Can we have some weak function to populate the header string ? I agree that the var2_w has different meanings among architectures. We should not force it to data->ins_lat. The patch as below should fix it. Does it work for you? My point about weak function was actually for the arch specific header string. But I guess we should not force it to data->ins_lat Yes, I don't think PowerPC should force var2_w to data->ins_lat. I think you can create your own field. as you mentioned. I checked the below patch defining an ‘arch_perf_parse_sample_weight' for powerpc and it works. But one observation is that, for cases with kernel having support for PERF_SAMPLE_WEIGHT_STRUCT but missing arch specific support for ‘arch_perf_parse_sample_weight', it will report ‘Local Weight’ wrongly since weak function takes it as 64 bit. Not sure if that is a valid case to consider though. Currently, the PERF_SAMPLE_WEIGHT_STRUCT is only enabled on X86 by default. https://lore.kernel.org/lkml/1612296553-21962-6-git-send-email-kan.li...@linux.intel.com/ For PowerPC, the PERF_SAMPLE_WEIGHT is still the default setting. There is no way to set PERF_SAMPLE_WEIGHT_STRUCT via perf tool. I don't think the above case will happen. Thanks, Kan
Re: [PATCH 3/9] perf tools: Support data block and addr block
On 2/5/2021 6:02 AM, Namhyung Kim wrote: On Wed, Feb 3, 2021 at 5:14 AM wrote: From: Kan Liang Two new data source fields, to indicate the block reasons of a load instruction, are introduced on the Intel Sapphire Rapids server. The fields can be used by the memory profiling. Add a new sort function, SORT_MEM_BLOCKED, for the two fields. For the previous platforms or the block reason is unknown, print "N/A" for the block reason. Add blocked as a default mem sort key for perf report and perf mem report. Signed-off-by: Kan Liang --- [SNIP] +int perf_mem__blk_scnprintf(char *out, size_t sz, struct mem_info *mem_info) +{ + size_t l = 0; + u64 mask = PERF_MEM_BLK_NA; + + sz -= 1; /* -1 for null termination */ + out[0] = '\0'; + + if (mem_info) + mask = mem_info->data_src.mem_blk; + + if (!mask || (mask & PERF_MEM_BLK_NA)) { + l += scnprintf(out + l, sz - l, " N/A"); + return l; + } + if (mask & PERF_MEM_BLK_DATA) + l += scnprintf(out + l, sz - l, " Data"); + if (mask & PERF_MEM_BLK_ADDR) + l += scnprintf(out + l, sz - l, " Addr"); So this means it's possible to have BLK_DATA and BLK_ADDR together and in that case it'll print "DataAddr", right? Yes, it's possible. If so, it will print "Data Addr". Thanks, Kan + + return l; +} + int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_info) { int i = 0; @@ -348,6 +371,8 @@ int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_in i += perf_mem__tlb_scnprintf(out + i, sz - i, mem_info); i += scnprintf(out + i, sz - i, "|LCK "); i += perf_mem__lck_scnprintf(out + i, sz - i, mem_info); + i += scnprintf(out + i, sz - i, "|BLK "); + i += perf_mem__blk_scnprintf(out + i, sz - i, mem_info); return i; }
Re: [PATCH 6/9] perf report: Support instruction latency
On 2/5/2021 6:08 AM, Namhyung Kim wrote: On Wed, Feb 3, 2021 at 5:14 AM wrote: From: Kan Liang The instruction latency information can be recorded on some platforms, e.g., the Intel Sapphire Rapids server. With both memory latency (weight) and the new instruction latency information, users can easily locate the expensive load instructions, and also understand the time spent in different stages. The users can optimize their applications in different pipeline stages. The 'weight' field is shared among different architectures. Reusing the 'weight' field may impacts other architectures. Add a new field to store the instruction latency. Like the 'weight' support, introduce a 'ins_lat' for the global instruction latency, and a 'local_ins_lat' for the local instruction latency version. Could you please clarify the difference between the global latency and the local latency? The global means the total latency. The local means average latency, aka total / number of samples. Thanks, Kan
Re: [PATCH V3 1/5] perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT
On 2/5/2021 5:43 AM, Namhyung Kim wrote: On Fri, Feb 5, 2021 at 12:24 AM Liang, Kan wrote: On 2/4/2021 9:00 AM, Namhyung Kim wrote: Hi Kan, On Sat, Jan 30, 2021 at 2:25 AM Liang, Kan wrote: [SNIP] diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index b15e344..c50718a 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -145,12 +145,14 @@ enum perf_event_sample_format { PERF_SAMPLE_CGROUP = 1U << 21, PERF_SAMPLE_DATA_PAGE_SIZE = 1U << 22, PERF_SAMPLE_CODE_PAGE_SIZE = 1U << 23, + PERF_SAMPLE_WEIGHT_STRUCT = 1U << 24, - PERF_SAMPLE_MAX = 1U << 24, /* non-ABI */ + PERF_SAMPLE_MAX = 1U << 25, /* non-ABI */ __PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */ }; +#define PERF_SAMPLE_WEIGHT_TYPE(PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT) I'm not sure you want to expose it in the uapi header as it's not intended to be used IMHO. I'm not sure I understood, but it's indeed used in the tool patch set. https://lore.kernel.org/lkml/1612296553-21962-6-git-send-email-kan.li...@linux.intel.com/ Well, it's not a big deal.. but I just worried if some users might do event.attr.sample_type = PERF_SAMPLE_WEIGHT_TYPE; Yes, it's possible. For this case, Kernel returns -EINVAL. Actually, even without the macro, users may still set PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT manually. We cannot control what a user sets. I don't think it's a problem as long as the kernel can handle it properly. Thanks, Kan
Re: [PATCH V3 1/5] perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT
On 2/4/2021 9:00 AM, Namhyung Kim wrote: Hi Kan, On Sat, Jan 30, 2021 at 2:25 AM Liang, Kan wrote: [SNIP] diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index b15e344..c50718a 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -145,12 +145,14 @@ enum perf_event_sample_format { PERF_SAMPLE_CGROUP = 1U << 21, PERF_SAMPLE_DATA_PAGE_SIZE = 1U << 22, PERF_SAMPLE_CODE_PAGE_SIZE = 1U << 23, + PERF_SAMPLE_WEIGHT_STRUCT = 1U << 24, - PERF_SAMPLE_MAX = 1U << 24, /* non-ABI */ + PERF_SAMPLE_MAX = 1U << 25, /* non-ABI */ __PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */ }; +#define PERF_SAMPLE_WEIGHT_TYPE(PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT) I'm not sure you want to expose it in the uapi header as it's not intended to be used IMHO. I'm not sure I understood, but it's indeed used in the tool patch set. https://lore.kernel.org/lkml/1612296553-21962-6-git-send-email-kan.li...@linux.intel.com/ Thanks, Kan
Re: [PATCH 6/9] perf report: Support instruction latency
On 2/4/2021 8:11 AM, Athira Rajeev wrote: On 03-Feb-2021, at 1:39 AM, kan.li...@linux.intel.com wrote: From: Kan Liang The instruction latency information can be recorded on some platforms, e.g., the Intel Sapphire Rapids server. With both memory latency (weight) and the new instruction latency information, users can easily locate the expensive load instructions, and also understand the time spent in different stages. The users can optimize their applications in different pipeline stages. The 'weight' field is shared among different architectures. Reusing the 'weight' field may impacts other architectures. Add a new field to store the instruction latency. Like the 'weight' support, introduce a 'ins_lat' for the global instruction latency, and a 'local_ins_lat' for the local instruction latency version. Add new sort functions, INSTR Latency and Local INSTR Latency, accordingly. Add local_ins_lat to the default_mem_sort_order[]. Signed-off-by: Kan Liang --- tools/perf/Documentation/perf-report.txt | 6 +++- tools/perf/util/event.h | 1 + tools/perf/util/evsel.c | 4 ++- tools/perf/util/hist.c | 12 ++-- tools/perf/util/hist.h | 2 ++ tools/perf/util/intel-pt.c | 5 ++-- tools/perf/util/session.c| 8 -- tools/perf/util/sort.c | 47 +++- tools/perf/util/sort.h | 3 ++ tools/perf/util/synthetic-events.c | 4 ++- 10 files changed, 81 insertions(+), 11 deletions(-) diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt index 826b5a9..0565b7c 100644 --- a/tools/perf/Documentation/perf-report.txt +++ b/tools/perf/Documentation/perf-report.txt @@ -108,6 +108,9 @@ OPTIONS - period: Raw number of event count of sample - time: Separate the samples by time stamp with the resolution specified by --time-quantum (default 100ms). Specify with overhead and before it. + - ins_lat: Instruction latency in core cycles. This is the global + instruction latency + - local_ins_lat: Local instruction latency version By default, comm, dso and symbol keys are used. (i.e. --sort comm,dso,symbol) @@ -154,7 +157,8 @@ OPTIONS - blocked: reason of blocked load access for the data at the time of the sample And the default sort keys are changed to local_weight, mem, sym, dso, - symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'. + symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat, + see '--mem-mode'. If the data file has tracepoint event(s), following (dynamic) sort keys are also available: diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h index ff403ea..5d50a49 100644 --- a/tools/perf/util/event.h +++ b/tools/perf/util/event.h @@ -141,6 +141,7 @@ struct perf_sample { u16 insn_len; u8 cpumode; u16 misc; + u16 ins_lat; bool no_hw_idx; /* No hw_idx collected in branch_stack */ char insn[MAX_INSN]; void *raw_data; diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index 0a2a307..24c0b59 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -2337,8 +2337,10 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event, weight.full = *array; if (type & PERF_SAMPLE_WEIGHT) data->weight = weight.full; - else + else { data->weight = weight.var1_dw; + data->ins_lat = weight.var2_w; + } array++; } diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c index 6866ab0..8a432f8 100644 --- a/tools/perf/util/hist.c +++ b/tools/perf/util/hist.c @@ -209,6 +209,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h) hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12); hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12); hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10); + hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13); + hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13); if (symbol_conf.nanosecs) hists__new_col_len(hists, HISTC_TIME, 16); else @@ -286,12 +288,13 @@ static long hist_time(unsigned long htime) } static void he_stat__add_period(struct he_stat *he_stat, u64 period, - u64 weight) + u64 weight, u64 ins_lat) { he_stat->period += period; he_stat->weight += weight; he_stat->nr_events += 1; + he_stat->ins_lat += ins_lat; } static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src) @@ -303,6 +306,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat
Re: [PATCH 2/9] perf tools: Support the auxiliary event
On 2/3/2021 3:02 PM, Arnaldo Carvalho de Melo wrote: Em Tue, Feb 02, 2021 at 12:09:06PM -0800,kan.li...@linux.intel.com escreveu: From: Kan Liang diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index c26ea822..c48f6de 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -2689,6 +2689,9 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target, if (perf_missing_features.aux_output) return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel."); break; + case ENODATA: + return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. " +"Please add an auxiliary event in front of the load latency event."); Are you sure this is the only case where ENODATA comes out from perf_event_open()? Well, according to your comment in: 61b985e3e775a3a7 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids") It should be at that point in time, so its safe to merge as-is, but then I think this is fragile, what if someone else, in the future, not knowing that ENODATA is supposed to be used only with that ancient CPU, Sapphire Rapids, uses it?:-) Please consider adding a check before assuming ENODATA is for this specific case. Sure, I will add a check in V2. Thanks, Kan
Re: [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT
On 2/3/2021 3:31 PM, Arnaldo Carvalho de Melo wrote: --- a/tools/perf/util/perf_event_attr_fprintf.c +++ b/tools/perf/util/perf_event_attr_fprintf.c @@ -35,7 +35,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value) bit_name(BRANCH_STACK), bit_name(REGS_USER), bit_name(STACK_USER), bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC), bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(AUX), - bit_name(CGROUP), bit_name(DATA_PAGE_SIZE), + bit_name(CGROUP), bit_name(DATA_PAGE_SIZE), bit_name(WEIGHT_STRUCT), I have CODE_PAGE_SIZE in my perf/core branch, was this somehow removed? https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core=c1de7f3d84ca324c7cda85c3ce27b11741af2124 I see, you did this patchkit on top of upstream, that has just DATA_PAGE_SIZE, while my perf/core branch has CODE_PAGE_SIZE. I'm adjusting it, please next time do tooling development on acme/perf/core. Sorry, I will rebase the patchset on acme/perf/core. Thanks, Kan
Re: [PATCH V3 0/5] perf core PMU support for Sapphire Rapids (Kernel)
On 2/1/2021 9:30 AM, Peter Zijlstra wrote: I made the below changes, does that work? Yes, it works well. Thanks, Kan --- --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3627,6 +3627,15 @@ static int core_pmu_hw_config(struct per return intel_pmu_bts_config(event); } +#define INTEL_TD_METRIC_AVAILABLE_MAX (INTEL_TD_METRIC_RETIRING + \ +((x86_pmu.num_topdown_events - 1) << 8)) + +static bool is_available_metric_event(struct perf_event *event) +{ + return is_metric_event(event) && + event->attr.config <= INTEL_TD_METRIC_AVAILABLE_MAX; +} + static inline bool is_mem_loads_event(struct perf_event *event) { return (event->attr.config & INTEL_ARCH_EVENT_MASK) == X86_CONFIG(.event=0xcd, .umask=0x01); @@ -3711,7 +3720,7 @@ static int intel_pmu_hw_config(struct pe if (event->attr.config & X86_ALL_EVENT_FLAGS) return -EINVAL; - if (is_metric_event(event)) { + if (is_available_metric_event(event)) { struct perf_event *leader = event->group_leader; /* The metric events don't support sampling. */ --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -87,7 +87,14 @@ static inline bool is_topdown_count(stru return event->hw.flags & PERF_X86_EVENT_TOPDOWN; } -static inline bool is_metric_event(struct perf_event *event); +static inline bool is_metric_event(struct perf_event *event) +{ + u64 config = event->attr.config; + + return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) && + ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING) && + ((config & INTEL_ARCH_EVENT_MASK) <= INTEL_TD_METRIC_MAX); +} static inline bool is_slots_event(struct perf_event *event) { @@ -901,18 +908,6 @@ static struct perf_pmu_events_ht_attr ev struct pmu *x86_get_pmu(void); extern struct x86_pmu x86_pmu __read_mostly; -#define INTEL_TD_METRIC_AVAILABLE_MAX (INTEL_TD_METRIC_RETIRING + \ -((x86_pmu.num_topdown_events - 1) << 8)) - -static inline bool is_metric_event(struct perf_event *event) -{ - u64 config = event->attr.config; - - return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) && - ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING) && - ((config & INTEL_ARCH_EVENT_MASK) <= INTEL_TD_METRIC_AVAILABLE_MAX); -} - static __always_inline struct x86_perf_task_context_opt *task_context_opt(void *ctx) { if (static_cpu_has(X86_FEATURE_ARCH_LBR)) --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -284,10 +284,12 @@ struct x86_pmu_capability { #define INTEL_TD_METRIC_BAD_SPEC 0x8100 /* Bad speculation metric */ #define INTEL_TD_METRIC_FE_BOUND 0x8200 /* FE bound metric */ #define INTEL_TD_METRIC_BE_BOUND 0x8300 /* BE bound metric */ -#define INTEL_TD_METRIC_HEAVY_OPS 0x8400 /* Heavy Operations metric */ -#define INTEL_TD_METRIC_BR_MISPREDICT 0x8500 /* Branch Mispredict metric */ -#define INTEL_TD_METRIC_FETCH_LAT 0x8600 /* Fetch Latency metric */ -#define INTEL_TD_METRIC_MEM_BOUND 0x8700 /* Memory bound metric */ +/* Level 2 metrics */ +#define INTEL_TD_METRIC_HEAVY_OPS 0x8400 /* Heavy Operations metric */ +#define INTEL_TD_METRIC_BR_MISPREDICT 0x8500 /* Branch Mispredict metric */ +#define INTEL_TD_METRIC_FETCH_LAT 0x8600 /* Fetch Latency metric */ +#define INTEL_TD_METRIC_MEM_BOUND 0x8700 /* Memory bound metric */ + #define INTEL_TD_METRIC_MAX INTEL_TD_METRIC_MEM_BOUND #define INTEL_TD_METRIC_NUM 8 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -908,7 +908,6 @@ enum perf_event_type { * u32 var1_dw; * } && PERF_SAMPLE_WEIGHT_STRUCT * #endif -* * } * } * { u64 data_src; } && PERF_SAMPLE_DATA_SRC @@ -1276,29 +1275,23 @@ struct perf_branch_entry { reserved:40; }; -#if defined(__LITTLE_ENDIAN_BITFIELD) union perf_sample_weight { __u64 full; +#if defined(__LITTLE_ENDIAN_BITFIELD) struct { __u32 var1_dw; __u16 var2_w; __u16 var3_w; }; -}; - #elif defined(__BIG_ENDIAN_BITFIELD) - -union perf_sample_weight { - __u64 full; struct { __u16 var3_w; __u16 var2_w; __u32 var1_dw; }; -}; - #else #error "Unknown endianness" #endif +}; #endif /* _UAPI_LINUX_PERF_EVENT_H */
Re: [PATCH V3 1/5] perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT
On 1/28/2021 5:40 PM, kan.li...@linux.intel.com wrote: From: Kan Liang Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the cost of an action represented by the sample. This allows the profiler to scale the samples to be more informative to the programmer. It could also help to locate a hotspot, e.g., when profiling by memory latencies, the expensive load appear higher up in the histograms. But current PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This could be a problem, if users want two or more factors to contribute to the weight. For example, Golden Cove core PMU can provide both the instruction latency and the cache Latency information as factors for the memory profiling. For current X86 platforms, although meminfo::latency is defined as a u64, only the lower 32 bits include the valid data in practice (No memory access could last than 4G cycles). The higher 32 bits can be used to store new factors. Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new sample weight structure. It shares the same space as the PERF_SAMPLE_WEIGHT sample type. Users can apply either the PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but they cannot apply both sample types simultaneously. Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type. - For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT sample type. PowerPC can re-struct the weight field similarly later. - For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now. The following patches will apply the new factors for the PERF_SAMPLE_WEIGHT_STRUCT sample type. The field in the union perf_sample_weight should be shared among different architectures. A generic name is required, but it's hard to abstract a name that applies to all architectures. For example, on X86, the fields are to store all kinds of latency. While on PowerPC, it stores MMCRA[TECX/TECM], which should not be latency. So a general name prefix 'var$NUM' is used here. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Kan Liang --- arch/powerpc/perf/core-book3s.c | 2 +- arch/x86/events/intel/ds.c | 17 +++--- include/linux/perf_event.h | 4 ++-- include/uapi/linux/perf_event.h | 49 +++-- kernel/events/core.c| 11 + 5 files changed, 66 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 28206b1..869d999 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -2195,7 +2195,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val, if (event->attr.sample_type & PERF_SAMPLE_WEIGHT && ppmu->get_mem_weight) - ppmu->get_mem_weight(); + ppmu->get_mem_weight(); if (perf_event_overflow(event, , regs)) power_pmu_stop(event, 0); diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 67dbc91..2f54b1f 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -960,7 +960,8 @@ static void adaptive_pebs_record_size_update(void) } #define PERF_PEBS_MEMINFO_TYPE (PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC | \ - PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_WEIGHT | \ + PERF_SAMPLE_PHYS_ADDR | \ + PERF_SAMPLE_WEIGHT_TYPE |\ PERF_SAMPLE_TRANSACTION |\ PERF_SAMPLE_DATA_PAGE_SIZE) @@ -987,7 +988,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event) gprs = (sample_type & PERF_SAMPLE_REGS_INTR) && (attr->sample_regs_intr & PEBS_GP_REGS); - tsx_weight = (sample_type & PERF_SAMPLE_WEIGHT) && + tsx_weight = (sample_type & PERF_SAMPLE_WEIGHT_TYPE) && ((attr->config & INTEL_ARCH_EVENT_MASK) == x86_pmu.rtm_abort_event); @@ -1369,8 +1370,8 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event, /* * Use latency for weight (only avail with PEBS-LL) */ - if (fll && (sample_type & PERF_SAMPLE_WEIGHT)) - data->weight = pebs->lat; + if (fll && (sample_type & PERF_SAMPLE_WEIGHT_TYPE)) + data->weight.full = pebs->lat; /* * data.data_src encodes the data source @@ -1462,8 +1463,8 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event, if (x86_pmu.intel_cap.pebs_format >= 2) { /* Only set the TSX weight when no memory weight. */ -
Re: [PATCH V2 1/5] perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT
On 1/27/2021 2:03 PM, Peter Zijlstra wrote: On Wed, Jan 27, 2021 at 07:38:41AM -0800, kan.li...@linux.intel.com wrote: diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index b15e344..13b4019 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -145,12 +145,14 @@ enum perf_event_sample_format { PERF_SAMPLE_CGROUP = 1U << 21, PERF_SAMPLE_DATA_PAGE_SIZE = 1U << 22, PERF_SAMPLE_CODE_PAGE_SIZE = 1U << 23, + PERF_SAMPLE_WEIGHT_STRUCT = 1U << 24, - PERF_SAMPLE_MAX = 1U << 24, /* non-ABI */ + PERF_SAMPLE_MAX = 1U << 25, /* non-ABI */ __PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */ }; +#define PERF_SAMPLE_WEIGHT_TYPE (PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT) /* * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set * @@ -890,7 +892,16 @@ enum perf_event_type { *char data[size]; *u64 dyn_size; } && PERF_SAMPLE_STACK_USER * -* { u64 weight; } && PERF_SAMPLE_WEIGHT +* { union perf_sample_weight +* { +* u64 full; && PERF_SAMPLE_WEIGHT +* struct { +* u32 low_dword; +* u16 high_word; +* u16 higher_word; +* } && PERF_SAMPLE_WEIGHT_STRUCT +* } +* } * { u64 data_src; } && PERF_SAMPLE_DATA_SRC * { u64 transaction; } && PERF_SAMPLE_TRANSACTION * { u64 abi; # enum perf_sample_regs_abi @@ -1248,4 +1259,13 @@ struct perf_branch_entry { reserved:40; }; +union perf_sample_weight { + __u64 full; + struct { + __u32 low_dword; + __u16 high_word; + __u16 higher_word; + }; +}; *urgh*, my naming lives ... anybody got a better suggestion? I think we need a generic name here, but the problem is that the 'weight' field has different meanings among architectures. The 'weight' fields are to store all kinds of latency on X86. On PowerPC, it stores MMCRA[TECX/TECM], which doesn't look like a latency. I don't think I can use the name, 'cache_lat' or 'instruction_lat', here. Right? If so, how about 'var'? u32 var_1_dw; u16 var_2_w; u16 var_3_w; Also, do we want to care about byte order? Sure. I will add the big-endian and little-endian support. Thanks, Kan
Re: [PATCH V2 3/5] perf/x86/intel: Filter unsupported Topdown metrics event
On 1/27/2021 2:13 PM, Peter Zijlstra wrote: On Wed, Jan 27, 2021 at 07:38:43AM -0800, kan.li...@linux.intel.com wrote: From: Kan Liang Current perf doesn't check the index of a Topdown metrics event before updating the event. A perf tool user may get a value from an unsupported Topdown metrics event. For example, the L2 topdown event, cpu/event=0x00,umask=0x84/, is not supported on Ice Lake. A perf tool user may mistakenly use the unsupported events via RAW format. In this case, the scheduler follows the unknown event handling and assigns a GP counter to the event. The icl_get_topdown_value() returns the value of the slots to the event. The perf tool user will get the value for the unsupported Topdown metrics event. Add a check in the __icl_update_topdown_event() and avoid updating unsupported events. I was struggling trying to understand how we end up here. Because userspace can add whatever crap it wants, and why is only this thing a problem.. But the actual problem is the next patch changing INTEL_TD_METRIC_NUM, which then means is_metric_idx() will change, and that means that for ICL we'll accept these raw events as metric events on creation and then at runtime we get into trouble. This isn't spelled out. I do think this is entirely the wrong fix for that though. You're now adding cycles to the relative hot path, instead of fixing the problem at event creation, which is the slow path. Why can't we either refuse the event on ICL or otherwise wreck it on construction to avoid getting into trouble here? To reject the unsupported topdown events, I think perf needs to know the number of the supported topdown events. Maybe we can add a variable num_topdown_events in x86_pmu as below. Is it OK? diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 3d6fdcf..c7f2602 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1061,6 +1061,18 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) return unsched ? -EINVAL : 0; } +#define INTEL_TD_METRIC_AVAILABLE_MAX (INTEL_TD_METRIC_RETIRING + \ +((x86_pmu.num_topdown_events - 1) << 8)) + +inline bool is_metric_event(struct perf_event *event) +{ + u64 config = event->attr.config; + + return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) && + ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING) && + ((config & INTEL_ARCH_EVENT_MASK) <= INTEL_TD_METRIC_AVAILABLE_MAX); +} + static int add_nr_metric_event(struct cpu_hw_events *cpuc, struct perf_event *event) { diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index cd4d542..eab1eba 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5769,6 +5769,7 @@ __init int intel_pmu_init(void) x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xc9, .umask=0x04); x86_pmu.lbr_pt_coexist = true; intel_pmu_pebs_data_source_skl(pmem); + x86_pmu.num_topdown_events = 4; x86_pmu.update_topdown_event = icl_update_topdown_event; x86_pmu.set_topdown_event_period = icl_set_topdown_event_period; pr_cont("Icelake events, "); diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 15977d0..8b05893 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -87,14 +87,7 @@ static inline bool is_topdown_count(struct perf_event *event) return event->hw.flags & PERF_X86_EVENT_TOPDOWN; } -static inline bool is_metric_event(struct perf_event *event) -{ - u64 config = event->attr.config; - - return ((config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) && - ((config & INTEL_ARCH_EVENT_MASK) >= INTEL_TD_METRIC_RETIRING) && - ((config & INTEL_ARCH_EVENT_MASK) <= INTEL_TD_METRIC_MAX); -} +inline bool is_metric_event(struct perf_event *event); static inline bool is_slots_event(struct perf_event *event) { @@ -782,6 +775,7 @@ struct x86_pmu { /* * Intel perf metrics */ + int num_topdown_events; u64 (*update_topdown_event)(struct perf_event *event); int (*set_topdown_event_period)(struct perf_event *event); Thanks, Kan
Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
On 1/26/2021 10:37 AM, Peter Zijlstra wrote: On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.li...@linux.intel.com wrote: @@ -1577,9 +1668,20 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event, } if (format_size & PEBS_DATACFG_MEMINFO) { + if (sample_type & PERF_SAMPLE_WEIGHT) { + u64 weight = meminfo->latency; + + if (x86_pmu.flags & PMU_FL_INSTR_LATENCY) + weight >>= PEBS_CACHE_LATENCY_OFFSET; + data->weight = weight & PEBS_LATENCY_MASK ?: intel_get_tsx_weight(meminfo->tsx_tuning); + } + + if (sample_type & PERF_SAMPLE_WEIGHT_EXT) { + data->weight_ext.val = 0; + if (x86_pmu.flags & PMU_FL_INSTR_LATENCY) + data->weight_ext.instr_latency = meminfo->latency & PEBS_LATENCY_MASK; + } if (sample_type & PERF_SAMPLE_DATA_SRC) data->data_src.val = get_data_src(event, meminfo->aux); Talk to me about that SAMPLE_WEIGHT stuff I'm not liking it. Sure you want multiple dimensions, but urgh. Also, afaict, as proposed you're wasting 80/128 bits. That is, all data you want to export fits in a single u64 and yet you're using two, which is mighty daft. Sure, pebs::lat / pebs_meminfo::latency is defined as a u64, but you can't tell me that that is ever actually more than 4G cycles. Even the TSX block latency is u32. So how about defining SAMPLE_WEIGHT_STRUCT which uses the exact same data as SAMPLE_WEIGHT but unions it with a struct. I'm not sure if we want: union sample_weight { u64 weight; struct { u32 low_dword; u32 high_dword; }; /* or */ struct { u32 low_dword; u16 high_word; u16 higher_word; }; }; Then have the core code enforce SAMPLE_WEIGHT ^ SAMPLE_WEIGHT_STRUCT and make the existing code never set the high dword. So the kernel will only accept either SAMPLE_WEIGHT type or SAMPLE_WEIGHT_STRUCT type. It should error out if both types are set, right? I will check if u32 is enough for meminfo::latency on the previous platforms. Thanks, Kan
Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
On 1/26/2021 9:43 AM, Peter Zijlstra wrote: On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.li...@linux.intel.com wrote: @@ -2319,6 +2474,17 @@ static void __icl_update_topdown_event(struct perf_event *event, { u64 delta, last = 0; + /* +* Although the unsupported topdown events are not exposed to users, +* users may mistakenly use the unsupported events via RAW format. +* For example, using L2 topdown event, cpu/event=0x00,umask=0x84/, +* on Ice Lake. In this case, the scheduler follows the unknown +* event handling and assigns a GP counter to the event. +* Check the case, and avoid updating unsupported events. +*/ + if (event->hw.idx < INTEL_PMC_IDX_FIXED) + return; + delta = icl_get_topdown_value(event, slots, metrics); if (last_slots) last = icl_get_topdown_value(event, last_slots, last_metrics); Is this a separate patch? I will move it to a separate patch. Thanks, Kan
Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
On 1/26/2021 9:44 AM, Peter Zijlstra wrote: On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.li...@linux.intel.com wrote: @@ -3671,6 +3853,31 @@ static int intel_pmu_hw_config(struct perf_event *event) } } + /* +* To retrieve complete Memory Info of the load latency event, an +* auxiliary event has to be enabled simultaneously. Add a check for +* the load latency event. +* +* In a group, the auxiliary event must be in front of the load latency +* event. The rule is to simplify the implementation of the check. +* That's because perf cannot have a complete group at the moment. +*/ + if (x86_pmu.flags & PMU_FL_MEM_LOADS_AUX && + (event->attr.sample_type & PERF_SAMPLE_DATA_SRC) && + is_mem_loads_event(event)) { + struct perf_event *leader = event->group_leader; + struct perf_event *sibling = NULL; + + if (!is_mem_loads_aux_event(leader)) { + for_each_sibling_event(sibling, leader) { + if (is_mem_loads_aux_event(sibling)) + break; + } + if (list_entry_is_head(sibling, >sibling_list, sibling_list)) + return -ENODATA; + } + } + if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY)) return 0; I have vague memories of this getting mentioned in a call at some point. Pretend I don't know anything and tell me more. Adding the auxiliary event is for the new data source fields, data block & address block. If perf only samples the load latency event, the value of the data block & address block fields in a sample is not correct. To get the correct value, we have to sample both the auxiliary event and the load latency together on SPR. So I add the check in the kernel. I also modify the perf mem in the perf tool accordingly. Thanks, Kan
Re: [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT
On 1/26/2021 9:42 AM, Peter Zijlstra wrote: On Tue, Jan 19, 2021 at 12:38:20PM -0800, kan.li...@linux.intel.com wrote: @@ -900,6 +901,13 @@ enum perf_event_type { *char data[size]; } && PERF_SAMPLE_AUX * { u64 data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE * { u64 code_page_size;} && PERF_SAMPLE_CODE_PAGE_SIZE +* { union { +* u64 weight_ext; +* struct { +* u64 instr_latency:16, +* reserved:48; +* }; +* } && PERF_SAMPLE_WEIGHT_EXT * }; */ PERF_RECORD_SAMPLE = 9, @@ -1248,4 +1256,12 @@ struct perf_branch_entry { reserved:40; }; +union perf_weight_ext { + __u64 val; + struct { + __u64 instr_latency:16, + reserved:48; + }; +}; + #endif /* _UAPI_LINUX_PERF_EVENT_H */ diff --git a/kernel/events/core.c b/kernel/events/core.c index 55d1879..9363d12 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1903,6 +1903,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type) if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE) size += sizeof(data->code_page_size); + if (sample_type & PERF_SAMPLE_WEIGHT_EXT) + size += sizeof(data->weight_ext); + event->header_size = size; } @@ -6952,6 +6955,9 @@ void perf_output_sample(struct perf_output_handle *handle, perf_aux_sample_output(event, handle, data); } + if (sample_type & PERF_SAMPLE_WEIGHT_EXT) + perf_output_put(handle, data->weight_ext); + if (!event->attr.watermark) { int wakeup_events = event->attr.wakeup_events; This patch is broken and will expose uninitialized kernel stack. Could we initialize the 'weight_ext' in perf_sample_data_init()? I understand that we prefer not to set the field in perf_sample_data_init() to minimize the cachelines touched. However, the perf_sample_data_init() should be the most proper place to do the initialization. Also, the 'weight' is already initialized in it. As an extension, I think the 'weight_ext' should be initialized in it as well. In the perf_prepare_sample(), I think we can only clear the unused fields. The [0:15] bits may still leak the data. Thanks, Kan
Re: [PATCH] perf stat: Add Topdown metrics events as default events
On 1/21/2021 2:57 PM, Arnaldo Carvalho de Melo wrote: Em Thu, Jan 21, 2021 at 05:37:52AM -0800, kan.li...@linux.intel.com escreveu: From: Kan Liang The Topdown Microarchitecture Analysis (TMA) Method is a structured analysis methodology to identify critical performance bottlenecks in out-of-order processors. From the Ice Lake and later platforms, the Topdown information can be retrieved from the dedicated "metrics" register, which isn't impacted by other events. Also, the Topdown metrics support both per thread/process and per core measuring. Adding Topdown metrics events as default events can enrich the default measuring information, and would not cost any extra multiplexing. Introduce arch_evlist__add_default_attrs() to allow architecture specific default events. Add the Topdown metrics events in the X86 specific arch_evlist__add_default_attrs(). Other architectures can add their own default events later separately. With the patch, $perf stat sleep 1 Performance counter stats for 'sleep 1': 0.82 msec task-clock:u #0.001 CPUs utilized 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 61 page-faults:u #0.074 M/sec 319,941 cycles:u #0.388 GHz 242,802 instructions:u#0.76 insn per cycle 54,380 branches:u# 66.028 M/sec 4,043 branch-misses:u #7.43% of all branches 1,585,555 slots:u # 1925.189 M/sec 238,941 topdown-retiring:u# 15.0% retiring 410,378 topdown-bad-spec:u# 25.8% bad speculation 634,222 topdown-fe-bound:u# 39.9% frontend bound 304,675 topdown-be-bound:u# 19.2% backend bound Shouldn't we be adding this to one of the -d levels? The -d levels' events are architecture events, which should be available on all architectures. The Topdown events are X86 specific for now. I don't think we should add the Topdown events into any of the -d levels. Different architectures should have their unique events. I think we should allow architectures to default their unique and very useful events. But you say that this is essentially for free, so you check if this extra register is in place and if it is, hey, its for free, add it, is that the rationale? > I.e. it doesn't cause any impact to what we were getting before, so we should default to use free stuff? Free is just one of the benefits. The key motivation is that we think the Topdown information is very useful information, which can help users identify performance bottlenecks in modern out-of-order processors. Here is an example that triggers lots of itlb misses. With the patch, users can simply identify the performance issue with default perf stat (98.0% frontend bound, which leads the users to check the fetch bandwidth or the fetch latency related events, e.g., itlb misses). $perf stat ./itlb -b1 -n200 -c0 Performance counter stats for './itlb -b1 -n200 -c0': 14,567.03 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 683 page-faults:u #0.047 K/sec 55,890,568,522 cycles:u #3.837 GHz 3,601,934,137 instructions:u#0.06 insn per cycle 400,353,806 branches:u# 27.484 M/sec 24,078 branch-misses:u #0.01% of all branches 279,391,452,475 slots:u # 19179.715 M/sec 2,230,224,702 topdown-retiring:u# 0.8% retiring 3,286,958,264 topdown-bad-spec:u# 1.2% bad speculation 273,953,648,626 topdown-fe-bound:u# 98.0% frontend bound 4,579,804 topdown-be-bound:u# 0.0% backend bound Without the patch, users cannot get any useful hints from the default perf stat. Even with -d -d, users still cannot locate the issue to itlb misses, because we don't have an architecture iTLB-loads event. The -d -d also triggers the multiplexing. $ perf stat -d -d ./itlb -b1 -n200 -c0 Performance counter stats for './itlb -b1 -n200 -c0': 14,633.58 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u#0.000 K/sec 0 cpu-migrations:u #0.000 K/sec 682 page-faults:u #0.047 K/sec 56,198,668,109 cycles:u #3.840 GHz (38.41%) 3,602,460,111 instructions:u#0.06 insn per cycle (46.12%) 400,445,992
Re: [PATCH V4 4/6] perf script: Add support for PERF_SAMPLE_CODE_PAGE_SIZE
On 1/15/2021 2:25 PM, Arnaldo Carvalho de Melo wrote: Em Tue, Jan 05, 2021 at 11:57:50AM -0800,kan.li...@linux.intel.com escreveu: From: Stephane Eranian Display sampled code page sizes when PERF_SAMPLE_CODE_PAGE_SIZE was set. For example, perf script --fields comm,event,ip,code_page_size dtlb mem-loads:uP:445777 4K dtlb mem-loads:uP:40f724 4K dtlb mem-loads:uP:474926 4K dtlb mem-loads:uP:401075 4K dtlb mem-loads:uP:401095 4K dtlb mem-loads:uP:401095 4K dtlb mem-loads:uP:4010cc 4K dtlb mem-loads:uP:440b6f 4K Acked-by: Namhyung Kim Acked-by: Jiri Olsa Signed-off-by: Stephane Eranian You missed your Signed-off-by, I'm adding it, please ack this change. The patch 4 and 5 are from Stephane. I only made minor changes so that the code can be rebased to the latest perf/core branch (c07b45a355ee). May add a tag as below. [kan.li...@linux.intel.com: Rebase on top of acme's perf/core branch commit c07b45a355ee] Signed-off-by: Kan Liang Thanks, Kan
Re: [PATCH V4 0/6] Add the page size in the perf record (user tools)
On 1/12/2021 12:24 AM, Athira Rajeev wrote: On 06-Jan-2021, at 1:27 AM, kan.li...@linux.intel.com wrote: From: Kan Liang Changes since V3: - Rebase on top of acme's perf/core branch commit c07b45a355ee ("perf record: Tweak "Lowering..." warning in record_opts__config_freq") Changes since V2: - Rebase on top of acme perf/core branch commit eec7b53d5916 ("perf test: Make sample-parsing test aware of PERF_SAMPLE_{CODE,DATA}_PAGE_SIZE") - Use unit_number__scnprintf() in get_page_size_name() - Emit warning about kernel not supporting the code page size sample_type bit Changes since V1: - Fix the compile warning with GCC 10 - Add Acked-by from Namhyung Kim Current perf can report both virtual addresses and physical addresses, but not the page size. Without the page size information of the utilized page, users cannot decide whether to promote/demote large pages to optimize memory usage. The kernel patches have been merged into tip perf/core branch, commit 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE") commit 76a5433f95f3 ("perf/x86/intel: Support PERF_SAMPLE_DATA_PAGE_SIZE") commit 4cb6a42e4c4b ("powerpc/perf: Support PERF_SAMPLE_DATA_PAGE_SIZE") commit 995f088efebe ("perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE") commit 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs") and Peter's perf/core branch commit 524680ce47a1 ("mm/gup: Provide gup_get_pte() more generic") commit 44a35d6937d2 ("mm: Introduce pXX_leaf_size()") commit 2f1e2f091ad0 ("perf/core: Fix arch_perf_get_page_size()") commit 7649e44aacdd ("arm64/mm: Implement pXX_leaf_size() support") commit 1df1ae7e262c ("sparc64/mm: Implement pXX_leaf_size() support") This patch set is to enable the page size support in user tools. Hi Kan Liang, I am trying to check this series on powerpc. # perf mem --phys-data --data-page-size record To my observation, some of the samples returned zero size and comes as ’N/A’ in the perf report # perf mem --phys-data --data-page-size report For fetching the page size, though initially there was a weak function added ( as arch_perf_get_page_size ) here: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core=51b646b2d9f84d6ff6300e3c1d09f2be4329a424 later I see it got removed here: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core=8af26be062721e52eba1550caf50b712f774c5fd I picked kernel changes from git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git , or I am missing something ? I believe all the kernel changes have been merged. According to the commit message of the recent changes, only Power/8xxx is supported for power for now. I guess that may be the reason of some 'N/A's. https://lore.kernel.org/patchwork/cover/1345521/ Thanks, Kan Thanks Athira Kan Liang (3): perf mem: Clean up output format perf mem: Support data page size perf tools: Add support for PERF_SAMPLE_CODE_PAGE_SIZE Stephane Eranian (3): perf script: Add support for PERF_SAMPLE_CODE_PAGE_SIZE perf report: Add support for PERF_SAMPLE_CODE_PAGE_SIZE perf test: Add test case for PERF_SAMPLE_CODE_PAGE_SIZE tools/perf/Documentation/perf-mem.txt | 3 + tools/perf/Documentation/perf-record.txt | 3 + tools/perf/Documentation/perf-report.txt | 1 + tools/perf/Documentation/perf-script.txt | 2 +- tools/perf/builtin-mem.c | 111 +++--- tools/perf/builtin-record.c | 2 + tools/perf/builtin-script.c | 13 ++- tools/perf/tests/sample-parsing.c | 4 + tools/perf/util/event.h | 1 + tools/perf/util/evsel.c | 18 +++- tools/perf/util/evsel.h | 1 + tools/perf/util/hist.c| 2 + tools/perf/util/hist.h| 1 + tools/perf/util/perf_event_attr_fprintf.c | 2 +- tools/perf/util/record.h | 1 + tools/perf/util/session.c | 3 + tools/perf/util/sort.c| 26 + tools/perf/util/sort.h| 2 + tools/perf/util/synthetic-events.c| 8 ++ 19 files changed, 144 insertions(+), 60 deletions(-) -- 2.25.1
Re: [PATCH 0/2] perf/x86/intel/uncore: Derive die id from NUMA info with more than 8 nodes
On 1/8/2021 10:35 AM, Steve Wahl wrote: For Intel uncore, the registers being used to identify the die don't contain enough bits to uniquely identify more than 8 dies. On systems with more than 8 dies, this results in error messages of the form "skx_uncore: probe of :XX:XX.X failed with error -22", and some perf counters showing up as "". On such systems, use NUMA information to determine die id. Continue to use the register information with 8 or fewer numa nodes to cover cases like NUMA not being enabled. The first patch moves translation from physical to logical die id earlier in the code, and stores only the logical id. The logical id is the only one that is really used. Without this change the second patch would have to store both physical and logical id, which was much more complicated. The second patch adds the alternative of deriving the logical die id from the NUMA information when there are more than 8 nodes. Steve Wahl (2): perf/x86/intel/uncore: Store the logical die id instead of the physical die id. perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info arch/x86/events/intel/uncore.c | 58 +- arch/x86/events/intel/uncore.h | 5 +- arch/x86/events/intel/uncore_snb.c | 2 +- arch/x86/events/intel/uncore_snbep.c | 114 ++- 4 files changed, 99 insertions(+), 80 deletions(-) Thanks Steve for working on the issue. The patch set looks good to me. Reviewed-by: Kan Liang Thanks, Kan
Re: [PATCH] perf intel-pt: Fix 'CPU too large' error
On 1/7/2021 12:41 PM, Adrian Hunter wrote: In some cases, the number of cpus (nr_cpus_online) is confused with the maximum cpu number (nr_cpus_avail), which results in the error in the example below: Example on system with 8 cpus: Before: # echo 0 > /sys/devices/system/cpu/cpu2/online # ./perf record --kcore -e intel_pt// taskset --cpu-list 7 uname Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.147 MB perf.data ] # ./perf script --itrace=e Requested CPU 7 too large. Consider raising MAX_NR_CPUS 0x25908 [0x8]: failed to process type: 68 [Invalid argument] After: # ./perf script --itrace=e # Fixes: 8c7274691f0d ("perf machine: Replace MAX_NR_CPUS with perf_env::nr_cpus_online") Fixes: 7df4e36a4785 ("perf session: Replace MAX_NR_CPUS with perf_env::nr_cpus_online") Cc: sta...@vger.kernel.org Signed-off-by: Adrian Hunter Tested-by: Kan Liang Thanks, Kan --- tools/perf/util/machine.c | 4 ++-- tools/perf/util/session.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c index 1ae32a81639c..46844599d25d 100644 --- a/tools/perf/util/machine.c +++ b/tools/perf/util/machine.c @@ -2977,7 +2977,7 @@ int machines__for_each_thread(struct machines *machines, pid_t machine__get_current_tid(struct machine *machine, int cpu) { - int nr_cpus = min(machine->env->nr_cpus_online, MAX_NR_CPUS); + int nr_cpus = min(machine->env->nr_cpus_avail, MAX_NR_CPUS); if (cpu < 0 || cpu >= nr_cpus || !machine->current_tid) return -1; @@ -2989,7 +2989,7 @@ int machine__set_current_tid(struct machine *machine, int cpu, pid_t pid, pid_t tid) { struct thread *thread; - int nr_cpus = min(machine->env->nr_cpus_online, MAX_NR_CPUS); + int nr_cpus = min(machine->env->nr_cpus_avail, MAX_NR_CPUS); if (cpu < 0) return -EINVAL; diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c index 3b3c50b12791..2777c2df7d87 100644 --- a/tools/perf/util/session.c +++ b/tools/perf/util/session.c @@ -2391,7 +2391,7 @@ int perf_session__cpu_bitmap(struct perf_session *session, { int i, err = -1; struct perf_cpu_map *map; - int nr_cpus = min(session->header.env.nr_cpus_online, MAX_NR_CPUS); + int nr_cpus = min(session->header.env.nr_cpus_avail, MAX_NR_CPUS); for (i = 0; i < PERF_TYPE_MAX; ++i) { struct evsel *evsel;
Re: [PATCH V3 5/9] perf mem: Support data page size
On 12/19/2020 3:56 PM, Arnaldo Carvalho de Melo wrote: Em Wed, Dec 16, 2020 at 10:58:01AM -0800, kan.li...@linux.intel.com escreveu: From: Kan Liang Add option --data-page-size in "perf mem" to record/report data page size. Here are some examples. perf mem --phys-data --data-page-size report -D So I stopped at this cset, it isn't applying to my tree, I'll test what I have, which is up to the patch before this one and push to Linus, as the window is closing. Hi Arnaldo, Sorry for the late response. I was on vacation. I will rebase the rest of the patches on top of your perf/core branch and send them out shortly. Thanks, Kan - Arnaldo # PID, TID, IP, ADDR, PHYS ADDR, DATA PAGE SIZE, LOCAL WEIGHT, DSRC, # SYMBOL 20134 20134 0xb5bd2fd0 0x0169a274e96a308 0x00044e96a308 4K 1168 0x5080144 /lib/modules/4.18.0-rc7+/build/vmlinux:perf_ctx_unlock 20134 20134 0xb63f645c 0xb752b814 0xcfb52b814 2M 225 0x26a100142 /lib/modules/4.18.0-rc7+/build/vmlinux:_raw_spin_lock 20134 20134 0xb660300c 0xfe00016b8bb0 0x0 4K 0 0x5080144 /lib/modules/4.18.0-rc7+/build/vmlinux:__x86_indirect_thunk_rax perf mem --phys-data --data-page-size report --stdio # To display the perf.data header info, please use # --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 5K of event 'cpu/mem-loads,ldlat=30/P' # Total weight : 281234 # Sort order : # mem,sym,dso,symbol_daddr,dso_daddr,tlb,locked,phys_daddr,data_page_size # # Overhead Samples Memory access Symbol # Shared Object Data Symbol Data # Object TLB access Locked Data Physical # Address Data Page Size # # # ... ... # .. .. .. # .. # 28.54% 1826 L1 or L1 hit [k] __x86_indirect_thunk_rax [kernel.vmlinux] [k] 0xb0df31b0ff28 [unknown]L1 or L2 hitNo [k] 4K 6.02% 256 L1 or L1 hit [.] touch_buffer dtlb [.] 0x7ffd50109da8 [stack] L1 or L2 hitNo [.] 0x00042454ada8 4K 3.23% 5 L1 or L1 hit [k] clear_huge_page [kernel.vmlinux] [k] 0x9a2753b8ce60 [unknown] L1 or L2 hitNo [k] 0x000453b8ce60 2M 2.98% 4 L1 or L1 hit [k] clear_page_erms [kernel.vmlinux] [k] 0xb0df31b0fd00 [unknown] L1 or L2 hitNo [k] 4K Acked-by: Namhyung Kim Acked-by: Jiri Olsa Signed-off-by: Kan Liang --- tools/perf/Documentation/perf-mem.txt | 3 +++ tools/perf/builtin-mem.c | 20 +++- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/tools/perf/Documentation/perf-mem.txt b/tools/perf/Documentation/perf-mem.txt index 199ea0f0a6c0..66177511c5c4 100644 --- a/tools/perf/Documentation/perf-mem.txt +++ b/tools/perf/Documentation/perf-mem.txt @@ -63,6 +63,9 @@ OPTIONS --phys-data:: Record/Report sample physical addresses +--data-page-size:: + Record/Report sample data address page size + RECORD OPTIONS -- -e:: diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c index 7d6ee2208709..f3aac85aa9d4 100644 --- a/tools/perf/builtin-mem.c +++ b/tools/perf/builtin-mem.c @@ -30,6 +30,7 @@ struct perf_mem { booldump_raw; boolforce; boolphys_addr; + booldata_page_size; int operation; const char *cpu_list; DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS); @@ -124,6 +125,9 @@ static int __cmd_record(int argc, const char **argv, struct perf_mem *mem) if (mem->phys_addr) rec_argv[i++] = "--phys-data"; + if (mem->data_page_size) + rec_argv[i++] = "--data-page-size"; + for (j = 0; j < PERF_MEM_EVENTS__MAX; j++) { e = perf_mem_events__ptr(j); if (!e->record) @@ -173,6 +177,7 @@ dump_raw_samples(struct perf_tool *tool, struct perf_mem *mem = container_of(tool, struct perf_mem, tool); struct addr_location al; const char *fmt, *field_sep; + char str[PAGE_SIZE_NAME_LEN]; if (machine__resolve(machine, , sample) < 0) { fprintf(stderr, "problem processing %d event, skipping it.\n", @@ -209,6 +214,12 @@ dump_raw_samples(struct perf_tool *tool, symbol_conf.field_sep); } + if (mem->data_page_size) { + printf("%s%s", +
Re: [PATCH V2 3/3] perf: Optimize sched_task() in a context switch
On 12/10/2020 2:13 AM, Namhyung Kim wrote: Hi Peter and Kan, How can we move this forward? Hi Namhyung, Thanks for the test. The changes look good to me. Hi Peter, Should we resend the patch set for further review? Thanks, Kan Thanks, Namhyung On Fri, Dec 4, 2020 at 4:14 PM Namhyung Kim wrote: Hi Peter, On Wed, Dec 2, 2020 at 2:29 AM Peter Zijlstra wrote: On Mon, Nov 30, 2020 at 11:38:42AM -0800, kan.li...@linux.intel.com wrote: From: Kan Liang Some calls to sched_task() in a context switch can be avoided. For example, large PEBS only requires flushing the buffer in context switch out. The current code still invokes the sched_task() for large PEBS in context switch in. I still hate this one, how's something like this then? Which I still don't really like.. but at least its simpler. (completely untested, may contain spurious edits, might ICE the compiler and set your pets on fire if it doesn't) I've tested this version... and it worked well besides the optimization.. :) [SNIP] +static void context_sched_task(struct perf_cpu_context *cpuctx, + struct perf_event_context *ctx, + bool sched_in) +{ + struct pmu *pmu = ctx->pmu; + + if (cpuctx->sched_cb_dir[sched_in] && pmu->sched_task) + pmu->sched_task(ctx, false); applied: s/false/sched_in/ +} + static void perf_event_context_sched_out(struct task_struct *task, int ctxn, struct task_struct *next) { @@ -3424,9 +3433,7 @@ static void perf_event_context_sched_out WRITE_ONCE(next_ctx->task, task); perf_pmu_disable(pmu); - - if (cpuctx->sched_cb_usage && pmu->sched_task) - pmu->sched_task(ctx, false); + context_sched_task(cpuctx, ctx, false); /* * PMU specific parts of task perf context can require @@ -3465,8 +3472,7 @@ static void perf_event_context_sched_out raw_spin_lock(>lock); perf_pmu_disable(pmu); - if (cpuctx->sched_cb_usage && pmu->sched_task) - pmu->sched_task(ctx, false); + context_sched_task(cpuctx, ctx, false); task_ctx_sched_out(cpuctx, ctx, EVENT_ALL); perf_pmu_enable(pmu); [SNIP] @@ -3563,8 +3582,7 @@ void __perf_event_task_sched_out(struct { int ctxn; - if (__this_cpu_read(perf_sched_cb_usage)) - perf_pmu_sched_task(task, next, false); + perf_pmu_sched_task(task, next, false); I think the reason is this change. It now calls perf_pmu_sched_task() without checking the counter. And this is for per-cpu events. if (atomic_read(_switch_events)) perf_event_switch(task, next, false); @@ -3828,8 +3846,7 @@ static void perf_event_context_sched_in( cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE); perf_event_sched_in(cpuctx, ctx, task); - if (cpuctx->sched_cb_usage && pmu->sched_task) - pmu->sched_task(cpuctx->task_ctx, true); + context_sched_task(cpuctx, ctx, true); perf_pmu_enable(pmu); @@ -3875,8 +3892,7 @@ void __perf_event_task_sched_in(struct t if (atomic_read(_switch_events)) perf_event_switch(task, prev, true); - if (__this_cpu_read(perf_sched_cb_usage)) - perf_pmu_sched_task(prev, task, true); + perf_pmu_sched_task(prev, task, true); Ditto. } static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count) So I made a change like below.. and it could bring the optimization back. Thanks, Namhyung diff --git a/kernel/events/core.c b/kernel/events/core.c index 9107e7c3ccfb..a30243a9fab5 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3528,6 +3528,9 @@ static void __perf_pmu_sched_task(struct perf_cpu_context *cpuctx, bool sched_in { struct pmu *pmu; + if (!cpuctx->sched_cb_dir[sched_in]) + return; + pmu = cpuctx->ctx.pmu; /* software PMUs will not have sched_task */ if (WARN_ON_ONCE(!pmu->sched_task))
Re: [PATCH V2 02/12] perf record: Support new sample type for data page size
On 12/7/2020 12:07 PM, Arnaldo Carvalho de Melo wrote: Em Mon, Nov 30, 2020 at 09:27:53AM -0800, kan.li...@linux.intel.com escreveu: From: Kan Liang Support new sample type PERF_SAMPLE_DATA_PAGE_SIZE for page size. Add new option --data-page-size to record sample data page size. So, trying this on a kernel without this feature I get: [acme@five perf]$ perf record --data-page-size sleep 1 Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cycles:u). /bin/dmesg | grep -i perf may provide additional information. [acme@five perf]$ I'm adding the following patch right after yours, next time please test this and provide a similar error message. Sorry, I missed it. Besides the PERF_SAMPLE_DATA_PAGE_SIZE, I think we have to fix the PERF_SAMPLE_CODE_PAGE_SIZE as well. Should I send a separate patch to fix it? Thanks, Kan - Arnaldo commit 2044fec7fcc6070b09f9b6a67922b0b9e4295dba Author: Arnaldo Carvalho de Melo Date: Mon Dec 7 14:04:05 2020 -0300 perf evsel: Emit warning about kernel not supporting the data page size sample_type bit Before we had this unhelpful message: $ perf record --data-page-size sleep 1 Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cycles:u). /bin/dmesg | grep -i perf may provide additional information. $ Add support to the perf_missing_features variable to remember what caused evsel__open() to fail and then use that information in evsel__open_strerror(). $ perf record --data-page-size sleep 1 Error: Asking for the data page size isn't supported by this kernel. $ Cc: Kan Liang Cc: Namhyung Kim Cc: Andi Kleen Cc: Jiri Olsa Cc: Mark Rutland Cc: Michael Ellerman Cc: Stephane Eranian Cc: Will Deacon Signed-off-by: Arnaldo Carvalho de Melo diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index 5e6085c3fc761a55..c26ea82220bd8625 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -1873,7 +1873,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus, * Must probe features in the order they were added to the * perf_event_attr interface. */ -if (!perf_missing_features.cgroup && evsel->core.attr.cgroup) { +if (!perf_missing_features.data_page_size && + (evsel->core.attr.sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)) { + perf_missing_features.data_page_size = true; + pr_debug2_peo("Kernel has no PERF_SAMPLE_DATA_PAGE_SIZE support, bailing out\n"); + goto out_close; + } else if (!perf_missing_features.cgroup && evsel->core.attr.cgroup) { perf_missing_features.cgroup = true; pr_debug2_peo("Kernel has no cgroup sampling support, bailing out\n"); goto out_close; @@ -2673,6 +2678,8 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target, "We found oprofile daemon running, please stop it and try again."); break; case EINVAL: + if (evsel->core.attr.sample_type & PERF_SAMPLE_DATA_PAGE_SIZE && perf_missing_features.data_page_size) + return scnprintf(msg, size, "Asking for the data page size isn't supported by this kernel."); if (evsel->core.attr.write_backward && perf_missing_features.write_backward) return scnprintf(msg, size, "Reading from overwrite event is not supported by this kernel."); if (perf_missing_features.clockid) diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h index 79a860d8e3eefe23..cd1d8dd431997b84 100644 --- a/tools/perf/util/evsel.h +++ b/tools/perf/util/evsel.h @@ -144,6 +144,7 @@ struct perf_missing_features { bool aux_output; bool branch_hw_idx; bool cgroup; + bool data_page_size; }; extern struct perf_missing_features perf_missing_features;
Re: [PATCH V2 06/12] perf mem: Clean up output format
On 12/4/2020 6:27 PM, Jiri Olsa wrote: On Mon, Nov 30, 2020 at 09:27:57AM -0800, kan.li...@linux.intel.com wrote: SNIP @@ -172,7 +172,7 @@ dump_raw_samples(struct perf_tool *tool, { struct perf_mem *mem = container_of(tool, struct perf_mem, tool); struct addr_location al; - const char *fmt; + const char *fmt, *field_sep; if (machine__resolve(machine, , sample) < 0) { fprintf(stderr, "problem processing %d event, skipping it.\n", @@ -186,60 +186,41 @@ dump_raw_samples(struct perf_tool *tool, if (al.map != NULL) al.map->dso->hit = 1; - if (mem->phys_addr) { - if (symbol_conf.field_sep) { - fmt = "%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s0x%016"PRIx64 - "%s%"PRIu64"%s0x%"PRIx64"%s%s:%s\n"; - } else { - fmt = "%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64 - "%s0x%016"PRIx64"%s%5"PRIu64"%s0x%06"PRIx64 - "%s%s:%s\n"; - symbol_conf.field_sep = " "; - } - - printf(fmt, - sample->pid, - symbol_conf.field_sep, - sample->tid, - symbol_conf.field_sep, - sample->ip, - symbol_conf.field_sep, - sample->addr, - symbol_conf.field_sep, - sample->phys_addr, - symbol_conf.field_sep, - sample->weight, - symbol_conf.field_sep, - sample->data_src, - symbol_conf.field_sep, - al.map ? (al.map->dso ? al.map->dso->long_name : "???") : "???", - al.sym ? al.sym->name : "???"); + field_sep = symbol_conf.field_sep; hum, what's the point of having field_sep? To keep the fmt consistent. The patch divides the "printf(fmt,..." into two part. In the first half part, the symbol_conf.field_sep may be modified to " "; In the second half part, we should not use the modified symbol_conf.field_sep for the below check. The field_sep keeps the original value and should be used. + if (field_sep) { + fmt = "%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s"; } else { - if (symbol_conf.field_sep) { - fmt = "%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s%"PRIu64 - "%s0x%"PRIx64"%s%s:%s\n"; - } else { - fmt = "%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64 - "%s%5"PRIu64"%s0x%06"PRIx64"%s%s:%s\n"; - symbol_conf.field_sep = " "; - } + fmt = "%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64"%s"; + symbol_conf.field_sep = " "; + } + printf(fmt, + sample->pid, + symbol_conf.field_sep, + sample->tid, + symbol_conf.field_sep, + sample->ip, + symbol_conf.field_sep, + sample->addr, + symbol_conf.field_sep); - printf(fmt, - sample->pid, - symbol_conf.field_sep, - sample->tid, - symbol_conf.field_sep, - sample->ip, - symbol_conf.field_sep, - sample->addr, - symbol_conf.field_sep, - sample->weight, - symbol_conf.field_sep, - sample->data_src, - symbol_conf.field_sep, - al.map ? (al.map->dso ? al.map->dso->long_name : "???") : "???", - al.sym ? al.sym->name : "???"); + if (mem->phys_addr) { + printf("0x%016"PRIx64"%s", + sample->phys_addr, + symbol_conf.field_sep); } + + if (field_sep) + fmt = "%"PRIu64"%s0x%"PRIx64"%s%s:%s\n"; + else + fmt = "%5"PRIu64"%s0x%06"PRIx64"%s%s:%s\n"; + + printf(fmt, + sample->weight, + symbol_conf.field_sep, + sample->data_src, + symbol_conf.field_sep, + al.map ? (al.map->dso ? al.map->dso->long_name : "???") : "???", + al.sym ? al.sym->name : "???"); out_put: addr_location__put(); return 0; @@ -287,10 +268,12 @@ static int report_raw_events(struct perf_mem *mem) if (ret < 0) goto out_delete; + printf("# PID, TID, IP, ADDR, "); + if (mem->phys_addr) - printf("# PID, TID, IP, ADDR, PHYS ADDR, LOCAL WEIGHT, DSRC, SYMBOL\n"); - else - printf("# PID, TID, IP, ADDR, LOCAL WEIGHT, DSRC, SYMBOL\n"); +
Re: [PATCH 00/12] Add the page size in the perf record (user tools)
On 11/27/2020 3:22 PM, Jiri Olsa wrote: hi, I can't compile this on Fedora 32, check the log below The patch set was compiled with GCC 9. I will install a GCC 10 and fix all the warnings. Thanks, Kan jirka --- $ make JOBS=1 BUILD: Doing 'make -j1' parallel build Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs from latest version at 'include/uapi/linux/perf_event.h' diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h CC util/parse-events.o CC util/session.o util/session.c: In function ‘machines__deliver_event’: util/session.c:1278:37: error: ‘%lu’ directive output may be truncated writing between 1 and 20 bytes into a region of size 10 [-Werror=format-truncation=] 1278 | snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]); | ^~~ util/session.c:1278:36: note: directive argument in the range [1, 18446744073709551615] 1278 | snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]); |^~~ In file included from /usr/include/stdio.h:867, from /home/jolsa/kernel/linux-perf/tools/lib/perf/include/perf/cpumap.h:6, from util/session.c:13: /usr/include/bits/stdio2.h:67:10: note: ‘__builtin___snprintf_chk’ output between 3 and 22 bytes into a destination of size 10 67 | return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1, | ^~~~ 68 |__bos (__s), __fmt, __va_arg_pack ()); |~ util/session.c:1278:37: error: ‘%lu’ directive output may be truncated writing between 1 and 20 bytes into a region of size 10 [-Werror=format-truncation=] 1278 | snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]); | ^~~ util/session.c:1278:36: note: directive argument in the range [1, 18446744073709551615] 1278 | snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]); |^~~ In file included from /usr/include/stdio.h:867, from /home/jolsa/kernel/linux-perf/tools/lib/perf/include/perf/cpumap.h:6, from util/session.c:13: /usr/include/bits/stdio2.h:67:10: note: ‘__builtin___snprintf_chk’ output between 3 and 22 bytes into a destination of size 10 67 | return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1, | ^~~~ 68 |__bos (__s), __fmt, __va_arg_pack ()); |~ util/session.c: In function ‘get_page_size_name’: util/session.c:1278:37: error: ‘%lu’ directive output may be truncated writing between 1 and 20 bytes into a region of size 10 [-Werror=format-truncation=] 1278 | snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]); | ^~~ util/session.c:1278:36: note: directive argument in the range [1, 18446744073709551615] 1278 | snprintf(str, PAGE_SIZE_NAME_LEN, "%lu%c", size, suffixes[i]); |^~~ In file included from /usr/include/stdio.h:867, from /home/jolsa/kernel/linux-perf/tools/lib/perf/include/perf/cpumap.h:6, from util/session.c:13: /usr/include/bits/stdio2.h:67:10: note: ‘__builtin___snprintf_chk’ output between 3 and 22 bytes into a destination of size 10 67 | return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1, | ^~~~ 68 |__bos (__s), __fmt, __va_arg_pack ()); |~ cc1: all warnings being treated as errors make[4]: *** [/home/jolsa/kernel/linux-perf/tools/build/Makefile.build:97: util/session.o] Error 1 make[3]: *** [/home/jolsa/kernel/linux-perf/tools/build/Makefile.build:139: util] Error 2 make[2]: *** [Makefile.perf:647: perf-in.o] Error 2 make[1]: *** [Makefile.perf:233: sub-make] Error 2 make: *** [Makefile:70: all] Error 2
Re: [PATCH 1/3] perf/core: Flush PMU internal buffers for per-CPU events
On 11/24/2020 12:42 AM, Madhavan Srinivasan wrote: On 11/24/20 10:21 AM, Namhyung Kim wrote: Hello, On Mon, Nov 23, 2020 at 8:00 PM Michael Ellerman wrote: Namhyung Kim writes: Hi Peter and Kan, (Adding PPC folks) On Tue, Nov 17, 2020 at 2:01 PM Namhyung Kim wrote: Hello, On Thu, Nov 12, 2020 at 4:54 AM Liang, Kan wrote: On 11/11/2020 11:25 AM, Peter Zijlstra wrote: On Mon, Nov 09, 2020 at 09:49:31AM -0500, Liang, Kan wrote: - When the large PEBS was introduced (9c964efa4330), the sched_task() should be invoked to flush the PEBS buffer in each context switch. However, The perf_sched_events in account_event() is not updated accordingly. The perf_event_task_sched_* never be invoked for a pure per-CPU context. Only per-task event works. At that time, the perf_pmu_sched_task() is outside of perf_event_context_sched_in/out. It means that perf has to double perf_pmu_disable() for per-task event. - The patch 1 tries to fix broken per-CPU events. The CPU context cannot be retrieved from the task->perf_event_ctxp. So it has to be tracked in the sched_cb_list. Yes, the code is very similar to the original codes, but it is actually the new code for per-CPU events. The optimization for per-task events is still kept. For the case, which has both a CPU context and a task context, yes, the __perf_pmu_sched_task() in this patch is not invoked. Because the sched_task() only need to be invoked once in a context switch. The sched_task() will be eventually invoked in the task context. The thing is; your first two patches rely on PERF_ATTACH_SCHED_CB and only set that for large pebs. Are you sure the other users (Intel LBR and PowerPC BHRB) don't need it? I didn't set it for LBR, because the perf_sched_events is always enabled for LBR. But, yes, we should explicitly set the PERF_ATTACH_SCHED_CB for LBR. if (has_branch_stack(event)) inc = true; If they indeed do not require the pmu::sched_task() callback for CPU events, then I still think the whole perf_sched_cb_{inc,dec}() interface No, LBR requires the pmu::sched_task() callback for CPU events. Now, The LBR registers have to be reset in sched in even for CPU events. To fix the shorter LBR callstack issue for CPU events, we also need to save/restore LBRs in pmu::sched_task(). https://lore.kernel.org/lkml/1578495789-95006-4-git-send-email-kan.li...@linux.intel.com/ is confusing at best. Can't we do something like this instead? I think the below patch may have two issues. - PERF_ATTACH_SCHED_CB is required for LBR (maybe PowerPC BHRB as well) now. - We may disable the large PEBS later if not all PEBS events support large PEBS. The PMU need a way to notify the generic code to decrease the nr_sched_task. Any updates on this? I've reviewed and tested Kan's patches and they all look good. Maybe we can talk to PPC folks to confirm the BHRB case? Can we move this forward? I saw patch 3/3 also adds PERF_ATTACH_SCHED_CB for PowerPC too. But it'd be nice if ppc folks can confirm the change. Sorry I've read the whole thread, but I'm still not entirely sure I understand the question. Thanks for your time and sorry about not being clear enough. We found per-cpu events are not calling pmu::sched_task() on context switches. So PERF_ATTACH_SCHED_CB was added to indicate the core logic that it needs to invoke the callback. The patch 3/3 added the flag to PPC (for BHRB) with other changes (I think it should be split like in the patch 2/3) and want to get ACKs from the PPC folks. Sorry for delay. I guess first it will be better to split the ppc change to a separate patch, Both PPC and X86 invokes the perf_sched_cb_inc() directly. The patch changes the parameters of the perf_sched_cb_inc(). I think we have to update the PPC and X86 codes together. Otherwise, there will be a compile error, if someone may only applies the change for the perf_sched_cb_inc() but forget to applies the changes in PPC or X86 specific codes. secondly, we are missing the changes needed in the power_pmu_bhrb_disable() where perf_sched_cb_dec() needs the "state" to be included. Ah, right. The below patch should fix the issue. diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index bced502f64a1..6756d1602a67 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -391,13 +391,18 @@ static void power_pmu_bhrb_enable(struct perf_event *event) static void power_pmu_bhrb_disable(struct perf_event *event) { struct cpu_hw_events *cpuhw = this_cpu_ptr(_hw_events); + int state = PERF_SCHED_CB_SW_IN; if (!ppmu->bhrb_nr) return; WARN_ON_ONCE(!cpuhw->bhrb_users); cpuhw->bhrb_users--; - perf_sched_cb_dec(event->ctx->pmu); + + if (!(event->attach_state & PERF_ATTACH_TASK)) + state |= PERF_SCHED_CB_CPU; + + perf_
Re: [PATCH 3/5] perf/core: Fix arch_perf_get_page_size()
On 11/13/2020 6:19 AM, Peter Zijlstra wrote: The (new) page-table walker in arch_perf_get_page_size() is broken in various ways. Specifically while it is used in a locless manner, it doesn't depend on CONFIG_HAVE_FAST_GUP nor uses the proper _lockless offset methods, nor is careful to only read each entry only once. Also the hugetlb support is broken due to calling pte_page() without first checking pte_special(). Rewrite the whole thing to be a proper lockless page-table walker and employ the new pXX_leaf_size() pgtable functions to determine the TLB size without looking at the page-frames. Fixes: 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs") Fixes: 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE") The issue (https://lkml.kernel.org/r/8e88ba79-7c40-ea32-a7ed-bdc4fc04b...@linux.intel.com) has been fixed by this patch set. Tested-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) --- arch/arm64/include/asm/pgtable.h|3 + arch/sparc/include/asm/pgtable_64.h | 13 arch/sparc/mm/hugetlbpage.c | 19 -- include/linux/pgtable.h | 16 + kernel/events/core.c| 102 +--- 5 files changed, 82 insertions(+), 71 deletions(-) --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7001,90 +7001,62 @@ static u64 perf_virt_to_phys(u64 virt) return phys_addr; } -#ifdef CONFIG_MMU - /* - * Return the MMU page size of a given virtual address. - * - * This generic implementation handles page-table aligned huge pages, as well - * as non-page-table aligned hugetlbfs compound pages. - * - * If an architecture supports and uses non-page-table aligned pages in their - * kernel mapping it will need to provide it's own implementation of this - * function. + * Return the MMU/TLB page size of a given virtual address. */ -__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr) +static u64 perf_get_tlb_page_size(struct mm_struct *mm, unsigned long addr) { - struct page *page; - pgd_t *pgd; - p4d_t *p4d; - pud_t *pud; - pmd_t *pmd; - pte_t *pte; + u64 size = 0; - pgd = pgd_offset(mm, addr); - if (pgd_none(*pgd)) - return 0; +#ifdef CONFIG_HAVE_FAST_GUP + pgd_t *pgdp, pgd; + p4d_t *p4dp, p4d; + pud_t *pudp, pud; + pmd_t *pmdp, pmd; + pte_t *ptep, pte; - p4d = p4d_offset(pgd, addr); - if (!p4d_present(*p4d)) + pgdp = pgd_offset(mm, addr); + pgd = READ_ONCE(*pgdp); + if (pgd_none(pgd)) return 0; - if (p4d_leaf(*p4d)) - return 1ULL << P4D_SHIFT; + if (pgd_leaf(pgd)) + return pgd_leaf_size(pgd); - pud = pud_offset(p4d, addr); - if (!pud_present(*pud)) + p4dp = p4d_offset_lockless(pgdp, pgd, addr); + p4d = READ_ONCE(*p4dp); + if (!p4d_present(p4d)) return 0; - if (pud_leaf(*pud)) { -#ifdef pud_page - page = pud_page(*pud); - if (PageHuge(page)) - return page_size(compound_head(page)); -#endif - return 1ULL << PUD_SHIFT; - } + if (p4d_leaf(p4d)) + return p4d_leaf_size(p4d); - pmd = pmd_offset(pud, addr); - if (!pmd_present(*pmd)) + pudp = pud_offset_lockless(p4dp, p4d, addr); + pud = READ_ONCE(*pudp); + if (!pud_present(pud)) return 0; - if (pmd_leaf(*pmd)) { -#ifdef pmd_page - page = pmd_page(*pmd); - if (PageHuge(page)) - return page_size(compound_head(page)); -#endif - return 1ULL << PMD_SHIFT; - } + if (pud_leaf(pud)) + return pud_leaf_size(pud); - pte = pte_offset_map(pmd, addr); - if (!pte_present(*pte)) { - pte_unmap(pte); + pmdp = pmd_offset_lockless(pudp, pud, addr); + pmd = READ_ONCE(*pmdp); + if (!pmd_present(pmd)) return 0; - } - page = pte_page(*pte); - if (PageHuge(page)) { - u64 size = page_size(compound_head(page)); - pte_unmap(pte); - return size; - } - - pte_unmap(pte); - return PAGE_SIZE; -} + if (pmd_leaf(pmd)) + return pmd_leaf_size(pmd); -#else + ptep = pte_offset_map(, addr); + pte = ptep_get_lockless(ptep); + if (pte_present(pte)) + size = pte_leaf_size(pte); + pte_unmap(ptep); +#endif /* CONFIG_HAVE_FAST_GUP */ -static u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr) -{ - return 0; + return size; } -#endif - static u64 perf_get_page_size(unsigned long addr) { struct mm_struct *mm; @@ -7109,7 +7081,7 @@ static u64 perf_get_page_size(unsigned l mm = _mm; } - size = arch_perf_get_page_size(mm, addr); +
Re: [PATCH 1/3] perf/core: Flush PMU internal buffers for per-CPU events
On 11/11/2020 11:25 AM, Peter Zijlstra wrote: On Mon, Nov 09, 2020 at 09:49:31AM -0500, Liang, Kan wrote: - When the large PEBS was introduced (9c964efa4330), the sched_task() should be invoked to flush the PEBS buffer in each context switch. However, The perf_sched_events in account_event() is not updated accordingly. The perf_event_task_sched_* never be invoked for a pure per-CPU context. Only per-task event works. At that time, the perf_pmu_sched_task() is outside of perf_event_context_sched_in/out. It means that perf has to double perf_pmu_disable() for per-task event. - The patch 1 tries to fix broken per-CPU events. The CPU context cannot be retrieved from the task->perf_event_ctxp. So it has to be tracked in the sched_cb_list. Yes, the code is very similar to the original codes, but it is actually the new code for per-CPU events. The optimization for per-task events is still kept. For the case, which has both a CPU context and a task context, yes, the __perf_pmu_sched_task() in this patch is not invoked. Because the sched_task() only need to be invoked once in a context switch. The sched_task() will be eventually invoked in the task context. The thing is; your first two patches rely on PERF_ATTACH_SCHED_CB and only set that for large pebs. Are you sure the other users (Intel LBR and PowerPC BHRB) don't need it? I didn't set it for LBR, because the perf_sched_events is always enabled for LBR. But, yes, we should explicitly set the PERF_ATTACH_SCHED_CB for LBR. if (has_branch_stack(event)) inc = true; If they indeed do not require the pmu::sched_task() callback for CPU events, then I still think the whole perf_sched_cb_{inc,dec}() interface No, LBR requires the pmu::sched_task() callback for CPU events. Now, The LBR registers have to be reset in sched in even for CPU events. To fix the shorter LBR callstack issue for CPU events, we also need to save/restore LBRs in pmu::sched_task(). https://lore.kernel.org/lkml/1578495789-95006-4-git-send-email-kan.li...@linux.intel.com/ is confusing at best. Can't we do something like this instead? I think the below patch may have two issues. - PERF_ATTACH_SCHED_CB is required for LBR (maybe PowerPC BHRB as well) now. - We may disable the large PEBS later if not all PEBS events support large PEBS. The PMU need a way to notify the generic code to decrease the nr_sched_task. The current related code in Intel PMU is as below. static void pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, struct perf_event *event, bool add) { struct pmu *pmu = event->ctx->pmu; /* * Make sure we get updated with the first PEBS * event. It will trigger also during removal, but * that does not hurt: */ bool update = cpuc->n_pebs == 1; if (needed_cb != pebs_needs_sched_cb(cpuc)) { if (!needed_cb) perf_sched_cb_inc(pmu); else perf_sched_cb_dec(pmu); update = true; } Thanks, Kan --- diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 546cc89217bb..672d6f039fce 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3565,8 +3565,10 @@ static int intel_pmu_hw_config(struct perf_event *event) if (!(event->attr.freq || (event->attr.wakeup_events && !event->attr.watermark))) { event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; if (!(event->attr.sample_type & - ~intel_pmu_large_pebs_flags(event))) + ~intel_pmu_large_pebs_flags(event))) { event->hw.flags |= PERF_X86_EVENT_LARGE_PEBS; + event->attach_state |= PERF_ATTACH_SCHED_CB; + } } if (x86_pmu.pebs_aliases) x86_pmu.pebs_aliases(event); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 9a38f579bc76..af9ee539c179 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -606,6 +606,7 @@ struct swevent_hlist { #define PERF_ATTACH_TASK 0x04 #define PERF_ATTACH_TASK_DATA 0x08 #define PERF_ATTACH_ITRACE0x10 +#define PERF_ATTACH_SCHED_CB 0x20 struct perf_cgroup; struct perf_buffer; @@ -817,6 +818,7 @@ struct perf_event_context { int is_active; int nr_stat; int nr_freq; + int nr_sched_task; int rotate_disable; /* * Set when nr_events != nr_active, except tolerant to events not @@ -872,7 +874,7 @@ struct perf_cpu_context { struct list_headcgrp_cpuc
Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 11/4/2020 12:11 PM, Liang, Kan wrote: On 10/13/2020 12:34 PM, Peter Zijlstra wrote: Subject: perf,mm: Handle non-page-table-aligned hugetlbfs From: Peter Zijlstra Date: Fri, 9 Oct 2020 11:09:27 +0200 A limited nunmber of architectures support hugetlbfs sizes that do not align with the page-tables (ARM64, Power, Sparc64). Add support for this to the generic perf_get_page_size() implementation, and also allow an architecture to override this implementation. This latter is only needed when it uses non-page-table aligned huge pages in its kernel map. Signed-off-by: Peter Zijlstra (Intel) --- kernel/events/core.c | 39 +-- 1 file changed, 33 insertions(+), 6 deletions(-) --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6996,10 +6996,18 @@ static u64 perf_virt_to_phys(u64 virt) #ifdef CONFIG_MMU /* - * Return the MMU page size of a given virtual address + * Return the MMU page size of a given virtual address. + * + * This generic implementation handles page-table aligned huge pages, as well + * as non-page-table aligned hugetlbfs compound pages. + * + * If an architecture supports and uses non-page-table aligned pages in their + * kernel mapping it will need to provide it's own implementation of this + * function. */ -static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr) +__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr) { + struct page *page; pgd_t *pgd; p4d_t *p4d; pud_t *pud; @@ -7021,15 +7029,27 @@ static u64 __perf_get_page_size(struct m if (!pud_present(*pud)) return 0; - if (pud_leaf(*pud)) + if (pud_leaf(*pud)) { +#ifdef pud_page + page = pud_page(*pud); + if (PageHuge(page)) + return page_size(compound_head(page)); +#endif return 1ULL << PUD_SHIFT; + } pmd = pmd_offset(pud, addr); if (!pmd_present(*pmd)) return 0; - if (pmd_leaf(*pmd)) + if (pmd_leaf(*pmd)) { +#ifdef pmd_page + page = pmd_page(*pmd); + if (PageHuge(page)) + return page_size(compound_head(page)); +#endif return 1ULL << PMD_SHIFT; + } pte = pte_offset_map(pmd, addr); if (!pte_present(*pte)) { @@ -7037,13 +7057,20 @@ static u64 __perf_get_page_size(struct m return 0; } + page = pte_page(*pte); + if (PageHuge(page)) { + u64 size = page_size(compound_head(page)); + pte_unmap(pte); + return size; + } + The PageHuge() check for PTE crashes my machine when I did page size test. (Sorry, I didn't find the issue earlier. I just found some time to re-run the page size test.) It seems we don't need the check for PTE here. The size should be always PAGE_SIZE, no? After I remove the check, everything looks good. Hi Peter, Any comments for this issue? As my understanding, we don't need the PageHuge() check for PTE page here. However, I don't expect that the page fault crashes here. Because perf already check the pte_present(*pte) before invoking PageHuge(). I'm not sure if it's triggered by other potential issues, so I'm asking here rather than submiting a patch to simply remove the PageHuge() check. Thanks, Kan [ 167.383120] BUG: unable to handle page fault for address: fca803fb8000 [ 167.383121] #PF: supervisor read access in kernel mode [ 167.383121] #PF: error_code(0x) - not-present page [ 167.383121] PGD 4adfc8067 P4D 4adfc8067 PUD 4adfc7067 PMD 0 [ 167.383122] Oops: [#1] SMP NOPTI [ 167.383122] CPU: 0 PID: 2461 Comm: perf Not tainted 5.10.0-rc1_page_size+ #54 [ 167.383123] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3162.A00.1904162000 04/16/2019 [ 167.383123] traps: PANIC: double fault, error_code: 0x0 [ 167.383123] double fault: [#2] SMP NOPTI [ 167.383123] CPU: 0 PID: 2461 Comm: perf Not tainted 5.10.0-rc1_page_size+ #54 [ 167.383124] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3162.A00.1904162000 04/16/2019 [ 167.383124] RIP: 0010:__sprint_symbol+0xb/0x100 [ 167.383124] Code: 85 e4 75 b9 48 85 c0 75 bc 49 89 d8 4c 89 e1 4c 89 fa 4c 89 f6 4c 89 ef e8 42 70 04 00 eb a6 0f 1f 44 00 00 55 48 89 e5 41 57 <41> 56 49 89 f6 41 55 41 89 cd 48 8d 4d b8 41 54 49 89 fc 53 48 63 [ 167.383125] RSP: 0018:fe00b000 EFLAGS: 00010046 [ 167.383125] RAX: 0053 RBX: 0a00 RCX: 0001 [ 167.383125] RDX: RSI: 9f8a6176 RDI: fe00b029 [ 167.383126] RBP: fe00b008 R08: a0bf8661 R09: 0001 [ 167.383126] R10: 000f R11: 9e641dc189c8 R12: 9e641dc189c9 [ 167.383126] R13: 9e641dc1a7e0 R14: 0a00ff05 R15: fe00b029 [ 167.383126] FS: () GS:9e641dc0() knlG
Re: [PATCH 1/3] perf/core: Flush PMU internal buffers for per-CPU events
On 11/9/2020 12:33 PM, Peter Zijlstra wrote: On Mon, Nov 09, 2020 at 09:49:31AM -0500, Liang, Kan wrote: Maybe we can frob x86_pmu_enable()... Could you please elaborate? Something horrible like this. It will detect the first time we enable the PMU on a new task (IOW we did a context switch) and wipe the counters when user RDPMC is on... Oh, you mean the RDPMC patch. It should be doable, but I think sched_task() may be a better place, especially with the new optimization (patch 3). We can set PERF_SCHED_CB_SW_IN bit for the case, so we only do the check for per-task events in sched in. It looks like the below patch has to unconditionally do the check (even for the non-RDPMC cases), which should be unnecessary. Anyway, I think the RDPMC patch should depend on the implementation of the sched_task(). We may have further discussion when the design of sched_task() is finalized. Thanks, Kan diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 77b963e5e70a..d862927baaef 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1289,6 +1289,15 @@ static void x86_pmu_enable(struct pmu *pmu) perf_events_lapic_init(); } + if (cpuc->current != current) { + struct mm_struct *mm = current->mm; + + cpuc->current = current; + + if (mm && atomic_read(>context.perf_rdpmc_allowed)) + wipe_dirty_counters(); + } + cpuc->enabled = 1; barrier(); diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 7895cf4c59a7..d16118cb3bd0 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -248,6 +248,8 @@ struct cpu_hw_events { unsigned inttxn_flags; int is_fake; + void *current; + /* * Intel DebugStore bits */
Re: [PATCH 1/3] perf/core: Flush PMU internal buffers for per-CPU events
On 11/9/2020 6:04 AM, Peter Zijlstra wrote: On Mon, Nov 09, 2020 at 10:52:35AM +0100, Peter Zijlstra wrote: On Fri, Nov 06, 2020 at 01:29:33PM -0800, kan.li...@linux.intel.com wrote: From: Kan Liang Sometimes the PMU internal buffers have to be flushed for per-CPU events during a context switch, e.g., large PEBS. Otherwise, the perf tool may report samples in locations that do not belong to the process where the samples are processed in, because PEBS does not tag samples with PID/TID. The current code only flush the buffers for a per-task event. It doesn't check a per-CPU event. Add a new event state flag, PERF_ATTACH_SCHED_CB, to indicate that the PMU internal buffers have to be flushed for this event during a context switch. Add sched_cb_entry and perf_sched_cb_usages back to track the PMU/cpuctx which is required to be flushed. Only need to invoke the sched_task() for per-CPU events in this patch. The per-task events have been handled in perf_event_context_sched_in/out already. Fixes: 9c964efa4330 ("perf/x86/intel: Drain the PEBS buffer during context switches") Are you sure? In part this patch looks like a revert of: 44fae179ce73a26733d9e2d346da4e1a1cb94647 556cccad389717d6eb4f5a24b45ff41cad3aaabf The patch tries to fix a long term PEBS bug with per-CPU event. It's not a revert. I think we still want to keep the optimization for per-task event with 44fae179ce73 and 556cccad3897. Here is the story. - When the large PEBS was introduced (9c964efa4330), the sched_task() should be invoked to flush the PEBS buffer in each context switch. However, The perf_sched_events in account_event() is not updated accordingly. The perf_event_task_sched_* never be invoked for a pure per-CPU context. Only per-task event works. At that time, the perf_pmu_sched_task() is outside of perf_event_context_sched_in/out. It means that perf has to double perf_pmu_disable() for per-task event. - The recent optimization (44fae179ce73 and 556cccad3897) only optimize the per-task event by moving the sched_task() into perf_event_context_sched_in/out. So perf only need to invoke perf_pmu_disable() once. But it doesn't fix the issue in a pure per-CPU context. The per-CPU events are still broken. - The patch 1 tries to fix broken per-CPU events. The CPU context cannot be retrieved from the task->perf_event_ctxp. So it has to be tracked in the sched_cb_list. Yes, the code is very similar to the original codes, but it is actually the new code for per-CPU events. The optimization for per-task events is still kept. For the case, which has both a CPU context and a task context, yes, the __perf_pmu_sched_task() in this patch is not invoked. Because the sched_task() only need to be invoked once in a context switch. The sched_task() will be eventually invoked in the task context. - The following patch 3 tries to optimize the sched_task() further. - We handle the per-CPU event and per-task event in different places. In a pure per-task context, we don't need to go through the sched_cb_list. We can get the information from task->perf_event_ctxp. We introduce a flag to distinguish them. - For most of the cases, we don't need to invoke the sched_task() in both sched in and sched out. We also introduce a flag for PMUs to avoid unnecessary calls. All the implementation and optimization are generic. For example, for power, the BHRB entries are reset in context switch in. They can also benefit from the optimization. *groan*... I think I might've made a mistake with those two patches. I assumed the whole cpuctx->task_ctx thing was relevant, it is not. As per perf_sched_cb_{inc,dec}(struct pmu *), the thing we care about is that the *PMU* gets a context switch callback, we don't give a crap about the actual task context. Except that LBR code does, but I'm thinking that started the whole confusion -- and I'm still not sure it's actually correct either. Now,.. how did we end up with the above two patches anyway... /me frobs around in the inbox... Ah! that daft user RDPMC thing. I wanted to avoid yet another pmu::method(). Hmm.. and the reason I proposed that change is because we'd end up with the sched_cb for every context switch now, not just large-pebs and or lbr crud. And this form avoids the double perf_pmu_disable() and all that. With the patch set, I think the double perf_pmu_disable() is still avoided. Maybe we can frob x86_pmu_enable()... Could you please elaborate? Thanks, Kan
Re: [RFC 1/2] perf/core: Enable sched_task callbacks if PMU has it
On 11/5/2020 7:53 PM, Namhyung Kim wrote: On Fri, Nov 6, 2020 at 4:01 AM Liang, Kan wrote: On 11/5/2020 10:45 AM, Namhyung Kim wrote: Hello, On Thu, Nov 5, 2020 at 11:47 PM Liang, Kan wrote: On 11/2/2020 9:52 AM, Namhyung Kim wrote: If an event associated with a PMU which has a sched_task callback, it should be called regardless of cpu/task context. For example, I don't think it's necessary. We should call it when we have to. Otherwise, it just waste cycles. Shouldn't the patch 2 be enough? I'm not sure, without this patch __perf_event_task_sched_in/out cannot be called for per-cpu events (w/o cgroups) IMHO. And I could not find any other place to check the perf_sched_cb_usages. Yes, it should a bug for large PEBS, and it should has always been there since the large PEBS was introduced. I just tried some older kernels (before recent change). Large PEBS is not flushed with per-cpu events. But from your description, it looks like the issue is only found after recent change. Could you please double check if the issue can also be reproduced before the recent change? Yep, actually Gabriel reported this problem on v4.4 kernel. Thanks for the confirm. So large PEBS never works with per-cpu events. :( I will send a new patch set to address the issue. Thanks, Kan
Re: [RFC 2/2] perf/core: Invoke pmu::sched_task callback for per-cpu events
On 11/5/2020 4:15 PM, Stephane Eranian wrote: On Thu, Nov 5, 2020 at 11:40 AM Liang, Kan wrote: On 11/5/2020 10:54 AM, Namhyung Kim wrote: -void perf_sched_cb_inc(struct pmu *pmu) +void perf_sched_cb_inc(struct pmu *pmu, bool systemwide) { struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context); - if (!cpuctx->sched_cb_usage++) - list_add(>sched_cb_entry, this_cpu_ptr(_cb_list)); + cpuctx->sched_cb_usage++; - this_cpu_inc(perf_sched_cb_usages); + if (systemwide) { + this_cpu_inc(perf_sched_cb_usages); + list_add(>sched_cb_entry, this_cpu_ptr(_cb_list)); You need to check the value and make sure it's added only once. Right, maybe we have to add a new variable for that. Sure, I tend to agree here that we need a narrower scope trigger, only when needed, i.e., an event or PMU feature that requires ctxsw work. In fact, I am also interested in splitting ctxswin and ctswout callbacks. The reason is that you have overhead in the way the callback is invoked. You may end up calling the sched_task on ctxswout when only ctxwin is needed. In doing that you pay the cost of stopping/starting the PMU for possibly nothing. Stopping the PMU can be expensive, like on AMD where you need multiple wrmsr. So splitting or adding a flag to convey that either CTXSW_IN or CTXSW_OUT is needed would help. I am suggesting this now given you are adding a flag. Yes, adding a new flag would avoid the unnecessary PMU stop/restart for both per-cpu and per-process event. How about the patch as below? (Just did simple test. Should need another patch to split the callback.) diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 6586f7e71cfb..e5837fdf82e3 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -370,6 +370,7 @@ static void power_pmu_bhrb_reset(void) static void power_pmu_bhrb_enable(struct perf_event *event) { struct cpu_hw_events *cpuhw = this_cpu_ptr(_hw_events); + int state = PERF_SCHED_CB_SW_IN; if (!ppmu->bhrb_nr) return; @@ -380,7 +381,11 @@ static void power_pmu_bhrb_enable(struct perf_event *event) cpuhw->bhrb_context = event->ctx; } cpuhw->bhrb_users++; - perf_sched_cb_inc(event->ctx->pmu); + + if (!(event->attach_state & PERF_ATTACH_TASK)) + state |= PERF_SCHED_CB_CPU; + + perf_sched_cb_inc(event->ctx->pmu, state); } static void power_pmu_bhrb_disable(struct perf_event *event) diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 444e5f061d04..f4f9d737ca85 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -1019,12 +1019,16 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, * that does not hurt: */ bool update = cpuc->n_pebs == 1; + int state = PERF_SCHED_CB_SW_OUT; if (needed_cb != pebs_needs_sched_cb(cpuc)) { + if (!(event->attach_state & PERF_ATTACH_TASK)) + state |= PERF_SCHED_CB_CPU; + if (!needed_cb) - perf_sched_cb_inc(pmu); + perf_sched_cb_inc(pmu, state); else - perf_sched_cb_dec(pmu); + perf_sched_cb_dec(pmu, state); update = true; } diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 8961653c5dd2..e4c500580df5 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -660,6 +660,7 @@ void intel_pmu_lbr_add(struct perf_event *event) { struct kmem_cache *kmem_cache = event->pmu->task_ctx_cache; struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); + int state = PERF_SCHED_CB_SW_IN; if (!x86_pmu.lbr_nr) return; @@ -693,7 +694,13 @@ void intel_pmu_lbr_add(struct perf_event *event) */ if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0) cpuc->lbr_pebs_users++; - perf_sched_cb_inc(event->ctx->pmu); + + if (!(event->attach_state & PERF_ATTACH_TASK)) + state |= PERF_SCHED_CB_CPU; + if (event->attach_state & PERF_ATTACH_TASK_DATA) + state |= PERF_SCHED_CB_SW_OUT; + + perf_sched_cb_inc(event->ctx->pmu, state); if (!cpuc->lbr_users++ && !event->total_time_running) intel_pmu_lbr_reset(); @@ -724,6 +731,7 @@ void release_lbr_buffers(void) void intel_pmu_lbr_del(struct perf_event *event) { struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); + int state = 0; if (!x86_pmu.lbr_nr) return; @@ -740,7 +748,10 @@ void intel_pmu_lbr_del(struct perf_event *event) cpuc->lbr_use
Re: [RFC 2/2] perf/core: Invoke pmu::sched_task callback for per-cpu events
On 11/5/2020 10:54 AM, Namhyung Kim wrote: -void perf_sched_cb_inc(struct pmu *pmu) +void perf_sched_cb_inc(struct pmu *pmu, bool systemwide) { struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context); - if (!cpuctx->sched_cb_usage++) - list_add(>sched_cb_entry, this_cpu_ptr(_cb_list)); + cpuctx->sched_cb_usage++; - this_cpu_inc(perf_sched_cb_usages); + if (systemwide) { + this_cpu_inc(perf_sched_cb_usages); + list_add(>sched_cb_entry, this_cpu_ptr(_cb_list)); You need to check the value and make sure it's added only once. Right, maybe we have to add a new variable for that. diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 6586f7e71cfb..63c9b87cab5e 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -380,7 +380,7 @@ static void power_pmu_bhrb_enable(struct perf_event *event) cpuhw->bhrb_context = event->ctx; } cpuhw->bhrb_users++; - perf_sched_cb_inc(event->ctx->pmu); + perf_sched_cb_inc(event->ctx->pmu, !(event->attach_state & PERF_ATTACH_TASK)); } static void power_pmu_bhrb_disable(struct perf_event *event) diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index 444e5f061d04..a34b90c7fa6d 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -1022,9 +1022,9 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc, if (needed_cb != pebs_needs_sched_cb(cpuc)) { if (!needed_cb) - perf_sched_cb_inc(pmu); + perf_sched_cb_inc(pmu, !(event->attach_state & PERF_ATTACH_TASK)); else - perf_sched_cb_dec(pmu); + perf_sched_cb_dec(pmu, !(event->attach_state & PERF_ATTACH_TASK)); update = true; } diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 8961653c5dd2..8d4d02cde3d4 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -693,7 +693,7 @@ void intel_pmu_lbr_add(struct perf_event *event) */ if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0) cpuc->lbr_pebs_users++; - perf_sched_cb_inc(event->ctx->pmu); + perf_sched_cb_inc(event->ctx->pmu, !(event->attach_state & PERF_ATTACH_TASK)); if (!cpuc->lbr_users++ && !event->total_time_running) intel_pmu_lbr_reset(); @@ -740,7 +740,7 @@ void intel_pmu_lbr_del(struct perf_event *event) cpuc->lbr_users--; WARN_ON_ONCE(cpuc->lbr_users < 0); WARN_ON_ONCE(cpuc->lbr_pebs_users < 0); - perf_sched_cb_dec(event->ctx->pmu); + perf_sched_cb_dec(event->ctx->pmu, !(event->attach_state & PERF_ATTACH_TASK)); } static inline bool vlbr_exclude_host(void) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index a1b91f2de264..14f936385cc8 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -875,6 +875,7 @@ struct perf_cpu_context { struct list_headsched_cb_entry; int sched_cb_usage; + int sched_cb_sw_usage; int online; /* @@ -967,8 +968,8 @@ extern const struct perf_event_attr *perf_event_attrs(struct perf_event *event); extern void perf_event_print_debug(void); extern void perf_pmu_disable(struct pmu *pmu); extern void perf_pmu_enable(struct pmu *pmu); -extern void perf_sched_cb_dec(struct pmu *pmu); -extern void perf_sched_cb_inc(struct pmu *pmu); +extern void perf_sched_cb_dec(struct pmu *pmu, bool systemwide); +extern void perf_sched_cb_inc(struct pmu *pmu, bool systemwide); extern int perf_event_task_disable(void); extern int perf_event_task_enable(void); diff --git a/kernel/events/core.c b/kernel/events/core.c index 66a9bd71f3da..af75859c9138 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3484,22 +3484,32 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn, static DEFINE_PER_CPU(struct list_head, sched_cb_list); -void perf_sched_cb_dec(struct pmu *pmu) +void perf_sched_cb_dec(struct pmu *pmu, bool systemwide) { struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context); + --cpuctx->sched_cb_usage; + + if (!systemwide) + return; + this_cpu_dec(perf_sched_cb_usages); - if (!--cpuctx->sched_cb_usage) + if (!--cpuctx->sched_cb_sw_usage) list_del(>sched_cb_entry); } -void perf_sched_cb_inc(struct pmu *pmu) +void perf_sched_cb_inc(struct pmu *pmu, bool systemwide) { struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context); - if (!cpuctx->sched_cb_usage++) + cpuctx->sched_cb_usage++; + + if (!systemwide) + return; + + if
Re: [RFC 1/2] perf/core: Enable sched_task callbacks if PMU has it
On 11/5/2020 10:45 AM, Namhyung Kim wrote: Hello, On Thu, Nov 5, 2020 at 11:47 PM Liang, Kan wrote: On 11/2/2020 9:52 AM, Namhyung Kim wrote: If an event associated with a PMU which has a sched_task callback, it should be called regardless of cpu/task context. For example, I don't think it's necessary. We should call it when we have to. Otherwise, it just waste cycles. Shouldn't the patch 2 be enough? I'm not sure, without this patch __perf_event_task_sched_in/out cannot be called for per-cpu events (w/o cgroups) IMHO. And I could not find any other place to check the perf_sched_cb_usages. Yes, it should a bug for large PEBS, and it should has always been there since the large PEBS was introduced. I just tried some older kernels (before recent change). Large PEBS is not flushed with per-cpu events. But from your description, it looks like the issue is only found after recent change. Could you please double check if the issue can also be reproduced before the recent change? diff --git a/kernel/events/core.c b/kernel/events/core.c index b458ed3dc81b..aaa0155c4142 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4696,6 +4696,8 @@ static void unaccount_event(struct perf_event *event) dec = true; if (has_branch_stack(event)) dec = true; + if (event->pmu->sched_task) + dec = true; I think sched_task is a too big hammer. The __perf_event_task_sched_in/out will be invoked even for non-pebs per-cpu events, which is not necessary. Maybe we can introduce a flag to indicate the case. How about the patch as below? diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index c79748f6921d..953a4bb98330 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3565,8 +3565,10 @@ static int intel_pmu_hw_config(struct perf_event *event) if (!(event->attr.freq || (event->attr.wakeup_events && !event->attr.watermark))) { event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD; if (!(event->attr.sample_type & - ~intel_pmu_large_pebs_flags(event))) + ~intel_pmu_large_pebs_flags(event))) { event->hw.flags |= PERF_X86_EVENT_LARGE_PEBS; + event->attach_state |= PERF_ATTACH_SCHED_DATA; + } } if (x86_pmu.pebs_aliases) x86_pmu.pebs_aliases(event); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 0defb526cd0c..3eef7142aa11 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -606,6 +606,7 @@ struct swevent_hlist { #define PERF_ATTACH_TASK 0x04 #define PERF_ATTACH_TASK_DATA 0x08 #define PERF_ATTACH_ITRACE 0x10 +#define PERF_ATTACH_SCHED_DATA 0x20 struct perf_cgroup; struct perf_buffer; diff --git a/kernel/events/core.c b/kernel/events/core.c index dba4ea4e648b..a38133b5543a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4675,7 +4675,7 @@ static void unaccount_event(struct perf_event *event) if (event->parent) return; - if (event->attach_state & PERF_ATTACH_TASK) + if (event->attach_state & (PERF_ATTACH_TASK | PERF_ATTACH_SCHED_DATA)) dec = true; if (event->attr.mmap || event->attr.mmap_data) atomic_dec(_mmap_events); @@ -11204,7 +11204,7 @@ static void account_event(struct perf_event *event) if (event->parent) return; - if (event->attach_state & PERF_ATTACH_TASK) + if (event->attach_state & (PERF_ATTACH_TASK | PERF_ATTACH_SCHED_DATA)) inc = true; if (event->attr.mmap || event->attr.mmap_data) atomic_inc(_mmap_events); Thanks, Kan
Re: [RFC 2/2] perf/core: Invoke pmu::sched_task callback for per-cpu events
On 11/2/2020 9:52 AM, Namhyung Kim wrote: The commit 44fae179ce73 ("perf/core: Pull pmu::sched_task() into perf_event_context_sched_out()") moved the pmu::sched_task callback to be called for task event context. But it missed to call it for per-cpu events to flush PMU internal buffers (i.e. for PEBS, ...). This commit basically reverts the commit but keeps the optimization for task events and only add missing calls for cpu events. Fixes: 44fae179ce73 ("perf/core: Pull pmu::sched_task() into perf_event_context_sched_out()") > Signed-off-by: Namhyung Kim --- include/linux/perf_event.h | 1 + kernel/events/core.c | 38 -- 2 files changed, 37 insertions(+), 2 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 0defb526cd0c..abb70a557cb5 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -872,6 +872,7 @@ struct perf_cpu_context { struct list_headcgrp_cpuctx_entry; #endif + struct list_head sched_cb_entry; int sched_cb_usage; intonline; diff --git a/kernel/events/core.c b/kernel/events/core.c index aaa0155c4142..c04d9a913537 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -383,6 +383,7 @@ static DEFINE_MUTEX(perf_sched_mutex); static atomic_t perf_sched_count; static DEFINE_PER_CPU(atomic_t, perf_cgroup_events); +static DEFINE_PER_CPU(int, perf_sched_cb_usages); static DEFINE_PER_CPU(struct pmu_event_list, pmu_sb_events); static atomic_t nr_mmap_events __read_mostly; @@ -3480,11 +3481,16 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn, } } +static DEFINE_PER_CPU(struct list_head, sched_cb_list); + void perf_sched_cb_dec(struct pmu *pmu) { struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context); - --cpuctx->sched_cb_usage; + this_cpu_dec(perf_sched_cb_usages); + + if (!--cpuctx->sched_cb_usage) + list_del(>sched_cb_entry); } @@ -3492,7 +3498,10 @@ void perf_sched_cb_inc(struct pmu *pmu) { struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context); - cpuctx->sched_cb_usage++; + if (!cpuctx->sched_cb_usage++) + list_add(>sched_cb_entry, this_cpu_ptr(_cb_list)); + + this_cpu_inc(perf_sched_cb_usages); } /* @@ -3521,6 +3530,24 @@ static void __perf_pmu_sched_task(struct perf_cpu_context *cpuctx, bool sched_in perf_ctx_unlock(cpuctx, cpuctx->task_ctx); } +static void perf_pmu_sched_task(struct task_struct *prev, + struct task_struct *next, + bool sched_in) +{ + struct perf_cpu_context *cpuctx; + + if (prev == next) + return; + + list_for_each_entry(cpuctx, this_cpu_ptr(_cb_list), sched_cb_entry) { + /* will be handled in perf_event_context_sched_in/out */ + if (cpuctx->task_ctx) + continue; + + __perf_pmu_sched_task(cpuctx, sched_in); + } +} + static void perf_event_switch(struct task_struct *task, struct task_struct *next_prev, bool sched_in); @@ -3543,6 +3570,9 @@ void __perf_event_task_sched_out(struct task_struct *task, { int ctxn; + if (__this_cpu_read(perf_sched_cb_usages)) + perf_pmu_sched_task(task, next, false); + if (atomic_read(_switch_events)) perf_event_switch(task, next, false); @@ -3850,6 +3880,9 @@ void __perf_event_task_sched_in(struct task_struct *prev, if (atomic_read(_switch_events)) perf_event_switch(task, prev, true); + + if (__this_cpu_read(perf_sched_cb_usages)) + perf_pmu_sched_task(prev, task, true); } It looks like the ("perf/core: Pull pmu::sched_task() into perf_event_context_sched_in()") is also reverted in the patch. static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count) @@ -12999,6 +13032,7 @@ static void __init perf_event_init_all_cpus(void) #ifdef CONFIG_CGROUP_PERF INIT_LIST_HEAD(_cpu(cgrp_cpuctx_list, cpu)); #endif + INIT_LIST_HEAD(_cpu(sched_cb_list, cpu)); } } Can we only update the perf_sched_cb_usages and sched_cb_list for per-cpu event as below patch (not tested)? If user only uses the per-process event, we don't need to go through the list. diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 6586f7e71cfb..63c9b87cab5e 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -380,7 +380,7 @@ static void power_pmu_bhrb_enable(struct perf_event *event) cpuhw->bhrb_context = event->ctx; } cpuhw->bhrb_users++; - perf_sched_cb_inc(event->ctx->pmu); + perf_sched_cb_inc(event->ctx->pmu,
Re: [RFC 1/2] perf/core: Enable sched_task callbacks if PMU has it
On 11/2/2020 9:52 AM, Namhyung Kim wrote: If an event associated with a PMU which has a sched_task callback, it should be called regardless of cpu/task context. For example, I don't think it's necessary. We should call it when we have to. Otherwise, it just waste cycles. Shouldn't the patch 2 be enough? Thanks, Kan a per-cpu event might enable large PEBS buffers so it needs to flush the buffer whenever task scheduling happens. > The underlying PMU may or may not require this for the given event, but it will be handled in the pmu::sched_task() callback anyway. Signed-off-by: Namhyung Kim --- kernel/events/core.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/events/core.c b/kernel/events/core.c index b458ed3dc81b..aaa0155c4142 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4696,6 +4696,8 @@ static void unaccount_event(struct perf_event *event) dec = true; if (has_branch_stack(event)) dec = true; + if (event->pmu->sched_task) + dec = true; if (event->attr.ksymbol) atomic_dec(_ksymbol_events); if (event->attr.bpf_event) @@ -11225,6 +11227,8 @@ static void account_event(struct perf_event *event) inc = true; if (is_cgroup_event(event)) inc = true; + if (event->pmu->sched_task) + inc = true; if (event->attr.ksymbol) atomic_inc(_ksymbol_events); if (event->attr.bpf_event)
Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 10/13/2020 12:34 PM, Peter Zijlstra wrote: Subject: perf,mm: Handle non-page-table-aligned hugetlbfs From: Peter Zijlstra Date: Fri, 9 Oct 2020 11:09:27 +0200 A limited nunmber of architectures support hugetlbfs sizes that do not align with the page-tables (ARM64, Power, Sparc64). Add support for this to the generic perf_get_page_size() implementation, and also allow an architecture to override this implementation. This latter is only needed when it uses non-page-table aligned huge pages in its kernel map. Signed-off-by: Peter Zijlstra (Intel) --- kernel/events/core.c | 39 +-- 1 file changed, 33 insertions(+), 6 deletions(-) --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6996,10 +6996,18 @@ static u64 perf_virt_to_phys(u64 virt) #ifdef CONFIG_MMU /* - * Return the MMU page size of a given virtual address + * Return the MMU page size of a given virtual address. + * + * This generic implementation handles page-table aligned huge pages, as well + * as non-page-table aligned hugetlbfs compound pages. + * + * If an architecture supports and uses non-page-table aligned pages in their + * kernel mapping it will need to provide it's own implementation of this + * function. */ -static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr) +__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr) { + struct page *page; pgd_t *pgd; p4d_t *p4d; pud_t *pud; @@ -7021,15 +7029,27 @@ static u64 __perf_get_page_size(struct m if (!pud_present(*pud)) return 0; - if (pud_leaf(*pud)) + if (pud_leaf(*pud)) { +#ifdef pud_page + page = pud_page(*pud); + if (PageHuge(page)) + return page_size(compound_head(page)); +#endif return 1ULL << PUD_SHIFT; + } pmd = pmd_offset(pud, addr); if (!pmd_present(*pmd)) return 0; - if (pmd_leaf(*pmd)) + if (pmd_leaf(*pmd)) { +#ifdef pmd_page + page = pmd_page(*pmd); + if (PageHuge(page)) + return page_size(compound_head(page)); +#endif return 1ULL << PMD_SHIFT; + } pte = pte_offset_map(pmd, addr); if (!pte_present(*pte)) { @@ -7037,13 +7057,20 @@ static u64 __perf_get_page_size(struct m return 0; } + page = pte_page(*pte); + if (PageHuge(page)) { + u64 size = page_size(compound_head(page)); + pte_unmap(pte); + return size; + } + The PageHuge() check for PTE crashes my machine when I did page size test. (Sorry, I didn't find the issue earlier. I just found some time to re-run the page size test.) It seems we don't need the check for PTE here. The size should be always PAGE_SIZE, no? After I remove the check, everything looks good. Thanks, Kan [ 167.383120] BUG: unable to handle page fault for address: fca803fb8000 [ 167.383121] #PF: supervisor read access in kernel mode [ 167.383121] #PF: error_code(0x) - not-present page [ 167.383121] PGD 4adfc8067 P4D 4adfc8067 PUD 4adfc7067 PMD 0 [ 167.383122] Oops: [#1] SMP NOPTI [ 167.383122] CPU: 0 PID: 2461 Comm: perf Not tainted 5.10.0-rc1_page_size+ #54 [ 167.383123] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3162.A00.1904162000 04/16/2019 [ 167.383123] traps: PANIC: double fault, error_code: 0x0 [ 167.383123] double fault: [#2] SMP NOPTI [ 167.383123] CPU: 0 PID: 2461 Comm: perf Not tainted 5.10.0-rc1_page_size+ #54 [ 167.383124] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3162.A00.1904162000 04/16/2019 [ 167.383124] RIP: 0010:__sprint_symbol+0xb/0x100 [ 167.383124] Code: 85 e4 75 b9 48 85 c0 75 bc 49 89 d8 4c 89 e1 4c 89 fa 4c 89 f6 4c 89 ef e8 42 70 04 00 eb a6 0f 1f 44 00 00 55 48 89 e5 41 57 <41> 56 49 89 f6 41 55 41 89 cd 48 8d 4d b8 41 54 49 89 fc 53 48 63 [ 167.383125] RSP: 0018:fe00b000 EFLAGS: 00010046 [ 167.383125] RAX: 0053 RBX: 0a00 RCX: 0001 [ 167.383125] RDX: RSI: 9f8a6176 RDI: fe00b029 [ 167.383126] RBP: fe00b008 R08: a0bf8661 R09: 0001 [ 167.383126] R10: 000f R11: 9e641dc189c8 R12: 9e641dc189c9 [ 167.383126] R13: 9e641dc1a7e0 R14: 0a00ff05 R15: fe00b029 [ 167.383126] FS: () GS:9e641dc0() knlGS: [ 167.383127] CS: 0010 DS: ES: CR0: 80050033 [ 167.383127] CR2: fe00aff8 CR3: 00014a0ba004 CR4: 00770ef0 [ 167.383127] PKRU: 5554 [ 167.383127] Call Trace: [ 167.383127] [ 167.383128] sprint_symbol+0x15/0x20 [ 167.383128] symbol_string+0x49/0x90 [
Re: [PATCH] perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY
On 10/19/2020 12:09 PM, Andi Kleen wrote: Reported-by: Andi Kleen Signed-off-by: Kan Liang I guess this should have a Fixes: tag and also be proposed for stable. I will send V2 shortly to update the tag and Cc. Thanks, Kan
Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 10/12/2020 4:48 AM, Will Deacon wrote: On Sat, Oct 10, 2020 at 12:28:39AM +1100, Michael Ellerman wrote: Peter Zijlstra writes: Patch 4 makes it all far worse by exposing it to pretty much everybody. Now, I think we can fix at least the user mappings with the below delta, but if archs are using non-page-table MMU sizes we'll need arch helpers. ARM64 is in that last boat. I think we probably need it to be weak so we can provide our own version. I guess the same thing applies to us, but I can't really tell how accurate this stuff needs to be for userspace. If it's trying to use the page-table configuration to infer properties of the TLB, that's never going to be reliable because the TLB can choose both to split and coalesce entries as long as software can't tell. Hi Peter, It looks like everybody wants a __weak function. If so, I guess we should drop the generic code in this patch. For X86, we have existing functions to retrieve the page level and the page size. I think we don't need the generic code either. https://lkml.kernel.org/r/1549648509-12704-2-git-send-email-kan.li...@linux.intel.com/ Should I send a V10 patch to drop the generic code and implement an X86 specific perf_get_page_size()? Will, Michael, and others can implement their version later separately. Thanks, Kan
Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 10/9/2020 5:09 AM, Peter Zijlstra wrote: (we might not need the #ifdef gunk, but I've not yet dug out my cross compilers this morning) --- --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7009,6 +7009,7 @@ static u64 perf_virt_to_phys(u64 virt) */ static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr) { + struct page *page; pgd_t *pgd; p4d_t *p4d; pud_t *pud; @@ -7030,15 +7031,27 @@ static u64 __perf_get_page_size(struct m if (!pud_present(*pud)) return 0; - if (pud_leaf(*pud)) + if (pud_leaf(*pud)) { +#ifdef pud_page + page = pud_page(*pud); + if (PageHuge(page)) + return page_size(compound_head(page)); I think the page_size() returns the Kernel Page Size of a compound page. What we want is the MMU page size. If it's for the generic code, I think it should be a problem for X86. Thanks, Kan +#endif return 1ULL << PUD_SHIFT; + } pmd = pmd_offset(pud, addr); if (!pmd_present(*pmd)) return 0; - if (pmd_leaf(*pmd)) + if (pmd_leaf(*pmd)) { +#ifdef pmd_page + page = pmd_page(*pmd); + if (PageHuge(page)) + return page_size(compound_head(page)); +#endif return 1ULL << PMD_SHIFT; + } pte = pte_offset_map(pmd, addr); if (!pte_present(*pte)) { @@ -7046,6 +7059,10 @@ static u64 __perf_get_page_size(struct m return 0; } + page = pte_page(*pte); + if (PageHuge(page)) + return page_size(compound_head(page)); + pte_unmap(pte); return PAGE_SIZE; }
Re: [PATCH] perf/x86/intel: Fix n_metric for the canceled group
On 10/2/2020 7:02 AM, Peter Zijlstra wrote: On Wed, Sep 30, 2020 at 07:29:35AM -0700, kan.li...@linux.intel.com wrote: From: Kan Liang When a group that has TopDown members is failed to be scheduled, any later TopDown groups will not return valid values. Here is an example. A background perf that occupies all the GP counters and the fixed counter 1. $perf stat -e "{cycles,cycles,cycles,cycles,cycles,cycles,cycles, cycles,cycles}:D" -a A user monitors a TopDown group. It works well, because the fixed counter 3 and the PERF_METRICS are available. $perf stat -x, --topdown -- ./workload retiring,bad speculation,frontend bound,backend bound, 18.0,16.1,40.4,25.5, Then the user tries to monitor a group that has TopDown members. Because of the cycles event, the group is failed to be scheduled. $perf stat -x, -e '{slots,topdown-retiring,topdown-be-bound, topdown-fe-bound,topdown-bad-spec,cycles}' -- ./workload ,,slots,0,0.00,, ,,topdown-retiring,0,0.00,, ,,topdown-be-bound,0,0.00,, ,,topdown-fe-bound,0,0.00,, ,,topdown-bad-spec,0,0.00,, ,,cycles,0,0.00,, The user tries to monitor a TopDown group again. It doesn't work anymore. $perf stat -x, --topdown -- ./workload , In a txn, cancel_txn() is to truncate the event_list for a canceled group and update the number of events added in this transaction. However, the number of TopDown events added in this transaction is not updated. The kernel will probably fail to add new Topdown events. Check if the canceled group has Topdown events. If so, subtract the TopDown events from n_metric accordingly. Fixes: 7b2c05a15d29 ("perf/x86/intel: Generic support for hardware TopDown metrics") Reported-by: Andi Kleen Reviewed-by: Andi Kleen Signed-off-by: Kan Liang --- arch/x86/events/core.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 0f3d01562ded..4cb3ccbe2d62 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2017,6 +2017,7 @@ static void x86_pmu_cancel_txn(struct pmu *pmu) { unsigned int txn_flags; struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); + int i; WARN_ON_ONCE(!cpuc->txn_flags); /* no txn in flight */ @@ -2031,6 +2032,15 @@ static void x86_pmu_cancel_txn(struct pmu *pmu) */ __this_cpu_sub(cpu_hw_events.n_added, __this_cpu_read(cpu_hw_events.n_txn)); __this_cpu_sub(cpu_hw_events.n_events, __this_cpu_read(cpu_hw_events.n_txn)); + + /* Subtract Topdown events in the canceled group from n_metric */ + if (x86_pmu.intel_cap.perf_metrics && cpuc->n_metric) { + for (i = 0; i < cpuc->n_txn; i++) { + if (is_metric_event(cpuc->event_list[i + cpuc->n_events])) + __this_cpu_dec(cpu_hw_events.n_metric); + } + WARN_ON_ONCE(__this_cpu_read(cpu_hw_events.n_metric) < 0); + } perf_pmu_enable(pmu); } Urgh, I'd much rather we add n_txn_metric. But also, while looking at this, don't we have the same problem with n_pair ? Something like this perhaps... Sure. For the perf metric, the below patch fixes the problem. Tested-by: Kan Liang Thanks, Kan --- diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 757e49755e7c..9b7792c0b6fb 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -1066,6 +1066,7 @@ static int add_nr_metric_event(struct cpu_hw_events *cpuc, if (cpuc->n_metric == INTEL_TD_METRIC_NUM) return -EINVAL; cpuc->n_metric++; + cpuc->n_txn_metric++; } return 0; @@ -1089,8 +1090,10 @@ static int collect_event(struct cpu_hw_events *cpuc, struct perf_event *event, return -EINVAL; cpuc->event_list[n] = event; - if (is_counter_pair(>hw)) + if (is_counter_pair(>hw)) { cpuc->n_pair++; + cpuc->n_txn_pair++; + } return 0; } @@ -2062,6 +2065,8 @@ static void x86_pmu_start_txn(struct pmu *pmu, unsigned int txn_flags) perf_pmu_disable(pmu); __this_cpu_write(cpu_hw_events.n_txn, 0); + __this_cpu_write(cpu_hw_events.n_txn_metric, 0); + __this_cpu_write(cpu_hw_events.n_txn_pair, 0); } /* @@ -2087,6 +2092,8 @@ static void x86_pmu_cancel_txn(struct pmu *pmu) */ __this_cpu_sub(cpu_hw_events.n_added, __this_cpu_read(cpu_hw_events.n_txn)); __this_cpu_sub(cpu_hw_events.n_events, __this_cpu_read(cpu_hw_events.n_txn)); + __this_cpu_sub(cpu_hw_events.n_metric, __this_cpu_read(cpu_hw_events.n_txn_metric)); + __this_cpu_sub(cpu_hw_events.n_pair, __this_cpu_read(cpu_hw_events.n_txn_pair)); perf_pmu_enable(pmu); } diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/30/2020 6:45 PM, Stephane Eranian wrote: On Wed, Sep 30, 2020 at 10:30 AM Peter Zijlstra wrote: On Wed, Sep 30, 2020 at 07:48:48AM -0700, Dave Hansen wrote: On 9/30/20 7:42 AM, Liang, Kan wrote: When I tested on my kernel, it panicked because I suspect current->active_mm could be NULL. Adding a check for NULL avoided the problem. But I suspect this is not the correct solution. I guess the NULL active_mm should be a rare case. If so, I think it's not bad to add a check and return 0 page size. I think it would be best to understand why ->active_mm is NULL instead of just papering over the problem. If it is papered over, and this is common, you might end up effectively turning off your shiny new feature inadvertently. context_switch() can set prev->active_mm to NULL when it transfers it to @next. It does this before @current is updated. So an NMI that comes in between this active_mm swizzling and updating @current will see !active_mm. I think Peter is right. This code is called in the context of NMI, so if active_mm is set to NULL inside a locked section, this is not enough to protect the perf_events code from seeing it. In general though; I think using ->active_mm is a mistake though. That code should be doing something like: mm = current->mm; if (!mm) mm = _mm; I will send a V9 with the changes Peter suggests. Thanks, Kan
Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/30/2020 3:15 AM, Stephane Eranian wrote: +static u64 perf_get_page_size(unsigned long addr) +{ + unsigned long flags; + u64 size; + + if (!addr) + return 0; + + /* +* Software page-table walkers must disable IRQs, +* which prevents any tear down of the page tables. +*/ + local_irq_save(flags); + + size = __perf_get_page_size(current->active_mm, addr); + When I tested on my kernel, it panicked because I suspect current->active_mm could be NULL. Adding a check for NULL avoided the problem. But I suspect this is not the correct solution. I guess the NULL active_mm should be a rare case. If so, I think it's not bad to add a check and return 0 page size. Thanks, Kan
Re: [RESEND PATCH V2 5/6] perf/x86/intel/uncore: Generic support for the PCI sub driver
On 9/21/2020 6:19 PM, Bjorn Helgaas wrote: On Mon, Sep 14, 2020 at 07:34:19AM -0700, kan.li...@linux.intel.com wrote: From: Kan Liang Some uncore counters may be located in the configuration space of a PCI device, which already has a bonded driver. Currently, the uncore driver cannot register a PCI uncore PMU for these counters, because, to register a PCI uncore PMU, the uncore driver must be bond to the device. However, one device can only have one bonded driver. Add an uncore PCI sub driver to support such kind of devices. The sub driver doesn't own the device. In initialization, the sub driver searches the device via pci_get_device(), and register the corresponding PMU for the device. In the meantime, the sub driver registeris a PCI bus notifier, which is used to notify the sub driver once the device is removed. The sub driver can unregister the PMU accordingly. s/registeris/registers/ It looks like this only handles hot-remove of the device, not hot-add. Maybe that's OK for your use case, I dunno, so just pointing it out. Hi Bjorn, Thanks for the review. Yes, the patch only supports the hot-remove for now, because - I didn't get any requests for the hot-add support. I doubt anyone hot-adds a Root Port device in practice. - To be honest, I don't have a test case to verify the hot-add of a Root Port device. I used the command line below to test the hot-remove. sudo bash -c 'echo 1 > /sys/bus/pci/devices/\:14\:04.0/remove' Then the device is gone. I have no idea how to add it back. So I only implemented the hot-remove in this patch. If anyone requires the hot-add, I can add it later separately. Thanks, Kan
Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/17/2020 6:02 PM, Dave Hansen wrote: The problem is that the warning from the perf tool usually includes some hints regarding the cause of the warning or possible solution to workaround/fix the warning. What message should we deliver to the users? "Warning: Too many error page size. Address space isolation feature may be enabled, please check"? That's not particularly actionable for an end user. Do any of the perf errors just tell them to submit a bug report? "Excessive number of page size lookup errors, please report to..." Yes, found some perf tool warnings end with "Consider reporting to linux-kernel@vger.kernel.org." I will add a warning "Excessive number of page size lookup errors, please report to linux-kernel@vger.kernel.org.", if 90% of the samples have the page size 0. Thanks, Kan
Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/17/2020 5:24 PM, Dave Hansen wrote: On 9/17/20 2:16 PM, Liang, Kan wrote: One last concern as I look at this: I wish it was a bit more future-proof. There are lots of weird things folks are trying to do with the page tables, like Address Space Isolation. For instance, if you get a perf NMI when running userspace, current->mm->pgd is *different* than the PGD that was in use when userspace was running. It's close enough today, but it might not stay that way. But I can't think of any great ways to future proof this code, other than spitting out an error message if too many of the page table walks fail when they shouldn't. If the page table walks fail, page size 0 will return. So the worst case is that the page size is not available for users, which is not a fatal error. Right, it's not a fatal error. It will just more or less silently break this feature. If my understanding is correct, when the above case happens, there is nothing we can do for now (because we have no idea what it will become), except disabling the page size support and throw an error/warning. From the user's perspective, throwing an error message or marking the page size unavailable should be the same. I think we may leave the code as-is. We can fix the future case later separately. The only thing I can think of is to record the number of consecutive page walk errors without a success. Occasional failures are OK and expected, such as if reclaim zeroes a PTE and a later perf event goes and looks at it. But a *LOT* of consecutive errors indicates a real problem somewhere. Maybe if you have 10,000 or 1,000,000 successive walk failures, you do a WARN_ON_ONCE(). The user space perf tool looks like a better place for this kind of warning. The perf tool knows the total number of the samples. It also knows the number of the page size 0 samples. We can set a threshold, e.g., 90%. If 90% of the samples have the page size 0, perf tool will throw out a warning message. The problem is that the warning from the perf tool usually includes some hints regarding the cause of the warning or possible solution to workaround/fix the warning. What message should we deliver to the users? "Warning: Too many error page size. Address space isolation feature may be enabled, please check"? Thanks, Kan
Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/17/2020 3:00 PM, Dave Hansen wrote: On 9/17/20 6:52 AM, kan.li...@linux.intel.com wrote: + mm = current->mm; + if (!mm) { + /* +* For kernel threads and the like, use init_mm so that +* we can find kernel memory. +*/ + mm = _mm; + } I think it might be better to use current->active_mm instead of current->mm. Kernel threads can "take over" the mm of the threads that switched to them, so they're not actually using all of the page tables from the init_mm all the time. It's not _common_, thought. The only time that I think they can diverge is when vmalloc PGD sync'ing needs to be done, and there's even an effort to remove that. But, it's probably more more precise to just use ->active_mm since that's what will actually be more consistent with the values loaded into CR3. I _think_ ->active_mm is always non-NULL, too. Thanks. yes, active_mm looks better here. I will use active_mm to replace the mm and _mm. One last concern as I look at this: I wish it was a bit more future-proof. There are lots of weird things folks are trying to do with the page tables, like Address Space Isolation. For instance, if you get a perf NMI when running userspace, current->mm->pgd is *different* than the PGD that was in use when userspace was running. It's close enough today, but it might not stay that way. But I can't think of any great ways to future proof this code, other than spitting out an error message if too many of the page table walks fail when they shouldn't. If the page table walks fail, page size 0 will return. So the worst case is that the page size is not available for users, which is not a fatal error. If my understanding is correct, when the above case happens, there is nothing we can do for now (because we have no idea what it will become), except disabling the page size support and throw an error/warning. From the user's perspective, throwing an error message or marking the page size unavailable should be the same. I think we may leave the code as-is. We can fix the future case later separately. Thanks, Kan
Re: [PATCH V2 3/4] perf stat: Support new per thread TopDown metrics
On 9/10/2020 11:37 PM, Namhyung Kim wrote: Hello, On Thu, Sep 10, 2020 at 10:48 PM wrote: From: Andi Kleen Icelake has support for reporting per thread TopDown metrics. These are reported differently than the previous TopDown support, each metric is standalone, but scaled to pipeline "slots". We don't need to do anything special for HyperThreading anymore. Teach perf stat --topdown to handle these new metrics and print them in the same way as the previous TopDown metrics. The restrictions of only being able to report information per core is gone. Acked-by: Jiri Olsa Co-developed-by: Kan Liang Signed-off-by: Kan Liang Signed-off-by: Andi Kleen --- tools/perf/Documentation/perf-stat.txt | 7 +- tools/perf/builtin-stat.c | 30 - tools/perf/util/stat-shadow.c | 89 ++ tools/perf/util/stat.c | 4 ++ tools/perf/util/stat.h | 8 +++ 5 files changed, 134 insertions(+), 4 deletions(-) diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt index c9bfefc051fb..e803dbdc88a8 100644 --- a/tools/perf/Documentation/perf-stat.txt +++ b/tools/perf/Documentation/perf-stat.txt @@ -357,6 +357,11 @@ if the workload is actually bound by the CPU and not by something else. For best results it is usually a good idea to use it with interval mode like -I 1000, as the bottleneck of workloads can change often. +This enables --metric-only, unless overridden with --no-metric-only. + +The following restrictions only apply to older Intel CPUs and Atom, +on newer CPUs (IceLake and later) TopDown can be collected for any thread: + The top down metrics are collected per core instead of per CPU thread. Per core mode is automatically enabled and -a (global monitoring) is needed, requiring root rights or @@ -368,8 +373,6 @@ echo 0 > /proc/sys/kernel/nmi_watchdog for best results. Otherwise the bottlenecks may be inconsistent on workload with changing phases. -This enables --metric-only, unless overridden with --no-metric-only. - To interpret the results it is usually needed to know on which CPUs the workload runs on. If needed the CPUs can be forced using taskset. diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 5583e22ca808..6290da5bd142 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -128,6 +128,15 @@ static const char * topdown_attrs[] = { NULL, }; +static const char *topdown_metric_attrs[] = { + "slots", + "topdown-retiring", + "topdown-bad-spec", + "topdown-fe-bound", + "topdown-be-bound", + NULL, +}; + static const char *smi_cost_attrs = { "{" "msr/aperf/," @@ -1691,6 +1700,24 @@ static int add_default_attributes(void) char *str = NULL; bool warn = false; + if (!force_metric_only) + stat_config.metric_only = true; + + if (topdown_filter_events(topdown_metric_attrs, , 1) < 0) { + pr_err("Out of memory\n"); + return -1; + } + if (topdown_metric_attrs[0] && str) { + if (!stat_config.interval && !stat_config.metric_only) { + fprintf(stat_config.output, + "Topdown accuracy may decrease when measuring long periods.\n" + "Please print the result regularly, e.g. -I1000\n"); + } + goto setup_metrics; + } + + str = NULL; zfree() ? Yes, even the topdown events don't exist, str is still allocated. The str should be free. I will send V3 to fix it shortly. Thanks, Kan Thanks Namhyung + if (stat_config.aggr_mode != AGGR_GLOBAL && stat_config.aggr_mode != AGGR_CORE) { pr_err("top down event configuration requires --per-core mode\n"); @@ -1702,8 +1729,6 @@ static int add_default_attributes(void) return -1; } - if (!force_metric_only) - stat_config.metric_only = true; if (topdown_filter_events(topdown_attrs, , arch_topdown_check_group()) < 0) { pr_err("Out of memory\n"); @@ -1712,6 +1737,7 @@ static int add_default_attributes(void) if (topdown_attrs[0] && str) { if (warn) arch_topdown_group_warn(); +setup_metrics: err = parse_events(evsel_list, str, ); if (err) { fprintf(stderr,
Re: [PATCH V2 3/3] perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
On 9/8/2020 11:58 AM, pet...@infradead.org wrote: > On Mon, Sep 07, 2020 at 06:01:15PM +0200, pet...@infradead.org wrote: >> On Fri, Aug 21, 2020 at 12:57:54PM -0700, kan.li...@linux.intel.com >>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c >>> index 0f3d01562ded..fa08d810dcd2 100644 >>> --- a/arch/x86/events/core.c >>> +++ b/arch/x86/events/core.c >>> @@ -1440,7 +1440,10 @@ static void x86_pmu_start(struct perf_event >>> *event, int flags) >>> >>>cpuc->events[idx] = event; >>>__set_bit(idx, cpuc->active_mask); >>> - __set_bit(idx, cpuc->running); >>> + /* The cpuc->running is only used by the P4 PMU */ >>> + if (!cpu_has(_cpu_data, X86_FEATURE_ARCH_PERFMON) && >>> + (boot_cpu_data.x86 == 0xf)) >>> + __set_bit(idx, cpuc->running); >>>x86_pmu.enable(event); >>>perf_event_update_userpage(event); >>> } >> >> Yuck! Use a static_branch() or something. This is a gnarly nest of >> code >> that runs 99.9% of the time to conclude a negative. IOW a complete >> waste >> of cycles for everybody not running a P4 space heater. > > Better still, move it into p4_pmu_enable_event(). > Sure, I will move the "cpuc->running" into P4 specific code. On 9/7/2020 12:01 PM, pet...@infradead.org wrote: @@ -1544,6 +1547,9 @@ static void x86_pmu_del(struct perf_event *event, int flags) if (cpuc->txn_flags & PERF_PMU_TXN_ADD) goto do_del; + if (READ_ONCE(x86_pmu.attr_rdpmc) && x86_pmu.sched_task && + test_bit(event->hw.idx, cpuc->active_mask)) + __set_bit(event->hw.idx, cpuc->dirty); And that too seems like an overly complicated set of tests and branches. This should be effectivly true for the 99% common case. I made a mistake in the V2 version regarding whether P4 PMU supports RDPMC. SDM explicitly documents that the RDPMC instruction was introduced in P6. I once thought P4 is older than P6, but I'm wrong. P4 should have NetBurst microarchitecture which is the successor to the P6. So P4 should also support the RDPMC instruction. We should not share the memory space between the 'dirty' and the 'running' variable. I will modify it in V3 I will also unconditionally set cpuc->dirty here. The worst case for an RDPMC task is that we may have to clear all counters for the first time in x86_pmu_event_mapped. After that, the sched_task() will clear/update the 'dirty'. Only the real 'dirty' counters are clear. For non-RDPMC task, it's harmless to unconditionally set the cpuc->dirty. static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm) { if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED)) return; + /* +* Enable sched_task() for the RDPMC task, +* and clear the existing dirty counters. +*/ + if (x86_pmu.sched_task && event->hw.target && !is_sampling_event(event)) { + perf_sched_cb_inc(event->ctx->pmu); + x86_pmu_clear_dirty_counters(); + } I'm failing to see the correctness of the !is_sampling_event() part there. It looks like an RDPMC task can do sampling as well? I once thought it only does counting. I will fix it in V3. /* * This function relies on not being called concurrently in two * tasks in the same mm. Otherwise one task could observe @@ -2246,6 +2286,9 @@ static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *m if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED)) return; + if (x86_pmu.sched_task && event->hw.target && !is_sampling_event(event)) + perf_sched_cb_dec(event->ctx->pmu); + Idem. if (atomic_dec_and_test(>context.perf_rdpmc_allowed)) on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1); } diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index c72e4904e056..e67713bfa33a 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4166,11 +4166,39 @@ static void intel_pmu_cpu_dead(int cpu) intel_cpuc_finish(_cpu(cpu_hw_events, cpu)); } +static void intel_pmu_rdpmc_sched_task(struct perf_event_context *ctx) +{ + struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); + struct perf_event *event; + + if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX)) + return; + + /* +* If the new task has the RDPMC enabled, clear the dirty counters to +* prevent the potential leak. If the new task doesn't have the RDPMC +* enabled, do nothing. +*/ + list_for_each_entry(event, >event_list, event_entry) { + if (event->hw.target && + (event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED) && + !is_sampling_event(event) && + atomic_read(>mmap_count)) + break; + } + if (>event_entry == >event_list) + return; That's
Re: [PATCH V2 0/6] Support PCIe3 uncore PMU on Snow Ridge
Hi Peter, Could you please share the comments for this patch set? Thanks, Kan On 8/5/2020 4:24 PM, kan.li...@linux.intel.com wrote: From: Kan Liang Changes since V1: - Drop the platform device solution - A new uncore PCI sub driver solution is introduced which searches the PCIe Root Port device via pci_get_device() and id table. Register a PCI bus notifier for the remove notification. Once the device is removed, the uncore driver can be notified to unregister the corresponding PMU. - Modify the parameters of uncore_pci_pmu_unregister() function. The Snow Ridge integrated PCIe3 uncore performance monitoring unit (PMU) can be used to collect the performance data, e.g., the utilization between the PCIe devices and the components (in M2IOSF) which are responsible for translating and managing the requests to/from the device. The performance data is very useful for analyzing the performance of the PCIe devices. The PCIe3 uncore PMU was once supported in the Linux kernel, but it was removed by the commit 2167f1625c2f ("perf/x86/intel/uncore: Remove PCIe3 unit for SNR") due to the conflict between the PCIe Root Port driver and the perf uncore driver. The counters of the PCIe3 uncore PMU are located in the configuration space of the PCIe Root Port device, which already has a bonded driver portdrv_pci. One device can only have one bonded driver. The uncore driver is always failed to be loaded. To re-enable the PCIe3 uncore PMU support in the uncore driver, a new solution should be introduced, which has to meet the requirements as below: - must have a reliable way to find the PCIe Root Port device from the uncore driver; - must be able to access the uncore counters of the PCIe Root Port device from the uncore driver; - must support hotplug. When the PCIe Root Port device is removed, the uncore driver has to be notified and unregisters the PCIe3 uncore PMU. In the v1 patch set, a new platform device solution is implemented to address the issue, but it's rejected. https://lkml.kernel.org/r/20200708183034.GA466341@bjorn-Precision-5520 A new uncore PCI sub driver solution is introduced from the V2 patch set, which: - searches the PCIe Root Port device via pci_get_device() and id table. The matched pci_dev can be used to register a PMU for accessing the counters in the PCIe Root Port device. - register a PCI bus notifier. Once the device is removed, the uncore driver can be notified to unregister the corresponding PMU. Kan Liang (6): perf/x86/intel/uncore: Factor out uncore_pci_get_dev_die_info() perf/x86/intel/uncore: Factor out uncore_pci_find_dev_pmu() perf/x86/intel/uncore: Factor out uncore_pci_pmu_register() perf/x86/intel/uncore: Factor out uncore_pci_pmu_unregister() perf/x86/intel/uncore: Generic support for the PCI sub driver perf/x86/intel/uncore: Support PCIe3 unit on Snow Ridge arch/x86/events/intel/uncore.c | 273 +++ arch/x86/events/intel/uncore.h | 1 + arch/x86/events/intel/uncore_snbep.c | 53 +++ 3 files changed, 265 insertions(+), 62 deletions(-)
Re: [PATCH 0/4] TopDown metrics support for Ice Lake (perf tool)
On 8/26/2020 11:54 AM, Jiri Olsa wrote: On Thu, Aug 20, 2020 at 09:45:28AM -0700, kan.li...@linux.intel.com wrote: From: Kan Liang The kernel patches have been merged into the tip's perf/core branch. The patch set is on top of commit 2cb5383b30d4 ("perf/x86/intel: Support per-thread RDPMC TopDown metrics") of the tip's perf/core branch. The changes for the perf tool include: - Extend --topdown option to support per thread TopDown metrics - Support sample-read topdown metric group - Add a complete document for the TopDown usage. Ice Lake has support for measuring the level 1 TopDown metrics directly in hardware. This is implemented by an additional METRICS register, and a new Fixed Counter 3 that measures pipeline SLOTS. New in Icelake - Do not require generic counters. This allows to collect TopDown always in addition to other events. - Measuring TopDown per thread/process instead of only per core For the Ice Lake implementation of performance metrics, the values in PERF_METRICS MSR are derived from fixed counter 3. Software should start both registers, PERF_METRICS and fixed counter 3, from zero. Additionally, software is recommended to periodically clear both registers in order to maintain accurate measurements. The latter is required for certain scenarios that involve sampling metrics at high rates. Software should always write fixed counter 3 before write to PERF_METRICS. IA32_PERF_GLOBAL_STATUS. OVF_PERF_METRICS[48]: If this bit is set, it indicates that some PERF_METRICS-related counter has overflowed and a PMI is triggered. Software has to synchronize, e.g. re-start, PERF_METRICS as well as fixed counter 3. Otherwise, PERF_METRICS may return invalid values. Limitation - To get accurate result and avoid reading the METRICS register multiple times, the TopDown metrics events and SLOTS event have to be in the same group. - METRICS and SLOTS registers have to be cleared after each read by SW. That is to prevent the lose of precision. - Cannot do sampling read SLOTS and TopDown metric events Please refer SDM Vol3, 18.3.9.3 Performance Metrics for the details of TopDown metrics. Andi Kleen (2): perf stat: Support new per thread TopDown metrics perf, tools: Add documentation for topdown metrics Kan Liang (2): perf tools: Rename group to topdown perf record: Support sample-read topdown metric group I don't have Ice lake to actualy check, but it looks good Acked-by: Jiri Olsa Thanks Jirka for the review. Hi Arnaldo, Andi has a comment regarding a grammar error in the printf message. I once plan to apply a fix in the next version. I'm not sure whether you will have more comments for this patch set. If yes, I will wait for the comments. If not, should I send a V2 patch set with this fixed? On 8/20/2020 4:05 PM, Andi Kleen wrote: >> + fprintf(stat_config.output, >> + "Topdown accuracy may decreases when measuring long period.\n" > "may decrease" ... "long periods". > > (s needs to move to the end) Thanks, Kan
Re: [PATCH V6 06/16] perf script: Use ULL for enum perf_output_field
On 8/12/2020 8:21 AM, Arnaldo Carvalho de Melo wrote: Em Mon, Aug 10, 2020 at 02:24:26PM -0700, Kan Liang escreveu: The Bitwise-Shift operator (1U << ) is used in the enum perf_output_field, which has already reached its capacity (32 items). If more items are added, a compile error will be triggered. Change the U to ULL, which extend the capacity to 64 items. The enum perf_output_field is only used to calculate a value for the 'fields' in the output structure. The 'fields' is u64. The change doesn't break anything. Jiri did this already: https://git.kernel.org/torvalds/c/60e5eeb56a1 Thanks for pointing it out. I will rebase the code on top of it. Thanks, Kan Signed-off-by: Kan Liang --- tools/perf/builtin-script.c | 64 ++--- 1 file changed, 32 insertions(+), 32 deletions(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 447457786362..214bec350971 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -82,38 +82,38 @@ static bool native_arch; unsigned int scripting_max_stack = PERF_MAX_STACK_DEPTH; enum perf_output_field { - PERF_OUTPUT_COMM= 1U << 0, - PERF_OUTPUT_TID = 1U << 1, - PERF_OUTPUT_PID = 1U << 2, - PERF_OUTPUT_TIME= 1U << 3, - PERF_OUTPUT_CPU = 1U << 4, - PERF_OUTPUT_EVNAME = 1U << 5, - PERF_OUTPUT_TRACE = 1U << 6, - PERF_OUTPUT_IP = 1U << 7, - PERF_OUTPUT_SYM = 1U << 8, - PERF_OUTPUT_DSO = 1U << 9, - PERF_OUTPUT_ADDR= 1U << 10, - PERF_OUTPUT_SYMOFFSET = 1U << 11, - PERF_OUTPUT_SRCLINE = 1U << 12, - PERF_OUTPUT_PERIOD = 1U << 13, - PERF_OUTPUT_IREGS = 1U << 14, - PERF_OUTPUT_BRSTACK = 1U << 15, - PERF_OUTPUT_BRSTACKSYM = 1U << 16, - PERF_OUTPUT_DATA_SRC= 1U << 17, - PERF_OUTPUT_WEIGHT = 1U << 18, - PERF_OUTPUT_BPF_OUTPUT = 1U << 19, - PERF_OUTPUT_CALLINDENT = 1U << 20, - PERF_OUTPUT_INSN= 1U << 21, - PERF_OUTPUT_INSNLEN = 1U << 22, - PERF_OUTPUT_BRSTACKINSN = 1U << 23, - PERF_OUTPUT_BRSTACKOFF = 1U << 24, - PERF_OUTPUT_SYNTH = 1U << 25, - PERF_OUTPUT_PHYS_ADDR = 1U << 26, - PERF_OUTPUT_UREGS = 1U << 27, - PERF_OUTPUT_METRIC = 1U << 28, - PERF_OUTPUT_MISC= 1U << 29, - PERF_OUTPUT_SRCCODE = 1U << 30, - PERF_OUTPUT_IPC = 1U << 31, + PERF_OUTPUT_COMM= 1ULL << 0, + PERF_OUTPUT_TID = 1ULL << 1, + PERF_OUTPUT_PID = 1ULL << 2, + PERF_OUTPUT_TIME= 1ULL << 3, + PERF_OUTPUT_CPU = 1ULL << 4, + PERF_OUTPUT_EVNAME = 1ULL << 5, + PERF_OUTPUT_TRACE = 1ULL << 6, + PERF_OUTPUT_IP = 1ULL << 7, + PERF_OUTPUT_SYM = 1ULL << 8, + PERF_OUTPUT_DSO = 1ULL << 9, + PERF_OUTPUT_ADDR= 1ULL << 10, + PERF_OUTPUT_SYMOFFSET = 1ULL << 11, + PERF_OUTPUT_SRCLINE = 1ULL << 12, + PERF_OUTPUT_PERIOD = 1ULL << 13, + PERF_OUTPUT_IREGS = 1ULL << 14, + PERF_OUTPUT_BRSTACK = 1ULL << 15, + PERF_OUTPUT_BRSTACKSYM = 1ULL << 16, + PERF_OUTPUT_DATA_SRC= 1ULL << 17, + PERF_OUTPUT_WEIGHT = 1ULL << 18, + PERF_OUTPUT_BPF_OUTPUT = 1ULL << 19, + PERF_OUTPUT_CALLINDENT = 1ULL << 20, + PERF_OUTPUT_INSN= 1ULL << 21, + PERF_OUTPUT_INSNLEN = 1ULL << 22, + PERF_OUTPUT_BRSTACKINSN = 1ULL << 23, + PERF_OUTPUT_BRSTACKOFF = 1ULL << 24, + PERF_OUTPUT_SYNTH = 1ULL << 25, + PERF_OUTPUT_PHYS_ADDR = 1ULL << 26, + PERF_OUTPUT_UREGS = 1ULL << 27, + PERF_OUTPUT_METRIC = 1ULL << 28, + PERF_OUTPUT_MISC= 1ULL << 29, + PERF_OUTPUT_SRCCODE = 1ULL << 30, + PERF_OUTPUT_IPC = 1ULL << 31, }; struct output_option { -- 2.17.1